tidymodels / embed Goto Github PK

View Code? Open in Web Editor NEW

140.0 140.0 16.0 14.04 MB

Extra recipes for predictor embeddings

Home Page: https://embed.tidymodels.org

License: Other

R 100.00%

embed's Introduction

tidymodels

Overview

tidymodels is a “meta-package” for modeling and statistical analysis that shares the underlying design philosophy, grammar, and data structures of the tidyverse.

It includes a core set of packages that are loaded on startup:

broom takes the messy output of built-in functions in R, such as lm, nls, or t.test, and turns them into tidy data frames.
dials has tools to create and manage values of tuning parameters.
dplyr contains a grammar for data manipulation.
ggplot2 implements a grammar of graphics.
infer is a modern approach to statistical inference.
parsnip is a tidy, unified interface to creating models.
purrr is a functional programming toolkit.
recipes is a general data preprocessor with a modern interface. It can create model matrices that incorporate feature engineering, imputation, and other help tools.
rsample has infrastructure for resampling data so that models can be assessed and empirically validated.
tibble has a modern re-imagining of the data frame.
tune contains the functions to optimize model hyper-parameters.
workflows has methods to combine pre-processing steps and models into a single object.
yardstick contains tools for evaluating models (e.g. accuracy, RMSE, etc.).

A list of all tidymodels functions across different CRAN packages can be found at https://www.tidymodels.org/find/.

You can install the released version of tidymodels from CRAN with:

install.packages("tidymodels")

Install the development version from GitHub with:

# install.packages("pak")
pak::pak("tidymodels/tidymodels")

When loading the package, the versions and conflicts are listed:

library(tidymodels)
#> ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
#> ✔ broom        1.0.5      ✔ recipes      1.0.10
#> ✔ dials        1.2.1      ✔ rsample      1.2.0 
#> ✔ dplyr        1.1.4      ✔ tibble       3.2.1 
#> ✔ ggplot2      3.5.0      ✔ tidyr        1.3.1 
#> ✔ infer        1.0.6      ✔ tune         1.2.0 
#> ✔ modeldata    1.3.0      ✔ workflows    1.1.4 
#> ✔ parsnip      1.2.1      ✔ workflowsets 1.1.0 
#> ✔ purrr        1.0.2      ✔ yardstick    1.3.1
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> ✖ purrr::discard() masks scales::discard()
#> ✖ dplyr::filter()  masks stats::filter()
#> ✖ dplyr::lag()     masks stats::lag()
#> ✖ recipes::step()  masks stats::step()
#> • Learn how to get started at https://www.tidymodels.org/start/

Contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

For questions and discussions about tidymodels packages, modeling, and machine learning, please post on RStudio Community.
Most issues will likely belong on the GitHub repo of an individual package. If you think you have encountered a bug with the tidymodels metapackage itself, please submit an issue.
Either way, learn how to create and share a reprex (a minimal, reproducible example), to clearly communicate about your code.
Check out further details on contributing guidelines for tidymodels packages and how to get help.

embed's People

Contributors

Stargazers

Watchers

Forkers

eddelbuettel gridl nanaakwasiabayieboateng makarevichy tmastny dfalbel burgikukac konradsemsch smingerson juliasilge minghao2016 estern95 asiripanich camilodlt rnaimehaom

embed's Issues

Temporarily register embed tunable methods in tune

See tidymodels/recipes#406

Move `master` branch to `main`

The master branch of this repository will soon be renamed to main, as part of a coordinated change across several GitHub organizations (including, but not limited to: tidyverse, r-lib, tidymodels, and sol-eng). We anticipate this will happen by the end of September 2021.

That will be preceded by a release of the usethis package, which will gain some functionality around detecting and adapting to a renamed default branch. There will also be a blog post at the time of this master --> main change.

The purpose of this issue is to:

Help us firm up the list of targetted repositories
Make sure all maintainers are aware of what's coming
Give us an issue to close when the job is done
Give us a place to put advice for collaborators re: how to adapt

message id: euphoric_snowdog

Extend to general classification problems?

For prediction problems with K classes, it seems like a reasonable generalization would be to create K - 1 new predictor columns of class probabilities.

In the unpooled case, nnet::multinom would be an option at the cost another dependency. Haven't actually played around with keras yet but might be able to get a dependency-free softmax that way. Some small amount of regularization may be necessary if I recall correctly?

In the partially pooled case, there's family = "categorical" in brms, or potentially K-1 binary fits from stan_glmer or glmer. In the latter case it'd probably be best to use K-1 binary fits for the unpooled case as well for consistency.

Haven't used this personally so would love to hear from someone in the know if this would actually be useful.

Enable the Tidyselect functions in embed step_*

Hi, could we enable this recipes 0.1.15 enhancement in embed? Thanks!

The full tidyselect DSL is now allowed inside recipes step_*() functions. This includes the operators &, |, - and ! and the new where() function. Additionally, the restriction preventing user defined selectors from being used has been lifted (#572).

Harmonize across retain/preserve/keep_original_cols

In tidymodels/recipes#635 we decided to unify on keep_original_cols for the idea of retaining/preserving variables but here we've got at least one other argument:

embed/R/umap.R

Lines 35 to 36 in d46dd82

 #' @param retain A single logical for whether the original predictors should 

 #' be kept (in addition to the new embedding variables).

In that PR I have an example of deprecating a different option (preserve) for step_pls() and switching over to the new argument name.

Reformat readme

The readme has most of the essential information but is missing titles and sections as is used in most other tidymodels packages

See https://github.com/tidymodels/broom for example

step_embed not working with parallel processing in caret

Hello, step_embed does not seem to work with parallel processing in caret. I think it may be related to topepo/caret#860

I am getting this error, which is very similar to the error in the previous issue:

Error in {: task 1 failed - "$ operator is invalid for atomic vectors"

Here is a reproducible example:

library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
library(embed)
#> Registered S3 method overwritten by 'xts':
#>   method     from
#>   as.zoo.xts zoo
library(doParallel)
#> Loading required package: foreach
#> Loading required package: iterators
#> Loading required package: parallel

sessionInfo()
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Mojave 10.14.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] parallel  stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#> [1] doParallel_1.0.15 iterators_1.0.12  foreach_1.4.7     embed_0.0.3      
#> [5] recipes_0.1.6     dplyr_0.8.3       caret_6.0-84      ggplot2_3.2.1    
#> [9] lattice_0.20-38  
#> 
#> loaded via a namespace (and not attached):
#>   [1] minqa_1.2.4           colorspace_1.4-1      class_7.3-15         
#>   [4] ggridges_0.5.1        rsconnect_0.8.15      markdown_1.0         
#>   [7] base64enc_0.1-3       rstan_2.19.2          DT_0.7               
#>  [10] prodlim_2018.04.18    lubridate_1.7.4       codetools_0.2-16     
#>  [13] splines_3.6.0         knitr_1.23            shinythemes_1.1.2    
#>  [16] zeallot_0.1.0         bayesplot_1.7.0       jsonlite_1.6         
#>  [19] nloptr_1.2.1          tfruns_1.4            uwot_0.1.3           
#>  [22] shiny_1.3.2           compiler_3.6.0        backports_1.1.4      
#>  [25] assertthat_0.2.1      Matrix_1.2-17         lazyeval_0.2.2       
#>  [28] cli_1.1.0             later_0.8.0           htmltools_0.3.6      
#>  [31] prettyunits_1.0.2     tools_3.6.0           igraph_1.2.4.1       
#>  [34] gtable_0.3.0          glue_1.3.1            reshape2_1.4.3       
#>  [37] Rcpp_1.0.2            vctrs_0.2.0           nlme_3.1-139         
#>  [40] crosstalk_1.0.0       timeDate_3043.102     gower_0.2.1          
#>  [43] xfun_0.8              stringr_1.4.0         ps_1.3.0             
#>  [46] lme4_1.1-21           lifecycle_0.1.0       mime_0.7             
#>  [49] miniUI_0.1.1.1        gtools_3.8.1          MASS_7.3-51.4        
#>  [52] zoo_1.8-6             scales_1.0.0          ipred_0.9-9          
#>  [55] rstanarm_2.18.2       colourpicker_1.0      promises_1.0.1       
#>  [58] inline_0.3.15         shinystan_2.5.0       yaml_2.2.0           
#>  [61] reticulate_1.13       gridExtra_2.3         loo_2.1.0            
#>  [64] StanHeaders_2.18.1-10 keras_2.2.4.1         rpart_4.1-15         
#>  [67] stringi_1.4.3         highr_0.8             tensorflow_1.13.1    
#>  [70] dygraphs_1.1.1.6      boot_1.3-22           pkgbuild_1.0.3       
#>  [73] lava_1.6.6            rlang_0.4.0           pkgconfig_2.0.2      
#>  [76] matrixStats_0.54.0    evaluate_0.14         purrr_0.3.2          
#>  [79] rstantools_1.5.1      htmlwidgets_1.3       tidyselect_0.2.5     
#>  [82] processx_3.4.1        plyr_1.8.4            magrittr_1.5.0.9000  
#>  [85] R6_2.4.0              generics_0.0.2        pillar_1.4.2         
#>  [88] whisker_0.3-2         withr_2.1.2           xts_0.11-2           
#>  [91] survival_2.44-1.1     nnet_7.3-12           tibble_2.1.3         
#>  [94] crayon_1.3.4          rmarkdown_1.13.6      grid_3.6.0           
#>  [97] data.table_1.12.2     callr_3.3.1           ModelMetrics_1.2.2   
#> [100] threejs_0.3.1         digest_0.6.20         xtable_1.8-4         
#> [103] tidyr_0.8.99.9000     httpuv_1.5.1          RcppParallel_4.4.3   
#> [106] stats4_3.6.0          munsell_0.5.0         shinyjs_1.0

mtcars2 <- as_tibble(mtcars)
mtcars2 <- mtcars2 %>%
  mutate(cyl = as.factor(paste0("num_", cyl))) %>%
  mutate(am = as.factor(ifelse(am == 1, "am", "not_am")))

rec <- recipe(am ~ cyl + hp, mtcars2) %>%
  step_embed(
    cyl,
    outcome = vars(am),
    options = embed_control(epochs = 75, validation_split = 0.2)
  )

ctrl <- trainControl(
  method = 'cv',
  number = 5,
  savePredictions = 'final',
  classProbs = TRUE,
  summaryFunction = twoClassSummary,
  sampling = NULL,
  returnData = FALSE
)

cl <- makePSOCKcluster(4)
registerDoParallel(cl)

train(
  rec,
  mtcars2,
  method = "glm",
  metric = "ROC",
  trControl = ctrl
)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Set session seed to 6913 (disabled GPU, CPU parallelism)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Error in {: task 1 failed - "$ operator is invalid for atomic vectors"

stopCluster(cl)

^{Created on 2019-08-26 by the reprex package (v0.3.0)}

And here is the same example, working without parallel processing:

library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
library(embed)
#> Registered S3 method overwritten by 'xts':
#>   method     from
#>   as.zoo.xts zoo

sessionInfo()
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Mojave 10.14.6
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] embed_0.0.3     recipes_0.1.6   dplyr_0.8.3     caret_6.0-84   
#> [5] ggplot2_3.2.1   lattice_0.20-38
#> 
#> loaded via a namespace (and not attached):
#>   [1] minqa_1.2.4           colorspace_1.4-1      class_7.3-15         
#>   [4] ggridges_0.5.1        rsconnect_0.8.15      markdown_1.0         
#>   [7] base64enc_0.1-3       rstan_2.19.2          DT_0.7               
#>  [10] prodlim_2018.04.18    lubridate_1.7.4       codetools_0.2-16     
#>  [13] splines_3.6.0         knitr_1.23            shinythemes_1.1.2    
#>  [16] zeallot_0.1.0         bayesplot_1.7.0       jsonlite_1.6         
#>  [19] nloptr_1.2.1          tfruns_1.4            uwot_0.1.3           
#>  [22] shiny_1.3.2           compiler_3.6.0        backports_1.1.4      
#>  [25] assertthat_0.2.1      Matrix_1.2-17         lazyeval_0.2.2       
#>  [28] cli_1.1.0             later_0.8.0           htmltools_0.3.6      
#>  [31] prettyunits_1.0.2     tools_3.6.0           igraph_1.2.4.1       
#>  [34] gtable_0.3.0          glue_1.3.1            reshape2_1.4.3       
#>  [37] Rcpp_1.0.2            vctrs_0.2.0           nlme_3.1-139         
#>  [40] iterators_1.0.12      crosstalk_1.0.0       timeDate_3043.102    
#>  [43] gower_0.2.1           xfun_0.8              stringr_1.4.0        
#>  [46] ps_1.3.0              lme4_1.1-21           lifecycle_0.1.0      
#>  [49] mime_0.7              miniUI_0.1.1.1        gtools_3.8.1         
#>  [52] MASS_7.3-51.4         zoo_1.8-6             scales_1.0.0         
#>  [55] ipred_0.9-9           rstanarm_2.18.2       colourpicker_1.0     
#>  [58] promises_1.0.1        parallel_3.6.0        inline_0.3.15        
#>  [61] shinystan_2.5.0       yaml_2.2.0            reticulate_1.13      
#>  [64] gridExtra_2.3         loo_2.1.0             StanHeaders_2.18.1-10
#>  [67] keras_2.2.4.1         rpart_4.1-15          stringi_1.4.3        
#>  [70] highr_0.8             tensorflow_1.13.1     dygraphs_1.1.1.6     
#>  [73] foreach_1.4.7         boot_1.3-22           pkgbuild_1.0.3       
#>  [76] lava_1.6.6            rlang_0.4.0           pkgconfig_2.0.2      
#>  [79] matrixStats_0.54.0    evaluate_0.14         purrr_0.3.2          
#>  [82] rstantools_1.5.1      htmlwidgets_1.3       tidyselect_0.2.5     
#>  [85] processx_3.4.1        plyr_1.8.4            magrittr_1.5.0.9000  
#>  [88] R6_2.4.0              generics_0.0.2        pillar_1.4.2         
#>  [91] whisker_0.3-2         withr_2.1.2           xts_0.11-2           
#>  [94] survival_2.44-1.1     nnet_7.3-12           tibble_2.1.3         
#>  [97] crayon_1.3.4          rmarkdown_1.13.6      grid_3.6.0           
#> [100] data.table_1.12.2     callr_3.3.1           ModelMetrics_1.2.2   
#> [103] threejs_0.3.1         digest_0.6.20         xtable_1.8-4         
#> [106] tidyr_0.8.99.9000     httpuv_1.5.1          RcppParallel_4.4.3   
#> [109] stats4_3.6.0          munsell_0.5.0         shinyjs_1.0

mtcars2 <- as_tibble(mtcars)
mtcars2 <- mtcars2 %>%
  mutate(cyl = as.factor(paste0("num_", cyl))) %>%
  mutate(am = as.factor(ifelse(am == 1, "am", "not_am")))

rec <- recipe(am ~ cyl + hp, mtcars2) %>%
  step_embed(
    cyl,
    outcome = vars(am),
    options = embed_control(epochs = 75, validation_split = 0.2)
  )

ctrl <- trainControl(
  method = 'cv',
  number = 5,
  savePredictions = 'final',
  classProbs = TRUE,
  summaryFunction = twoClassSummary,
  sampling = NULL,
  returnData = FALSE
)

train(
  rec,
  mtcars2,
  method = "glm",
  metric = "ROC",
  trControl = ctrl
)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Set session seed to 8761 (disabled GPU, CPU parallelism)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Set session seed to 7367 (disabled GPU, CPU parallelism)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Set session seed to 1120 (disabled GPU, CPU parallelism)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Set session seed to 3630 (disabled GPU, CPU parallelism)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Set session seed to 6134 (disabled GPU, CPU parallelism)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Set session seed to 6598 (disabled GPU, CPU parallelism)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Set session seed to 8840 (disabled GPU, CPU parallelism)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Generalized Linear Model 
#> 
#> Recipe steps: embed 
#> Resampling: Cross-Validated (5 fold) 
#> Summary of sample sizes: 26, 25, 25, 25, 27 
#> Resampling results:
#> 
#>   ROC    Sens  Spec     
#>   0.775  0.7   0.7333333

^{Created on 2019-08-26 by the reprex package (v0.3.0)}

Release embed 0.0.6

Prepare for release:

Check current CRAN check results
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
rhub::check_for_cran()
revdepcheck::revdep_check(num_workers = 4)
Update cran-comments.md
Polish NEWS
Review pkgdown reference index for, e.g., missing topics

Submit to CRAN:

usethis::use_version('patch')
devtools::submit_cran()
Approve email

Wait for CRAN...

Accepted 🎉
usethis::use_github_release()
usethis::use_dev_version()

Error loading and baking with step_umap for new data

Hi,

Thanks a lot for this embedding step! I'm a big fan of this package and it makes easy to accomplish a lot with just few lines of code.

I recently encountered a problem whilst trying to reload a saved step_umap object to bake on new data. The error message is shown below:

Error in .External(list(name = "CppMethod__invoke_void", address = <pointer: (nil)>, : NULL value passed as symbol address

I have done a bit of Google search for the said error and did find this git issue jlmelville/uwot#19 but I wanted to check if there exist any workaround in the embed package implementation.

Thanks!

step_lencode_(lme|bayes) fails without at least rlang 0.4.10

According to the rlang news, rlang::check_installed() was added in version 0.4.10, the most recent CRAN version as of filing this issue. This is used in the utility functions lme_coefs() and stan_coefs(). However, embed's description does not impose any version restrictions on rlang.

I think the description should be updated accordingly or some alternative for rlang::check_installed() should be used.

Use irlba's truncated SVD to speed up step_pca

step_pca is very useful, but is slow and memory-intensive when run on more than a few hundred features, even if num_comp is much smaller than p. (In my experience this makes it especially time-intensive to tune the num_comp training parameter, which requires running the SVD preparation step many times).

As a solution, this step could use the irlba package for truncated SVD, which is much faster and more memory efficient when the number of components is small compared to p.

I could imagine the step either automatically using irlba when num_comp is far smaller than p, or doing so only when the user requests something like truncated = TRUE, but in any case it would be very helpful!

Reproducible example, if we were trying to build a model to identify which Jane Austen book a line of text came from:

library(janeaustenr)
library(recipes)
library(textrecipes)

# Train a model to match a single line to one of Jane Austen's books 
books <- austen_books() %>%
  filter(text != "")

rec <- recipe(book ~ text, books) %>%
  step_tokenize(text) %>%
  step_tokenfilter(text, max_tokens = 300) %>%
  step_tfidf(text)

# This is slow (~40s for me), and uses so much memory that it's hard to terminate
rec %>%
  step_pca(starts_with("tfidf"), num_comp = 5) %>%
  prep() %>%
  juice()

# But this is fast (~3.5s)
rec %>%
  prep() %>%
  juice() %>%
  select(-book) %>%
  as.matrix() %>%
  irlba(nv = 5)

Set Tensorflow session seed

Hi! Thanks for developing such a good (and needed) recipe extension!

I was looking into the documentation for a while and could not find a parameter to set the Tensorflow seed to get reproducible results. Each time I rerun the transformation I get a new seed, I'd instead fix this behaviour at least when transforming the training data set.

Keep it up!
Cris

Release embed 0.1.0

Prepare for release:

Check current CRAN check results
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
rhub::check_for_cran()
revdepcheck::revdep_check(num_workers = 4)
Update cran-comments.md
Polish NEWS
Review pkgdown reference index for, e.g., missing topics
Draft blog post

Submit to CRAN:

usethis::use_version('minor')
devtools::submit_cran()
Approve email

Wait for CRAN...

Error if load embed after tensorflow

I have been running through the code on this embed:Tensorflow page: https://tidymodels.github.io/embed/articles/Applications/Tensorflow.html

I installed tensorflow/keras in an environment r-tensorflow using conda (could not get install_tensorflow to work due to HTTPS issues).

I have found that the code runs successfully if I load libraries and define environments in the following order (prior to running the code on the embed:Tensorflow page)

library(embed)
# Note that need to load tensorflow after embed
reticulate::use_python("C:/Users/duartem/.conda/envs/r-tensorflow/python.exe", required = TRUE)

library(tensorflow)
Sys.setenv(TENSORFLOW_PYTHON="C:/Users/duartem/.conda/envs/r-tensorflow/python")

reticulate::py_config()

If I instead load embed after tensorflow I get the following error:

Error in UseMethod("compile") : no applicable method for 'compile' applied to an object of class "c('tensorflow.python.keras.engine.training.Model', 'tensorflow.python.keras.engine.network.Network', 'tensorflow.python.keras.engine.base_layer.Layer', 'tensorflow.python.training.checkpointable.base.CheckpointableBase', 'python.builtin.object')"

This error occurs in the following step on the embed:Tensorflow page

tf_embed <- 
  recipe(Sale_Price ~ ., data = ames) %>%
  step_log(Sale_Price, base = 10) %>%
  # Add some other predictors that can be used by the network. We
  # preprocess them first
  step_YeoJohnson(Lot_Area, Full_Bath, Gr_Liv_Area)  %>%
  step_range(Lot_Area, Full_Bath, Gr_Liv_Area)  %>%
  step_embed(
    Neighborhood, 
    outcome = vars(Sale_Price),
    predictors = vars(Lot_Area, Full_Bath, Gr_Liv_Area),
    num_terms = 5, 
    hidden_units = 10, 
    options = embed_control(epochs = 75, validation_split = 0.2)
  ) %>% 
  prep(training = ames)

Session Info:

sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2012 R2 x64 (build 9600)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252   
[3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] AmesHousing_0.0.3  yardstick_0.0.3    tibble_2.1.2       rsample_0.0.4     
 [5] tidyr_0.8.3        purrr_0.3.2        parsnip_0.0.2      infer_0.4.0.1     
 [9] ggplot2_3.1.1      dials_0.0.2        scales_1.0.0       broom_0.5.2       
[13] tidymodels_0.0.2   embed_0.0.2        recipes_0.1.5.9000 dplyr_0.8.1       
[17] tensorflow_1.13.1 

loaded via a namespace (and not attached):
  [1] minqa_1.2.4         colorspace_1.4-1    class_7.3-15       
  [4] ggridges_0.5.1      rsconnect_0.8.13    markdown_0.9       
  [7] base64enc_0.1-3     tidytext_0.2.0      rstudioapi_0.10    
 [10] rstan_2.18.2        SnowballC_0.6.0     DT_0.6             
 [13] prodlim_2018.04.18  lubridate_1.7.4     codetools_0.2-16   
 [16] splines_3.6.0       knitr_1.23          shinythemes_1.1.2  
 [19] zeallot_0.1.0       bayesplot_1.7.0     jsonlite_1.6       
 [22] nloptr_1.2.1        pROC_1.14.0         packrat_0.5.0      
 [25] tfruns_1.4          shiny_1.3.2         compiler_3.6.0     
 [28] backports_1.1.4     assertthat_0.2.1    Matrix_1.2-17      
 [31] lazyeval_0.2.2      cli_1.1.0           later_0.8.0        
 [34] htmltools_0.3.6     prettyunits_1.0.2   tools_3.6.0        
 [37] igraph_1.2.4.1      gtable_0.3.0        glue_1.3.1         
 [40] reshape2_1.4.3      Rcpp_1.0.1          nlme_3.1-139       
 [43] crosstalk_1.0.0     timeDate_3043.102   gower_0.2.1        
 [46] xfun_0.7            stringr_1.4.0       ps_1.3.0           
 [49] lme4_1.1-21         mime_0.6            miniUI_0.1.1.1     
 [52] gtools_3.8.1        tidypredict_0.3.0   MASS_7.3-51.4      
 [55] zoo_1.8-6           ipred_0.9-9         rstanarm_2.18.2    
 [58] colourpicker_1.0    promises_1.0.1      parallel_3.6.0     
 [61] inline_0.3.15       shinystan_2.5.0     tidyposterior_0.0.2
 [64] reticulate_1.12     gridExtra_2.3       loo_2.1.0          
 [67] StanHeaders_2.18.1  keras_2.2.4.1       rpart_4.1-15       
 [70] stringi_1.4.3       tokenizers_0.2.1    dygraphs_1.1.1.6   
 [73] boot_1.3-22         pkgbuild_1.0.3      lava_1.6.5         
 [76] rlang_0.3.4         pkgconfig_2.0.2     matrixStats_0.54.0 
 [79] lattice_0.20-38     labeling_0.3        rstantools_1.5.1   
 [82] htmlwidgets_1.3     processx_3.3.1      tidyselect_0.2.5   
 [85] plyr_1.8.4          magrittr_1.5        R6_2.4.0           
 [88] generics_0.0.2      pillar_1.4.1        whisker_0.3-2      
 [91] withr_2.1.2         xts_0.11-2          survival_2.44-1.1  
 [94] nnet_7.3-12         janeaustenr_0.1.5   crayon_1.3.4       
 [97] grid_3.6.0          callr_3.2.0         threejs_0.3.1      
[100] digest_0.6.19       xtable_1.8-4        httpuv_1.5.1       
[103] stats4_3.6.0        munsell_0.5.0       shinyjs_1.0

step_embed throws Error: "Error in if (is.na(b)) return(1L) : argument is of length zero"

A fix is needed because of library tensorflow.

Using

embed_0.1.0 
tensorflow_2.2.0
reticulate_1.16
tidyverse_1.3.0

prep.encoding <- prep(recipe.encoding, training = training.set, retain = TRUE)

throws Error:
Error in if (is.na(b)) return(1L) : argument is of length zero

from this line:
compareVersion("2.0", as.character(tensorflow::tf_version()))

because
tensorflow::tf_version() returns NULL

okc data now in modeldata

Not in recipes anymore:

   Running examples in ‘embed-Ex.R’ failed
   The error most likely occurred in:
   
   > ### Name: step_embed
   > ### Title: Encoding Factors into Multiple Columns
   > ### Aliases: step_embed tidy.step_embed embed_control
   > ### Keywords: datagen
   > 
   > ### ** Examples
   > 
   > data(okc)
   Warning in data(okc) : data set ‘okc’ not found
   > 
   > rec <- recipe(Class ~ age + location, data = okc) %>%
   +   step_embed(location, outcome = vars(Class),
   +              options = embed_control(epochs = 10))
   Error in is_tibble(data) : object 'okc' not found
   Calls: %>% ... eval -> recipe -> recipe.formula -> form2args -> is_tibble
   Execution halted
   ```

*   checking tests ...
   ```
    ERROR
   Running the tests in ‘tests/testthat.R’ failed.
   Last 13 lines of output:
     
     > 
     > test_check(package = "embed")
     ── 1. Error: (unknown) (@test_woe.R#9)  ────────────────────────────────────────
     object 'credit_data' not found
     Backtrace:
      1. base::sample(1:nrow(credit_data), 2000)
      2. base::nrow(credit_data)
     
     ══ testthat results  ═══════════════════════════════════════════════════════════
     [ OK: 112 | SKIPPED: 10 | WARNINGS: 1 | FAILED: 1 ]
     1. Error: (unknown) (@test_woe.R#9) 
     
     Error: testthat unit tests failed
     Execution halted
   ```

Release embed 0.1.2

Prepare for release:

devtools::build_readme()
Check current CRAN check results
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
rhub::check_for_cran()
revdepcheck::revdep_check(num_workers = 4)
Update cran-comments.md
Polish NEWS

Submit to CRAN:

usethis::use_version('patch')
devtools::submit_cran()
Approve email

Wait for CRAN...

Accepted 🎉
usethis::use_github_release()
usethis::use_dev_version()

Intuition behind step_tfembed approach?

Hi Max, I have a question about the approach in step_tfembed. It looks like it only supports a single hidden layer, and while it may eventually handle multiple predictors at once (#5), only those categoricals will be fed to the TF/keras model as inputs. Is that correct?

This seems quite different from the approach in Guo/Berkhahn and others where the embeddings are learned jointly with the task, considering all predictors (including non-categorical). Is the idea that even much narrower and shallower models should learn similarly good categorical embeddings, or that this approach strikes a nice effort/benefit balance? I'm curious if you could say something about the intuition there or point me towards other examples of this approach. Is this something you've had success with in your own experiments?

Clustering: Add a step_kmeans()

Love embed. It would be super awesome if there was a step_kmeans() or step_cluster() that added cluster assignments to a data frame.

Why?

Cluster assignments are super important for segmentation. K-Means and similar algorithms (e.g. K-modes) can help us to identify customer groups.

Embed

Embed is a good spot for this. step_umap() is a similar algorithm that I often use in combination with K-Means.

Let me know what you think.

Thanks, Matt

Release embed 0.1.4

Prepare for release:

devtools::build_readme()
Check current CRAN check results
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
rhub::check_for_cran()
revdepcheck::revdep_check(num_workers = 4)
Update cran-comments.md
Polish NEWS

Submit to CRAN:

usethis::use_version('patch')
devtools::submit_cran()
Approve email

Wait for CRAN...

Accepted 🎉
usethis::use_github_release()
usethis::use_dev_version()

outcome name inconsistencies for step_umap()

step_umap() creates column names in an inconsistent manner compared to how the dimensionality reduction steps in {recipes} are doing it. Adding a prefix argument in step_umap() could properly resolve this issue

library(recipes)
library(embed)

recipe(~., data = mtcars) %>%
  step_pca(all_predictors()) %>%
  prep() %>%
  bake(new_data = mtcars) %>%
  names()
#> [1] "PC1" "PC2" "PC3" "PC4" "PC5"

recipe(~., data = mtcars) %>%
  step_ica(all_predictors()) %>%
  prep() %>%
  bake(new_data = mtcars) %>%
  names()
#> [1] "IC1" "IC2" "IC3" "IC4" "IC5"

recipe(~., data = mtcars) %>%
  step_nnmf(all_predictors()) %>%
  prep() %>%
  bake(new_data = mtcars) %>%
  names()
#> [1] "NNMF1" "NNMF2"

recipe(~., data = mtcars) %>%
  step_umap(all_predictors()) %>%
  prep() %>%
  bake(new_data = mtcars) %>%
  names()
#> [1] "umap_1" "umap_2"

^{Created on 2021-03-12 by the reprex package (v0.3.0)}

The year 2106 resurfaces

Guo, C and Berkhahn F (2106)

I think I reported that before :)

0.0.2 checklist

Prepare for release:

devtools::check_win_devel()
rhub::check_for_cran()
Polish NEWS
If new failures, update email.yml then revdepcheck::revdep_email_maintainers()

Perform release:

Create RC branch (for bigger releases)
Bump version (in DESCRIPTION and NEWS)
devtools::check_win_devel() (again!)
devtools::submit_cran()
Approve email

Wait for CRAN...

Tag release
Merge RC back to master branch
pkgdown::build_site()
Bump dev version
~~Write blog post~~ Deferred for a tidymodels general update.

Template from r-lib/usethis#338

Q: handling of all_predictors() in step_lencode_*

How does the step_lencode_* functions handle multi-variables ?
Does it construct a single big model afterwards or does it handle each variable separately ?

Option to estimate models with lme4 (or similar)

It might be nice to have an option to estimate pooled likelihood encodings in lme4, which is faster at the cost of less flexibility.

This is a silly example, but:

library(rstanarm)
library(lme4)
library(microbenchmark)

fit_stan <- function() stan_lmer(Sepal.Width ~ 1|Species, data = iris)
fit_lme4 <- function() lmer(Sepal.Width ~ 1|Species, data = iris)

microbenchmark(
  fit_stan(),
  fit_lme4(),
  times = 5
)

Each stan_lmer fit takes 13 seconds on my computer, but lmer takes only a fifth of a second, which could add up quickly if users are calculating encodings across many resampled datasets.

I'd be happy to make a PR (won't be able to start working on it till next Monday), but wanted to see if you're interested first, and if you have anything thoughts on an inferface.

Umap projections

using https://github.com/jlmelville/uwot

Moving some packages from import to suggest

This package has some very heavy (system) dependencies that are required in any case (via Imports:). Is it an option for you to move some of them into Suggests: to make a base installation lighter? As far as I understand this package, most recipes steps from this package require just one or two of these dependencies (like tensorflow), so if I only use say one step from this package, I technically don't need the majority of the dependencies in Imports:. This in particular matters in model deployment.

Support empty selections in all recipe steps

Ref tidymodels/recipes#813

Problem with embed installation

install.packages("embed")
Installing package into ‘/home/roxana/R/x86_64-pc-linux-gnu-library/3.6’
(as ‘lib’ is unspecified)
probando la URL 'https://cloud.r-project.org/src/contrib/embed_0.1.1.tar.gz'
Content type 'application/x-gzip' length 46880 bytes (45 KB)
==================================================
downloaded 45 KB

installing source package ‘embed’ ...
** package ‘embed’ successfully unpacked and MD5 sums checked
** using staged installation
** R
** byte-compile and prepare package for lazy loading

*** caught segfault ***
address 0x7fc0dd39d008, cause 'invalid permissions'

Traceback:
1: dyn.load(file, DLLpath = DLLpath, ...)
2: library.dynam(lib, package, package.lib)
3: loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]])
4: asNamespace(ns)
5: namespaceImportFrom(ns, loadNamespace(j <- i[[1L]], c(lib.loc, .libPaths()), versionCheck = vI[[j]]), i[[2L]], from = package)
6: loadNamespace(package = package, lib.loc = lib.loc, keep.source = keep.source, keep.parse.data = keep.parse.data, partial = TRUE)
7: withCallingHandlers(expr, packageStartupMessage = function(c) invokeRestart("muffleMessage"))
8: suppressPackageStartupMessages(loadNamespace(package = package, lib.loc = lib.loc, keep.source = keep.source, keep.parse.data = keep.parse.data, partial = TRUE))
9: code2LazyLoadDB(package, lib.loc = lib.loc, keep.source = keep.source, keep.parse.data = keep.parse.data, compress = compress, set.install.dir = set.install.dir)
10: tools:::makeLazyLoading("embed", "/home/roxana/R/x86_64-pc-linux-gnu-library/3.6/00LOCK-embed/00new", keep.source = FALSE, keep.parse.data = FALSE, set.install.dir = "/home/roxana/R/x86_64-pc-linux-gnu-library/3.6/embed")
An irrecoverable exception occurred. R is aborting now ...
Segmentation fault (core dumped)
ERROR: lazy loading failed for package ‘embed’

removing ‘/home/roxana/R/x86_64-pc-linux-gnu-library/3.6/embed’
Warning in install.packages :
installation of package ‘embed’ had non-zero exit status

The downloaded source packages are in
‘/tmp/RtmpgjvCcx/downloaded_packages’

My sessionInfo() details as follows:

sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
[1] LC_CTYPE=es_AR.UTF-8 LC_NUMERIC=C LC_TIME=es_AR.UTF-8
[4] LC_COLLATE=es_AR.UTF-8 LC_MONETARY=es_AR.UTF-8 LC_MESSAGES=es_AR.UTF-8
[7] LC_PAPER=es_AR.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=es_AR.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] compiler_3.6.3 Matrix_1.2-18 magrittr_1.5 R6_2.4.1 generics_0.0.2
[6] tools_3.6.3 whisker_0.4 base64enc_0.1-3 Rcpp_1.0.5 reticulate_1.16
[11] keras_2.3.0.0 tensorflow_2.2.0 grid_3.6.3 zeallot_0.1.0 jsonlite_1.7.0
[16] tfruns_1.4 lattice_0.20-40

Multiple features

Could you add an example of how you would use embed when you want to create/get embeddings from multiple categorical variables? My assumption is you would have multiple embedding layers, but train the model at once. Is that possible with embed? Can you provide an example of that? I hope this makes sense. (This would be like Fig. 1 in the Guo & Berkhahn paper)

0.0.3 checklist

Prepare for release:

Submit to CRAN:

usethis::use_version()
Update cran-comments.md
devtools::submit_cran()
Approve email

Wait for CRAN...

Accepted 🎉
usethis::use_github_release()
usethis::use_dev_version()
Tweet

Re-licensing embed ✍️

We are systematically re-licensing tidymodels packages to use the MIT license, to make our package licenses as clear and permissive as possible. To do so, we need the approval of all copyright holders, which I have found by reviewing contributions from all non-RStudio contributors.

@Athospd, @klahrich, @konradsemsch, would you permit us to re-license embed with the MIT license? If so, please comment "I agree" below.

Allow setting step_woe's "too many levels" warning threshold

Issue

step_woe (via add_woe(), which is run during both prep() and bake()) warns if the dictionary contains more than 50 levels for the factor. It gets quite noisy when using the recipe in any resampling scheme.

Request

Condition the warning based on an argument to step_woe(), say max_levels. If any of the provided predictors have more unique values than max_levels, emit the warning during prep(). The default value would be 50 to match current behavior.

Happy to make a PR if you agree. The reprex below demonstrates the current behavior on prep() and bake().

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
library(tibble)
library(embed)
packageVersion("embed")
#> [1] '0.1.4'
set.seed(100)
data <- tibble(y = factor(sample(c('yes', 'no'), size = 52*10, replace = TRUE)),
               fct = factor(sample(c(letters, LETTERS), size = 52*10, replace = TRUE)))

rec <- recipe(y~fct, data = data) %>% 
  step_woe(fct, outcome = vars(y))

# warns on `prep()`, which I appreciate
prec <- prep(rec)  
#> Warning: Variable fct has 52 unique values. Is this expected? In case of numeric
#> variable, see ?step_discretize().

# also warns on `bake()`, which I don't  much.
bake(prec, new_data = data)
#> Warning: Variable fct has 52 unique values. Is this expected? In case of numeric
#> variable, see ?step_discretize().
#> # A tibble: 520 x 2
#>    y     woe_fct
#>    <fct>   <dbl>
#>  1 no    -0.590 
#>  2 yes   -0.254 
#>  3 no     0.192 
#>  4 no     0.817 
#>  5 yes   -0.542 
#>  6 yes   -0.0308
#>  7 no     0.123 
#>  8 no    -0.282 
#>  9 no     1.22  
#> 10 yes   -0.318 
#> # ... with 510 more rows

^{Created on 2021-02-24 by the reprex package (v1.0.0)}

outcome specification inconsistent between step_woe and other embed::step_*

All embed::step_* functions have an outcome parameter. However, step_woe doesn't accept a vars column selection, expecting the bar name. In fact, the bare name is captured by an enquo:

https://github.com/tmastny/embed/blob/cdabcf28a8c5086237637d08ca86642b6fc2af50/R/woe.R#L127

This is inconsistent with the other step functions. I'm happy to help with a PR, but I don't know which is intended (bare name or vars).

Here is a reproducible example, which the error

#> Can't find column `vars(diabetes)` in `.data`.

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
library(embed)
#> Registered S3 method overwritten by 'xts':
#>   method     from
#>   as.zoo.xts zoo
library(dplyr)
library(mlbench)

set.seed(124)

data(PimaIndiansDiabetes)

d <- PimaIndiansDiabetes
d <- d %>%
  as_tibble() %>%
  select(diabetes, everything())

# make factor variables
d <- d %>%
  mutate(mass_fct = factor(ifelse(mass > 30, "large", "small"))) %>%
  mutate(pregnant_fct = as.factor(pregnant)) %>%
  mutate(pressure_fct = factor(case_when(
    pressure < 30 ~ "low",
    between(pressure, 30, 50) ~ "medium",
    pressure > 50 ~ "high"
  ))) %>%
  mutate(triceps_fct = factor(ifelse(triceps > 0, "has", "none"))) %>%
  mutate(insulin_fct = factor(insulin)) %>%
  mutate(age_fct = factor(age))

# steps in `embed`
embed_rec <- recipe(diabetes ~ ., d) %>%
  embed::step_woe(mass_fct, outcome = vars(diabetes)) %>%
  embed::step_lencode_bayes(pregnant_fct, outcome = vars(diabetes)) %>%
  embed::step_lencode_glm(pressure_fct, outcome = vars(diabetes)) %>%
  embed::step_lencode_mixed(triceps_fct, outcome = vars(diabetes)) %>%
  embed::step_embed(
    insulin_fct, 
    outcome = vars(diabetes),
    options = embed_control(epochs = 1)
  ) %>%
  embed::step_umap(pedigree, outcome = vars(diabetes))

embed_rec_prepped <- prep(embed_rec, d)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Can't find column `vars(diabetes)` in `.data`.

embed_rec <- recipe(diabetes ~ ., d) %>%
  embed::step_woe(mass_fct, outcome = diabetes) %>%
  embed::step_lencode_bayes(pregnant_fct, outcome = vars(diabetes)) %>%
  embed::step_lencode_glm(pressure_fct, outcome = vars(diabetes)) %>%
  embed::step_lencode_mixed(triceps_fct, outcome = vars(diabetes)) %>%
  embed::step_embed(
    insulin_fct, 
    outcome = vars(diabetes),
    options = embed_control(epochs = 1)
  ) %>%
  embed::step_umap(pedigree, outcome = vars(diabetes))

embed_rec_prepped <- prep(embed_rec, d)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: `.key` is deprecated
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> boundary (singular) fit: see ?isSingular
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Set session seed to 5396 (disabled GPU, CPU parallelism)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

devtools::session_info()
#> ─ Session info ──────────────────────────────────────────────────────────
#>  setting  value                       
#>  version  R version 3.6.0 (2019-04-26)
#>  os       macOS Mojave 10.14.6        
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  ctype    en_US.UTF-8                 
#>  tz       America/Chicago             
#>  date     2019-08-29                  
#> 
#> ─ Packages ──────────────────────────────────────────────────────────────
#>  package      * version     date       lib
#>  assertthat     0.2.1       2019-03-21 [1]
#>  backports      1.1.4       2019-04-10 [1]
#>  base64enc      0.1-3       2015-07-28 [1]
#>  bayesplot      1.7.0       2019-05-23 [1]
#>  boot           1.3-22      2019-04-02 [1]
#>  callr          3.3.1       2019-07-18 [1]
#>  class          7.3-15      2019-01-01 [1]
#>  cli            1.1.0       2019-03-19 [1]
#>  codetools      0.2-16      2018-12-24 [1]
#>  colorspace     1.4-1       2019-03-18 [1]
#>  colourpicker   1.0         2017-09-27 [1]
#>  crayon         1.3.4       2017-09-16 [1]
#>  crosstalk      1.0.0       2016-12-21 [1]
#>  desc           1.2.0       2018-05-01 [1]
#>  devtools       2.0.2       2019-04-08 [1]
#>  digest         0.6.20      2019-07-04 [1]
#>  dplyr        * 0.8.3       2019-07-04 [1]
#>  DT             0.8         2019-08-07 [1]
#>  dygraphs       1.1.1.6     2018-07-11 [1]
#>  ellipsis       0.2.0.1     2019-07-02 [1]
#>  embed        * 0.0.3.9000  2019-08-30 [1]
#>  evaluate       0.14        2019-05-28 [1]
#>  fs             1.3.1       2019-05-06 [1]
#>  generics       0.0.2       2018-11-29 [1]
#>  ggplot2        3.2.1       2019-08-10 [1]
#>  ggridges       0.5.1       2018-09-27 [1]
#>  glue           1.3.1       2019-03-12 [1]
#>  gower          0.2.1       2019-05-14 [1]
#>  gridExtra      2.3         2017-09-09 [1]
#>  gtable         0.3.0       2019-03-25 [1]
#>  gtools         3.8.1       2018-06-26 [1]
#>  highr          0.8         2019-03-20 [1]
#>  htmltools      0.3.6       2017-04-28 [1]
#>  htmlwidgets    1.3         2018-09-30 [1]
#>  httpuv         1.5.1       2019-04-05 [1]
#>  igraph         1.2.4.1     2019-04-22 [1]
#>  inline         0.3.15      2018-05-18 [1]
#>  ipred          0.9-9       2019-04-28 [1]
#>  jsonlite       1.6         2018-12-07 [1]
#>  keras          2.2.4.1     2019-04-05 [1]
#>  knitr          1.23        2019-05-18 [1]
#>  later          0.8.0       2019-02-11 [1]
#>  lattice        0.20-38     2018-11-04 [1]
#>  lava           1.6.6       2019-08-01 [1]
#>  lazyeval       0.2.2       2019-03-15 [1]
#>  lifecycle      0.1.0       2019-08-01 [1]
#>  lme4           1.1-21      2019-03-05 [1]
#>  loo            2.1.0       2019-03-13 [1]
#>  lubridate      1.7.4       2018-04-11 [1]
#>  magrittr       1.5.0.9000  2019-07-03 [1]
#>  markdown       1.1         2019-08-07 [1]
#>  MASS           7.3-51.4    2019-03-31 [1]
#>  Matrix         1.2-17      2019-03-22 [1]
#>  matrixStats    0.54.0      2018-07-23 [1]
#>  memoise        1.1.0       2017-04-21 [1]
#>  mime           0.7         2019-06-11 [1]
#>  miniUI         0.1.1.1     2018-05-18 [1]
#>  minqa          1.2.4       2014-10-09 [1]
#>  mlbench      * 2.1-1       2012-07-10 [1]
#>  munsell        0.5.0       2018-06-12 [1]
#>  nlme           3.1-139     2019-04-09 [1]
#>  nloptr         1.2.1       2018-10-03 [1]
#>  nnet           7.3-12      2016-02-02 [1]
#>  pillar         1.4.2       2019-06-29 [1]
#>  pkgbuild       1.0.5       2019-08-26 [1]
#>  pkgconfig      2.0.2       2018-08-16 [1]
#>  pkgload        1.0.2       2018-10-29 [1]
#>  plyr           1.8.4       2016-06-08 [1]
#>  prettyunits    1.0.2       2015-07-13 [1]
#>  processx       3.4.1       2019-07-18 [1]
#>  prodlim        2018.04.18  2018-04-18 [1]
#>  promises       1.0.1       2018-04-13 [1]
#>  ps             1.3.0       2018-12-21 [1]
#>  purrr          0.3.2       2019-03-15 [1]
#>  R6             2.4.0       2019-02-14 [1]
#>  Rcpp           1.0.2       2019-07-25 [1]
#>  RcppAnnoy      0.0.12      2019-05-12 [1]
#>  RcppParallel   4.4.3       2019-05-22 [1]
#>  recipes      * 0.1.6       2019-07-02 [1]
#>  remotes        2.1.0       2019-06-24 [1]
#>  reshape2       1.4.3       2017-12-11 [1]
#>  reticulate     1.13        2019-07-24 [1]
#>  rlang          0.4.0       2019-06-25 [1]
#>  rmarkdown      1.13.6      2019-07-09 [1]
#>  rpart          4.1-15      2019-04-12 [1]
#>  rprojroot      1.3-2       2018-01-03 [1]
#>  rsconnect      0.8.15      2019-07-22 [1]
#>  RSpectra       0.15-0      2019-06-11 [1]
#>  rstan          2.19.2      2019-07-09 [1]
#>  rstanarm       2.18.2      2018-11-10 [1]
#>  rstantools     1.5.1       2018-08-22 [1]
#>  scales         1.0.0       2018-08-09 [1]
#>  sessioninfo    1.1.1       2018-11-05 [1]
#>  shiny          1.3.2       2019-04-22 [1]
#>  shinyjs        1.0         2018-01-08 [1]
#>  shinystan      2.5.0       2018-05-01 [1]
#>  shinythemes    1.1.2       2018-11-06 [1]
#>  StanHeaders    2.18.1-10   2019-06-14 [1]
#>  stringi        1.4.3       2019-03-12 [1]
#>  stringr        1.4.0       2019-02-10 [1]
#>  survival       2.44-1.1    2019-04-01 [1]
#>  tensorflow     1.14.0      2019-08-01 [1]
#>  testthat       2.1.1       2019-04-23 [1]
#>  tfruns         1.4         2018-08-25 [1]
#>  threejs        0.3.1       2017-08-13 [1]
#>  tibble         2.1.3       2019-06-06 [1]
#>  tidyr          0.8.99.9000 2019-08-26 [1]
#>  tidyselect     0.2.5       2018-10-11 [1]
#>  timeDate       3043.102    2018-02-21 [1]
#>  usethis        1.5.0       2019-04-07 [1]
#>  uwot           0.1.3       2019-04-07 [1]
#>  vctrs          0.2.0       2019-07-05 [1]
#>  whisker        0.4         2019-08-28 [1]
#>  withr          2.1.2       2018-03-15 [1]
#>  xfun           0.9         2019-08-21 [1]
#>  xtable         1.8-4       2019-04-21 [1]
#>  xts            0.11-2      2018-11-05 [1]
#>  yaml           2.2.0       2018-07-25 [1]
#>  zeallot        0.1.0       2018-01-28 [1]
#>  zoo            1.8-6       2019-05-28 [1]
#>  source                             
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  Github (tidymodels/embed@cdabcf2)  
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  Github (tidyverse/magrittr@4104d6b)
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  Github (rstudio/rmarkdown@7b18786) 
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  Github (tidyverse/tidyr@a3431e3)   
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#>  CRAN (R 3.6.0)                     
#> 
#> [1] /Library/Frameworks/R.framework/Versions/3.6/Resources/library

^{Created on 2019-08-29 by the reprex package (v0.3.0)}

Q: possible data leakage if the embedding using the label information?

I am just wondering about features created by embeddings using the label information. Is there any data leakage problem if I try to build a model with those added features?

Somewhere between the glm and stan

Would it be interesting to have an option to let the user choose how to blend the prior and posterior estimates through a hyperparameter (versus lme4 or stan) ?

Here's a description of the approach: https://kaggle2.blob.core.windows.net/forum-message-attachments/225952/7441/high%20cardinality%20categoricals.pdf.

I've seen it used on kaggle a lot. Usually users also add a random noise parameter as another hyperparameter to be added to the coefficient estimates.

It should be easy to implement, would happy to do it. Could be on top of the glm approach (as a last optional step after the glm coefficients have been estimated) or a separate method.

Step_lencode_mixed has faulty default ID

The step_lencode_mixed uses a default id of rand_id("lencode_bayes")

Should this be changed to rand_id("lencode_mixed")?

woe calculation in dictionary()

Why is woe different if I sort the outcome column? I get different woe values if in the first row of data the value of outcome column is 1 or 0.

mtcars_outcome_1_first <- mtcars
mtcar_outcome_0_first <- mtcars %>% arrange(am)

embed::dictionary(.data = mtcars_outcome_1_first %>% select(cyl, am), outcome = "am")
embed::dictionary(.data = mtcar_outcome_0_first %>% select(cyl, am), outcome = "am")

`CppMethod` error when applying prepped UMAP recipe after saving/reading as `.rds`

Seems like there is a bug 🐛 for step_umap() when trying to save a prepped recipe as .rds and reading it back to apply it new data.

library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(tidyverse)
library(embed)

split <- seq.int(1, 150, by = 9)
tr <- iris[-split, ]
te <- iris[ split, ]

set.seed(11)
supervised <- 
   recipe(Species ~ ., data = tr) %>%
   step_center(all_predictors()) %>% 
   step_scale(all_predictors()) %>% 
   step_umap(all_predictors(), outcome = vars(Species), num_comp = 2) %>% 
   prep(training = tr)

write_rds(supervised, here::here(tempdir(), "umap.rds"))
saved_rec <- read_rds(here::here(tempdir(), "umap.rds"))
saved_rec %>% bake(new_data = te)
#> Error in .External(structure(list(name = "CppMethod__invoke_notvoid", : NULL value passed as symbol address

^{Created on 2021-08-02 by the reprex package (v2.0.0)}

I'm sure this is not us (i.e. not the embed package) but I wonder if there is anything we can do about this.

The recipe is fine if you don't save as .rds and then read it back.

step_lencode_bayes() takes forever

I successfully applied step_lencode_glmand step_embed to small datasets.
However, applying step_lencode_bayes takes forever meaning it hasn't finished for very small datasets like ameshousing after 1h - which is the time to abort for me.
Does this have method-intrinsic reasons?
Furthermore, I am confused by the output message Linear embedding for factors via GLM for all_nominal().
What bayesian model is exactly implemented here? A reference to paper would be great, as I need it for publication.

Here is a reproducible example.

dataset <- AmesHousing::make_ames()
target.label <- "Sale_Price"
target <- dataset[[target.label]]
features.labels <- dataset %>% select(-target.label) %>% names

train.index <- createDataPartition(target, p = 0.8, list = FALSE) %>% as.vector()
training.set <- dataset[train.index, ]
testing.set <- dataset[-train.index, ]

features.labels <-training.set %>% 
    select(-target.label) %>% names %T>% print
  
recipe.base <- features.labels %>% 
    paste(collapse = " + ") %>% 
    paste(target.label, "~", .) %>% 
    as.formula %>% 
    recipe(training.set)

recipe.encoding <- recipe.base %>% 
    step_lencode_bayes(all_nominal(), outcome = vars(target.label)) 

# this step takes forever
prep.encoding <- prep(recipe.encoding, training = training.set, retain = TRUE)

training.set.juiced <- juice(prep.encoding) %T>% print
testing.set.baked <- prep.encoding %>% bake(testing_original)

Running the prep step on the first 50 rows of the training set throws this cryptic error on which I would appreciate an explanation:

Error: grouping factors must have > 1 sampled level
In addition: Warning messages:
1: There were 1 divergent transitions after warmup. Increasing adapt_delta above 0.95 may help. See
http://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup

Any hints welcome!

add a sparse PCA step

Via the VBsparsePCA package.

remove calls to warning() and stop()

replace with rlang::abort() and rlang::warn().

Broken link to article.

Hello,

I found a broken link to http://jse.amstat.org/v23n2/kim.pdf (Kim and Escobedo-Land (2015)(pdf)) in the page https://tidymodels.github.io/embed/articles/Applications/GLM.html .

Could you check when you update the site?

Implementation of `step_tsne` ?

Hi!

I was just pondering if a function to conduct t-sne analysis could be implemented.
Looking for dimensionality reduction options in the recipes package, I found that the embed package contains step_umap (and, of course, the recipes package contains step_pca).

I'd like to thank you guys for the work you are doing in this and in many other R packages. :)

step_embed() "argument is of length zero" error

I'm trying to learn how to do entity embedding for categorical variables but I keep getting this error and I can't figure out why.

rec <- recipe(Case~AnimacyObj+ Participants+ AgencySubj, data=causee)%>%
step_embed(AnimacyObj,Participants, AgencySubj,
outcome=vars(Case),
num_terms=3,
hidden_units=10,
options= embed_control(epochs = 25, validation_split=0.2))%>%
prep()

Error in if (is.na(b)) return(1L) : argument is of length zero

Release embed 0.1.3

Prepare for release:

devtools::build_readme()
Check current CRAN check results
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
rhub::check_for_cran()
revdepcheck::revdep_check(num_workers = 4)
Update cran-comments.md
Polish NEWS

Submit to CRAN:

usethis::use_version('patch')
devtools::submit_cran()
Approve email

Wait for CRAN...

Accepted 🎉
usethis::use_github_release()
usethis::use_dev_version()

step_woe() printing

The problem

I'm having trouble with printing a recipe that includes step_woe(). Other steps (printed via the recipe) produce a single line of explanation. Step_woe() looks like a raw print. Perhaps print.step_woe is not properly exported.

Reproducible example

library(recipes)
library(embed)
library(modeldata)
data("credit_data")

rec <- recipe(Status ~ ., data = credit_data) %>%
  step_center(all_numeric()) %>%
  step_woe(Job, Home, outcome = vars(Status)) %>%
  step_scale(all_numeric())

print("The printed recipe follows - Note that step_woe is a 'raw' print compared to step_center & step_scale")
print(rec)

tidy.step_woe inconsistent compared to other embed::tidy.step_*

I have a large set of recipes (you could call them a cookbook). To manage all the recipes, I like to programmatically extract the step function and all the columns the function applies to.

However, step_woe has an inconsistency in the naming that makes programmatically extracting this data more difficult.

It refers to the terms as variables. You can see it in the source code here:

embed/R/woe.R

Line 411 in cdabcf2

res <- tibble(variable = term_names,

Even though the variable name is term_names, for some reason the column name was called variables. This is inconsistent with other embed::step_* functions.

For example:

embed/R/bayes.R

Line 294 in cdabcf2

terms = term_names

I'm happy to help out with a PR if you are interested. I'm also working through an audit of recipes::step_* functions, but I haven't had a chance to post yet.

Here is a reproducible example showing all the tidy.step_* functions. Only step_woe has variables column name:

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
library(embed)
#> Registered S3 method overwritten by 'xts':
#>   method     from
#>   as.zoo.xts zoo
library(dplyr)
library(mlbench)

set.seed(124)

data(PimaIndiansDiabetes)

d <- PimaIndiansDiabetes
d <- d %>%
  as_tibble() %>%
  select(diabetes, everything())

# make factor variables
d <- d %>%
  mutate(mass_fct = factor(ifelse(mass > 30, "large", "small"))) %>%
  mutate(pregnant_fct = as.factor(pregnant)) %>%
  mutate(pressure_fct = factor(case_when(
    pressure < 30 ~ "low",
    between(pressure, 30, 50) ~ "medium",
    pressure > 50 ~ "high"
  ))) %>%
  mutate(triceps_fct = factor(ifelse(triceps > 0, "has", "none"))) %>%
  mutate(insulin_fct = factor(insulin)) %>%
  mutate(age_fct = factor(age))

# steps in `embed`
embed_rec <- recipe(diabetes ~ ., d) %>%
  embed::step_woe(mass_fct, outcome = diabetes) %>%
  embed::step_lencode_bayes(pregnant_fct, outcome = vars(diabetes)) %>%
  embed::step_lencode_glm(pressure_fct, outcome = vars(diabetes)) %>%
  embed::step_lencode_mixed(triceps_fct, outcome = vars(diabetes)) %>%
  embed::step_embed(
    insulin_fct, 
    outcome = vars(diabetes),
    options = embed_control(epochs = 1)
  ) %>%
  embed::step_umap(pedigree, outcome = vars(diabetes))

embed_rec_prepped <- prep(embed_rec, d)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Warning: `.key` is deprecated
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> boundary (singular) fit: see ?isSingular
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?
#> Set session seed to 2636 (disabled GPU, CPU parallelism)
#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

#> Warning: All elements of `...` must be named.
#> Did you want `data = c(type, role, source)`?

for (i in seq_along(embed_rec_prepped$steps)) {
  embed_rec_prepped %>%
    tidy(i) %>%
    print()
}
#> # A tibble: 2 x 9
#>   variable predictor n_tot n_neg n_pos p_neg p_pos    woe id       
#>   <chr>    <chr>     <int> <dbl> <dbl> <dbl> <dbl>  <dbl> <chr>    
#> 1 mass_fct large       465   250   215   0.5 0.802  0.473 woe_Jltqf
#> 2 mass_fct small       303   250    53   0.5 0.198 -0.928 woe_Jltqf
#> # A tibble: 18 x 4
#>    level   value terms        id                 
#>    <chr>   <dbl> <chr>        <chr>              
#>  1 0      0.630  pregnant_fct lencode_bayes_Dh6Ag
#>  2 1      1.20   pregnant_fct lencode_bayes_Dh6Ag
#>  3 2      1.33   pregnant_fct lencode_bayes_Dh6Ag
#>  4 3      0.543  pregnant_fct lencode_bayes_Dh6Ag
#>  5 4      0.630  pregnant_fct lencode_bayes_Dh6Ag
#>  6 5      0.515  pregnant_fct lencode_bayes_Dh6Ag
#>  7 6      0.679  pregnant_fct lencode_bayes_Dh6Ag
#>  8 7     -0.0949 pregnant_fct lencode_bayes_Dh6Ag
#>  9 8     -0.139  pregnant_fct lencode_bayes_Dh6Ag
#> 10 9     -0.280  pregnant_fct lencode_bayes_Dh6Ag
#> 11 10     0.353  pregnant_fct lencode_bayes_Dh6Ag
#> 12 11    -0.0729 pregnant_fct lencode_bayes_Dh6Ag
#> 13 12     0.304  pregnant_fct lencode_bayes_Dh6Ag
#> 14 13     0.217  pregnant_fct lencode_bayes_Dh6Ag
#> 15 14     0.0188 pregnant_fct lencode_bayes_Dh6Ag
#> 16 15     0.166  pregnant_fct lencode_bayes_Dh6Ag
#> 17 17     0.173  pregnant_fct lencode_bayes_Dh6Ag
#> 18 ..new  0.341  pregnant_fct lencode_bayes_Dh6Ag
#> # A tibble: 4 x 4
#>   level  value terms        id                 
#>   <chr>  <dbl> <chr>        <chr>              
#> 1 high   0.634 pressure_fct lencode_bayes_4bgX7
#> 2 low    0.223 pressure_fct lencode_bayes_4bgX7
#> 3 medium 0.916 pressure_fct lencode_bayes_4bgX7
#> 4 ..new  0.591 pressure_fct lencode_bayes_4bgX7
#> # A tibble: 3 x 4
#>   level value terms       id                 
#>   <chr> <dbl> <chr>       <chr>              
#> 1 has   0.624 triceps_fct lencode_bayes_hv7LH
#> 2 none  0.624 triceps_fct lencode_bayes_hv7LH
#> 3 ..new 0.624 triceps_fct lencode_bayes_hv7LH
#> # A tibble: 187 x 5
#>    insulin_fct_embed_1 insulin_fct_embed_2 level terms     id              
#>                  <dbl>               <dbl> <chr> <chr>     <chr>           
#>  1            -0.0301             -0.0259  ..new insulin_… lencode_bayes_G…
#>  2             0.0146              0.0397  0     insulin_… lencode_bayes_G…
#>  3            -0.00895            -0.0290  14    insulin_… lencode_bayes_G…
#>  4            -0.0280             -0.00999 15    insulin_… lencode_bayes_G…
#>  5            -0.0267             -0.0226  16    insulin_… lencode_bayes_G…
#>  6            -0.0463              0.00294 18    insulin_… lencode_bayes_G…
#>  7            -0.0102             -0.0445  22    insulin_… lencode_bayes_G…
#>  8             0.00978             0.0479  23    insulin_… lencode_bayes_G…
#>  9             0.0182             -0.0329  25    insulin_… lencode_bayes_G…
#> 10            -0.00414            -0.0186  29    insulin_… lencode_bayes_G…
#> # … with 177 more rows
#> # A tibble: 1 x 2
#>   terms    id        
#>   <chr>    <chr>     
#> 1 pedigree umap_exfSj

^{Created on 2019-08-29 by the reprex package (v0.3.0)}

Add novel-value handing method to step_woe

I noticed that all the glm based steps have a method for handling novel values: "For novel levels, a slightly timmed average of the coefficients is returned."

Would it make sense to add the same handling method to step_woe?

tidyr updates

Maybe fine with the changes to recipes though.

	#' @param retain A single logical for whether the original predictors should
	#' be kept (in addition to the new embedding variables).

tidymodels / embed Goto Github PK

embed's Introduction

tidymodels

Overview

Contributing

embed's People

Contributors

Stargazers

Watchers

Forkers

embed's Issues

Why?

Embed

Issue

Request

The problem

Reproducible example

Recommend Projects

Recommend Topics

Recommend Org