Giter Club home page Giter Club logo

templates_ml's Introduction

Templates for machine learning projects

This is a set of templates for machine learning projects. This is a work in progress, and def. not a final or best version.

Examples in both python and R contain 'pipelines' a sequence of steps, so you perform both feature engineering and model training on the trainset. and use the trained artefacts on the testset. That way you do not leak from the testset into the trainingset.

example python R
simple categorical simple categorical rf.py categorical_prediction_rf.R
simple categorical with lightgbm categorical_prediction_lgbm.R
cross validation cross_validation
hyperparameter tuning hyperparameter_tuning.py hyperparameter_tuning

Comparison R and Python templates

The R language was designed with data analysis in mind and when ML came on the scene it flowed nicely into the language. Python is a general purpose language with ML and some stats packages bolted on. Python is super readable even for people who never program in python. R has data.frames, missing values and statistical functionality built in, but it looks a bit weird.

Catagorical data

R has support for categorical data (factors), and tree based models like lightGBM and random forests can make efficient use of that. In python, it is possible to use the 'categorical' option as column and some models (like lightGBM) can make use of that. But this need attention and you have to perform work!

Object oriented programming

Python loves to be encapsulated object oriented: methods are part of objects. and objects need to be activated/initialized. You call a fit function from a sklearnmodel-object. modelobject.fit(args). R loves to be functional object oriented: methods belong to generic functions and a call looks like a normal function fit(modelobject, args).

In practice that makes the code look and feel different.

So in python (sklearn) you import a modelobject and manipulate that, the modelobject keeps state.

# instantiate RandomForest classifier instance
clf = RandomForestClassifier(max_depth=2, random_state=0)
# fit that instance with features (X) and results (y)
clf.fit(X, y)
# the clf object contains the trained model now.
# use that object to predict new values.
clf.predict([[0, 0, 0, 0]]

in R (tidymodels) you setup a model and write the result to a new object, and if you want, you can pipe the steps after each other even.

# instantiate a random forest model
rf_mod <-
  rand_forest(trees = 1000) %>%
  set_mode("classification") %>%
  set_engine("ranger") %>% 
# train the model (rf_mod), 
# with species as target 
# and all the other variables as features
# use the trainingset
trained_model <- fit(rf_mod, species~., trainingset)
# the trained_model now contains enough information to predict new data
predictions <- predict(trained_model, testset)

Concepts in different languages

What is it called ( or Help me Search)? (see also this excellent article by tim mastny)

Concept Python (sklearn) R (tidymodels)
Combine feature engineering & modeling steps Pipeline workflow
split data into training and test set train_test_split() initial_split()
feature engineering sklearn.preprocessing recipes::step_* functions
tuning create a dictionary yourself with tunable hyperparameters {"decisiontreeclassifier__max_depth":[1, 4, 8, 11]} dials::grid_* functions. (max_entropy, latin_hypercube, random, regular
cross-validation from sklearn.model_selection import GridSearchCV vfold_cv()

templates_ml's People

Contributors

rmhogervorst avatar

Stargazers

Roberto Salas avatar Caleb Jenkins avatar

Watchers

James Cloos avatar  avatar  avatar

Forkers

sandeshregmi

templates_ml's Issues

Warning message: All models failed in tune_grid(). See the `.notes` column.

Hi there,

I'm amazed your blog about using tidymodels with lightgbm and encountered warning message at trying to reproduce your templates_ml code:
"Warning message: All models failed in tune_grid(). See the .notes column."

Didn't know where it went wrong.

The following is my session info:
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Traditional)_Taiwan.950 LC_CTYPE=Chinese (Traditional)_Taiwan.950
[3] LC_MONETARY=Chinese (Traditional)_Taiwan.950 LC_NUMERIC=C
[5] LC_TIME=Chinese (Traditional)_Taiwan.950

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] doParallel_1.0.15 iterators_1.0.12 foreach_1.5.0 ggplot2_3.3.2 treesnip_0.1.0 yardstick_0.0.7
[7] workflows_0.1.2 dials_0.0.8 scales_1.1.1 tune_0.1.1 parsnip_0.1.2 recipes_0.1.13
[13] rsample_0.0.7 dplyr_1.0.0 janitor_2.0.1 AmesHousing_0.0.4

loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 lubridate_1.7.9 lattice_0.20-41 tidyr_1.1.1 listenv_0.8.0 class_7.3-17
[7] assertthat_0.2.1 digest_0.6.25 ipred_0.9-9 plyr_1.8.6 R6_2.4.1 pillar_1.4.6
[13] rlang_0.4.7 data.table_1.12.8 rstudioapi_0.11 DiceDesign_1.8-1 furrr_0.1.0 rpart_4.1-15
[19] Matrix_1.2-18 splines_4.0.2 gower_0.2.2 stringr_1.4.0 munsell_0.5.0 tinytex_0.24
[25] compiler_4.0.2 xfun_0.15 pkgconfig_2.0.3 globals_0.12.5 nnet_7.3-14 tidyselect_1.1.0
[31] tibble_3.0.3 prodlim_2019.11.13 codetools_0.2-16 GPfit_1.0-8 lightgbm_3.0.0 fansi_0.4.1
[37] future_1.18.0 crayon_1.3.4 withr_2.2.0 MASS_7.3-51.6 grid_4.0.2 jsonlite_1.7.0
[43] gtable_0.3.0 lifecycle_0.2.0 magrittr_1.5 pROC_1.16.2 cli_2.0.2 stringi_1.4.6
[49] timeDate_3043.102 snakecase_0.11.0 ellipsis_0.3.1 lhs_1.0.2 generics_0.0.2 vctrs_0.3.2
[55] lava_1.6.7 tools_4.0.2 glue_1.4.1 purrr_0.3.4 survival_3.2-3 colorspace_1.4-1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.