Giter Club home page Giter Club logo

featurerank's Introduction

Feature Rank: ensemble feature ranking for variable selection

Ensemble feature ranking for variable selection in SuperLearner ensembles (Polley et al. 2021), based on Effrosynidis and Arampatzis (2021). Multiple algorithms estimate a ranking of the strength of the relationship between predictors and the outcome in the training set, and these rankings are combined into a single ranking via an aggregation method (reciprocal ranking currently). The final ranking can then be cut at a certain number of variables (e.g. top 10 predictors, top 70%, etc.) to create one or more feature selection wrappers for SuperLearner. The result should generally be more robust and stable than feature selection using a single algorithm. See also (Neumann, Genze, and Heider 2017) for a similar method.

Install

# install.packages("remotes")
remotes::install_github("ck37/featurerank")

Algorithms

Currently implemented algorithms are:

  • Feature ranking: correlation, glm, glmnet, random forest, bart, xgboost + shap, variance
  • Rank aggregation: reciprocal ranking

Example

A minimal example to demonstrate how the package can be used.

Prepare dataset

# TODO: switch to a less problematic demo dataset.
data(Boston, package = "MASS")

# Use "chas" as our outcome variable, which is binary.
y = Boston$chas
x = subset(Boston, select = -chas)

Create feature ranking library

Specify the feature ranking wrappers for the ensemble library.

library(featurerank)

# Modify RF feature ranker to use 100 trees (faster than default of 500).
featrank_randomForest100 =
  function(...) featrank_randomForest(ntree = 100L, ...)

# Specify the set of feature ranking algorithms.
ensemble_rank_custom =
  function(top_vars, ...)
    ensemble_rank(fn_rank = c(featrank_cor, featrank_randomForest100,
                              featrank_glm, featrank_glmnet),
                              #featrank_shap, # too verbose currently
                              #featrank_dbarts), # skip for speed
                  top_vars = top_vars,
                  ...)

# There are 13 total vars so try dropping 1 of them.
top12 = function(...) ensemble_rank_custom(top_vars = 12, ...)

# Try dropping worst 2 predictors.
top11 = function(...) ensemble_rank_custom(top_vars = 11, ...)

# Drop worst 3 predictors.
top10 = function(...) ensemble_rank_custom(top_vars = 10, ...)

Use in SuperLearner

library(SuperLearner)

set.seed(1)
# Takes 93 seconds with 1 core.
sl = SuperLearner(y, x, family = binomial(),
                  # 10-fold cross-validation stratified on the outcome.
                  cvControl = list(V = 10L, stratifyCV = TRUE),
                  SL.library =
                    list("SL.glm", # Baseline estimator uses all predictors.
                         # Try three ensemble screening options, giving the
                         # screened variable list to logistic regression (SL.glm).
                         c("SL.glm", "top12", "top11", "top10")))

# Review timing.
sl$times$everything
##    user  system elapsed 
##  90.393   0.637  91.407
# We do achieve a modest AUC benefit.
ck37r::auc_table(sl, y = y)[, -6]
##        learner       auc         se  ci_lower  ci_upper
## 1   SL.glm_All 0.7426862 0.02930653 0.6852464 0.8001259
## 2 SL.glm_top12 0.7485151 0.02852544 0.6926062 0.8044239
## 3 SL.glm_top11 0.7535018 0.02760091 0.6994050 0.8075986
## 4 SL.glm_top10 0.7613032 0.02585664 0.7106251 0.8119813
# Which features were dropped (will show FALSE below)?
t(sl$whichScreen)
##          All top12 top11 top10
## crim    TRUE  TRUE  TRUE FALSE
## zn      TRUE FALSE FALSE FALSE
## indus   TRUE  TRUE  TRUE  TRUE
## nox     TRUE  TRUE  TRUE  TRUE
## rm      TRUE  TRUE  TRUE  TRUE
## age     TRUE  TRUE  TRUE  TRUE
## dis     TRUE  TRUE  TRUE  TRUE
## rad     TRUE  TRUE  TRUE  TRUE
## tax     TRUE  TRUE  TRUE  TRUE
## ptratio TRUE  TRUE  TRUE  TRUE
## black   TRUE  TRUE FALSE FALSE
## lstat   TRUE  TRUE  TRUE  TRUE
## medv    TRUE  TRUE  TRUE  TRUE

Assess ranking stability

# Check if we see stability across multiple runs,
# especially for comparison to individual feature ranking algorithms.
# (See stability scores in Table 3 of paper.)
set.seed(2)

# Takes about 90 seconds using 1 core.
system.time({
results =
  do.call(rbind.data.frame,
          lapply(1:10,
                 function(i) top12(y, x, family = binomial(),
                                   # Default replications is 3 - more replications increases stability.
                                   replications = 10,
                                   detailed_results = TRUE)$ranking))
})
##    user  system elapsed 
##  90.368   0.648  91.309
names(results) = names(x)
# Stability looks excellent.
results
##    crim zn indus nox rm age dis rad tax ptratio black lstat medv
## 1    11 13     8   5  9  10   6   3   7       4    12     2    1
## 2    11 13     7   4  9  10   8   3   6       5    12     2    1
## 3    11 13     7   4  9  10   6   3   8       5    12     2    1
## 4    11 13     7   4 10   9   6   3   8       5    12     2    1
## 5    11 13    10   4  7   9   6   3   8       5    12     2    1
## 6    11 13     8   4  9   7  10   3   6       5    12     2    1
## 7    11 13     9   5 10   7   6   3   8       4    12     2    1
## 8    11 13     9   4  6  10   7   3   8       5    12     2    1
## 9    11 13    10   4  6   8   7   3   9       5    12     2    1
## 10   11 13     9   4  6   8  10   3   7       5    12     2    1
# What if we treated each iteration as its own ranking and then aggregated?
agg_reciprocal_rank(t(results))
##    crim      zn   indus     nox      rm     age     dis     rad     tax ptratio 
##      11      13       9       4       8      10       6       3       7       5 
##   black   lstat    medv 
##      12       2       1

References

Effrosynidis, Dimitrios, and Avi Arampatzis. 2021. “An Evaluation of Feature Selection Methods for Environmental Data.” Ecological Informatics 61: 101224.

Neumann, Ursula, Nikita Genze, and Dominik Heider. 2017. “EFS: An Ensemble Feature Selection Tool Implemented as r-Package and Web-Application.” BioData Mining 10 (1): 1–9.

Polley, Eric, Erin LeDell, Chris J. Kennedy, Sam Lendle, and Mark van der Laan. 2021. “SuperLearner: Super Learner Prediction.” CRAN. https://CRAN.R-project.org/package=SuperLearner.

featurerank's People

Contributors

ck37 avatar hlhowardliu avatar mpetukh avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

featurerank's Issues

Support observation weights

All wrappers should support observation weights if at all possible. The correlation wrapper needs to be upgraded at the very least.

Review other wrapper options

  • Additional feature selection algorithms to try from the main paper
  • Use tidymodels or mlr3 - do they have good feature selection/ranking options?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.