Giter Club home page Giter Club logo

mlr3filters's Introduction

mlr3filters

Package website: release | dev

{mlr3filters} adds feature selection filters to mlr3. The implemented filters can be used stand-alone, or as part of a machine learning pipeline in combination with mlr3pipelines and the filter operator.

Wrapper methods for feature selection are implemented in mlr3fselect. Learners which support the extraction feature importance scores can be combined with a filter from this package for embedded feature selection.

r-cmd-check CRAN Status StackOverflow Mattermost

Installation

CRAN version

install.packages("mlr3filters")

Development version

remotes::install_github("mlr-org/mlr3filters")

Filters

Filter Example

set.seed(1)
library("mlr3")
library("mlr3filters")

task = tsk("sonar")
filter = flt("auc")
head(as.data.table(filter$calculate(task)))
##    feature     score
## 1:     V11 0.2811368
## 2:     V12 0.2429182
## 3:     V10 0.2327018
## 4:     V49 0.2312622
## 5:      V9 0.2308442
## 6:     V48 0.2062784

Implemented Filters

Name label Task Types Feature Types Package
anova ANOVA F-Test Classif Integer, Numeric stats
auc Area Under the ROC Curve Score Classif Integer, Numeric mlr3measures
carscore Correlation-Adjusted coRrelation Score Regr Logical, Integer, Numeric care
carsurvscore Correlation-Adjusted coRrelation Survival Score Surv Integer, Numeric carSurv, mlr3proba
cmim Minimal Conditional Mutual Information Maximization Classif & Regr Integer, Numeric, Factor, Ordered praznik
correlation Correlation Regr Integer, Numeric stats
disr Double Input Symmetrical Relevance Classif & Regr Integer, Numeric, Factor, Ordered praznik
find_correlation Correlation-based Score Universal Integer, Numeric stats
importance Importance Score Universal Logical, Integer, Numeric, Character, Factor, Ordered, POSIXct
information_gain Information Gain Classif & Regr Integer, Numeric, Factor, Ordered FSelectorRcpp
jmi Joint Mutual Information Classif & Regr Integer, Numeric, Factor, Ordered praznik
jmim Minimal Joint Mutual Information Maximization Classif & Regr Integer, Numeric, Factor, Ordered praznik
kruskal_test Kruskal-Wallis Test Classif Integer, Numeric stats
mim Mutual Information Maximization Classif & Regr Integer, Numeric, Factor, Ordered praznik
mrmr Minimum Redundancy Maximal Relevancy Classif & Regr Integer, Numeric, Factor, Ordered praznik
njmim Minimal Normalised Joint Mutual Information Maximization Classif & Regr Integer, Numeric, Factor, Ordered praznik
performance Predictive Performance Universal Logical, Integer, Numeric, Character, Factor, Ordered, POSIXct
permutation Permutation Score Universal Logical, Integer, Numeric, Character, Factor, Ordered, POSIXct
relief RELIEF Classif & Regr Integer, Numeric, Factor, Ordered FSelectorRcpp
selected_features Embedded Feature Selection Universal Logical, Integer, Numeric, Character, Factor, Ordered, POSIXct
univariate_cox Univariate Cox Survival Score Surv Integer, Numeric, Logical survival
variance Variance Universal Integer, Numeric stats

Variable Importance Filters

The following learners allow the extraction of variable importance and therefore are supported by FilterImportance:

## [1] "classif.featureless" "classif.ranger"      "classif.rpart"      
## [4] "classif.xgboost"     "regr.featureless"    "regr.ranger"        
## [7] "regr.rpart"          "regr.xgboost"

If your learner is not listed here but capable of extracting variable importance from the fitted model, the reason is most likely that it is not yet integrated in the package mlr3learners or the extra learner extension. Please open an issue so we can add your package.

Some learners need to have their variable importance measure “activated” during learner creation. For example, to use the “impurity” measure of Random Forest via the {ranger} package:

task = tsk("iris")
lrn = lrn("classif.ranger", seed = 42)
lrn$param_set$values = list(importance = "impurity")

filter = flt("importance", learner = lrn)
filter$calculate(task)
head(as.data.table(filter), 3)
##         feature     score
## 1: Petal.Length 44.682462
## 2:  Petal.Width 43.113031
## 3: Sepal.Length  9.039099

Performance Filter

FilterPerformance is a univariate filter method which calls resample() with every predictor variable in the dataset and ranks the final outcome using the supplied measure. Any learner can be passed to this filter with classif.rpart being the default. Of course, also regression learners can be passed if the task is of type “regr”.

Filter-based Feature Selection

In many cases filtering is only one step in the modeling pipeline. To select features based on filter values, one can use PipeOpFilter from mlr3pipelines.

library(mlr3pipelines)
task = tsk("spam")

# the `filter.frac` should be tuned
graph = po("filter", filter = flt("auc"), filter.frac = 0.5) %>>%
  po("learner", lrn("classif.rpart"))

learner = as_learner(graph)
rr = resample(task, learner, rsmp("holdout"))

mlr3filters's People

Contributors

bblodfon avatar be-marc avatar bommert avatar github-actions[bot] avatar jakob-r avatar larskotthoff avatar lorenzwalthert avatar mb706 avatar mislavsag avatar mllg avatar pat-s avatar pre-commit-ci[bot] avatar sebffischer avatar sumny avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mlr3filters's Issues

Survival filters of MLR to be available in MLR3

Hello there,

I want to run a number of machine learning algorithms with different feature selection methods on survival data using the MLR3 package. For that, I am using the Benchmark() function of MLR3.

Unfortunately, filter feature selection methods of MLR3 do not support survival, yet. However, MLR package supports survival filters as shown on this page.

I would like to request for adding the survival filters of MLR to MLR3.

Thank you

Filter request: Feature-Selection Perceptron

"This embedded method is based on a perceptron, a type of artificial neural net-work that can be seen as the simplest kind of feedforward neu-ral network, namely, a linear classifier. This method is based on training a perceptron in a supervised learning context. Interconnection weights are used to indicate the most relevant features and provide a ranking."

Seijo-Pardo, B., Porto-Díaz, I., Bolón-Canedo, V., & Alonso-Betanzos, A. (2017). Ensemble feature selection: Homogeneous and heterogeneous approaches. Knowledge-Based Systems, 118, 124–139. https://doi.org/10/f9qgrv

Mejia-Lavalle, M., Sucar, L., & Arroyo-Figueroa, G. (2006). Feature selection with a perceptron neural net. Proceedings of the International Workshop on Feature Selection for Data Mining, 131–135.

-> No idea if there is an R implementation

Uninformative error message when filtering empty tasks

Filtering a 0-row task

> tsk = mlr_tasks$get("iris")$filter(integer(0))
> mlr3featsel::FilterVariance$new()$calculate(tsk)
Error in mlr3featsel::FilterVariance$new()$calculate(tsk) : 
  Assertion on 'fv' failed: Contains missing values (element 1).

Filtering a 0-column task

> tsk = mlr_tasks$get("iris")$select(integer(0))
> mlr3featsel::FilterVariance$new()$calculate(tsk)
Error in bmerge(i, x, leftcols, rightcols, xo, roll, rollends, nomatch,  : 
  x.'id' is a character column being joined to i.'V1' which is type 'integer'. Character 

The first one should probably give a more informative error message (the filtering may happen as part of a longer pipeline where the user would not immediately know that an empty task was given). The second one should imho not even throw an error.

`FilterVariableImportance()` fails for `classif.ranger`

library(mlr3)
library(mlr3learners)
library(mlr3featsel)
lrn = mlr_learners$get("classif.ranger", param_vals = list(importance = "impurity"))
task = mlr_tasks$get("iris")
filter = FilterVariableImportance$new(learner = lrn)
filter$calculate(task)
#> INFO  [16:30:10.241] Training learner 'classif.ranger' on task 'iris' ...
#> Error: No importance stored

Created on 2019-06-04 by the reprex package (v0.3.0)

The reason is that

https://github.com/mlr-org/mlr3featsel/blob/fc91dbc3540c2e05dae859990fbade854a6d7b06/R/FilterVariableImportance.R#L40

overwrites the param_set slot of learner which has the importance = "impurity" information here that is needed during training.
Do we need this line?
After the clone (in l.39) everything should be present in the learner object?

Also we need a test for this. The current one does not catch this bug since classif.rpart, which is used in the test, does not rely on a external param_vals to be set.

Initialization methods of filters have unused arguments

E.g., FilterAUC has:

initialize = function(id, packages,
      feature_types,
      task_type,
      settings = list(na.rm = TRUE)) {
      super$initialize(
        id = "FilterAUC",
        packages = "stats",
        feature_types = "numeric",
        task_type = "classif",
        settings = settings)
    }

Only settings is used, the other arguments are silently ignored.

I'd suggest this:

initialize = function(id = "FilterAUC", settings = list(na.rm = TRUE)) {
      super$initialize(
        id = assert_string(id),
        packages = "stats",
        feature_types = "numeric",
        task_type = "classif",
        settings = assert_list(settings, names = "unique"))
    }

Missing filter / featsel methods

Filters

Pkg

No pkg

  • AUC

  • generic permutation

  • univariate.model.score

stats

  • anova

  • kruskal

  • linear.correlation

  • rank.correlation

  • variance

FSelector

Do we want to have these filters in again? Slow and Java problems..

FSelectorRcpp

  • information.gain
  • gain.ratio
  • symmetrical.uncertainty

Learner integrated filters

  • ranger.impurity

  • ranger.permutation

  • cforest.importance

Do we want to add the ramdomForest and randomForestSRC ones?

mRMRe

- [ ] mrmr -> slow and no support for classif tasks mlr-org/mlr#2604

praznik

  • CMIM

  • DISR

  • JMI

  • JMIM

  • MIM

  • MRMR

  • NJMIM

care

  • carscore

spFSR

Need to check.

Ensemble filters

  • Min

  • Mean

  • Median

  • Max

  • Borda

  • Borda-staircase

  • Borda-power

Feature importance calculation

The name for the function to calculate a ranking for feature importance is calculate() -- this is very non-descriptive and non-intuitive. I propose to rename to ranking().

Release mlr3filters 0.1.0

Prepare for release:

  • Check that description is informative
  • Check licensing of included files
  • usethis::use_cran_comments()
  • devtools::check()
  • devtools::check_win_devel()
  • rhub::check_for_cran()
  • Update cran-comments.md
  • Draft blog post

Submit to CRAN:

  • usethis::use_version('minor')
  • devtools::submit_cran()
  • Approve email

Wait for CRAN...

  • Accepted 🎉
  • usethis::use_github_release()
  • usethis::use_dev_version()
  • Update install instructions in README
  • Finish blog post
  • Tweet
  • Add link to blog post in pkgdown news menu

FilterEmbedded should export the parameter set of the underlying Learner

FilterEmbedded$new(learner = mlr_learners$get("classif.ranger"))$param_set

is

ParamSet:
       id    class lower upper levels     default value
1:   frac ParamDbl     0     1        <NoDefault>
2: cutoff ParamDbl  -Inf   Inf        <NoDefault>
3:  nfeat ParamInt     1   Inf        <NoDefault>

but could be

> FilterEmbedded$new(learner = mlr_learners$get("classif.ranger"))$param_set
ParamSet:
                              id    class lower upper                                       levels     default value
 1:                         frac ParamDbl     0     1                                              <NoDefault>      
 2:                       cutoff ParamDbl  -Inf   Inf                                              <NoDefault>      
 3:                        nfeat ParamInt     1   Inf                                              <NoDefault>      
 4:                    num.trees ParamInt     1   Inf                                                      500      
 5:                         mtry ParamInt     1   Inf                                              <NoDefault>      
 6:                   importance ParamFct    NA    NA none,impurity,impurity_corrected,permutation <NoDefault>      
 7:                 write.forest ParamLgl    NA    NA                                   TRUE,FALSE        TRUE      
[...]

It is best do do this via active bindings and ParamSetCollection, see how PipeOpLearnerCV does this for inspiration.

Cross-link mlr3pipelines

This part is heavily underdocumented. I would suggest adding mlr3pipelines to Suggests to be able to link to PipeoFilter and also add examples in the R docs and Readme.

FilterAUC and missing values

FilterAUC operates on features with missing values by just ranking the missing values last (default in rank()). I'm not sure that this is statistically sound.

I'd suggest removing them and calculate the AUC on the remaining observations.

@berndbischl @pat-s ?

Implementation of "wrapper" methods

Sticking closely to the implementation in mlr3tuning.

Methods

Ported from what we have in mlr.

  • Exhaustive search makeFeatSelControlExhaustive

  • Genetic algorithm makeFeatSelControlGA

  • Random search makeFeatSelControlRandom

  • Deterministic forward or backward search makeFeatSelControlSequential

cross-ref #2

Support partial scoring features

This is the k argument for praznik filters. If we had the filter hyperparameters abs, perc and thres available, we could do something smart with it.

Filter request: ReliefF

"This filter is an extension of the original Relief
algorithm [31] that works by randomly sampling an instance
from the dataset and then locating its nearest neighbor from
the same and opposite class. The values of the nearest neigh-
bor attributes are compared to the sampled instance and used
to update relevance scores for each attribute. The rationale is that a useful attribute should differentiate between instances
from different classes, and have the same value for instances
from the same class. Compared to Relief, ReliefF is more robust,
better handles multiclass problems and incomplete and noisy
data, can be applied in all situations, has low bias, allows in-
teraction among features, and may capture local dependencies
which other methods miss."

Seijo-Pardo, B., Porto-Díaz, I., Bolón-Canedo, V., & Alonso-Betanzos, A. (2017). Ensemble feature selection: Homogeneous and heterogeneous approaches. Knowledge-Based Systems, 118, 124–139. https://doi.org/10/f9qgrv

-> This is available in FSelector::relief() but we would like to avoid FSelector due to its Java dep.

Structure of this pkg

Should/could follow mlr3tuning structure.

  • Two base classes (Filter and Featsel) or one class with respective functions?

Arguments:

  • id
  • settings (at least for Filter)
  • result -> returns the full Filter Values (needed for eventual caching) and the subsetted task in case of Featsel
  • [...] more?

For filters a subclass per pkg that inherits from the main class?
How should we do the structuring for featsel methods?

@ja-thomas

`FilterResult` or `Filter`?

Following the logic of class mlr3::BenchmarkResult.

With public member functions .$get_best(), .$combine(), etc.

Use paradox for Filter settings

If Filter objects had an associated ParamSet (and used its $values slot etc.), then the parameter values could more easily be tuned over. Having a ParamSet with good type and range information would also be informative for the user.

Merge "information gain" filters

FSelectorRcpp::information_gain() has three types

  • info.gain
  • gain.ratio
  • sym uncert

All three are highly similar. Since they are provided in one function anyways, I suggest to group them under the name "information.gain" filter and simply expose the type argument.
This is also the most used one in literature. The other two are used very sparsely.

In addition, this saves us some code lines :)

Filter calculate and 'n'

What happens if a filter does not support partial scoring?

  • at least this needs to be documented for the filters that allow this
  • apparently a filter simply ignores 'n' if it does support it. bad and errorprone?
  • maybe a property would be better here? and then a check of the argument depending on the property?

Caching

In mlr we use the memoise pkg for caching.
I saw that Tasks are cached in mlr3 already via this PR.

@mllg Would this concept also be preferrable for the filters or does it have any limitations? And should we rely on the memoise pkg for caching?

`.$filter_*()`: Modifies task without assignment - do we want that?

library(mlr3featsel)
task = mlr3::mlr_tasks$get("iris")
filter = FilterJMIM$new()
filter$calculate(task)
filter$filter_abs(task, 2)
task
#> <TaskClassif:iris> (150 x 3)
#> Target: Species
#> Features (2):
#> * dbl (2): Petal.Width, Sepal.Length

Created on 2019-06-07 by the reprex package (v0.3.0)

I do not like that the supplied task gets modified in place by .$filter_abs().
If the object is not assigned again explicitly, e.g. task = filter$filter_abs(task, 2) we should only print the task to the console but not modify it?

Are we doing this somewhere else? I would expect to only modify fields of a R6 class in place.

@mllg

Filter request: 'insignificant' factor levels

Scarcely occurring factor levels can distort modeling results. A filter that removes or imputes 'insignificant' factor levels would help prevent this. Insignificance could be denoted as ratio between factor occurrence and total samples.

Rename `FilterVariableImportance`

This filter uses the embedded methods of ML algorithms.
This should somehow be reflected in the name.
The current one is too generic.

How about FilterEmbedded?

Other suggestions are welcome!

Is it possible to add ks value in mlr3filters?

Hello, can you consider adding the KS value to the binary classification regression model as an evaluation standard, which is useful for many business scenarios, for example, when doing the scorecard model, the main observation indicator is KS.

Filters and mlr3pipelines and filters before building graph

Hi,

It is dfficult to find examples on how tu use mlr3filters with mlr3pipelines. That is, how to incorporate features filtering with other preprocessing and modelling steps.

I have several doubts:

  1. If I use mlr3 filters first and than build a graph with preprocessing elements and learners, is this cheating because I use whole dataset to extract features and than use that same features in my nested CV later?
  2. If answer to 1. is yes, (it is cheating), what can be the right apprach to filter (select) most important features? Shoould it be part of the graph ?
  3. Do you have any refference o the perfromance on selection and filters methods? Are feature selection through models much better than filters?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.