Giter Club home page Giter Club logo

Comments (11)

mehdidc avatar mehdidc commented on August 22, 2024

I would like to implement this, I think it should be optional (we will need to add for instance a boolean parameter compute_feature_importances=True/False).

I have seen than in the earth R package (http://www.milbo.org/doc/earth-notes.pdf), they have three criteria. During the pruning pass, we obtain, for each size of subsets of basis functions, either the best one (in terms of RSS) or an estimate of the best one depending on whether we do an exhaustive search or a greedy search (like implemented here and in the original paper). The three criteria are, if I understand well, the following :

  1. Having the best subsets of basis functions for each size, they compute the number of times each variable occurs in these subsets (there is one subset for each size, determined by the pruning pass).

  2. They compute the decrease of RSS when prunning a basis function from a subset, and for each variable they sum the decrease of RSS over all the subsets that include the variable.

  3. same than 2) except that they use GCV instead of RSS

So would that be valuable to add these features to py-earth ? if not , any other ideas ?

from py-earth.

jcrudy avatar jcrudy commented on August 22, 2024

Yes, this would be a great contribution. I'd suggest that instead of a parameter of the Earth constructor, it should be its own class or function. That way users wouldn't need to decide up front whether or not they want variable importance, but could potentially decide after fitting the model. This may require storing some additional information in the pruning record, but not too much I think. Other methods than evimp could be included in a similar way, with perhaps additional small changes to the Earth class. To be more clear, I'm thinking something like this:

from pyearth import Earth
from pyearth.importance import Importance
from your_data import X, y
model = Earth().fit(X,y)
importance = Importance(model).fit(X,y)

This is just a general idea. You can maybe come up with better names and think about whether it should be a class or a function, what should be returned, whether training data need be used (not for evimp, but for other methods?), etc.

from py-earth.

mehdidc avatar mehdidc commented on August 22, 2024

I implemented the internal code for in _pruning.pyx for these three criteria I talked about, but not yet the external API you suggest. So just to be sure to understand your idea what is done here : "importance = Importance(model).fit(X,y)", what would an Importance class contain ? is importance a model which is exaclty the same than model but which "filters" only the relevant features ?

a plot here for feature importance with these three criteria using the friedman1 dataset (sklearn.datasets.make_friedman1, http://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_friedman1.html) :
Imgur

from py-earth.

jcrudy avatar jcrudy commented on August 22, 2024

That's great. For the API, I was trying to think about how additional importance measurements might later be added. There are some that would require training data as well as a fitted model. However, the result would still just be a vector of importance measurements. So, I think there is no reason it needs to be a class. And, since for now it's just the evimp method and needs no training data, it's probably best to keep things simple and just do:

from pyearth import Earth
from pyearth.importance import variable_importance
from your_data import X, y
model = Earth().fit(X,y)
importance = variable_importance(model)

You could either specify the criterion as an argument or come up with separate function names for each.

Things worth thinking about:

  1. What happens if the model hasn't been pruned?
  2. If there are multiple outputs, is variable importance calculated separately for each output to give an importance matrix?

from py-earth.

agramfort avatar agramfort commented on August 22, 2024

from py-earth.

jcrudy avatar jcrudy commented on August 22, 2024

In this case it is definitely not costly to compute, and if that's the sklearn way then I guess we should do it that way. My only concern with that approach is that someone could fit a model without feature importance and later decide it is actually needed. In the case of this particular set of feature importance measures, we can just compute them every time because they're very cheap.

There is still the issue of exactly which feature importance measure to use: gcv, mse, or subsets. How would sklearn handle such an option? Ideally, all three types of importance measurement should be available after model fit. Perhaps it should be an n x 3 array? Is there anywhere sklearn uses feature_importances_ and expects them to be a single value per feature? Another option is to include an init parameter to make the choice (for sklearn compatibility), but also have methods or functions that can compute them later if needed.

@mehdidc, given what @agramfort said, I suggest the following:

  1. Have an init parameter, importance_type (feel free to change the name) that can be 'gcv', 'mse', or 'subsets'. By default, it is 'gcv'. Have a feature_importances_ attribute on any fitted and pruned model. Docstrings should point out that pruning is required for currently implemented importance measurements.
  2. Add an importance module with all three importance functions. Have a function, importance, that takes an Earth object and importance_type argument and returns an n x 1 array of importance values. When Earth computes importance at the end of the pruning pass, it can call this function. If someone decides to later compute a different importance measurement, the function can still be used on any pruned Earth model.

from py-earth.

agramfort avatar agramfort commented on August 22, 2024

In this case it is definitely not costly to compute, and if that's the
sklearn way then I guess we should do it that way. My only concern with
that approach is that someone could fit a model without feature importance
and later decide it is actually needed. In the case of this particular set
of feature importance measures, we can just compute them every time because
they're very cheap.

that's what sklearn does for random forest

There is still the issue of exactly which feature importance measure to
use: gcv, mse, or subsets. How would sklearn handle such an option?

not a problem with sklearn there is only one implemented

Ideally, all three types of importance measurement should be available
after model fit. Perhaps it should be an n x 3 array? Is there anywhere
sklearn uses feature_importances_ and expects them to be a single value per
feature?

keep it one float per feature. is there one method that is more standard?
that would be a good default. If we want to support all I would add a param
in init that takes None (no feature importance computed) | "mse" | "gcv" ...

from py-earth.

jcrudy avatar jcrudy commented on August 22, 2024

@mehdidc what is the status of this? I seem to remember you did some work on it already, but I can't recall how far you got with it. I'm marking it for 0.2, but if you think 0.1 feel free to change. Also, assigning to you.

from py-earth.

mehdidc avatar mehdidc commented on August 22, 2024

@jcrudy I am preparing very soon a PR for this, I think it is fine for 0.1 as it is orthogonal to the large changes you made to the forward pass, because it is only affecting the pruning pass.

from py-earth.

jcrudy avatar jcrudy commented on August 22, 2024

@mehdidc Awesome. Just changed it to 0.1.

from py-earth.

jcrudy avatar jcrudy commented on August 22, 2024

Implemented by @mehdidc in commit 2d70700. Closing this issue.

from py-earth.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.