Giter Club home page Giter Club logo

Comments (12)

CloseChoice avatar CloseChoice commented on May 18, 2024

I do not see this as a bug. For the tree_path_dependent calculation method we need every leave to be covered, and this precondition is not met in the example you showed.

Internally lightgbm creates n_estimators * n_classes trees (see lightgbm docs). In this case this will be 100 * 10 = 1000 trees and not all of their leaves are covered. So reducing the number of estimators solves the problem for me, e.g.:

model_mult = LGBMClassifier(**{'verbosity': -1, 'n_estimators': 10}).fit(data_mult.data, data_mult.target)

Edit: The problem I do see here though, is that the error message is not really informative and that even if one sets feature_perturbation='interventional' one does not get the warning, that a background dataset is provided. Will file a PR that fixes these issues.

from shap.

NegatedObjectIdentity avatar NegatedObjectIdentity commented on May 18, 2024

@CloseChoice I am convinced that your comment before is not right. Here are my arguments:

  1. Theoretical: tree_path_dependent does not require a background dataset at all, since it makes use of the tree paths (exactly as the name says) that were computed during training and, hence, where built with the training data. Since every leaf needs to have at least 1 data sample (otherwise no leaf is made during training), the tree path is per definition always complete. This is even what is said by your own documentation

The “tree_path_dependent” approach is to just follow the trees and use the number of training examples that went down each leaf to represent the background distribution. This approach does not require a background dataset, and so is used by default when no background dataset is provided.

  1. Practical: It does work if interactions is set to false. Hence, if you argumentation would be true it should not work there as well, but it does.

  2. Practical 2: If I bypass this assure statement that throws the error with a dirty hack by setting it to true hard coded, it works as well. Please see #3187 where I explained that.

  3. Your argument with lower number of estimators: This only demonstrates that the function that checks if all leafs are covered by background data works, but this check should not be there in the first place. See argument 2, where it is working if interactions are set to false and argument 3 where a dirty bypass works as well.

Therefore my conclusion, this check of if all leafs are covered with the background data should not be there at all if background data is set to None (as done in my example) because then tree_path_dependent option is used with does not require a background dataset (See argument 1). See argument 2 that this is the case if compute interactions is set to False. Just in case of compute interactions is set to True this error occurs, but can be functionally bypassed (See argument 3). Finally your argument (argument 4) does not prove that the error message is right, but only says that the check works as intended. Please let me know if my argumentation makes sense to you. Thank you very much for your invaluable work!

add: That tree_path_dependent does not require a background dataset is also mentioned in the original SHAP paper.

from shap.

CloseChoice avatar CloseChoice commented on May 18, 2024

I believe there is a central flaw in your argumentation. But let me explain this step by step. So first of all nobody said that you'd need a background dataset, you can extract the cover of each node directly from the model. Now, if the leaf is uncovered (cover = 0) then a problem arises in the algorithm (see algorithm 2 in the paper "Consistent individual attributions for tree shap models") where one divides by the cover. Obviously one cannot do that if the leaf is uncovered. The flaw in your example is that you explain the same data you trained on so while explaining you do not evaluate the (in training) uncovered leafs. I would suspect that you'll run into an error when you evaluate examples that run through uncovered leafs (but did not test that).

from shap.

NegatedObjectIdentity avatar NegatedObjectIdentity commented on May 18, 2024

Thank you @CloseChoice for your answer! I modified my example from above to have a dedicated train and test set as you suggested in your comment above, and as you can see it still works to compute SHAP explanations with background dataset to None and tree_path_dependent, also with the data not in the training set (tst data). Just the one case where interactions is set to True in the multi class case does throw again the error reported above (that I argue should not tested for). Therefore, I don't see a flaw in my argumentation. Furthermore, I want to point out that I use tree_path_dependent with background data=None to explain data not seen during training on a regular basis and I never ran into issues. Here the code:

from sklearn.datasets import load_digits
from sklearn.datasets import load_breast_cancer
from lightgbm import LGBMRegressor
from lightgbm import LGBMClassifier
from shap.explainers import Tree as TreeExplainer

# data | regression | binary | multi class
data_reg = load_diabetes(as_frame=True)
data_bin = load_breast_cancer(as_frame=True)
data_mult = load_digits(as_frame=True)

# Train data
data_reg_trn = data_reg.data.iloc[:399, :]
data_bin_trn = data_bin.data.iloc[:499, :]
data_mult_trn = data_mult.data.iloc[:1749, :]

# Train target
target_reg_trn = data_reg.target.iloc[:399]
target_bin_trn = data_bin.target.iloc[:499]
target_mult_trn = data_mult.target.iloc[:1749]

# Test data
data_reg_tst = data_reg.data.iloc[400:, :]
data_bin_tst = data_bin.data.iloc[500:, :]
data_mult_tst = data_mult.data.iloc[1750:, :]

# Test target
target_reg_tst = data_reg.target.iloc[400:]
target_bin_tst = data_bin.target.iloc[500:]
target_mult_tst = data_mult.target.iloc[1750:]

# train models | regression | binary | multi class
model_reg = LGBMRegressor(**{'verbosity': -1,}).fit(data_reg_trn, target_reg_trn)
model_bin = LGBMClassifier(**{'verbosity': -1,}).fit(data_bin_trn, target_bin_trn)
model_mult = LGBMClassifier(**{'verbosity': -1,}).fit(data_mult_trn, target_mult_trn)

# Explainer
explainer_reg = TreeExplainer(
    model_reg,
    data=None,
    feature_perturbation='tree_path_dependent')
explainer_bin = TreeExplainer(
    model_bin,
    data=None,
    feature_perturbation='tree_path_dependent')
explainer_mult = TreeExplainer(
    model_mult,
    data=None,
    feature_perturbation='tree_path_dependent')

# Explainations
explainations_reg = explainer_reg(data_reg_tst, interactions=False)
explainations_reg_inter = explainer_reg(data_reg_tst, interactions=True)
explainations_bin = explainer_bin(data_bin_tst, interactions=False)
explainations_bin_inter = explainer_bin(data_bin_tst, interactions=True)
explainations_mult = explainer_mult(data_mult_tst, interactions=False)
explainations_mult_inter = explainer_mult(data_mult_tst, interactions=True)

I suspect in your comment above you are mixing up two different things. The point is that there are at least two methods to compute SHAP values with the TreeExplainer. The difference comes at how to compute conditional expectations, that we need to decide how to handle correlated (or otherwise dependent) input features (https://shap.readthedocs.io/en/latest/generated/shap.TreeExplainer.html). First, interventional this one relies on a brackground dataset to compute the conditional expectations. In this case one has to test that the backgroud set covers all the leaves of the trees, otherwise we ran into the issue that some test samples cannot be computed (as you mentioned in you post above). However, there is a second method to compute the conditional expectations this is the tree_path_dependent option. Here the conditional expectation is computed via the tree paths, hence via the tree structure that was built during the training, and therefore indirectly the conditional expectations are computed via the training set and no background dataset is required (because it is implicitly given by tree structure). Since during training there must be one data sample in each leaf at minimum, coverage is always guarantied in this case. And please be aware that this is how it is done in any case where background dataset is None AND feature_perturbation is tree_path_dependent, except for the one case of multiclass and interactions=True, here an additional assert is made (the multiclass case even works when interactions are set to False). That this are two different approaches is explained in the SHAP documentation (https://shap.readthedocs.io/en/latest/generated/shap.TreeExplainer.html) and also in the shap paper.

Therefore, I am arguing again that in this case of multiclass AND interactions set to True AND background dataset is None AND feature_perturbation is tree_path_dependent, one should not test for coverage at all if (as it is also not done when multiclass AND interactions set to False AND background dataset is None AND feature_perturbation is tree_path_dependent and also in all other cases). This is because it is not required since the conditional expectations are computed via the tree structure and not via a background dataset. Please let me know what you think! Thank you again for your amazing work!

PS: Please also remember #3187, where I show that it works if one passes the error with a dirty hack.

from shap.

NegatedObjectIdentity avatar NegatedObjectIdentity commented on May 18, 2024

BTW if commenting out line 346 to line 357 of _tree.py, hence, this code where the error is generated

        # if self.feature_perturbation == "tree_path_dependent":
            # if not self.model.fully_defined_weighting:
            #     emsg = (
            #         "The background dataset you provided does "
            #         "not cover all the leaves in the model, "
            #         "so TreeExplainer cannot run with the "
            #         "feature_perturbation=\"tree_path_dependent\" option! "
            #         "Try providing a larger background "
            #         "dataset, no background dataset, or using "
            #         "feature_perturbation=\"interventional\"."
            #     )
            #     raise ExplainerError(emsg)

then it works.

from shap.

CloseChoice avatar CloseChoice commented on May 18, 2024

Hmm, will take a deeper look into this when I am back from vacation but there are thee things:

  1. it's not the case that during training (at least for lightgbm) each leaf is covered. That is what I explicitly checked. When you reduce the number of trees (as I've shown in my previous comment) all leaves are covered and there is no error.
  2. your test still might just reach an already covered leaf, you would need to verify explicitly that your example reaches an uncovered leaf. Could ypu try to set the train size to 1% and explain the remaining 99%?
  3. I pointed you towards the paper and the algorithm that is used in tree_dependent_path. There we divide by the cover, so I do not see a way how this cam be calculated. I suspect that there might be a difference between our implementation and the one lightgbm provides.

You raised a valid concern that all examples except for multiclass interactions are working. That is what I'll take a look at. One difference certainly is that for interactions=False we use lightgbm's shap calculation but for for interactions=True we use out own algorithm.

from shap.

NegatedObjectIdentity avatar NegatedObjectIdentity commented on May 18, 2024

@CloseChoice thank you very much again for responding, really appreciating your effort!

  1. That is very interesting! Could you elaborate on that? Because in LightGBM's documentation (e.g. https://lightgbm.readthedocs.io/en/stable/pythonapi/lightgbm.LGBMRegressor.html) of the parameter min_child_samples it says (int, optional (default=20)) – Minimum number of data needed in a child (leaf). So as long as this parameter is >= 1 all leaf should be covered all the time. Otherwise no split should occur. From my understand of how Decision Trees are built it is not possible to make a split if there lands no more sample in one of the leaf nodes.
  2. OK I tried it out. Only 1% of the data for training and 99% for testing. Still working if I comment out line 346 to line 357 of _tree.py to pass the test of covering (which I argue should not be there at all). Here the code:
from sklearn.datasets import load_diabetes
from sklearn.datasets import load_digits
from sklearn.datasets import load_breast_cancer
from lightgbm import LGBMRegressor
from lightgbm import LGBMClassifier
from shap.explainers import Tree as TreeExplainer

# data | regression | binary | multi class
data_reg = load_diabetes(as_frame=True)  # 442 samples / 100 ~ 4
data_bin = load_breast_cancer(as_frame=True)  # 569 samples / 100 ~ 6
data_mult = load_digits(as_frame=True)  # 1797 samples / 100 ~ 18

# Train data
data_reg_trn = data_reg.data.iloc[:4, :]
data_bin_trn = data_bin.data.iloc[:6, :]
data_mult_trn = data_mult.data.iloc[:18, :]

# Train target
target_reg_trn = data_reg.target.iloc[:4]
target_bin_trn = data_bin.target.iloc[:6]
target_mult_trn = data_mult.target.iloc[:18]

# Test data
data_reg_tst = data_reg.data.iloc[5:, :]
data_bin_tst = data_bin.data.iloc[7:, :]
data_mult_tst = data_mult.data.iloc[19:, :]

# Test target
target_reg_tst = data_reg.target.iloc[5:]
target_bin_tst = data_bin.target.iloc[7:]
target_mult_tst = data_mult.target.iloc[19:]

# train models | regression | binary | multi class
model_reg = LGBMRegressor(**{'verbosity': -1,}).fit(data_reg_trn, target_reg_trn)
model_bin = LGBMClassifier(**{'verbosity': -1,}).fit(data_bin_trn, target_bin_trn)
model_mult = LGBMClassifier(**{'verbosity': -1,}).fit(data_mult_trn, target_mult_trn)

# Explainer
explainer_reg = TreeExplainer(
    model_reg,
    data=None,
    feature_perturbation='tree_path_dependent')
explainer_bin = TreeExplainer(
    model_bin,
    data=None,
    feature_perturbation='tree_path_dependent')
explainer_mult = TreeExplainer(
    model_mult,
    data=None,
    feature_perturbation='tree_path_dependent')

# Explainations
explainations_reg = explainer_reg(data_reg_tst, interactions=False)
explainations_reg_inter = explainer_reg(data_reg_tst, interactions=True)
explainations_bin = explainer_bin(data_bin_tst, interactions=False)
explainations_bin_inter = explainer_bin(data_bin_tst, interactions=True)
explainations_mult = explainer_mult(data_mult_tst, interactions=False)
explainations_mult_inter = explainer_mult(data_mult_tst, interactions=True)
  1. Right, but I argue that in Lightgbm with min_child_samples >=1 all leafs are covered during training and, hence, tree_path_depending is always working if used without background data but with the tree's structure.

Thank you for the information that interaction=False uses a different implementation, was not aware of that! Have a nice vacation!

from shap.

CloseChoice avatar CloseChoice commented on May 18, 2024

@NegatedObjectIdentity thanks for responding and the code examples. Your 99%/1% example is a stromg indication that we can in fact remove the check. Will provide code examples concerning 1. once I am back. We'll figure that out ;)

from shap.

CloseChoice avatar CloseChoice commented on May 18, 2024

So I have looked deep into this and I while I cannot find an example where the code actually breaks if we remove the check I still would not like to remove it. Here is just that there is indication that it could break and I do not understand the code good enough to actually verify that it cannot break if we remove the tests. I suspect that running into this error is happening less often than we would run into some other problems when we remove the check, therefore I would like to keep it except you can help verify that this is not causing problems.
Here are a couple points I promised to give you feedback on:

  1. lightgbm actually gives us leaf nodes that are not covered. You can verify that by running the following code
from sklearn.datasets import load_digits
from lightgbm import LGBMClassifier
from shap.explainers import Tree as TreeExplainer
import numpy as np

data_mult = load_digits(as_frame=True)

# Train data
data_mult_trn = data_mult.data.iloc[:, :]

target_mult_trn = data_mult.target.iloc[:]

data_mult_tst = data_mult.data.iloc[1750:, :]

target_mult_tst = data_mult.target.iloc[1750:]

model_mult = LGBMClassifier(**{'verbosity': -1,}).fit(data_mult_trn, target_mult_trn)

explainer_mult = TreeExplainer(
    model_mult,
    data=None,
    feature_perturbation='tree_path_dependent')

pred = model_mult.predict_proba(data_mult_tst, raw_score=True)
tree_idx_with_uncovered_leafs = [idx for idx, k in enumerate(explainer_mult.model.node_sample_weight) if np.sum(np.abs(k)) == 0]
# check this to find an uncovered leaf
explainer_mult.model.node_sample_weight[tree_idx_with_uncovered_leafs[0]]

explanations_mult_inter = explainer_mult(data_mult_tst, interactions=True)
assert np.allclose(explanations_mult_inter.base_values + explanations_mult_inter.sum((1, 2)).values, pred)
  1. We are dividing by the cover, see here and if we set the cover manually to all zeros, by adding the line
self.model.node_sample_weight = np.zeros_like(self.model.node_sample_weight)

here the assert above breaks. This is a rather artificial example but I guess that we are just not finding the correct examples that actually run into the uncovered leafs.

So from my side this is a nofix until we have more information.

from shap.

NegatedObjectIdentity avatar NegatedObjectIdentity commented on May 18, 2024

@CloseChoice thank you again for your reply!

  1. Thank you very much for your code! I see a misconception here. A LightGBM node has several attributes e.g. it's value (average value of the samples ending up in a node), it's weight (e.g. down weighting due to learning rate, etc.), but also the number of samples ending up in this node. You are talking about node weights and yes, they can become zero. The same is true for the node value. But I am talking about number of samples in nodes, hence, the coverage and they cannot become zero. Here a code example that illustrates this:
from sklearn.datasets import load_digits
from lightgbm import LGBMClassifier
import numpy as np

data_mult = load_digits(as_frame=True)

# Train data
data_mult_trn = data_mult.data.iloc[:, :]
target_mult_trn = data_mult.target.iloc[:]
data_mult_tst = data_mult.data.iloc[1750:, :]
target_mult_tst = data_mult.target.iloc[1750:]

model_mult = LGBMClassifier(**{'verbosity': -1,}).fit(data_mult_trn, target_mult_trn)

TreeNodes = model_mult.booster_.trees_to_dataframe()

Please see TreeNodes dataframe. Some of the nodes have a value of 0 (which is OK since this means the prediction of this node is zero) and some have a weight of 0 (which is OK as well, because this measures the importance of a node and with a learning rate of 0.1 and this multiplied 100 time some nodes become very unimportant). However, last column gives you the sample counts for all ~45000 nodes in the lightgbm model. None of them has a count below 20 since this is the min_child_samples parameter that was used during training. Hence, coverage is given for all nodes. There is not a single node without samples associated to it. I even argue that it is not possible to make a decision tree with zero samples in a node, especially as there is one hyperparameter that checks exactly for that (min_child_samples).

  1. Makes sense dividing by zero should not work, but please see 1. coverage cannot become 0 in the case of using the tree structure.

Therefore, I argue again that the assert should not be there for the case of tree_path_dependent and background data = None.
(But to be clear, it should be there in any other case, hence, if one passes a background dataset, then coverage need to be checked). I also want to reiterate that this is exactly what the error message tells us:

ExplainerError: The background dataset you provided does not cover all the leaves in the model, so TreeExplainer cannot run with the feature_perturbation="tree_path_dependent" option! Try providing a larger background dataset, no background dataset, or using feature_perturbation="interventional".

Specifically it says "no background dataset"

from shap.

CloseChoice avatar CloseChoice commented on May 18, 2024

@CloseChoice thank you again for your reply!

1. Thank you very much for your code! I see a misconception here. A LightGBM node has several attributes e.g. it's value (average value of the samples ending up in a node), it's weight (e.g. down weighting due to learning rate, etc.), but also the number of samples ending up in this node. You are talking about node weights and yes, they can become zero. The same is true for the node value. But I am talking about number of samples in nodes, hence, the coverage and they cannot become zero. Here a code example that illustrates this:
from sklearn.datasets import load_digits
from lightgbm import LGBMClassifier
import numpy as np

data_mult = load_digits(as_frame=True)

# Train data
data_mult_trn = data_mult.data.iloc[:, :]
target_mult_trn = data_mult.target.iloc[:]
data_mult_tst = data_mult.data.iloc[1750:, :]
target_mult_tst = data_mult.target.iloc[1750:]

model_mult = LGBMClassifier(**{'verbosity': -1,}).fit(data_mult_trn, target_mult_trn)

TreeNodes = model_mult.booster_.trees_to_dataframe()

Please see TreeNodes dataframe. Some of the nodes have a value of 0 (which is OK since this means the prediction of this node is zero) and some have a weight of 0 (which is OK as well, because this measures the importance of a node and with a learning rate of 0.1 and this multiplied 100 time some nodes become very unimportant). However, last column gives you the sample counts for all ~45000 nodes in the lightgbm model. None of them has a count below 20 since this is the min_child_samples parameter that was used during training. Hence, coverage is given for all nodes. There is not a single node without samples associated to it. I even argue that it is not possible to make a decision tree with zero samples in a node, especially as there is one hyperparameter that checks exactly for that (min_child_samples).

2. Makes sense dividing by zero should not work, but please see 1. coverage cannot become 0 in the case of using the tree structure.

Therefore, I argue again that the assert should not be there for the case of tree_path_dependent and background data = None. (But to be clear, it should be there in any other case, hence, if one passes a background dataset, then coverage need to be checked). I also want to reiterate that this is exactly what the error message tells us:

ExplainerError: The background dataset you provided does not cover all the leaves in the model, so TreeExplainer cannot run with the feature_perturbation="tree_path_dependent" option! Try providing a larger background dataset, no background dataset, or using feature_perturbation="interventional".

Specifically it says "no background dataset"

If you'll find me the lightgbm docs where it says what you claim, I'll be convinced. But please understand that I invested 5 hours into this and I do not really see an issue here. So as mentioned, without further information, I see this as a nofix

from shap.

NegatedObjectIdentity avatar NegatedObjectIdentity commented on May 18, 2024

@CloseChoice @jameslamb answered the question about nodes with no coverage in microsoft/LightGBM#6388. He confirms that there are several checks in place to prevent nodes without samples.

I think we could solve this by not removing the check completely but to add and self.data is not None like:

if self.feature_perturbation == "tree_path_dependent" and self.data is not None:
            if not self.model.fully_defined_weighting:
                emsg = (
                    "The background dataset you provided does "
                    "not cover all the leaves in the model, "
                    "so TreeExplainer cannot run with the "
                    "feature_perturbation=\"tree_path_dependent\" option! "
                    "Try providing a larger background "
                    "dataset, no background dataset, or using "
                    "feature_perturbation=\"interventional\"."
                )
                raise ExplainerError(emsg)

In my local copy this fix works. Happy to hear your opinion!

from shap.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.