cerlymarco / linear-tree Goto Github PK
View Code? Open in Web Editor NEWA python library to build Model Trees with Linear Models at the leaves.
License: MIT License
A python library to build Model Trees with Linear Models at the leaves.
License: MIT License
How can someone extract the coefficients of each linear model implemented in each leaf?
dear Marco
please consider LinXgboost compared to your linear-tree
https://github.com/ldv1/LinXGBoost
https://arxiv.org/abs/1710.03634
thank you in advance
Linear Boosting will it work for categorical features?
Hello,
I have a dataframe with a column X >= 0. I added its index in the parameter split_features of LinearTreeRegressor.
I set max_depth to 1 and then used LinearRegression() as a base estimator.
When I count the number of samples at node_1 i.e. assumed to be <= to the indicated threshold (from the node_0) I realize that it doesn't correspond to my data for the column X.
When I increase max_depth some negative splitting thresholds appear whereas the column X is >= 0 as said previously.
do you normalize data or scale it somehow before training?
Thanks in advance !
Hi all, I am having a hard time finding out which method is used by linear tree to traverse the whole linear tree. Cause sometimes when I am plotting the tree plot and comparing it with the summary, the mapping makes no sense. For some left node the plot is displaying it as right and vice-versa.
you guys can compare the summary with the plot and let me know if I am incorrect somewhere.
0: {'col': 1,
'th': 0.0127,
'loss': 0.1937,
'samples': 160,
'children': (1, 2),
'models': (RidgeClassifier(), RidgeClassifier())},
1: {'col': 6,
'th': 0.1461,
'loss': 0.1,
'samples': 80,
'children': (3, 4),
'models': (RidgeClassifier(), RidgeClassifier())},
2: {'col': 0,
'th': 2.6051,
'loss': 0.05,
'samples': 80,
'children': (9, 10),
'models': (RidgeClassifier(), RidgeClassifier())},
4: {'col': 0,
'th': -0.0708,
'loss': 0.0364,
'samples': 55,
'children': (5, 6),
'models': (RidgeClassifier(), RidgeClassifier())},
6: {'col': 2,
'th': -0.7986,
'loss': 0.0,
'samples': 32,
'children': (7, 8),
'models': (RidgeClassifier(), RidgeClassifier())},
9: {'col': 2,
'th': -0.0865,
'loss': 0.0,
'samples': 59,
'children': (11, 12),
'models': (RidgeClassifier(), RidgeClassifier())},
3: {'loss': 0.08,
'samples': 25,
'models': RidgeClassifier(),
'classes': array([0, 1, 2])},
5: {'loss': 0.0,
'samples': 23,
'models': RidgeClassifier(),
'classes': array([0, 1, 2])},
7: {'loss': 0.0,
'samples': 16,
'models': RidgeClassifier(),
'classes': array([0, 1])},
8: {'loss': 0.0,
'samples': 16,
'models': RidgeClassifier(),
'classes': array([0, 1, 2])},
11: {'loss': 0.0,
'samples': 32,
'models': RidgeClassifier(),
'classes': array([0, 1, 2])},
12: {'loss': 0.0,
'samples': 27,
'models': RidgeClassifier(),
'classes': array([0, 1, 2])},
10: {'loss': 0.0476,
'samples': 21,
'models': RidgeClassifier(),
'classes': array([0, 1, 2])}}
Hey, I have been playing around a lot with your linear trees. Like them very much. Thanks!
Nevertheless, I am somewhat disappointed by the runtime performance. Compared to XGBoost Regressors (I know it's not a fair comparison) or linear regressions (also not fair), the linear tree is reeeeeaally slow.
50k observations, 80 features: 2s for linear regression, 27s for XGBoost, and 300s for the linear tree.
Have you seen similar runtimes or might I be using it wrong?
Another aspects that's interesting to me is the question whether is possibe to limit the features which are used for splits. I haven't found it in the code. Any change to see it in the future?
Hello,
thank you for your nice tool. I am using the function LinearTreeRegressor to draw a continuous piecewise linear. It works well, I am wondering, is it possible to show the location (the coordinates) of the breakpoints?
thank you
Hello! Thank you for useful package!
I think I might have found a potential bug in LinearForestClassifier.
I expected 'predict_proba' to use 'self.decision_function', similarly to 'predict' - to include predictions from both estimators (base + forest). Is that a potential bug or am I in wrong here?
linear-tree/lineartree/lineartree.py
Line 1560 in 8d5beca
I noticed that the implementations of _parallel_binning_fit
and _grow
internally round loss values to 5 decimal places. This makes the regression results dependent on the scale of the labels, as data with a lower natural loss value will result in many different splits of the data having the same loss when rounded to 5 decimal places. Is there a reason why this is the case?
This behavior can be observed by fitting a LinearTreeRegressor
using the default loss function and multiplying the scale of the labels by a small number (like 1e-9
). This will result in the regressor no longer learning any splits.
Hi, I was just going through each leaf node just to see how the coefficients for each feature are behaving. But while looking at it I realised that each node is returning three arrays of coefficients for each feature.
You can see above for one node how it is behaving, I mean it is correct I know but I am not able to understand it properly. Any insight would be appreciated.
I am using the LinearTreeClassifier and ran into an issue where it throws an error due to the split having a Gini Index of 0 and only a single class in the node. See below
When looking at the splits with a decision tree we see the following:
The colab notebook I used to create this issue is here: https://colab.research.google.com/drive/1NLWKZItwdRCmt6Dxmesqu75DLkXaJvs6?usp=sharing
Hi,
I trying notebook usage-LinearBoost with collab.
in 3 cell:
regr = LinearBoostRegressor(Ridge(), loss='linear')
regr.fit(X, y)
I have a problem:
`TypeError Traceback (most recent call last)
in ()
1 regr = LinearBoostRegressor(Ridge(), loss='linear')
----> 2 regr.fit(X, y)
1 frames
/usr/local/lib/python3.7/dist-packages/lineartree/_classes.py in _fit(self, X, y, sample_weight)
943 min_impurity_decrease=self.min_impurity_decrease,
944 min_impurity_split=self.min_impurity_split,
--> 945 ccp_alpha=self.ccp_alpha
946 )
947
TypeError: init() got an unexpected keyword argument 'min_impurity_split'`
linear-tree I install:
%pip install linear-tree
I have an example where it performs a split on a node a node with a loss of 0. Take a look at the below example. It performs a split on node 1
(where the loss = 0). This split does not add any value to the results and the parent node (node 1
) already gives perfect results.
Is this the intended behavior? Or should it not perform splits when the results are already perfect?
Hi, I really like your linear-tree library. I have been looking for something like this a while and it perfectly fits my use case.
If I understand the LinearTreeRegressor correctly a node is split when the weighted loss of the child nodes is less than the loss of the parent node.
What I would like to do is to only split a node if the decrease in loss is over a certain threshold. Scikit-learn has something called min_impurity_decrease which could be used.
I implemented a small suggestion in a PR. So I would be happy to expand on this and improve it (e.g. input validation, maybe extend to classification), if you find that useful.
I am using your Linear Tree code in the context of a research paper related to audio and I would like to have a reference in your work.
Is there a specific way to reference your work in the bibliography of the paper?
Hi, I am wondering how to perform a GridsearchCV to find best parameters for the tree and regression model?
For now I am able to tune the tree component of my model:
`
param_grid={
'n_estimators': [50, 100, 500, 700],
'max_depth': [10, 20, 30, 50],
'min_samples_split' : [2, 4, 8, 16, 32],
'max_features' : ['sqrt', 'log2', None]
}
cv = RepeatedKFold(n_repeats=3,
n_splits=3,
random_state=1)
model = GridSearchCV(
LinearForestRegressor(ElasticNet(random_state = 0), random_state=42),
param_grid=param_grid,
n_jobs=-1,
cv=cv,
scoring='neg_root_mean_squared_error'
)
`
Hi all, each time I am fitting a regression (Linear Tree regression mostly) model on any datasets the loss at each node is always 0 for some reason. Is such behaviour normal ?
Currently, we hard code the precision of threshold as 5 in here. Having this customizable will allow us to use the linear tree model when the number that are used are smaller in general. My suggestion is have another parameter that defaults to 5 and when people wants to use the model with smaller number, they could do it by set this parameter to the number they desired.
Let me know if this change is good and I could create a PR for it. I'm open for discussion.
Hi
thanks for writing this great package!
I was trying to display the decision tree with graphviz I get this error
AttributeError: 'LinearTreeRegressor' object has no attribute 'n_features_'
from lineartree import LinearTreeRegressor
from sklearn.linear_model import LinearRegression
reg = LinearTreeRegressor(base_estimator=LinearRegression())
reg.fit(train[x_cols], train["y"])
from graphviz import Source
from sklearn import tree
graph = Source( tree.export_graphviz(reg, out_file=None,feature_names=train.columns))
Hi @cerlymarco , thanks for developing this method into a good library.
Im thinking , maybe for some cases / most cases we need maximum slope for each regressor. The main concept is to prevent over optimistic extrapolation for prediction output.
If slope > max slope , then split that into a new node.
Hi, is there a way for setting the learning_rate
in the boosting regressors and classifiers?
EDIT:
Also, is LinearBoostingRegressor fitting a linear regression first and then boosting the residual via regression trees or boosting via a series of linear regression trees?
It may be helpful to other users if the following relevant literature is cited in the README:
Hi There!
I am very interesting in the linear-tree packge and I found it inspiring for my research. But when I was using LinearForestRegressor in my study, I found that the base estimator of it gave biased coefficients (with too small absolute values) so that the prediction was basically fitted by the forest estimator. Therefore the structure of liear forest will be very similar to a random forest regressor. I found that it may be due to the round off error in the source code function self._validate_data
where the dtype "float32" was used.
I generated a synthetic dataset to compare the LinearRegression model in the scikit-learn and the LinearForestRegressor. BTW, how can we deal with the data with features at multiple orders of magnitudes? Will the parameter base_estimator
support sklearn pipeline to support preprocessing like StandardScaler
in the future release?
Thank you for your excellent works!
import numpy as np
from lineartree import LinearForestRegressor
from sklearn.linear_model import LinearRegression
SEED = 1234
# Genrate a synthetic dataset
X1 = np.random.randn(1000, 1) * 1 + 10
X2 = np.random.randn(1000, 1) * 1e7 + 3e7
X3 = np.random.randn(1000, 1) * 100 + 200
X4 = np.random.randn(1000, 1) + 500
X5 = np.random.randn(1000, 1) + 1000
X6 = np.random.randn(1000, 1)
X7 = np.random.randn(1000, 1)
X8 = np.random.rand(1000, 1)
X = np.concatenate([X1, X2, X3, X4, X5, X6, X7, X8], axis=1)
y = X1 + np.sin(X2 * X6) + (X3 / 1e6) ** 2 + X4 / 1e3 + X2 / 1e7 + \
X7 * X8 + np.random.randn(1000, 1) * 0.1
y = np.log(y)
# Fit a linear regression model
lr = LinearRegression()
lr.fit(X, y)
lr_coef = lr.coef_
print(lr_coef)
# this will give [[ 7.49327164e-02 7.59350553e-09 -5.17630150e-06 -1.67616079e-05
# -1.73796325e-03 3.13294480e-04 4.07092831e-02 -7.15923013e-03]]
# Fit a linear forest model
lf = LinearForestRegressor(base_estimator=LinearRegression(),
n_estimators=100, max_depth=5,
max_features=1.0, random_state=SEED)
lf.fit(X, y)
lf_coef = lf.coef_
print(lf_coef)
# this will give [ 1.3074668e-09 7.2390938e-09 -2.1693744e-05 9.1071959e-09
# -6.6003052e-09 -7.7589535e-09 7.1229582e-09 5.3837756e-09]
Got this error when trying to follow examples to train model
It seems that the any 2 tree models in a forest can be trained in parallel, is there a way to do njobs=-1 in the parameter or wrap the entire thing into a with block passing in with joblib multiprocessing njob=-1?
Is it possible to replace linear fit with SGD fit for large scale data? Should we? (in terms of speed and model equivalence)
Also, is it possible to call gpu to solve linear each time(either the traditional way or the gradient based optimizers?)
I am thinking of this type of model, if applied on tabular data , can have tracable error sensitivity( because derivative or linear slopes are known, and jumps are finite). Maybe one thing to try is to use these model on a wide range biostats tabular datasets (some of them are very small(<2k obs, < 50 vars), but have good local correlations and need good interpretations). So I am planning to use it at scale.
Hi, let's say you're considering to split a node along a certain float-valued dimension. How do you choose the candidate split values (that is, the value to which you compare the column values to decide if they end up in the right or left subtree)?
To choose among the candidates, you compare the error values - but how do you choose the candidates themselves?
I have been using your library for quite a while and am super happy with it. So first, thanks a lot!
Lately, I used my framework (which also uses your library) on modern many core server with many jobs. Worked fine. Now I have updated everything via pip and with 8 jobs on my MacBook, I got the following error.
This error does not occur when using only a single job (I pass the number of jobs to n_jobs
).
I cannot nail the down the actual problem, but since it occurred right after the upgrade, I assume this might be the reason?
Am I doing something wrong here?
"""
Traceback (most recent call last):
File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 436, in _process_worker
r = call_item()
File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/joblib/externals/loky/process_executor.py", line 288, in __call__
return self.fn(*self.args, **self.kwargs)
File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 595, in __call__
return self.func(*args, **kwargs)
File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 263, in __call__
for func, args, kwargs in self.items]
File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 263, in <listcomp>
for func, args, kwargs in self.items]
File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/lineartree/_classes.py", line 56, in __call__
with config_context(**self.config):
File "/Users/martin/opt/anaconda3/lib/python3.7/contextlib.py", line 239, in helper
return _GeneratorContextManager(func, args, kwds)
File "/Users/martin/opt/anaconda3/lib/python3.7/contextlib.py", line 82, in __init__
self.gen = func(*args, **kwds)
TypeError: config_context() got an unexpected keyword argument 'target_offload'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "compression_selection_pipeline.py", line 41, in <module>
model_pipeline.learn_runtime_models(calibration_result_dir)
File "/Users/martin/Programming/compression_selection_v3/hyrise_calibration/model_pipeline.py", line 670, in learn_runtime_models
non_splitting_models("table_scan", table_scans)
File "/Users/martin/Programming/compression_selection_v3/hyrise_calibration/model_pipeline.py", line 590, in non_splitting_models
fitted_model = model_dict["model"].fit(X_train, y_train)
File "/Users/martin/Programming/compression_selection_v3/hyrise_calibration/model_pipeline.py", line 209, in fit
return self.regression.fit(X, y)
File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/lineartree/lineartree.py", line 187, in fit
self._fit(X, y, sample_weight)
File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/lineartree/_classes.py", line 576, in _fit
self._grow(X, y, sample_weight)
File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/lineartree/_classes.py", line 387, in _grow
loss=loss)
File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/lineartree/_classes.py", line 285, in _split
for feat in split_feat)
File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 1056, in __call__
self.retrieve()
File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 935, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/Users/martin/opt/anaconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result
return future.result(timeout=timeout)
File "/Users/martin/opt/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 435, in result
return self.__get_result()
File "/Users/martin/opt/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
TypeError: config_context() got an unexpected keyword argument 'target_offload'
PS: I have already left a star. :D
Hi,
I really like your work with Linear Trees, I would like to ask if there are references in some kind of papers, which describe accurately in the form of equations the split procedure!!
Thank you in advance,
George Moiragias
Thanks for the good library.
When using LinearTreeRegressor, I think that max_depth is often optimized by cross-validation.
This library allows max_depth in the range 1-20. However, depending on the dataset, simple linear regression may be suitable. Even in such a dataset, max_depth is forced to be 1 or more, so Simple Linear Regression cannot be applied properly with LinearTreeRegressor.
My suggestion is to change to a program that uses base_estimator to perform regression when "max_depth = 0".
With this change, LinearTreeRegressor can flexibly respond to both segmented regression and simple regression by changing hyperparameters.
/lineartree/_classes.py:338: DeprecationWarning:
the interpolation=
argument to quantile was renamed to method=
, which has additional options.
Users of the modes 'nearest', 'lower', 'higher', or 'midpoint' are encouraged to review the method they. (Deprecated NumPy 1.22)
Seems like a quick update here would get this warning to stop showing up, right? I can always ignore it, but figured I would mention it in case it is actually an error on my side.
Also, sorry, I don't actually what the best open source etiquette is. If I'm supposed to create a pull request with a proposed fix instead of just mentioning it then feel free to correct me.
Hello there!
This is a great package that I just found out. I’m still experimenting on it but it’s working nice.
I was trying to use categorical text features but it seems the package can only get numerical attributes and bin them internally to get the categories. I am doing something wrong?
I’d love to give this project 5 stars.
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.