Comments (10)
Hi @IncubatorShokuhou, I would like to ask that what is the purpose of storing the model in a pandas dataframe?
from deep-forest.
@xuyxu Actually I am trying to integrate deep-forest into PyCaret. In theory, PyCaret supports all ml algorithms with scikit-learn-Compatible API. In practice, most models, including xgboost, lightgbm, catboost, ngboost, explainable boosting matching et al. can be easily integrated.
Here is the example code:
from pycaret.datasets import get_data
boston = get_data('boston')
from pycaret.regression import *
from deepforest import CascadeForestRegressor
from ngboost import NGBRegressor
# setup, data preprocessing
exp_name = setup(data = boston, target = 'medv',silent = True)
# establish regressors
ngr = NGBRegressor()
ngboost = create_model(ngr)
cfr = CascadeForestRegressor()
casforest = create_model(cfr)
# compare models
best_model = compare_models(include=[ngboost,casforest,"xgboost","lightgbm"])
# save model
save_model(best_model , 'best_model ')
During the integration, I met 2 errors: 1. the Deep-Forest only accepts np.array, and cannot input pd.DataFrame, which could be easily fixed by #86 . 2. In line 2219 of https://github.com/pycaret/pycaret/blob/c76f4b7699474bd16a2e2a6d0f52759ae29898b6/pycaret/internal/tabular.py#L2219 , the model object is put into a pd.DataFrame, and the bug described above happened, which is quite weird for me.
I guess there might be something wrong with the initialization. Wish you could give me some suggestions.
from deep-forest.
Thanks for your kind explanations! I will take a look at your PR first ;-)
from deep-forest.
BTW, could you please telling me why a local implementation of RandomForestClassifier
instead of sklearn.ensemble.RandomForestClassifier
is used in line 50 of https://github.com/LAMDA-NJU/Deep-Forest/blob/master/deepforest/cascade.py#L50 . And in https://github.com/LAMDA-NJU/Deep-Forest/blob/master/deepforest/cascade.py#91, is lgb = __import__("lightgbm.sklearn")
simply equal to import lightgbm.sklearn as lgb
?
from deep-forest.
why a local implementation of RandomForestClassifier instead of sklearn.ensemble.RandomForestClassifier is used
sklearn.ensemble.RandomForestClassifier
is too slow when fitted on large datasets with millions of samples
lgb = import("lightgbm.sklearn")
We prefer to treat lightgbm as a soft dependency. If we use import lightgbm.sklearn as lgb
in the front, the program will raise an ImportError if lightgbm is not installed, which is not the case we want.
from deep-forest.
why a local implementation of RandomForestClassifier instead of sklearn.ensemble.RandomForestClassifier is used
sklearn.ensemble.RandomForestClassifier
is too slow when fitted on large datasets with millions of sampleslgb = import("lightgbm.sklearn")
We prefer to treat lightgbm as a soft dependency. If we use
import lightgbm.sklearn as lgb
in the front, the program will raise an ImportError if lightgbm is not installed, which is not the case we want.
I see. So maybe I can write a simple GPU version for the three models using cuML.ensemble.RandomForest
and gpu_hist
?
from deep-forest.
The performance would be much worse since Random Forest in cuML is not designed for the case where we want the forest to be as complex as possible (it does not support unlimited tree depth).
from deep-forest.
The performance would be much worse since Random Forest in cuML is not designed for the case where we want the forest to be as complex as possible (it does not support unlimited tree depth).
OK, I see.
from deep-forest.
@xuyxu I think I have figure out the reason of this error.
In https://github.com/pandas-dev/pandas/blob/64559124a4de977e1d5cd09e6d80fbb110d3a6ea/pandas/core/dtypes/cast.py#112 , pandas
will first identify whether the object has a __ len__
method. If true, pandas
will try to transform this list-like object(aka CascadeForestRegressor()) in a 1-dimensional numpy array of object dtype via construct_1d_object_array_from_listlike
in https://github.com/pandas-dev/pandas/blob/64559124a4de977e1d5cd09e6d80fbb110d3a6ea/pandas/core/dtypes/cast.py#L1970 .
So this error actually occur in
result = np.empty(0, dtype="object")
result[:] = CascadeForestRegressor()
and when trying to put CascadeForestRegressor()
into a empty np.array
, __getitem__
in https://github.com/LAMDA-NJU/Deep-Forest/blob/master/deepforest/cascade.py#540 is called, then the error occured.
Actually, the error can be more significantly reproduced in another way:
# basic example
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from deepforest import CascadeForestClassifier
X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
model = CascadeForestClassifier(random_state=1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred) * 100
print("\nTesting Accuracy: {:.3f} %".format(acc))
# now the model have 2 layers. Iterate it.
for i,j in enumerate(model):
print("i = ")
print(i)
print("j = ")
print(j)
print("ok")
and here is the error:
i =
0
j =
ClassificationCascadeLayer(buffer=<deepforest._io.Buffer object at 0x7efa7e4f5fd0>,
criterion='gini', layer_idx=0, n_estimators=4,
n_outputs=10, random_state=1)
ok
i =
1
j =
ClassificationCascadeLayer(buffer=<deepforest._io.Buffer object at 0x7efa7e4f5fd0>,
criterion='gini', layer_idx=1, n_estimators=4,
n_outputs=10, random_state=1)
ok
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-29-c9f04ba43562> in <module>
----> 1 for i,j in enumerate(model):
2 print("i = ")
3 print(i)
4 print("j = ")
5 print(j)
/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/deepforest/cascade.py in __getitem__(self, index)
518
519 def __getitem__(self, index):
--> 520 return self._get_layer(index)
521
522 def _get_n_output(self, y):
/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/deepforest/cascade.py in _get_layer(self, layer_idx)
561 logger.debug("self.n_layers_ = "+ str(self.n_layers_))
562 logger.debug("layer_idx = "+ str(layer_idx))
--> 563 raise ValueError(msg.format(self.n_layers_ - 1, layer_idx))
564
565 layer_key = "layer_{}".format(layer_idx)
ValueError: The layer index should be in the range [0, 1], but got 2 instead.
Then I noticed that https://docs.python.org/zh-cn/3/reference/datamodel.html#object.__setitem__ introduces:
Note for loops expect that an IndexError will be raised for illegal indexes to allow proper detection of the end of the sequence.
That's it! Deep-Forest
raises a ValueError
insted of IndexError
by mistake. When I changed it, everything is ok!
from deep-forest.
I am going to create a PR and fix this error ASAP.
from deep-forest.
Related Issues (20)
- Model crashes for very small data HOT 5
- can't get the features importance, please provide an example. HOT 2
- Matrix as an input of CascadeForestClassifier HOT 2
- Error in feature importance acquisition HOT 2
- Does the package already support Multi-Grained Scanning as the old package gcForest? HOT 4
- importing error HOT 2
- How to apply shap model to DF model to interpret features? HOT 2
- GPU Support HOT 2
- The api for multi-grain HOT 1
- How to plot roc curve ? HOT 3
- ValueError: too many values to unpack (expected 2) HOT 2
- cannot import name 'CascadeForestRegressor' from 'deepforest' HOT 6
- cant import CascadeForestRegressor HOT 2
- pip install deep-forest didn't work in wsl2 HOT 8
- pip install deep-forest didn't work in python3.10 HOT 3
- question HOT 8
- Please consider support on py310 HOT 1
- pip install deep-forest ERROR: Could not find a version that satisfies the requirement deep-forest (from versions: none) HOT 2
- np.int has been removed in munpy 1.24 HOT 5
- pip install doesn't work HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deep-forest.