Describe the bug CascadeForestRegressor

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

BTW, could you please telling me why a local implementation of <code class="notranslat

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[BUG] `CascadeForestRegressor` somehow cannot be inserted into a DataFrame about deep-forest HOT 10 CLOSED

lamda-nju commented on May 27, 2024

[BUG] `CascadeForestRegressor` somehow cannot be inserted into a DataFrame

from deep-forest.

Comments (10)

xuyxu commented on May 27, 2024

Hi @IncubatorShokuhou, I would like to ask that what is the purpose of storing the model in a pandas dataframe?

from deep-forest.

IncubatorShokuhou commented on May 27, 2024

@xuyxu Actually I am trying to integrate deep-forest into PyCaret. In theory, PyCaret supports all ml algorithms with scikit-learn-Compatible API. In practice, most models, including xgboost, lightgbm, catboost, ngboost, explainable boosting matching et al. can be easily integrated.

Here is the example code:

from pycaret.datasets import get_data
boston = get_data('boston')
from pycaret.regression import *
from deepforest import CascadeForestRegressor
from ngboost import NGBRegressor

# setup, data preprocessing
exp_name = setup(data = boston,  target = 'medv',silent = True)

# establish regressors
ngr = NGBRegressor()
ngboost = create_model(ngr)

cfr = CascadeForestRegressor()
casforest = create_model(cfr)

# compare models
best_model = compare_models(include=[ngboost,casforest,"xgboost","lightgbm"])

# save model
save_model(best_model , 'best_model ')

During the integration, I met 2 errors: 1. the Deep-Forest only accepts np.array, and cannot input pd.DataFrame, which could be easily fixed by #86 . 2. In line 2219 of https://github.com/pycaret/pycaret/blob/c76f4b7699474bd16a2e2a6d0f52759ae29898b6/pycaret/internal/tabular.py#L2219 , the model object is put into a pd.DataFrame, and the bug described above happened, which is quite weird for me.

I guess there might be something wrong with the initialization. Wish you could give me some suggestions.

from deep-forest.

xuyxu commented on May 27, 2024

Thanks for your kind explanations! I will take a look at your PR first ;-)

from deep-forest.

IncubatorShokuhou commented on May 27, 2024

BTW, could you please telling me why a local implementation of RandomForestClassifier instead of sklearn.ensemble.RandomForestClassifier is used in line 50 of https://github.com/LAMDA-NJU/Deep-Forest/blob/master/deepforest/cascade.py#L50 . And in https://github.com/LAMDA-NJU/Deep-Forest/blob/master/deepforest/cascade.py#91, is lgb = __import__("lightgbm.sklearn") simply equal to import lightgbm.sklearn as lgb ?

from deep-forest.

xuyxu commented on May 27, 2024

why a local implementation of RandomForestClassifier instead of sklearn.ensemble.RandomForestClassifier is used

sklearn.ensemble.RandomForestClassifier is too slow when fitted on large datasets with millions of samples

lgb = import("lightgbm.sklearn")

We prefer to treat lightgbm as a soft dependency. If we use import lightgbm.sklearn as lgb in the front, the program will raise an ImportError if lightgbm is not installed, which is not the case we want.

from deep-forest.

IncubatorShokuhou commented on May 27, 2024

why a local implementation of RandomForestClassifier instead of sklearn.ensemble.RandomForestClassifier is used

sklearn.ensemble.RandomForestClassifier is too slow when fitted on large datasets with millions of samples

lgb = import("lightgbm.sklearn")

We prefer to treat lightgbm as a soft dependency. If we use import lightgbm.sklearn as lgb in the front, the program will raise an ImportError if lightgbm is not installed, which is not the case we want.

I see. So maybe I can write a simple GPU version for the three models using cuML.ensemble.RandomForest and gpu_hist ?

from deep-forest.

xuyxu commented on May 27, 2024

The performance would be much worse since Random Forest in cuML is not designed for the case where we want the forest to be as complex as possible (it does not support unlimited tree depth).

from deep-forest.

IncubatorShokuhou commented on May 27, 2024

The performance would be much worse since Random Forest in cuML is not designed for the case where we want the forest to be as complex as possible (it does not support unlimited tree depth).

OK, I see.

from deep-forest.

IncubatorShokuhou commented on May 27, 2024

@xuyxu I think I have figure out the reason of this error.
In https://github.com/pandas-dev/pandas/blob/64559124a4de977e1d5cd09e6d80fbb110d3a6ea/pandas/core/dtypes/cast.py#112 , pandas will first identify whether the object has a __ len__ method. If true, pandas will try to transform this list-like object(aka CascadeForestRegressor()) in a 1-dimensional numpy array of object dtype via construct_1d_object_array_from_listlike in https://github.com/pandas-dev/pandas/blob/64559124a4de977e1d5cd09e6d80fbb110d3a6ea/pandas/core/dtypes/cast.py#L1970 .
So this error actually occur in

result = np.empty(0, dtype="object")
result[:] = CascadeForestRegressor()

and when trying to put CascadeForestRegressor() into a empty np.array, __getitem__ in https://github.com/LAMDA-NJU/Deep-Forest/blob/master/deepforest/cascade.py#540 is called, then the error occured.

Actually, the error can be more significantly reproduced in another way:

# basic example
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from deepforest import CascadeForestClassifier

X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
model = CascadeForestClassifier(random_state=1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred) * 100
print("\nTesting Accuracy: {:.3f} %".format(acc))

# now the model have 2 layers. Iterate it.
for i,j in enumerate(model):
    print("i = ")
    print(i)
    print("j = ")
    print(j)
    print("ok")

and here is the error:

i = 
0
j = 
ClassificationCascadeLayer(buffer=<deepforest._io.Buffer object at 0x7efa7e4f5fd0>,
                           criterion='gini', layer_idx=0, n_estimators=4,
                           n_outputs=10, random_state=1)
ok
i = 
1
j = 
ClassificationCascadeLayer(buffer=<deepforest._io.Buffer object at 0x7efa7e4f5fd0>,
                           criterion='gini', layer_idx=1, n_estimators=4,
                           n_outputs=10, random_state=1)
ok
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-29-c9f04ba43562> in <module>
----> 1 for i,j in enumerate(model):
      2     print("i = ")
      3     print(i)
      4     print("j = ")
      5     print(j)

/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/deepforest/cascade.py in __getitem__(self, index)
    518 
    519     def __getitem__(self, index):
--> 520         return self._get_layer(index)
    521 
    522     def _get_n_output(self, y):

/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/deepforest/cascade.py in _get_layer(self, layer_idx)
    561             logger.debug("self.n_layers_ = "+ str(self.n_layers_))
    562             logger.debug("layer_idx = "+ str(layer_idx))
--> 563             raise ValueError(msg.format(self.n_layers_ - 1, layer_idx))
    564 
    565         layer_key = "layer_{}".format(layer_idx)

ValueError: The layer index should be in the range [0, 1], but got 2 instead.

Then I noticed that https://docs.python.org/zh-cn/3/reference/datamodel.html#object.__setitem__ introduces:

Note for loops expect that an IndexError will be raised for illegal indexes to allow proper detection of the end of the sequence.

That's it! Deep-Forest raises a ValueError insted of IndexError by mistake. When I changed it, everything is ok!

from deep-forest.

IncubatorShokuhou commented on May 27, 2024

I am going to create a PR and fix this error ASAP.

from deep-forest.

[BUG] `CascadeForestRegressor` somehow cannot be inserted into a DataFrame about deep-forest HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent