Giter Club home page Giter Club logo

Comments (10)

xuyxu avatar xuyxu commented on May 27, 2024

Hi @IncubatorShokuhou, I would like to ask that what is the purpose of storing the model in a pandas dataframe?

from deep-forest.

IncubatorShokuhou avatar IncubatorShokuhou commented on May 27, 2024

@xuyxu Actually I am trying to integrate deep-forest into PyCaret. In theory, PyCaret supports all ml algorithms with scikit-learn-Compatible API. In practice, most models, including xgboost, lightgbm, catboost, ngboost, explainable boosting matching et al. can be easily integrated.

Here is the example code:

from pycaret.datasets import get_data
boston = get_data('boston')
from pycaret.regression import *
from deepforest import CascadeForestRegressor
from ngboost import NGBRegressor

# setup, data preprocessing
exp_name = setup(data = boston,  target = 'medv',silent = True)

# establish regressors
ngr = NGBRegressor()
ngboost = create_model(ngr)

cfr = CascadeForestRegressor()
casforest = create_model(cfr)

# compare models
best_model = compare_models(include=[ngboost,casforest,"xgboost","lightgbm"])

# save model
save_model(best_model , 'best_model ')

During the integration, I met 2 errors: 1. the Deep-Forest only accepts np.array, and cannot input pd.DataFrame, which could be easily fixed by #86 . 2. In line 2219 of https://github.com/pycaret/pycaret/blob/c76f4b7699474bd16a2e2a6d0f52759ae29898b6/pycaret/internal/tabular.py#L2219 , the model object is put into a pd.DataFrame, and the bug described above happened, which is quite weird for me.

I guess there might be something wrong with the initialization. Wish you could give me some suggestions.

from deep-forest.

xuyxu avatar xuyxu commented on May 27, 2024

Thanks for your kind explanations! I will take a look at your PR first ;-)

from deep-forest.

IncubatorShokuhou avatar IncubatorShokuhou commented on May 27, 2024

BTW, could you please telling me why a local implementation of RandomForestClassifier instead of sklearn.ensemble.RandomForestClassifier is used in line 50 of https://github.com/LAMDA-NJU/Deep-Forest/blob/master/deepforest/cascade.py#L50 . And in https://github.com/LAMDA-NJU/Deep-Forest/blob/master/deepforest/cascade.py#91, is lgb = __import__("lightgbm.sklearn") simply equal to import lightgbm.sklearn as lgb ?

from deep-forest.

xuyxu avatar xuyxu commented on May 27, 2024

why a local implementation of RandomForestClassifier instead of sklearn.ensemble.RandomForestClassifier is used

sklearn.ensemble.RandomForestClassifier is too slow when fitted on large datasets with millions of samples

lgb = import("lightgbm.sklearn")

We prefer to treat lightgbm as a soft dependency. If we use import lightgbm.sklearn as lgb in the front, the program will raise an ImportError if lightgbm is not installed, which is not the case we want.

from deep-forest.

IncubatorShokuhou avatar IncubatorShokuhou commented on May 27, 2024

why a local implementation of RandomForestClassifier instead of sklearn.ensemble.RandomForestClassifier is used

sklearn.ensemble.RandomForestClassifier is too slow when fitted on large datasets with millions of samples

lgb = import("lightgbm.sklearn")

We prefer to treat lightgbm as a soft dependency. If we use import lightgbm.sklearn as lgb in the front, the program will raise an ImportError if lightgbm is not installed, which is not the case we want.

I see. So maybe I can write a simple GPU version for the three models using cuML.ensemble.RandomForest and gpu_hist ?

from deep-forest.

xuyxu avatar xuyxu commented on May 27, 2024

The performance would be much worse since Random Forest in cuML is not designed for the case where we want the forest to be as complex as possible (it does not support unlimited tree depth).

from deep-forest.

IncubatorShokuhou avatar IncubatorShokuhou commented on May 27, 2024

The performance would be much worse since Random Forest in cuML is not designed for the case where we want the forest to be as complex as possible (it does not support unlimited tree depth).

OK, I see.

from deep-forest.

IncubatorShokuhou avatar IncubatorShokuhou commented on May 27, 2024

@xuyxu I think I have figure out the reason of this error.
In https://github.com/pandas-dev/pandas/blob/64559124a4de977e1d5cd09e6d80fbb110d3a6ea/pandas/core/dtypes/cast.py#112 , pandas will first identify whether the object has a __ len__ method. If true, pandas will try to transform this list-like object(aka CascadeForestRegressor()) in a 1-dimensional numpy array of object dtype via construct_1d_object_array_from_listlike in https://github.com/pandas-dev/pandas/blob/64559124a4de977e1d5cd09e6d80fbb110d3a6ea/pandas/core/dtypes/cast.py#L1970 .
So this error actually occur in

result = np.empty(0, dtype="object")
result[:] = CascadeForestRegressor()

and when trying to put CascadeForestRegressor() into a empty np.array, __getitem__ in https://github.com/LAMDA-NJU/Deep-Forest/blob/master/deepforest/cascade.py#540 is called, then the error occured.

Actually, the error can be more significantly reproduced in another way:

# basic example
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from deepforest import CascadeForestClassifier

X, y = load_digits(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
model = CascadeForestClassifier(random_state=1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
acc = accuracy_score(y_test, y_pred) * 100
print("\nTesting Accuracy: {:.3f} %".format(acc))

# now the model have 2 layers. Iterate it.
for i,j in enumerate(model):
    print("i = ")
    print(i)
    print("j = ")
    print(j)
    print("ok")

and here is the error:

i = 
0
j = 
ClassificationCascadeLayer(buffer=<deepforest._io.Buffer object at 0x7efa7e4f5fd0>,
                           criterion='gini', layer_idx=0, n_estimators=4,
                           n_outputs=10, random_state=1)
ok
i = 
1
j = 
ClassificationCascadeLayer(buffer=<deepforest._io.Buffer object at 0x7efa7e4f5fd0>,
                           criterion='gini', layer_idx=1, n_estimators=4,
                           n_outputs=10, random_state=1)
ok
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-29-c9f04ba43562> in <module>
----> 1 for i,j in enumerate(model):
      2     print("i = ")
      3     print(i)
      4     print("j = ")
      5     print(j)

/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/deepforest/cascade.py in __getitem__(self, index)
    518 
    519     def __getitem__(self, index):
--> 520         return self._get_layer(index)
    521 
    522     def _get_n_output(self, y):

/mnt/hdd2/lvhao/miniconda3/envs/pycaret/lib/python3.7/site-packages/deepforest/cascade.py in _get_layer(self, layer_idx)
    561             logger.debug("self.n_layers_ = "+ str(self.n_layers_))
    562             logger.debug("layer_idx = "+ str(layer_idx))
--> 563             raise ValueError(msg.format(self.n_layers_ - 1, layer_idx))
    564 
    565         layer_key = "layer_{}".format(layer_idx)

ValueError: The layer index should be in the range [0, 1], but got 2 instead.

Then I noticed that https://docs.python.org/zh-cn/3/reference/datamodel.html#object.__setitem__ introduces:

Note for loops expect that an IndexError will be raised for illegal indexes to allow proper detection of the end of the sequence.

That's it! Deep-Forest raises a ValueError insted of IndexError by mistake. When I changed it, everything is ok!

from deep-forest.

IncubatorShokuhou avatar IncubatorShokuhou commented on May 27, 2024

I am going to create a PR and fix this error ASAP.

from deep-forest.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.