Giter Club home page Giter Club logo

hummingbird's Issues

Randomization in tests

Right now all of our tests are using random data. This sometimes causes github actions to fail in unexpected ways.

We should use all deterministic data in our tests rather than random so that we get consistency across runs.

Notebooks don't work as-is

For example, the lightgbm example notebook has:

from hummingbird.ml import convert

and then

hb_model = convert(model, 'pytorch')

This returns an error:

In [14]: hb_model = convert(model, 'pytorch')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-5658f007aeed> in <module>
----> 1 hb_model = convert(model, 'pytorch')

TypeError: 'module' object is not callable

Should it instead be:
from hummingbird.ml.convert import convert_lightgbm as convert?

params to determine tree type, and max_depth

right now, we are hardcoding 3 or 4 as low, and 10 as high. We need to do more testing and tuning and document our choices.

Also, there are a lot of inconsistencies and duplications with the max_depth parameter. We need to revisit that for all 4 trees.

consistency in returned types

It's not clear when we should take:

pytorch_model(torch.from_numpy(X))[0]  #  RandomForestClassifier

vs. say

pytorch_model(torch.from_numpy(X))[1]   # DecisionTreeClassifier

In the end, we want it to be something simple like what we compare to such as

model.predict(X)

Minimum set of dependencies

Right now, users must install LGBM/XGBoost, etc. Can we have options in the setup.py file or some other options to not force users to install dependencies they won't be using?

pytorch problem with pip install on Python3.7 or Python3.8

Doing pip install hummingbird-ml on Python3.7 or Python3.8 a user reported:

ERROR: Could not find a version that satisfies the requirement torch>=1.4.0 (from hummingbird-ml) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
ERROR: No matching distribution found for torch>=1.4.0 (from hummingbird-ml)

Looks like pytorch on pypi is 1.0.2 and on the conda main channel it’s 1.3.1
Maybe linking to the installation page for pytorch would be useful.

Simplify convert_sklearn API

In its current implementation to convert a sklearn model we have something like:

convert_sklearn(model, initial_types=[('input', FloatTensorType([4, 3]))])

but we actually don't need the specification of input types (this is more a onnx converter thing). So we can have something like:

convert_sklearn(model)

which is nice and short. The problem with this is that XGBoostRegressor models do not surface information on the number of input features (while instead XGBoostClassifier does). Then if we go with the above API we will need a workaround for XGBoostRegressor. One possibility is to have the following specifically for XGBoostRegression models:

extra_config["n_features"] = 200
pytorch_model = convert_sklearn(model, extra_config=extra_config)

Another possibility is to pass some input data as for other converters:

pytorch_model = convert_sklearn(model, some_input_data)

One last possibility is to have a different API for each converter (Sklearn, LightGBM and XGBoost; as ONNXMLTools are doing right now). The for Sklearn we will have:
Another possibility is to pass some input data as for other converters:

pytorch_model = convert_sklearn(model)

For LightGBM we will have
Another possibility is to pass some input data as for other converters:

pytorch_model = convert_lightgbm(model)

And for XGboost we will have either to pass an extra param or the input data. For example:

pytorch_model = convert_xgboost(model, some_input_data)

Cleanup Linear Converter Code

We have the original draft of the linear code uploaded in this branch. All credit here goes to @scnakandala, the original author and brains behind this.

It contains an un-edited implementation of the following scikit-learn converters:

There is a single test file here that needs to be cleaned up and separated out.

Remove not required dependencies

Should we leave sklearn? From one site having sklearn will help first time users, but on the other side, everyone taking a dependency on hummingbird will have to install sklearn even if it is not used.

Automate document generation

  • We have been manually refreshing documentation (generated using pdoc) till now.
  • However, this can be automated using Github actions.

Open questions:

  • Do we want to update github pages, too? This may require us to monitor the quality of docstrings closely. This can be a deterrent to a first time contributor.
  • What do we do with runtime warnings?
  • Is generation of documentation a pre-push or post-push item? If it's post-push, what if we get errors?

Multiclass rouding errors

With mulitclass datasets (such as covtype or iris), sometimes we get errors on rounding:

import numpy as np
import torch
from hummingbird import convert_sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import fetch_covtype

X, y = fetch_covtype(return_X_y=True)
nrows=15000
X = X[0:nrows]
y = y[0:nrows]
X_torch = torch.from_numpy(X).float()

model = RandomForestClassifier(n_estimators=10, max_depth=10)
model.fit(X, y)

pytorch_model = convert_sklearn(
    model, 
    extra_config = {"tree_implementation": "gemm"})


skl = model.predict_proba(X)
pytorch_model.to('cuda')
hum_gpu = pytorch_model(X_torch.to('cuda'))

np.testing.assert_allclose(skl, hum_gpu[1].data.to('cpu').numpy(), rtol=1e-6, atol=1e-6)

gives error:


AssertionError: 
Not equal to tolerance rtol=1e-06, atol=1e-06

Mismatched elements: 332 / 105000 (0.316%)
Max absolute difference: 0.11943346
Max relative difference: 5.82971106
 x: array([[0.121156, 0.200913, 0.008188, ..., 0.643138, 0.00637 , 0.020236],
       [0.110779, 0.207474, 0.008188, ..., 0.646954, 0.00637 , 0.020236],
       [0.1959  , 0.707151, 0.008188, ..., 0.050266, 0.00637 , 0.032125],...
 y: array([[0.121156, 0.200913, 0.008188, ..., 0.643138, 0.00637 , 0.020236],
       [0.110779, 0.207474, 0.008188, ..., 0.646954, 0.00637 , 0.020236],
       [0.1959  , 0.707152, 0.008188, ..., 0.050266, 0.00637 , 0.032125],...

Rounding in small datasets

With small datasets, often get a rounding error that we don't see with larger datasets

repro:

import numpy as np
import torch, pickle
from hummingbird import convert_sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris


data = load_iris()
X, y = data.data, data.target
X_torch = torch.from_numpy(X)


model = RandomForestClassifier(n_estimators=10)
model.fit(X, y)
pytorch_model = convert_sklearn(
    model, 
    extra_config = {"tree_implementation": "perf_tree_trav"})

skl = model.predict_proba(X)

pytorch_model.to('cuda')
hum_gpu = pytorch_model(X_torch.to('cuda'))

np.testing.assert_allclose(skl, hum_gpu[1].data.to('cpu').numpy(), rtol=1e-06, atol=1e-06)

you get

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-06, atol=1e-06

Mismatched elements: 10 / 450 (2.22%)
Max absolute difference: 0.1
Max relative difference: 1.
 x: array([[1. , 0. , 0. ],
       [1. , 0. , 0. ],
       [1. , 0. , 0. ],...
 y: array([[1. , 0. , 0. ],
       [1. , 0. , 0. ],
       [1. , 0. , 0. ],...

Add support for Visual Studio Codespaces

One way to resolve #60 is to add explicit support for dev containers to the Hummingbird repo. The documentation for that is here. Specifically, we could add a Dockerfile and settings for VS Code to use it. I am happy to give that a try if someone has a Dockerfile I can base that on.

Make the dev container more comfy

The current dev container is derived off of a super barebones image. Not even less is installed. I will be looking into finding a more comfy base image, probably starting with the ones found here.

upgrade and test new sklearn version

we need to upgrade sklearn version. (currently scikit-learn==0.21.3). To do this, we need to accomodate some API changes in the newer version

ex: Imputer is deprecated (was in preprocessing). Now, do :from sklearn.impute import SimpleImputer

`wheel` missing from `requirements.txt`

I tried to run the following on a fresh VS Code Spaces instance:

$ pip install -r requirements.txt

And got the following errors:

Could not build wheels for numpy, since package 'wheel' is not installed.
Could not build wheels for scikit-learn, since package 'wheel' is not installed.
Could not build wheels for torch, since package 'wheel' is not installed.
Could not build wheels for xgboost, since package 'wheel' is not installed.
Could not build wheels for lightgbm, since package 'wheel' is not installed.
Could not build wheels for onnxconverter-common, since package 'wheel' is not installed.
Could not build wheels for scipy, since package 'wheel' is not installed.
Could not build wheels for joblib, since package 'wheel' is not installed.
Could not build wheels for future, since package 'wheel' is not installed.
Could not build wheels for protobuf, since package 'wheel' is not installed.
Could not build wheels for onnx, since package 'wheel' is not installed.
Could not build wheels for six, since package 'wheel' is not installed.
Could not build wheels for setuptools, since package 'wheel' is not installed.
Could not build wheels for typing-extensions, since package 'wheel' is not installed.

A quick

$ pip install wheel

Fixed that. Should wheel be in requirements.txt?

RF beam++ single node

There is a bug in beam++ for RF with node size 1 somewhere related to tree_commons.py:481

float64 issue

At the moment, HB only works with float32. You must cast float64 to float32 for correct results. (You will get an error with the gemm algorithm). We need to fix this.

Ex: in scikit-learn-random-forest-example.ipynb we must cast X as follows:
X = X[0:nrows].astype('|f4')

Fix xgb converter for version > 0.90

Regression for XGBoost is failing for 1.0.2.

Apparently there was:

  • a change in the API, alpha is now an array\list
  • a change in the model structure because we are getting different results for regression
  • a change in max_depth because None does not work as max_depth in 0.90

Minor documentation improvements

When generating documentation using pdoc, we observe a few warnings:

/home/neumann/miniconda3/envs/bird/lib/python3.7/site-packages/pdoc/html_helpers.py:498: ReferenceWarning: Code reference `hummingbird.supported` in module "hummingbird.ml.convert" does not match any documented object.
  linked = re.sub(r'[a-zA-Z_]\w*(?:\.[a-zA-Z_]\w*)*(?:\(\))?', handle_refname, code_span)
/home/neumann/miniconda3/envs/bird/lib/python3.7/site-packages/pdoc/html_helpers.py:498: ReferenceWarning: Code reference `hummingbird.supported_configurations` in module "hummingbird.ml.convert" does not match any documented object.
  linked = re.sub(r'[a-zA-Z_]\w*(?:\.[a-zA-Z_]\w*)*(?:\(\))?', handle_refname, code_span)
  1. We should change hummingbird.supported to hummingbird.ml.supported for the documentation site to correctly link them.
  2. Docstrings mention that hummingbird.supported_configurations shows the set of supported extra configurations.
    However, supported_configurations is not found in the repo (within any submodule).
    Is it not yet released?

Cannot use torch==1.5.0 due to breaking change

When using torch==1.5.0 (instead of the current torch==1.4.0), hummingbird sometimes gets stuck in an infinte loop in forward:

  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/hummingbird/hummingbird/operator_converters/_tree_implementations.py", line 349, in forward
    gather_indices = torch.index_select(nodes, 0, prev_indices).view(-1, self.num_trees)

This is potentially due to this [BC-BREAKING] change index_select scalar_check to retain dimensionality of input. #30790](pytorch/pytorch#30790) in torch==1.5.0 in the index_select function. The change makes it so they return a 0-dimensional tensor iff the input is 0-dimensional.

However, after digging around a bit more in our code, I separated out

 gather_indices = torch.index_select(nodes, 0, prev_indices).view(-1, self.num_trees)

into:

gather_indices = torch.index_select(nodes, 0, prev_indices)  #  now gets stuck here
if gather_indices.shape == torch.Size([]):
    gather_indices = gather_indices.view(-1)
gather_indices = gather_indices.view(-1, self.num_trees)

and found that the code is getting stuck on the index_select itself rather than any problem with the changed return type.

So maybe there is some issue related to optimize index_select performance on CPU with TensorIterator #30598, also new in torch==1.5

Tree classifiers return wrong labels while probabilities match

Apparently torch.argmax returns the last matching value, while np.argmax return the first one. This is general is not a problem, except when we get the same probabilities for 2 or more classes.

This will hopefully get fixed once this will get closed.

Once this is fixed we should add assertion over the labels results as well. At this point this assertions won't be useful since they will likely often fails.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.