microsoft / hummingbird Goto Github PK

View Code? Open in Web Editor NEW

3.3K 50.0 271.0 14.44 MB

Hummingbird compiles trained ML models into tensor computation for faster inference.

License: MIT License

Python 88.03% Jupyter Notebook 10.77% Dockerfile 0.15% Makefile 0.76% CSS 0.03% JavaScript 0.26%

machine-learning neural-networks scikit-learn pytorch tensor-computation

hummingbird's Issues

Randomization in tests

Right now all of our tests are using random data. This sometimes causes github actions to fail in unexpected ways.

We should use all deterministic data in our tests rather than random so that we get consistency across runs.

Add more python versions to CI pipeline

Notebooks don't work as-is

For example, the lightgbm example notebook has:

from hummingbird.ml import convert

and then

hb_model = convert(model, 'pytorch')

This returns an error:

In [14]: hb_model = convert(model, 'pytorch')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-5658f007aeed> in <module>
----> 1 hb_model = convert(model, 'pytorch')

TypeError: 'module' object is not callable

Should it instead be:
from hummingbird.ml.convert import convert_lightgbm as convert?

params to determine tree type, and max_depth

right now, we are hardcoding 3 or 4 as low, and 10 as high. We need to do more testing and tuning and document our choices.

Also, there are a lot of inconsistencies and duplications with the max_depth parameter. We need to revisit that for all 4 trees.

consistency in returned types

It's not clear when we should take:

pytorch_model(torch.from_numpy(X))[0]  #  RandomForestClassifier

vs. say

pytorch_model(torch.from_numpy(X))[1]   # DecisionTreeClassifier

In the end, we want it to be something simple like what we compare to such as

model.predict(X)

Add badge for code coverage in README.md

Minimum set of dependencies

Right now, users must install LGBM/XGBoost, etc. Can we have options in the setup.py file or some other options to not force users to install dependencies they won't be using?

Rename ALPHA into a more meaningful name

As discussed here we can use base_prediction or base_score or inital_prediction.

Install on MacOS has Lightgbm issues

When installing hummingbird via pip, users often run into:
OSError: dlopen(lib_lightgbm.so, 6): Library not loaded.

This is documented in our troubleshooting guide, and it is due to this fixed issue #1369 in LGBM but a cleaner solution would be a better user experience.

Running sklearn example on Colab: module object not callable

Check the gist colab nb here https://gist.github.com/ucalyptus/bcf09b711009b87e94f989cb13035909
Did install sklearn 0.21.3 beforehand and enabled GPU.

Remove files that overlap with skl2onnx

Since we are re-using part of skl2onnx code, let's see how many changes we did, and in case remove all files where we overlap and add include.

Add predict and predict_proba methods to Skl2PyTorch class

rename beam/batch/beampp

We need to rename this to match the paper. Ex:Gemm

pytorch problem with pip install on Python3.7 or Python3.8

Doing pip install hummingbird-ml on Python3.7 or Python3.8 a user reported:

ERROR: Could not find a version that satisfies the requirement torch>=1.4.0 (from hummingbird-ml) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
ERROR: No matching distribution found for torch>=1.4.0 (from hummingbird-ml)

Looks like pytorch on pypi is 1.0.2 and on the conda main channel it’s 1.3.1
Maybe linking to the installation page for pytorch would be useful.

Refactor tree classifier tests

This is to reduce duplication and possibly improve maintainability.
Let us do this similar to PR #102 (discussed in the PR as well)

Simplify convert_sklearn API

In its current implementation to convert a sklearn model we have something like:

convert_sklearn(model, initial_types=[('input', FloatTensorType([4, 3]))])

but we actually don't need the specification of input types (this is more a onnx converter thing). So we can have something like:

convert_sklearn(model)

which is nice and short. The problem with this is that XGBoostRegressor models do not surface information on the number of input features (while instead XGBoostClassifier does). Then if we go with the above API we will need a workaround for XGBoostRegressor. One possibility is to have the following specifically for XGBoostRegression models:

extra_config["n_features"] = 200
pytorch_model = convert_sklearn(model, extra_config=extra_config)

Another possibility is to pass some input data as for other converters:

pytorch_model = convert_sklearn(model, some_input_data)

One last possibility is to have a different API for each converter (Sklearn, LightGBM and XGBoost; as ONNXMLTools are doing right now). The for Sklearn we will have:
Another possibility is to pass some input data as for other converters:

pytorch_model = convert_sklearn(model)

For LightGBM we will have
Another possibility is to pass some input data as for other converters:

pytorch_model = convert_lightgbm(model)

And for XGboost we will have either to pass an extra param or the input data. For example:

pytorch_model = convert_xgboost(model, some_input_data)

Cleanup Linear Converter Code

We have the original draft of the linear code uploaded in this branch. All credit here goes to @scnakandala, the original author and brains behind this.

It contains an un-edited implementation of the following scikit-learn converters:

linear_classifier.py
- LinearRegression
- LogisticRegression
- LinearSVC
- SGDClassifier
- LogisticRegressionCV
svc.py
- SVC
- NuSVC

There is a single test file here that needs to be cleaned up and separated out.

Remove not required dependencies

Should we leave sklearn? From one site having sklearn will help first time users, but on the other side, everyone taking a dependency on hummingbird will have to install sklearn even if it is not used.

Automate document generation

We have been manually refreshing documentation (generated using pdoc) till now.
However, this can be automated using Github actions.

Open questions:

Do we want to update github pages, too? This may require us to monitor the quality of docstrings closely. This can be a deterrent to a first time contributor.
What do we do with runtime warnings?
Is generation of documentation a pre-push or post-push item? If it's post-push, what if we get errors?

Add sklearn's HistGradientBoosting

Merge DecisionTree and GBDT implementations into one

Better heuristics for tree implementations

Right now, we are hardcoding 3 or 4 as low, and 10 as high. We need to do more testing and tuning and document our choices.

Add GradientBoostingRegressor

Multiclass rouding errors

With mulitclass datasets (such as covtype or iris), sometimes we get errors on rounding:

import numpy as np
import torch
from hummingbird import convert_sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import fetch_covtype

X, y = fetch_covtype(return_X_y=True)
nrows=15000
X = X[0:nrows]
y = y[0:nrows]
X_torch = torch.from_numpy(X).float()

model = RandomForestClassifier(n_estimators=10, max_depth=10)
model.fit(X, y)

pytorch_model = convert_sklearn(
    model, 
    extra_config = {"tree_implementation": "gemm"})


skl = model.predict_proba(X)
pytorch_model.to('cuda')
hum_gpu = pytorch_model(X_torch.to('cuda'))

np.testing.assert_allclose(skl, hum_gpu[1].data.to('cpu').numpy(), rtol=1e-6, atol=1e-6)

gives error:


AssertionError: 
Not equal to tolerance rtol=1e-06, atol=1e-06

Mismatched elements: 332 / 105000 (0.316%)
Max absolute difference: 0.11943346
Max relative difference: 5.82971106
 x: array([[0.121156, 0.200913, 0.008188, ..., 0.643138, 0.00637 , 0.020236],
       [0.110779, 0.207474, 0.008188, ..., 0.646954, 0.00637 , 0.020236],
       [0.1959  , 0.707151, 0.008188, ..., 0.050266, 0.00637 , 0.032125],...
 y: array([[0.121156, 0.200913, 0.008188, ..., 0.643138, 0.00637 , 0.020236],
       [0.110779, 0.207474, 0.008188, ..., 0.646954, 0.00637 , 0.020236],
       [0.1959  , 0.707152, 0.008188, ..., 0.050266, 0.00637 , 0.032125],...

Rounding in small datasets

With small datasets, often get a rounding error that we don't see with larger datasets

repro:

import numpy as np
import torch, pickle
from hummingbird import convert_sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris


data = load_iris()
X, y = data.data, data.target
X_torch = torch.from_numpy(X)


model = RandomForestClassifier(n_estimators=10)
model.fit(X, y)
pytorch_model = convert_sklearn(
    model, 
    extra_config = {"tree_implementation": "perf_tree_trav"})

skl = model.predict_proba(X)

pytorch_model.to('cuda')
hum_gpu = pytorch_model(X_torch.to('cuda'))

np.testing.assert_allclose(skl, hum_gpu[1].data.to('cpu').numpy(), rtol=1e-06, atol=1e-06)

you get

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-06, atol=1e-06

Mismatched elements: 10 / 450 (2.22%)
Max absolute difference: 0.1
Max relative difference: 1.
 x: array([[1. , 0. , 0. ],
       [1. , 0. , 0. ],
       [1. , 0. , 0. ],...
 y: array([[1. , 0. , 0. ],
       [1. , 0. , 0. ],
       [1. , 0. , 0. ],...

Add support for Visual Studio Codespaces

One way to resolve #60 is to add explicit support for dev containers to the Hummingbird repo. The documentation for that is here. Specifically, we could add a Dockerfile and settings for VS Code to use it. I am happy to give that a try if someone has a Dockerfile I can base that on.

Remove operations such as learning rate from the data path on tree implementations

Make the dev container more comfy

The current dev container is derived off of a super barebones image. Not even less is installed. I will be looking into finding a more comfy base image, probably starting with the ones found here.

Add sklearn's DecisionTreeRegressor

Add support for ONNX-ML models as input format

Add support for ONNX as output model format

Fix the doc style based on google code style

link

Change assert(... is not None) into assertIsNotNone

There are several instances in the (test) code where we do assert(... is not None), e.g., (here). It is better to change it into assertIsNotNone.

Roadmap link in the readme file is broken

nit: Update build badge to represent master branch

Build status badge represents develop branch currently
However, code coverage is shown for master
Both the badges should represent the same branch.

upgrade and test new sklearn version

we need to upgrade sklearn version. (currently scikit-learn==0.21.3). To do this, we need to accomodate some API changes in the newer version

ex: Imputer is deprecated (was in preprocessing). Now, do :from sklearn.impute import SimpleImputer

Add TorchScript backend

warnings.filterwarnings("ignore") in tests is not working.

Apparently I am still getting a lot of warning when running tests.

`wheel` missing from `requirements.txt`

I tried to run the following on a fresh VS Code Spaces instance:

$ pip install -r requirements.txt

And got the following errors:

Could not build wheels for numpy, since package 'wheel' is not installed.
Could not build wheels for scikit-learn, since package 'wheel' is not installed.
Could not build wheels for torch, since package 'wheel' is not installed.
Could not build wheels for xgboost, since package 'wheel' is not installed.
Could not build wheels for lightgbm, since package 'wheel' is not installed.
Could not build wheels for onnxconverter-common, since package 'wheel' is not installed.
Could not build wheels for scipy, since package 'wheel' is not installed.
Could not build wheels for joblib, since package 'wheel' is not installed.
Could not build wheels for future, since package 'wheel' is not installed.
Could not build wheels for protobuf, since package 'wheel' is not installed.
Could not build wheels for onnx, since package 'wheel' is not installed.
Could not build wheels for six, since package 'wheel' is not installed.
Could not build wheels for setuptools, since package 'wheel' is not installed.
Could not build wheels for typing-extensions, since package 'wheel' is not installed.

A quick

$ pip install wheel

Fixed that. Should wheel be in requirements.txt?

Setup better requirements structure

I am thinking about following something along this line

Remove monkey patching API

Add gitter room \ badge

RF beam++ single node

There is a bug in beam++ for RF with node size 1 somewhere related to tree_commons.py:481

float64 issue

At the moment, HB only works with float32. You must cast float64 to float32 for correct results. (You will get an error with the gemm algorithm). We need to fix this.

Ex: in scikit-learn-random-forest-example.ipynb we must cast X as follows:
X = X[0:nrows].astype('|f4')

Fix xgb converter for version > 0.90

Regression for XGBoost is failing for 1.0.2.

Apparently there was:

a change in the API, alpha is now an array\list
a change in the model structure because we are getting different results for regression
a change in max_depth because None does not work as max_depth in 0.90

Minor documentation improvements

When generating documentation using pdoc, we observe a few warnings:

/home/neumann/miniconda3/envs/bird/lib/python3.7/site-packages/pdoc/html_helpers.py:498: ReferenceWarning: Code reference `hummingbird.supported` in module "hummingbird.ml.convert" does not match any documented object.
  linked = re.sub(r'[a-zA-Z_]\w*(?:\.[a-zA-Z_]\w*)*(?:\(\))?', handle_refname, code_span)
/home/neumann/miniconda3/envs/bird/lib/python3.7/site-packages/pdoc/html_helpers.py:498: ReferenceWarning: Code reference `hummingbird.supported_configurations` in module "hummingbird.ml.convert" does not match any documented object.
  linked = re.sub(r'[a-zA-Z_]\w*(?:\.[a-zA-Z_]\w*)*(?:\(\))?', handle_refname, code_span)

We should change hummingbird.supported to hummingbird.ml.supported for the documentation site to correctly link them.
Docstrings mention that hummingbird.supported_configurations shows the set of supported extra configurations.
However, supported_configurations is not found in the repo (within any submodule).
Is it not yet released?

Cannot use torch==1.5.0 due to breaking change

When using torch==1.5.0 (instead of the current torch==1.4.0), hummingbird sometimes gets stuck in an infinte loop in forward:

  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/hummingbird/hummingbird/operator_converters/_tree_implementations.py", line 349, in forward
    gather_indices = torch.index_select(nodes, 0, prev_indices).view(-1, self.num_trees)

This is potentially due to this [BC-BREAKING] change index_select scalar_check to retain dimensionality of input. #30790](pytorch/pytorch#30790) in torch==1.5.0 in the index_select function. The change makes it so they return a 0-dimensional tensor iff the input is 0-dimensional.

However, after digging around a bit more in our code, I separated out

 gather_indices = torch.index_select(nodes, 0, prev_indices).view(-1, self.num_trees)

into:

gather_indices = torch.index_select(nodes, 0, prev_indices)  #  now gets stuck here
if gather_indices.shape == torch.Size([]):
    gather_indices = gather_indices.view(-1)
gather_indices = gather_indices.view(-1, self.num_trees)

and found that the code is getting stuck on the index_select itself rather than any problem with the changed return type.

So maybe there is some issue related to optimize index_select performance on CPU with TensorIterator #30598, also new in torch==1.5

Tree classifiers return wrong labels while probabilities match

Apparently torch.argmax returns the last matching value, while np.argmax return the first one. This is general is not a problem, except when we get the same probabilities for 2 or more classes.

This will hopefully get fixed once this will get closed.

Once this is fixed we should add assertion over the labels results as well. At this point this assertions won't be useful since they will likely often fails.

microsoft / hummingbird Goto Github PK

hummingbird's Issues

Recommend Projects

Recommend Topics

Recommend Org