microsoft / hummingbird Goto Github PK
View Code? Open in Web Editor NEWHummingbird compiles trained ML models into tensor computation for faster inference.
License: MIT License
Hummingbird compiles trained ML models into tensor computation for faster inference.
License: MIT License
Right now all of our tests are using random data. This sometimes causes github actions to fail in unexpected ways.
We should use all deterministic data in our tests rather than random so that we get consistency across runs.
For example, the lightgbm example notebook has:
from hummingbird.ml import convert
and then
hb_model = convert(model, 'pytorch')
This returns an error:
In [14]: hb_model = convert(model, 'pytorch')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-14-5658f007aeed> in <module>
----> 1 hb_model = convert(model, 'pytorch')
TypeError: 'module' object is not callable
Should it instead be:
from hummingbird.ml.convert import convert_lightgbm as convert
?
right now, we are hardcoding 3 or 4 as low, and 10 as high. We need to do more testing and tuning and document our choices.
Also, there are a lot of inconsistencies and duplications with the max_depth
parameter. We need to revisit that for all 4 trees.
It's not clear when we should take:
pytorch_model(torch.from_numpy(X))[0] # RandomForestClassifier
vs. say
pytorch_model(torch.from_numpy(X))[1] # DecisionTreeClassifier
In the end, we want it to be something simple like what we compare to such as
model.predict(X)
Right now, users must install LGBM/XGBoost, etc. Can we have options in the setup.py
file or some other options to not force users to install dependencies they won't be using?
As discussed here we can use base_prediction
or base_score
or inital_prediction
.
When installing hummingbird via pip, users often run into:
OSError: dlopen(lib_lightgbm.so, 6): Library not loaded
.
This is documented in our troubleshooting guide, and it is due to this fixed issue #1369 in LGBM but a cleaner solution would be a better user experience.
Check the gist colab nb here https://gist.github.com/ucalyptus/bcf09b711009b87e94f989cb13035909
Did install sklearn 0.21.3 beforehand and enabled GPU.
Since we are re-using part of skl2onnx code, let's see how many changes we did, and in case remove all files where we overlap and add include.
We need to rename this to match the paper. Ex:Gemm
Doing pip install hummingbird-ml on Python3.7 or Python3.8 a user reported:
ERROR: Could not find a version that satisfies the requirement torch>=1.4.0 (from hummingbird-ml) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
ERROR: No matching distribution found for torch>=1.4.0 (from hummingbird-ml)
Looks like pytorch on pypi is 1.0.2 and on the conda main channel itβs 1.3.1
Maybe linking to the installation page for pytorch would be useful.
In its current implementation to convert a sklearn model we have something like:
convert_sklearn(model, initial_types=[('input', FloatTensorType([4, 3]))])
but we actually don't need the specification of input types (this is more a onnx converter thing). So we can have something like:
convert_sklearn(model)
which is nice and short. The problem with this is that XGBoostRegressor models do not surface information on the number of input features (while instead XGBoostClassifier does). Then if we go with the above API we will need a workaround for XGBoostRegressor. One possibility is to have the following specifically for XGBoostRegression models:
extra_config["n_features"] = 200
pytorch_model = convert_sklearn(model, extra_config=extra_config)
Another possibility is to pass some input data as for other converters:
pytorch_model = convert_sklearn(model, some_input_data)
One last possibility is to have a different API for each converter (Sklearn, LightGBM and XGBoost; as ONNXMLTools are doing right now). The for Sklearn we will have:
Another possibility is to pass some input data as for other converters:
pytorch_model = convert_sklearn(model)
For LightGBM we will have
Another possibility is to pass some input data as for other converters:
pytorch_model = convert_lightgbm(model)
And for XGboost we will have either to pass an extra param or the input data. For example:
pytorch_model = convert_xgboost(model, some_input_data)
We have the original draft of the linear code uploaded in this branch. All credit here goes to @scnakandala, the original author and brains behind this.
It contains an un-edited implementation of the following scikit-learn converters:
There is a single test file here that needs to be cleaned up and separated out.
Should we leave sklearn
? From one site having sklearn
will help first time users, but on the other side, everyone taking a dependency on hummingbird
will have to install sklearn
even if it is not used.
Open questions:
Right now, we are hardcoding 3 or 4 as low, and 10 as high. We need to do more testing and tuning and document our choices.
With mulitclass datasets (such as covtype or iris), sometimes we get errors on rounding:
import numpy as np
import torch
from hummingbird import convert_sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import fetch_covtype
X, y = fetch_covtype(return_X_y=True)
nrows=15000
X = X[0:nrows]
y = y[0:nrows]
X_torch = torch.from_numpy(X).float()
model = RandomForestClassifier(n_estimators=10, max_depth=10)
model.fit(X, y)
pytorch_model = convert_sklearn(
model,
extra_config = {"tree_implementation": "gemm"})
skl = model.predict_proba(X)
pytorch_model.to('cuda')
hum_gpu = pytorch_model(X_torch.to('cuda'))
np.testing.assert_allclose(skl, hum_gpu[1].data.to('cpu').numpy(), rtol=1e-6, atol=1e-6)
gives error:
AssertionError:
Not equal to tolerance rtol=1e-06, atol=1e-06
Mismatched elements: 332 / 105000 (0.316%)
Max absolute difference: 0.11943346
Max relative difference: 5.82971106
x: array([[0.121156, 0.200913, 0.008188, ..., 0.643138, 0.00637 , 0.020236],
[0.110779, 0.207474, 0.008188, ..., 0.646954, 0.00637 , 0.020236],
[0.1959 , 0.707151, 0.008188, ..., 0.050266, 0.00637 , 0.032125],...
y: array([[0.121156, 0.200913, 0.008188, ..., 0.643138, 0.00637 , 0.020236],
[0.110779, 0.207474, 0.008188, ..., 0.646954, 0.00637 , 0.020236],
[0.1959 , 0.707152, 0.008188, ..., 0.050266, 0.00637 , 0.032125],...
With small datasets, often get a rounding error that we don't see with larger datasets
repro:
import numpy as np
import torch, pickle
from hummingbird import convert_sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
data = load_iris()
X, y = data.data, data.target
X_torch = torch.from_numpy(X)
model = RandomForestClassifier(n_estimators=10)
model.fit(X, y)
pytorch_model = convert_sklearn(
model,
extra_config = {"tree_implementation": "perf_tree_trav"})
skl = model.predict_proba(X)
pytorch_model.to('cuda')
hum_gpu = pytorch_model(X_torch.to('cuda'))
np.testing.assert_allclose(skl, hum_gpu[1].data.to('cpu').numpy(), rtol=1e-06, atol=1e-06)
you get
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose
verbose=verbose, header=header, equal_nan=equal_nan)
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-06, atol=1e-06
Mismatched elements: 10 / 450 (2.22%)
Max absolute difference: 0.1
Max relative difference: 1.
x: array([[1. , 0. , 0. ],
[1. , 0. , 0. ],
[1. , 0. , 0. ],...
y: array([[1. , 0. , 0. ],
[1. , 0. , 0. ],
[1. , 0. , 0. ],...
The current dev container is derived off of a super barebones image. Not even less
is installed. I will be looking into finding a more comfy base image, probably starting with the ones found here.
There are several instances in the (test) code where we do assert(... is not None)
, e.g., (here). It is better to change it into assertIsNotNone
.
develop
branch currentlymaster
we need to upgrade sklearn version. (currently scikit-learn==0.21.3). To do this, we need to accomodate some API changes in the newer version
ex: Imputer is deprecated (was in preprocessing). Now, do :from sklearn.impute import SimpleImputer
Apparently I am still getting a lot of warning when running tests.
I tried to run the following on a fresh VS Code Spaces instance:
$ pip install -r requirements.txt
And got the following errors:
Could not build wheels for numpy, since package 'wheel' is not installed.
Could not build wheels for scikit-learn, since package 'wheel' is not installed.
Could not build wheels for torch, since package 'wheel' is not installed.
Could not build wheels for xgboost, since package 'wheel' is not installed.
Could not build wheels for lightgbm, since package 'wheel' is not installed.
Could not build wheels for onnxconverter-common, since package 'wheel' is not installed.
Could not build wheels for scipy, since package 'wheel' is not installed.
Could not build wheels for joblib, since package 'wheel' is not installed.
Could not build wheels for future, since package 'wheel' is not installed.
Could not build wheels for protobuf, since package 'wheel' is not installed.
Could not build wheels for onnx, since package 'wheel' is not installed.
Could not build wheels for six, since package 'wheel' is not installed.
Could not build wheels for setuptools, since package 'wheel' is not installed.
Could not build wheels for typing-extensions, since package 'wheel' is not installed.
A quick
$ pip install wheel
Fixed that. Should wheel
be in requirements.txt
?
I am thinking about following something along this line
There is a bug in beam++ for RF with node size 1 somewhere related to tree_commons.py:481
At the moment, HB only works with float32
. You must cast float64
to float32
for correct results. (You will get an error with the gemm
algorithm). We need to fix this.
Ex: in scikit-learn-random-forest-example.ipynb we must cast X
as follows:
X = X[0:nrows].astype('|f4')
Regression for XGBoost is failing for 1.0.2.
Apparently there was:
alpha
is now an array\listmax_depth
because None
does not work as max_depth
in 0.90When generating documentation using pdoc, we observe a few warnings:
/home/neumann/miniconda3/envs/bird/lib/python3.7/site-packages/pdoc/html_helpers.py:498: ReferenceWarning: Code reference `hummingbird.supported` in module "hummingbird.ml.convert" does not match any documented object.
linked = re.sub(r'[a-zA-Z_]\w*(?:\.[a-zA-Z_]\w*)*(?:\(\))?', handle_refname, code_span)
/home/neumann/miniconda3/envs/bird/lib/python3.7/site-packages/pdoc/html_helpers.py:498: ReferenceWarning: Code reference `hummingbird.supported_configurations` in module "hummingbird.ml.convert" does not match any documented object.
linked = re.sub(r'[a-zA-Z_]\w*(?:\.[a-zA-Z_]\w*)*(?:\(\))?', handle_refname, code_span)
hummingbird.supported
to hummingbird.ml.supported
for the documentation site to correctly link them.hummingbird.supported_configurations
shows the set of supported extra configurations.supported_configurations
is not found in the repo (within any submodule).When using torch==1.5.0 (instead of the current torch==1.4.0), hummingbird sometimes gets stuck in an infinte loop in forward
:
File "/opt/conda/envs/rapids/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
result = self.forward(*input, **kwargs)
File "/root/hummingbird/hummingbird/operator_converters/_tree_implementations.py", line 349, in forward
gather_indices = torch.index_select(nodes, 0, prev_indices).view(-1, self.num_trees)
This is potentially due to this [BC-BREAKING] change index_select scalar_check to retain dimensionality of input. #30790](pytorch/pytorch#30790) in torch==1.5.0 in the index_select
function. The change makes it so they return a 0-dimensional tensor iff the input is 0-dimensional.
However, after digging around a bit more in our code, I separated out
gather_indices = torch.index_select(nodes, 0, prev_indices).view(-1, self.num_trees)
into:
gather_indices = torch.index_select(nodes, 0, prev_indices) # now gets stuck here
if gather_indices.shape == torch.Size([]):
gather_indices = gather_indices.view(-1)
gather_indices = gather_indices.view(-1, self.num_trees)
and found that the code is getting stuck on the index_select
itself rather than any problem with the changed return type.
So maybe there is some issue related to optimize index_select performance on CPU with TensorIterator #30598, also new in torch==1.5
Apparently torch.argmax
returns the last matching value, while np.argmax
return the first one. This is general is not a problem, except when we get the same probabilities for 2 or more classes.
This will hopefully get fixed once this will get closed.
Once this is fixed we should add assertion over the labels results as well. At this point this assertions won't be useful since they will likely often fails.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.