microsoft / hummingbird Goto Github PK

View Code? Open in Web Editor NEW

3.3K 50.0 271.0 14.41 MB

Hummingbird compiles trained ML models into tensor computation for faster inference.

License: MIT License

Python 88.03% Jupyter Notebook 10.77% Dockerfile 0.15% Makefile 0.76% CSS 0.03% JavaScript 0.26%

machine-learning neural-networks scikit-learn pytorch tensor-computation

hummingbird's Introduction

Hummingbird

Introduction

Hummingbird is a library for compiling trained traditional ML models into tensor computations. Hummingbird allows users to seamlessly leverage neural network frameworks (such as PyTorch) to accelerate traditional ML models. Thanks to Hummingbird, users can benefit from: (1) all the current and future optimizations implemented in neural network frameworks; (2) native hardware acceleration; (3) having a unique platform to support for both traditional and neural network models; and have all of this (4) without having to re-engineer their models.

Currently, you can use Hummingbird to convert your trained traditional ML models into PyTorch, TorchScript, ONNX, and TVM). Hummingbird supports a variety of ML models and featurizers. These models include scikit-learn Decision Trees and Random Forest, and also LightGBM and XGBoost Classifiers/Regressors. Support for other neural network backends and models is on our roadmap.

Hummingbird also provides a convenient uniform "inference" API following the Sklearn API. This allows swapping Sklearn models with Hummingbird-generated ones without having to change the inference code. By converting the models to PyTorch and TorchScript it also becomes possible to serve them using TorchServe.

How Hummingbird Works

Hummingbird works by reconfiguring algorithmic operators such that we can perform more regular computations which are amenable to vectorized and GPU execution. Each operator is slightly different, and we incorporate multiple strategies. This example explains one of Hummingbird's strategies for translating a decision tree into tensors involving GEMM (GEneric Matrix Multiplication), where we implement the traversal of the tree using matrix multiplications. (GEMM is one of the three tree conversion strategies we currently support.)

Simple decision tree

In this example, the decision tree has four decision nodes (orange), and five leaf nodes (blue). The tree takes a feature vector with five elements as input. For example, assume that we want to calculate the output of this observation:

Step 1: Multiply the input tensor with tensor A (computed from the decision tree model above) that captures the relationship between input features and internal nodes. Then compare it with tensor B which is set to the value of each internal node (orange) to create the tensor input path that represents the path from input to node. In this case, the tree model has 4 conditions and the input vector is 5, therefore, the shape of tensor A is 5x4 and tensor B is 1x4.

Step 2: The input path tensor will be multiplied with tensor C that captures whether the internal node is a parent of that internal node, and if so, whether it is in the left or right sub-tree (left = 1, right =-1, otherwise =0) and then check the equals with tensor D that captures the count of the left child of its parent in the path from a leaf node to the tree root to create the tensor output path that represents the path from node to output. In this case, this tree model has 5 outputs with 4 conditions, therefore, the shape of tensor C is 4x5 and tensor D is 1x5.

Step 3: The output path will be multiplied with tensor E that captures the mapping between leaf nodes to infer the final prediction. In this case, tree model has 5 outputs, therefore, shape of tensor E is 5x1.

And now Hummingbird has compiled a tree-based model using the GEMM strategy! For more details, please see Figure 3 of our paper.

Thank you to Chien Vu for contributing the graphics and descriptions in his blog for this example!

Installation

Hummingbird was tested on Python 3.8, 3.9, 3.10 and 3.11 on Linux, Windows and MacOS machines. (TVM only works through Python3.10.) It is recommended to use a virtual environment (See: python3 venv doc or Using Python environments in VS Code.)

Hummingbird requires PyTorch >= 1.6.0. Please go here for instructions on how to install PyTorch based on your platform and hardware.

Once PyTorch is installed, you can get Hummingbird from pip with:

python -m pip install hummingbird-ml

If you require the optional dependencies lightgbm and xgboost, you can use:

python -m pip install hummingbird-ml[extra]

See also Troubleshooting for common problems.

Examples

See the notebooks section for examples that demonstrate use and speedups.

In general, Hummingbird syntax is very intuitive and minimal. To run your traditional ML model on DNN frameworks, you only need to import hummingbird.ml and add convert(model, 'dnn_framework') to your code. Below is an example using a scikit-learn random forest model and PyTorch as target framework.

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from hummingbird.ml import convert, load

# Create some random data for binary classification
num_classes = 2
X = np.random.rand(100000, 28)
y = np.random.randint(num_classes, size=100000)

# Create and train a model (scikit-learn RandomForestClassifier in this case)
skl_model = RandomForestClassifier(n_estimators=10, max_depth=10)
skl_model.fit(X, y)

# Use Hummingbird to convert the model to PyTorch
model = convert(skl_model, 'pytorch')

# Run predictions on CPU
model.predict(X)

# Run predictions on GPU
model.to('cuda')
model.predict(X)

# Save the model
model.save('hb_model')

# Load the model back
model = load('hb_model')

Documentation

The API documentation is here.

You can also read about Hummingbird in our blog post here.

For more details on the vision and on the technical details related to Hummingbird, please check our papers:

Tensors: An abstraction for general data processing. Dimitrios Koutsoukos, Supun Nakandala, Konstantinos Karanasos, Karla Saur, Gustavo Alonso, Matteo Interlandi. PVLDB 2021.
A Tensor Compiler for Unified Machine Learning Prediction Serving. Supun Nakandala, Karla Saur, Gyeong-In Yu, Konstantinos Karanasos, Carlo Curino, Markus Weimer, Matteo Interlandi. OSDI 2020.
Compiling Classical ML Pipelines into Tensor Computations for One-size-fits-all Prediction Serving. Supun Nakandala, Gyeong-In Yu, Markus Weimer, Matteo Interlandi. System for ML Workshop. NeurIPS 2019

Contributing

We welcome contributions! Please see the guide on Contributing.

Also, see our roadmap of planned features.

Community

Join our community!

Authors

Supun Nakandala (@scnakandala)
Matteo Interlandi (@interesaaat)
Karla Saur (@ksaur)

Special Thanks

Masahiro Hiramori (@mshr-h) for the ongoing contributions
Masahiro Masuda (@masahi) for the TVM and batching contributions

License

MIT License

hummingbird's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger scnakandala markusweimer xiaming9880 natke chaoshengt jingmouren adrianfz thvasilo ahmedkrmn varunreddy95 monikadubeyeng atif0604 ai-hub-deep-learning-fundamental hephaex tchigher jspisak kranthigv yaswanthjagilanka murilo harupy brjathu zeta1999 johnjjung parkman328 trueter vedangshah martins0n codeaudit cuulee appcoreopc cxz yyht aroraakshit intfrr hr004 amirstudy jaykimbravekjh shubhankart141 champtiravat ioangatop imran273 vidush5 ashuein ersawant ejhortala duydn o7s8r6 deepchatterjeevns mohanrajofficial87 deepstatsanalysis 100rabh1401 nirupam1sharma kaburelabs drjeym cghimire abhijitdalavi nanaakwasiabayieboateng pallavipooja16 ashwini-padhy ashishpatel26 faisalshahbaz harshithbelagur v4wind bayesianbrad shaunstanislauslau sailfish009 madhugraj rajesh16702 aespar21 ashish707 ank-it tuannguyen27 suvrajeet01 jeet-123 razodactyl balugiduturi enginux ankitshah009 gautamkanagaraj handsomebbb personx000 maasano nirmaldeveloper gokulsg prasannakumaran coderpriya mayurmorin renatacgcastanha shubha23 sowmya-debug hasantuberlin vhadevheel stjordanis tsivaguru morkarslaanesh gilvbp gwd666 sheheryar9 fhaase2

hummingbird's Issues

Fix the doc style based on google code style

link

Notebooks don't work as-is

For example, the lightgbm example notebook has:

from hummingbird.ml import convert

and then

hb_model = convert(model, 'pytorch')

This returns an error:

In [14]: hb_model = convert(model, 'pytorch')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-5658f007aeed> in <module>
----> 1 hb_model = convert(model, 'pytorch')

TypeError: 'module' object is not callable

Should it instead be:
from hummingbird.ml.convert import convert_lightgbm as convert?

Add GradientBoostingRegressor

upgrade and test new sklearn version

we need to upgrade sklearn version. (currently scikit-learn==0.21.3). To do this, we need to accomodate some API changes in the newer version

ex: Imputer is deprecated (was in preprocessing). Now, do :from sklearn.impute import SimpleImputer

Automate document generation

We have been manually refreshing documentation (generated using pdoc) till now.
However, this can be automated using Github actions.

Open questions:

Do we want to update github pages, too? This may require us to monitor the quality of docstrings closely. This can be a deterrent to a first time contributor.
What do we do with runtime warnings?
Is generation of documentation a pre-push or post-push item? If it's post-push, what if we get errors?

Add support for ONNX as output model format

Better heuristics for tree implementations

Right now, we are hardcoding 3 or 4 as low, and 10 as high. We need to do more testing and tuning and document our choices.

params to determine tree type, and max_depth

right now, we are hardcoding 3 or 4 as low, and 10 as high. We need to do more testing and tuning and document our choices.

Also, there are a lot of inconsistencies and duplications with the max_depth parameter. We need to revisit that for all 4 trees.

Cleanup Linear Converter Code

We have the original draft of the linear code uploaded in this branch. All credit here goes to @scnakandala, the original author and brains behind this.

It contains an un-edited implementation of the following scikit-learn converters:

linear_classifier.py
- LinearRegression
- LogisticRegression
- LinearSVC
- SGDClassifier
- LogisticRegressionCV
svc.py
- SVC
- NuSVC

There is a single test file here that needs to be cleaned up and separated out.

Figure out the Pypi naming

Minimum set of dependencies

Right now, users must install LGBM/XGBoost, etc. Can we have options in the setup.py file or some other options to not force users to install dependencies they won't be using?

Refactor tree classifier tests

This is to reduce duplication and possibly improve maintainability.
Let us do this similar to PR #102 (discussed in the PR as well)

Simplify convert_sklearn API

In its current implementation to convert a sklearn model we have something like:

convert_sklearn(model, initial_types=[('input', FloatTensorType([4, 3]))])

but we actually don't need the specification of input types (this is more a onnx converter thing). So we can have something like:

convert_sklearn(model)

which is nice and short. The problem with this is that XGBoostRegressor models do not surface information on the number of input features (while instead XGBoostClassifier does). Then if we go with the above API we will need a workaround for XGBoostRegressor. One possibility is to have the following specifically for XGBoostRegression models:

extra_config["n_features"] = 200
pytorch_model = convert_sklearn(model, extra_config=extra_config)

Another possibility is to pass some input data as for other converters:

pytorch_model = convert_sklearn(model, some_input_data)

One last possibility is to have a different API for each converter (Sklearn, LightGBM and XGBoost; as ONNXMLTools are doing right now). The for Sklearn we will have:
Another possibility is to pass some input data as for other converters:

pytorch_model = convert_sklearn(model)

For LightGBM we will have
Another possibility is to pass some input data as for other converters:

pytorch_model = convert_lightgbm(model)

And for XGboost we will have either to pass an extra param or the input data. For example:

pytorch_model = convert_xgboost(model, some_input_data)

Cannot use torch==1.5.0 due to breaking change

When using torch==1.5.0 (instead of the current torch==1.4.0), hummingbird sometimes gets stuck in an infinte loop in forward:

  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/hummingbird/hummingbird/operator_converters/_tree_implementations.py", line 349, in forward
    gather_indices = torch.index_select(nodes, 0, prev_indices).view(-1, self.num_trees)

This is potentially due to this [BC-BREAKING] change index_select scalar_check to retain dimensionality of input. #30790](pytorch/pytorch#30790) in torch==1.5.0 in the index_select function. The change makes it so they return a 0-dimensional tensor iff the input is 0-dimensional.

However, after digging around a bit more in our code, I separated out

 gather_indices = torch.index_select(nodes, 0, prev_indices).view(-1, self.num_trees)

into:

gather_indices = torch.index_select(nodes, 0, prev_indices)  #  now gets stuck here
if gather_indices.shape == torch.Size([]):
    gather_indices = gather_indices.view(-1)
gather_indices = gather_indices.view(-1, self.num_trees)

and found that the code is getting stuck on the index_select itself rather than any problem with the changed return type.

So maybe there is some issue related to optimize index_select performance on CPU with TensorIterator #30598, also new in torch==1.5

Randomization in tests

Right now all of our tests are using random data. This sometimes causes github actions to fail in unexpected ways.

We should use all deterministic data in our tests rather than random so that we get consistency across runs.

Install on MacOS has Lightgbm issues

When installing hummingbird via pip, users often run into:
OSError: dlopen(lib_lightgbm.so, 6): Library not loaded.

This is documented in our troubleshooting guide, and it is due to this fixed issue #1369 in LGBM but a cleaner solution would be a better user experience.

Running sklearn example on Colab: module object not callable

Check the gist colab nb here https://gist.github.com/ucalyptus/bcf09b711009b87e94f989cb13035909
Did install sklearn 0.21.3 beforehand and enabled GPU.

warnings.filterwarnings("ignore") in tests is not working.

Apparently I am still getting a lot of warning when running tests.

Add badge for code coverage in README.md

Change assert(... is not None) into assertIsNotNone

There are several instances in the (test) code where we do assert(... is not None), e.g., (here). It is better to change it into assertIsNotNone.

`wheel` missing from `requirements.txt`

I tried to run the following on a fresh VS Code Spaces instance:

$ pip install -r requirements.txt

And got the following errors:

Could not build wheels for numpy, since package 'wheel' is not installed.
Could not build wheels for scikit-learn, since package 'wheel' is not installed.
Could not build wheels for torch, since package 'wheel' is not installed.
Could not build wheels for xgboost, since package 'wheel' is not installed.
Could not build wheels for lightgbm, since package 'wheel' is not installed.
Could not build wheels for onnxconverter-common, since package 'wheel' is not installed.
Could not build wheels for scipy, since package 'wheel' is not installed.
Could not build wheels for joblib, since package 'wheel' is not installed.
Could not build wheels for future, since package 'wheel' is not installed.
Could not build wheels for protobuf, since package 'wheel' is not installed.
Could not build wheels for onnx, since package 'wheel' is not installed.
Could not build wheels for six, since package 'wheel' is not installed.
Could not build wheels for setuptools, since package 'wheel' is not installed.
Could not build wheels for typing-extensions, since package 'wheel' is not installed.

A quick

$ pip install wheel

Fixed that. Should wheel be in requirements.txt?

Add support for Visual Studio Codespaces

One way to resolve #60 is to add explicit support for dev containers to the Hummingbird repo. The documentation for that is here. Specifically, we could add a Dockerfile and settings for VS Code to use it. I am happy to give that a try if someone has a Dockerfile I can base that on.

float64 issue

At the moment, HB only works with float32. You must cast float64 to float32 for correct results. (You will get an error with the gemm algorithm). We need to fix this.

Ex: in scikit-learn-random-forest-example.ipynb we must cast X as follows:
X = X[0:nrows].astype('|f4')

nit: Update build badge to represent master branch

Build status badge represents develop branch currently
However, code coverage is shown for master
Both the badges should represent the same branch.

Add ExtraTreesRegressor

Rename ALPHA into a more meaningful name

As discussed here we can use base_prediction or base_score or inital_prediction.

Make the dev container more comfy

The current dev container is derived off of a super barebones image. Not even less is installed. I will be looking into finding a more comfy base image, probably starting with the ones found here.

Minor documentation improvements

When generating documentation using pdoc, we observe a few warnings:

/home/neumann/miniconda3/envs/bird/lib/python3.7/site-packages/pdoc/html_helpers.py:498: ReferenceWarning: Code reference `hummingbird.supported` in module "hummingbird.ml.convert" does not match any documented object.
  linked = re.sub(r'[a-zA-Z_]\w*(?:\.[a-zA-Z_]\w*)*(?:\(\))?', handle_refname, code_span)
/home/neumann/miniconda3/envs/bird/lib/python3.7/site-packages/pdoc/html_helpers.py:498: ReferenceWarning: Code reference `hummingbird.supported_configurations` in module "hummingbird.ml.convert" does not match any documented object.
  linked = re.sub(r'[a-zA-Z_]\w*(?:\.[a-zA-Z_]\w*)*(?:\(\))?', handle_refname, code_span)

We should change hummingbird.supported to hummingbird.ml.supported for the documentation site to correctly link them.
Docstrings mention that hummingbird.supported_configurations shows the set of supported extra configurations.
However, supported_configurations is not found in the repo (within any submodule).
Is it not yet released?

Merge DecisionTree and GBDT implementations into one

Remove files that overlap with skl2onnx

Since we are re-using part of skl2onnx code, let's see how many changes we did, and in case remove all files where we overlap and add include.

Multiclass rouding errors

With mulitclass datasets (such as covtype or iris), sometimes we get errors on rounding:

import numpy as np
import torch
from hummingbird import convert_sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import fetch_covtype

X, y = fetch_covtype(return_X_y=True)
nrows=15000
X = X[0:nrows]
y = y[0:nrows]
X_torch = torch.from_numpy(X).float()

model = RandomForestClassifier(n_estimators=10, max_depth=10)
model.fit(X, y)

pytorch_model = convert_sklearn(
    model, 
    extra_config = {"tree_implementation": "gemm"})


skl = model.predict_proba(X)
pytorch_model.to('cuda')
hum_gpu = pytorch_model(X_torch.to('cuda'))

np.testing.assert_allclose(skl, hum_gpu[1].data.to('cpu').numpy(), rtol=1e-6, atol=1e-6)

gives error:


AssertionError: 
Not equal to tolerance rtol=1e-06, atol=1e-06

Mismatched elements: 332 / 105000 (0.316%)
Max absolute difference: 0.11943346
Max relative difference: 5.82971106
 x: array([[0.121156, 0.200913, 0.008188, ..., 0.643138, 0.00637 , 0.020236],
       [0.110779, 0.207474, 0.008188, ..., 0.646954, 0.00637 , 0.020236],
       [0.1959  , 0.707151, 0.008188, ..., 0.050266, 0.00637 , 0.032125],...
 y: array([[0.121156, 0.200913, 0.008188, ..., 0.643138, 0.00637 , 0.020236],
       [0.110779, 0.207474, 0.008188, ..., 0.646954, 0.00637 , 0.020236],
       [0.1959  , 0.707152, 0.008188, ..., 0.050266, 0.00637 , 0.032125],...

Remove not required dependencies

Should we leave sklearn? From one site having sklearn will help first time users, but on the other side, everyone taking a dependency on hummingbird will have to install sklearn even if it is not used.

Add sklearn's HistGradientBoosting

Rounding in small datasets

With small datasets, often get a rounding error that we don't see with larger datasets

repro:

import numpy as np
import torch, pickle
from hummingbird import convert_sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris


data = load_iris()
X, y = data.data, data.target
X_torch = torch.from_numpy(X)


model = RandomForestClassifier(n_estimators=10)
model.fit(X, y)
pytorch_model = convert_sklearn(
    model, 
    extra_config = {"tree_implementation": "perf_tree_trav"})

skl = model.predict_proba(X)

pytorch_model.to('cuda')
hum_gpu = pytorch_model(X_torch.to('cuda'))

np.testing.assert_allclose(skl, hum_gpu[1].data.to('cpu').numpy(), rtol=1e-06, atol=1e-06)

you get

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-06, atol=1e-06

Mismatched elements: 10 / 450 (2.22%)
Max absolute difference: 0.1
Max relative difference: 1.
 x: array([[1. , 0. , 0. ],
       [1. , 0. , 0. ],
       [1. , 0. , 0. ],...
 y: array([[1. , 0. , 0. ],
       [1. , 0. , 0. ],
       [1. , 0. , 0. ],...

rename beam/batch/beampp

We need to rename this to match the paper. Ex:Gemm

Setup better requirements structure

I am thinking about following something along this line

Add TorchScript backend

RF beam++ single node

There is a bug in beam++ for RF with node size 1 somewhere related to tree_commons.py:481

pytorch problem with pip install on Python3.7 or Python3.8

Doing pip install hummingbird-ml on Python3.7 or Python3.8 a user reported:

ERROR: Could not find a version that satisfies the requirement torch>=1.4.0 (from hummingbird-ml) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
ERROR: No matching distribution found for torch>=1.4.0 (from hummingbird-ml)

Looks like pytorch on pypi is 1.0.2 and on the conda main channel it’s 1.3.1
Maybe linking to the installation page for pytorch would be useful.

Add support for ONNX-ML models as input format

consistency in returned types

It's not clear when we should take:

pytorch_model(torch.from_numpy(X))[0]  #  RandomForestClassifier

vs. say

pytorch_model(torch.from_numpy(X))[1]   # DecisionTreeClassifier

In the end, we want it to be something simple like what we compare to such as

model.predict(X)

Once this is fixed we should add assertion over the labels results as well. At this point this assertions won't be useful since they will likely often fails.

a change in the API, alpha is now an array\list
a change in the model structure because we are getting different results for regression
a change in max_depth because None does not work as max_depth in 0.90