Giter Club home page Giter Club logo

hummingbird's Introduction

Hummingbird

PyPI version coverage Gitter Downloads


Introduction

Hummingbird is a library for compiling trained traditional ML models into tensor computations. Hummingbird allows users to seamlessly leverage neural network frameworks (such as PyTorch) to accelerate traditional ML models. Thanks to Hummingbird, users can benefit from: (1) all the current and future optimizations implemented in neural network frameworks; (2) native hardware acceleration; (3) having a unique platform to support for both traditional and neural network models; and have all of this (4) without having to re-engineer their models.

Currently, you can use Hummingbird to convert your trained traditional ML models into PyTorch, TorchScript, ONNX, and TVM). Hummingbird supports a variety of ML models and featurizers. These models include scikit-learn Decision Trees and Random Forest, and also LightGBM and XGBoost Classifiers/Regressors. Support for other neural network backends and models is on our roadmap.

Hummingbird also provides a convenient uniform "inference" API following the Sklearn API. This allows swapping Sklearn models with Hummingbird-generated ones without having to change the inference code. By converting the models to PyTorch and TorchScript it also becomes possible to serve them using TorchServe.

How Hummingbird Works

Hummingbird works by reconfiguring algorithmic operators such that we can perform more regular computations which are amenable to vectorized and GPU execution. Each operator is slightly different, and we incorporate multiple strategies. This example explains one of Hummingbird's strategies for translating a decision tree into tensors involving GEMM (GEneric Matrix Multiplication), where we implement the traversal of the tree using matrix multiplications. (GEMM is one of the three tree conversion strategies we currently support.)


Simple decision tree

In this example, the decision tree has four decision nodes (orange), and five leaf nodes (blue). The tree takes a feature vector with five elements as input. For example, assume that we want to calculate the output of this observation:

Step 1: Multiply the input tensor with tensor A (computed from the decision tree model above) that captures the relationship between input features and internal nodes. Then compare it with tensor B which is set to the value of each internal node (orange) to create the tensor input path that represents the path from input to node. In this case, the tree model has 4 conditions and the input vector is 5, therefore, the shape of tensor A is 5x4 and tensor B is 1x4.

Step 2: The input path tensor will be multiplied with tensor C that captures whether the internal node is a parent of that internal node, and if so, whether it is in the left or right sub-tree (left = 1, right =-1, otherwise =0) and then check the equals with tensor D that captures the count of the left child of its parent in the path from a leaf node to the tree root to create the tensor output path that represents the path from node to output. In this case, this tree model has 5 outputs with 4 conditions, therefore, the shape of tensor C is 4x5 and tensor D is 1x5.

Step 3: The output path will be multiplied with tensor E that captures the mapping between leaf nodes to infer the final prediction. In this case, tree model has 5 outputs, therefore, shape of tensor E is 5x1.

And now Hummingbird has compiled a tree-based model using the GEMM strategy! For more details, please see Figure 3 of our paper.

Thank you to Chien Vu for contributing the graphics and descriptions in his blog for this example!

Installation

Hummingbird was tested on Python 3.8, 3.9, 3.10 and 3.11 on Linux, Windows and MacOS machines. (TVM only works through Python3.10.) It is recommended to use a virtual environment (See: python3 venv doc or Using Python environments in VS Code.)

Hummingbird requires PyTorch >= 1.6.0. Please go here for instructions on how to install PyTorch based on your platform and hardware.

Once PyTorch is installed, you can get Hummingbird from pip with:

python -m pip install hummingbird-ml

If you require the optional dependencies lightgbm and xgboost, you can use:

python -m pip install hummingbird-ml[extra]

See also Troubleshooting for common problems.

Examples

See the notebooks section for examples that demonstrate use and speedups.

In general, Hummingbird syntax is very intuitive and minimal. To run your traditional ML model on DNN frameworks, you only need to import hummingbird.ml and add convert(model, 'dnn_framework') to your code. Below is an example using a scikit-learn random forest model and PyTorch as target framework.

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from hummingbird.ml import convert, load

# Create some random data for binary classification
num_classes = 2
X = np.random.rand(100000, 28)
y = np.random.randint(num_classes, size=100000)

# Create and train a model (scikit-learn RandomForestClassifier in this case)
skl_model = RandomForestClassifier(n_estimators=10, max_depth=10)
skl_model.fit(X, y)

# Use Hummingbird to convert the model to PyTorch
model = convert(skl_model, 'pytorch')

# Run predictions on CPU
model.predict(X)

# Run predictions on GPU
model.to('cuda')
model.predict(X)

# Save the model
model.save('hb_model')

# Load the model back
model = load('hb_model')

Documentation

The API documentation is here.

You can also read about Hummingbird in our blog post here.

For more details on the vision and on the technical details related to Hummingbird, please check our papers:

Contributing

We welcome contributions! Please see the guide on Contributing.

Also, see our roadmap of planned features.

Community

Join our community! Gitter

Authors

Special Thanks

  • Masahiro Hiramori (@mshr-h) for the ongoing contributions
  • Masahiro Masuda (@masahi) for the TVM and batching contributions

License

MIT License

hummingbird's People

Contributors

ahmedkrmn avatar ananiask8 avatar bfgray3 avatar fd0r avatar giriprasad51 avatar grafail avatar interesaaat avatar jspisak avatar kernc avatar kranthigv avatar ksaur avatar liangfu avatar marsupialtail avatar masahi avatar microsoftopensource avatar mmbhatk avatar mshr-h avatar parulnith avatar qin-xiong avatar rathijit avatar romanbredehoft avatar sangamswadik avatar scnakandala avatar sleepy-owl avatar thvasilo avatar tuannguyen27 avatar ucalyptus avatar vumichien avatar xadupre avatar zhanjiezhu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hummingbird's Issues

Notebooks don't work as-is

For example, the lightgbm example notebook has:

from hummingbird.ml import convert

and then

hb_model = convert(model, 'pytorch')

This returns an error:

In [14]: hb_model = convert(model, 'pytorch')
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-14-5658f007aeed> in <module>
----> 1 hb_model = convert(model, 'pytorch')

TypeError: 'module' object is not callable

Should it instead be:
from hummingbird.ml.convert import convert_lightgbm as convert?

upgrade and test new sklearn version

we need to upgrade sklearn version. (currently scikit-learn==0.21.3). To do this, we need to accomodate some API changes in the newer version

ex: Imputer is deprecated (was in preprocessing). Now, do :from sklearn.impute import SimpleImputer

Automate document generation

  • We have been manually refreshing documentation (generated using pdoc) till now.
  • However, this can be automated using Github actions.

Open questions:

  • Do we want to update github pages, too? This may require us to monitor the quality of docstrings closely. This can be a deterrent to a first time contributor.
  • What do we do with runtime warnings?
  • Is generation of documentation a pre-push or post-push item? If it's post-push, what if we get errors?

params to determine tree type, and max_depth

right now, we are hardcoding 3 or 4 as low, and 10 as high. We need to do more testing and tuning and document our choices.

Also, there are a lot of inconsistencies and duplications with the max_depth parameter. We need to revisit that for all 4 trees.

Cleanup Linear Converter Code

We have the original draft of the linear code uploaded in this branch. All credit here goes to @scnakandala, the original author and brains behind this.

It contains an un-edited implementation of the following scikit-learn converters:

There is a single test file here that needs to be cleaned up and separated out.

Minimum set of dependencies

Right now, users must install LGBM/XGBoost, etc. Can we have options in the setup.py file or some other options to not force users to install dependencies they won't be using?

Simplify convert_sklearn API

In its current implementation to convert a sklearn model we have something like:

convert_sklearn(model, initial_types=[('input', FloatTensorType([4, 3]))])

but we actually don't need the specification of input types (this is more a onnx converter thing). So we can have something like:

convert_sklearn(model)

which is nice and short. The problem with this is that XGBoostRegressor models do not surface information on the number of input features (while instead XGBoostClassifier does). Then if we go with the above API we will need a workaround for XGBoostRegressor. One possibility is to have the following specifically for XGBoostRegression models:

extra_config["n_features"] = 200
pytorch_model = convert_sklearn(model, extra_config=extra_config)

Another possibility is to pass some input data as for other converters:

pytorch_model = convert_sklearn(model, some_input_data)

One last possibility is to have a different API for each converter (Sklearn, LightGBM and XGBoost; as ONNXMLTools are doing right now). The for Sklearn we will have:
Another possibility is to pass some input data as for other converters:

pytorch_model = convert_sklearn(model)

For LightGBM we will have
Another possibility is to pass some input data as for other converters:

pytorch_model = convert_lightgbm(model)

And for XGboost we will have either to pass an extra param or the input data. For example:

pytorch_model = convert_xgboost(model, some_input_data)

Cannot use torch==1.5.0 due to breaking change

When using torch==1.5.0 (instead of the current torch==1.4.0), hummingbird sometimes gets stuck in an infinte loop in forward:

  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/root/hummingbird/hummingbird/operator_converters/_tree_implementations.py", line 349, in forward
    gather_indices = torch.index_select(nodes, 0, prev_indices).view(-1, self.num_trees)

This is potentially due to this [BC-BREAKING] change index_select scalar_check to retain dimensionality of input. #30790](pytorch/pytorch#30790) in torch==1.5.0 in the index_select function. The change makes it so they return a 0-dimensional tensor iff the input is 0-dimensional.

However, after digging around a bit more in our code, I separated out

 gather_indices = torch.index_select(nodes, 0, prev_indices).view(-1, self.num_trees)

into:

gather_indices = torch.index_select(nodes, 0, prev_indices)  #  now gets stuck here
if gather_indices.shape == torch.Size([]):
    gather_indices = gather_indices.view(-1)
gather_indices = gather_indices.view(-1, self.num_trees)

and found that the code is getting stuck on the index_select itself rather than any problem with the changed return type.

So maybe there is some issue related to optimize index_select performance on CPU with TensorIterator #30598, also new in torch==1.5

Randomization in tests

Right now all of our tests are using random data. This sometimes causes github actions to fail in unexpected ways.

We should use all deterministic data in our tests rather than random so that we get consistency across runs.

`wheel` missing from `requirements.txt`

I tried to run the following on a fresh VS Code Spaces instance:

$ pip install -r requirements.txt

And got the following errors:

Could not build wheels for numpy, since package 'wheel' is not installed.
Could not build wheels for scikit-learn, since package 'wheel' is not installed.
Could not build wheels for torch, since package 'wheel' is not installed.
Could not build wheels for xgboost, since package 'wheel' is not installed.
Could not build wheels for lightgbm, since package 'wheel' is not installed.
Could not build wheels for onnxconverter-common, since package 'wheel' is not installed.
Could not build wheels for scipy, since package 'wheel' is not installed.
Could not build wheels for joblib, since package 'wheel' is not installed.
Could not build wheels for future, since package 'wheel' is not installed.
Could not build wheels for protobuf, since package 'wheel' is not installed.
Could not build wheels for onnx, since package 'wheel' is not installed.
Could not build wheels for six, since package 'wheel' is not installed.
Could not build wheels for setuptools, since package 'wheel' is not installed.
Could not build wheels for typing-extensions, since package 'wheel' is not installed.

A quick

$ pip install wheel

Fixed that. Should wheel be in requirements.txt?

Add support for Visual Studio Codespaces

One way to resolve #60 is to add explicit support for dev containers to the Hummingbird repo. The documentation for that is here. Specifically, we could add a Dockerfile and settings for VS Code to use it. I am happy to give that a try if someone has a Dockerfile I can base that on.

float64 issue

At the moment, HB only works with float32. You must cast float64 to float32 for correct results. (You will get an error with the gemm algorithm). We need to fix this.

Ex: in scikit-learn-random-forest-example.ipynb we must cast X as follows:
X = X[0:nrows].astype('|f4')

Make the dev container more comfy

The current dev container is derived off of a super barebones image. Not even less is installed. I will be looking into finding a more comfy base image, probably starting with the ones found here.

Minor documentation improvements

When generating documentation using pdoc, we observe a few warnings:

/home/neumann/miniconda3/envs/bird/lib/python3.7/site-packages/pdoc/html_helpers.py:498: ReferenceWarning: Code reference `hummingbird.supported` in module "hummingbird.ml.convert" does not match any documented object.
  linked = re.sub(r'[a-zA-Z_]\w*(?:\.[a-zA-Z_]\w*)*(?:\(\))?', handle_refname, code_span)
/home/neumann/miniconda3/envs/bird/lib/python3.7/site-packages/pdoc/html_helpers.py:498: ReferenceWarning: Code reference `hummingbird.supported_configurations` in module "hummingbird.ml.convert" does not match any documented object.
  linked = re.sub(r'[a-zA-Z_]\w*(?:\.[a-zA-Z_]\w*)*(?:\(\))?', handle_refname, code_span)
  1. We should change hummingbird.supported to hummingbird.ml.supported for the documentation site to correctly link them.
  2. Docstrings mention that hummingbird.supported_configurations shows the set of supported extra configurations.
    However, supported_configurations is not found in the repo (within any submodule).
    Is it not yet released?

Multiclass rouding errors

With mulitclass datasets (such as covtype or iris), sometimes we get errors on rounding:

import numpy as np
import torch
from hummingbird import convert_sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import fetch_covtype

X, y = fetch_covtype(return_X_y=True)
nrows=15000
X = X[0:nrows]
y = y[0:nrows]
X_torch = torch.from_numpy(X).float()

model = RandomForestClassifier(n_estimators=10, max_depth=10)
model.fit(X, y)

pytorch_model = convert_sklearn(
    model, 
    extra_config = {"tree_implementation": "gemm"})


skl = model.predict_proba(X)
pytorch_model.to('cuda')
hum_gpu = pytorch_model(X_torch.to('cuda'))

np.testing.assert_allclose(skl, hum_gpu[1].data.to('cpu').numpy(), rtol=1e-6, atol=1e-6)

gives error:


AssertionError: 
Not equal to tolerance rtol=1e-06, atol=1e-06

Mismatched elements: 332 / 105000 (0.316%)
Max absolute difference: 0.11943346
Max relative difference: 5.82971106
 x: array([[0.121156, 0.200913, 0.008188, ..., 0.643138, 0.00637 , 0.020236],
       [0.110779, 0.207474, 0.008188, ..., 0.646954, 0.00637 , 0.020236],
       [0.1959  , 0.707151, 0.008188, ..., 0.050266, 0.00637 , 0.032125],...
 y: array([[0.121156, 0.200913, 0.008188, ..., 0.643138, 0.00637 , 0.020236],
       [0.110779, 0.207474, 0.008188, ..., 0.646954, 0.00637 , 0.020236],
       [0.1959  , 0.707152, 0.008188, ..., 0.050266, 0.00637 , 0.032125],...

Remove not required dependencies

Should we leave sklearn? From one site having sklearn will help first time users, but on the other side, everyone taking a dependency on hummingbird will have to install sklearn even if it is not used.

Rounding in small datasets

With small datasets, often get a rounding error that we don't see with larger datasets

repro:

import numpy as np
import torch, pickle
from hummingbird import convert_sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris


data = load_iris()
X, y = data.data, data.target
X_torch = torch.from_numpy(X)


model = RandomForestClassifier(n_estimators=10)
model.fit(X, y)
pytorch_model = convert_sklearn(
    model, 
    extra_config = {"tree_implementation": "perf_tree_trav"})

skl = model.predict_proba(X)

pytorch_model.to('cuda')
hum_gpu = pytorch_model(X_torch.to('cuda'))

np.testing.assert_allclose(skl, hum_gpu[1].data.to('cpu').numpy(), rtol=1e-06, atol=1e-06)

you get

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/numpy/testing/_private/utils.py", line 1533, in assert_allclose
    verbose=verbose, header=header, equal_nan=equal_nan)
  File "/opt/conda/envs/rapids/lib/python3.7/site-packages/numpy/testing/_private/utils.py", line 846, in assert_array_compare
    raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-06, atol=1e-06

Mismatched elements: 10 / 450 (2.22%)
Max absolute difference: 0.1
Max relative difference: 1.
 x: array([[1. , 0. , 0. ],
       [1. , 0. , 0. ],
       [1. , 0. , 0. ],...
 y: array([[1. , 0. , 0. ],
       [1. , 0. , 0. ],
       [1. , 0. , 0. ],...

RF beam++ single node

There is a bug in beam++ for RF with node size 1 somewhere related to tree_commons.py:481

pytorch problem with pip install on Python3.7 or Python3.8

Doing pip install hummingbird-ml on Python3.7 or Python3.8 a user reported:

ERROR: Could not find a version that satisfies the requirement torch>=1.4.0 (from hummingbird-ml) (from versions: 0.1.2, 0.1.2.post1, 0.1.2.post2)
ERROR: No matching distribution found for torch>=1.4.0 (from hummingbird-ml)

Looks like pytorch on pypi is 1.0.2 and on the conda main channel it’s 1.3.1
Maybe linking to the installation page for pytorch would be useful.

consistency in returned types

It's not clear when we should take:

pytorch_model(torch.from_numpy(X))[0]  #  RandomForestClassifier

vs. say

pytorch_model(torch.from_numpy(X))[1]   # DecisionTreeClassifier

In the end, we want it to be something simple like what we compare to such as

model.predict(X)

Tree classifiers return wrong labels while probabilities match

Apparently torch.argmax returns the last matching value, while np.argmax return the first one. This is general is not a problem, except when we get the same probabilities for 2 or more classes.

This will hopefully get fixed once this will get closed.

Once this is fixed we should add assertion over the labels results as well. At this point this assertions won't be useful since they will likely often fails.

Fix xgb converter for version > 0.90

Regression for XGBoost is failing for 1.0.2.

Apparently there was:

  • a change in the API, alpha is now an array\list
  • a change in the model structure because we are getting different results for regression
  • a change in max_depth because None does not work as max_depth in 0.90

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.