bayeswitnesses / m2cgen Goto Github PK

Transform ML models into a native code (Java, C, Python, Go, JavaScript, Visual Basic, C#, R, PowerShell, PHP, Dart, Haskell, Ruby, F#, Rust) with zero dependencies

License: MIT License

Python 50.79% Java 3.80% C 3.20% Dockerfile 0.22% Makefile 0.10% Go 3.50% JavaScript 1.37% C# 3.69% Visual Basic .NET 3.69% VBA 0.14% PowerShell 4.45% Shell 0.03% R 3.56% PHP 3.21% Dart 4.19% Haskell 3.35% Ruby 3.03% F# 3.14% FreeBasic 0.40% Rust 4.15%

machine-learning scikit-learn statistical-learning xgboost lightgbm java python c javascript go

m2cgen's People

Contributors

Stargazers

Watchers

Forkers

shinroo averroes dgq2011 hivewang amallia ravitezu ntasfi saifrahmed silky prashant118 muharremokutan tokiran alisa-lisa jignyasi styanddty patrickjonesdotca hhy5277 rsohlot dsp6414 vic4key mbrukman shaunstanislauslau rahulsoibam batermj pvc2104 mndileepraj sddai namerspace zhouyonglong yskn67 frankchu0229 quyphamgo shafaypro foeinlove tomzhang thaneacheron bansaldivy99 thomasmazon yueyedeai lite-java ilikelucifer ceceshao1 efleurine ggerrein dst1213 shuoranly zhangqianjin 092000 lai-bluejay buaabandit abhimanyuaryan setop cxz chuckwoody canglangshushu siathalysedi 1agueye ibrahim85 tianwaishiguang jiafeipan individuodk kianqunki iamshivamjaiswal ii0 hzitoun mmourafiq collector-m alrevuelta arshamg chinam543 scorpionhiccup sbarman25 flamato omkarmehta akashravichandran reynierhdez xiaopangzi313 denisaltruist edgbr sruthi-racharla lanxingmo deepb1t zhoutao12 akafle1003 haixiaoxuan jason-song pli76 liuxy416 hufengquna akhvorov mannyjop ehoppmann rshanmugavel bcampbell-prosper zocdoc markwzx yuanjie-ai moomoofarm1 strategist922 baolinji

m2cgen's Issues

In scikit-learn SVC convert One-vs-one decisions to One-vs-rest

Refer to

m2cgen/m2cgen/assemblers/svm.py

Lines 110 to 127 in d73048e

 # One-vs-one decisions. 

 decisions = [] 

 for i in range(n_support_len): 

 for j in range(i + 1, n_support_len): 

 kernel_weight_mul_ops = [ 

 utils.mul(kernel_exprs[k], ast.NumVal(coef[i][k])) 

 for k in range(*support_ranges[j]) 

 ] 

 kernel_weight_mul_ops.extend([ 

 utils.mul(kernel_exprs[k], ast.NumVal(coef[j - 1][k])) 

 for k in range(*support_ranges[i]) 

 ]) 

 decision = utils.apply_op_to_expressions( 

 ast.BinNumOpType.ADD, 

 ast.NumVal(intercept[len(decisions)]), 

 *kernel_weight_mul_ops 

 ) 

 decisions.append(decision)

import m2cgen as m2c error

File "", line 1, in
File "/anaconda2/lib/python2.7/site-packages/m2cgen/init.py", line 1, in
from .exporters import export_to_java, export_to_python, export_to_c
File "/anaconda2/lib/python2.7/site-packages/m2cgen/exporters.py", line 1, in
from m2cgen import assemblers
File "/anaconda2/lib/python2.7/site-packages/m2cgen/assemblers/init.py", line 1, in
from .linear import LinearModelAssembler
File "/anaconda2/lib/python2.7/site-packages/m2cgen/assemblers/linear.py", line 2, in
from m2cgen.assemblers import utils
File "/anaconda2/lib/python2.7/site-packages/m2cgen/assemblers/utils.py", line 36
def apply_op_to_expressions(op, *exprs, to_reuse=False):
^
SyntaxError: invalid syntax

Dart language support

For those building Flutter apps that would like to be able to utilize static models trained in scikit on-device, this tool would be a perfect fit. And if the Flutter dev team decides to add a hot code push feature to the framework, models from m2cgen could be updated on the fly.

Converted version outputs index of class instead of class

If I train a classifier with non-consecutive numbers for classes, the resulting converted code (C in my case) will not output the classes but the index of the class. In my case I simply don't have an example for class 1 in all cases, so the classifier will not know this class exists. This creates discrepancies between Python and C.

from sklearn.ensemble import RandomForestClassifier
# linear mapping: x->x
# NB: my goal is not regression, this is just an example
x_train = np.repeat([0,1,2,3,4,5], 100).reshape([-1,1])
y_train = np.repeat([0,1,2,3,4,5], 100)

# however, class 1 is missing in training!
x_train = x_train[y_train!=1]
y_train = y_train[y_train!=1]

clf = RandomForestClassifier().fit(x_train, y_train)

# convert it
code = m2cgen.export_to_c(clf)

result = clf.predict(np.atleast_2d([0,1,2,3,4,5]).T)
# result =[0,0,2,3,4,5]

Calling it in C will give different results

# Pseudocode for C
double result[5] = score([0,1,2,3,4,5])

#result = [0,0,1,2,3,4]

Do you think there is any feasible way to keep original class label?

(see also nok/sklearn-porter#37 having the same problem)

Support for Categorical Variables in LightGBM

LightGBM supports categorical variables using an integer encoding. https://lightgbm.readthedocs.io/en/latest/Advanced-Topics.html#categorical-feature-support

The way this is represented in the tree is using the equals operator and a pipe delimited string for the categorical set.

Example

# Normal split
"threshold": 4.285125194551891,
"decision_type": "<=",


# Categorial split
"threshold": "4||6||7||8||9||22||28||32||63||64",
"decision_type": "=="

It's important to optimize the set membership check for performance reasons. At a minimum I think we'll need to

use a data structure like a set (or binary search an array)
hoist the set to the top level so its only initialized once

I tried a naive solution (inlined ors feat == a || feat == b ...) in addition to hoisted sets in java. The performance difference was ~6x on my model which has a large number of categorical operations (~30 members in each set)

If my understanding of m2cgen is right, I think this should be a new operator in interpreters/interpreter.py

I have a PoC here master...Zocdoc:cs_categorical if you'd like to look at the code. Its just a hack and would like your guidance on making this changes in a better way.

If its helpful I can open a pr and we can discuss the code there (also github allows maintainer commits to pr branches now which is nice).

Support scikit-learn GLM models

Wow, the newest scikit-learn release has introduced some GLM models!

Should be good competitor for recently added GLMs from statsmodels.

https://scikit-learn.org/stable/modules/classes.html#generalized-linear-models-glm-for-regression

linear_model.PoissonRegressor
linear_model.TweedieRegressor
linear_model.GammaRegressor

Planned support for sklearn pipelines?

Is there any conceivable way to convert a pipeline that includes other steps like feature extractions, etc?

I know this would be quite the undertaking if not currently supported. Just really love the idea of converting to no dependencies to move ML functionality to the edge. Great work!

Thanks!

Move Dart language to a different bucket of E2E tests on Travis

Although the Dart has been configured in .travis.yaml I don't see it's being executed. Eg. recent master build - https://travis-ci.org/BayesWitnesses/m2cgen/jobs/660185796.
Error in the output:

Unknown pytest.mark.dart - is this a typo?  You can register custom marks to avoid this warning

CC: @StrikerRUS

add option to save generated code into file

I'm sorry if I missed this functionality, but CLI version hasn't it for sure (I saw the related code only in generate_code_examples.py). I guess it will be very useful to eliminate copy-paste phase, especially for large models.

Of course, piping is a solution, but not for development in Jupyter Notebook, for example.

Planned support for raw xgb.Booster.model and raw lgb.Booster.model?

Can Support Batch Predictor?

Prepare for release 0.1.0

Classification support for ensemble models.
Classification for Python.
setup.py and release procedure.
Enable more sklearn models (like remaining linear models).
README + docs + examples.
Revisit the library API (exporters).
Implement CLI.
Deal with Python limitations on nested function calls (#57)

Optional:

C language support
XGBoost/LightGBM

Large LightGBM causes javac error "Code too Large"

When generating code for a large number of trees, the generated code exceeds the 64KB limit in java.

From Stackoverflow

A single method in a Java class may be at most 64KB of bytecode.

One solution is to add subfunctions https://github.com/BayesWitnesses/m2cgen/blob/master/m2cgen/assemblers/boosting.py#L43-L48 instead of having the body of every tree inside subroutine0. The amount of code that will fit inside each function is dependent on its depth + width so we might require some heuristic or tunable parameter. In my case, I ended up with 10 trees per subfunction

I'm not sure if there are similar limits in other languages

drop numpy dependency from Python code for cases without vectors

According to this line, it seems that numpy is used as a default math library for runtime even when we do not operate with vectors.

m2cgen/m2cgen/interpreters/python/interpreter.py

Lines 30 to 31 in 2475f3c

 if self.with_vectors or self.with_math_module: 

 self._cg.add_dependency("numpy", alias="np")

Let me describe two advantages of dropping numpy where it's possible.

The first one is excess dependence. Even though numpy is a sort of "classic" dependence and there should be no problems with installing it, it requires additional manipulation from a user side. Also, there are some companies with very strict security policies, which prohibit using pip (conda, brew, and other package managers). So, I guess, for them raw Python may be preferable solution in cases where it's possible.

The second one is speed. numpy is about efficient vector math, in other cases it only produces redundant computational cost. Consider the following example. Take this generated Python code from the repo, change return type from np.array to simple list, replace the following things in script:

numpy -> math
np.exp -> math.exp
np.power -> math.pow

Here what we get after removing numpy:

import math
def score_raw(input):
    var0 = (0) - (0.25)
    var1 = math.exp((var0) * ((((math.pow((5.4) - (input[0]), 2)) + (math.pow((3.0) - (input[1]), 2))) + (math.pow((4.5) - (input[2]), 2))) + (math.pow((1.5) - (input[3]), 2))))
    var2 = math.exp((var0) * ((((math.pow((6.2) - (input[0]), 2)) + (math.pow((2.2) - (input[1]), 2))) + (math.pow((4.5) - (input[2]), 2))) + (math.pow((1.5) - (input[3]), 2))))
    var3 = math.exp((var0) * ((((math.pow((5.0) - (input[0]), 2)) + (math.pow((2.3) - (input[1]), 2))) + (math.pow((3.3) - (input[2]), 2))) + (math.pow((1.0) - (input[3]), 2))))
    var4 = math.exp((var0) * ((((math.pow((5.9) - (input[0]), 2)) + (math.pow((3.2) - (input[1]), 2))) + (math.pow((4.8) - (input[2]), 2))) + (math.pow((1.8) - (input[3]), 2))))
    var5 = math.exp((var0) * ((((math.pow((5.0) - (input[0]), 2)) + (math.pow((2.0) - (input[1]), 2))) + (math.pow((3.5) - (input[2]), 2))) + (math.pow((1.0) - (input[3]), 2))))
    var6 = math.exp((var0) * ((((math.pow((6.7) - (input[0]), 2)) + (math.pow((3.0) - (input[1]), 2))) + (math.pow((5.0) - (input[2]), 2))) + (math.pow((1.7) - (input[3]), 2))))
    var7 = math.exp((var0) * ((((math.pow((7.0) - (input[0]), 2)) + (math.pow((3.2) - (input[1]), 2))) + (math.pow((4.7) - (input[2]), 2))) + (math.pow((1.4) - (input[3]), 2))))
    var8 = math.exp((var0) * ((((math.pow((4.9) - (input[0]), 2)) + (math.pow((2.4) - (input[1]), 2))) + (math.pow((3.3) - (input[2]), 2))) + (math.pow((1.0) - (input[3]), 2))))
    var9 = math.exp((var0) * ((((math.pow((6.3) - (input[0]), 2)) + (math.pow((2.5) - (input[1]), 2))) + (math.pow((4.9) - (input[2]), 2))) + (math.pow((1.5) - (input[3]), 2))))
    var10 = math.exp((var0) * ((((math.pow((6.0) - (input[0]), 2)) + (math.pow((2.7) - (input[1]), 2))) + (math.pow((5.1) - (input[2]), 2))) + (math.pow((1.6) - (input[3]), 2))))
    var11 = math.exp((var0) * ((((math.pow((5.7) - (input[0]), 2)) + (math.pow((2.6) - (input[1]), 2))) + (math.pow((3.5) - (input[2]), 2))) + (math.pow((1.0) - (input[3]), 2))))
    var12 = math.exp((var0) * ((((math.pow((5.1) - (input[0]), 2)) + (math.pow((3.8) - (input[1]), 2))) + (math.pow((1.9) - (input[2]), 2))) + (math.pow((0.4) - (input[3]), 2))))
    var13 = math.exp((var0) * ((((math.pow((4.4) - (input[0]), 2)) + (math.pow((2.9) - (input[1]), 2))) + (math.pow((1.4) - (input[2]), 2))) + (math.pow((0.2) - (input[3]), 2))))
    var14 = math.exp((var0) * ((((math.pow((5.7) - (input[0]), 2)) + (math.pow((4.4) - (input[1]), 2))) + (math.pow((1.5) - (input[2]), 2))) + (math.pow((0.4) - (input[3]), 2))))
    var15 = math.exp((var0) * ((((math.pow((5.8) - (input[0]), 2)) + (math.pow((4.0) - (input[1]), 2))) + (math.pow((1.2) - (input[2]), 2))) + (math.pow((0.2) - (input[3]), 2))))
    var16 = math.exp((var0) * ((((math.pow((5.1) - (input[0]), 2)) + (math.pow((3.3) - (input[1]), 2))) + (math.pow((1.7) - (input[2]), 2))) + (math.pow((0.5) - (input[3]), 2))))
    var17 = math.exp((var0) * ((((math.pow((5.7) - (input[0]), 2)) + (math.pow((3.8) - (input[1]), 2))) + (math.pow((1.7) - (input[2]), 2))) + (math.pow((0.3) - (input[3]), 2))))
    var18 = math.exp((var0) * ((((math.pow((4.3) - (input[0]), 2)) + (math.pow((3.0) - (input[1]), 2))) + (math.pow((1.1) - (input[2]), 2))) + (math.pow((0.1) - (input[3]), 2))))
    var19 = math.exp((var0) * ((((math.pow((4.5) - (input[0]), 2)) + (math.pow((2.3) - (input[1]), 2))) + (math.pow((1.3) - (input[2]), 2))) + (math.pow((0.3) - (input[3]), 2))))
    var20 = math.exp((var0) * ((((math.pow((6.3) - (input[0]), 2)) + (math.pow((2.7) - (input[1]), 2))) + (math.pow((4.9) - (input[2]), 2))) + (math.pow((1.8) - (input[3]), 2))))
    var21 = math.exp((var0) * ((((math.pow((6.0) - (input[0]), 2)) + (math.pow((3.0) - (input[1]), 2))) + (math.pow((4.8) - (input[2]), 2))) + (math.pow((1.8) - (input[3]), 2))))
    var22 = math.exp((var0) * ((((math.pow((6.3) - (input[0]), 2)) + (math.pow((2.8) - (input[1]), 2))) + (math.pow((5.1) - (input[2]), 2))) + (math.pow((1.5) - (input[3]), 2))))
    var23 = math.exp((var0) * ((((math.pow((5.8) - (input[0]), 2)) + (math.pow((2.8) - (input[1]), 2))) + (math.pow((5.1) - (input[2]), 2))) + (math.pow((2.4) - (input[3]), 2))))
    var24 = math.exp((var0) * ((((math.pow((6.1) - (input[0]), 2)) + (math.pow((3.0) - (input[1]), 2))) + (math.pow((4.9) - (input[2]), 2))) + (math.pow((1.8) - (input[3]), 2))))
    var25 = math.exp((var0) * ((((math.pow((7.7) - (input[0]), 2)) + (math.pow((2.6) - (input[1]), 2))) + (math.pow((6.9) - (input[2]), 2))) + (math.pow((2.3) - (input[3]), 2))))
    var26 = math.exp((var0) * ((((math.pow((6.9) - (input[0]), 2)) + (math.pow((3.1) - (input[1]), 2))) + (math.pow((5.1) - (input[2]), 2))) + (math.pow((2.3) - (input[3]), 2))))
    var27 = math.exp((var0) * ((((math.pow((6.3) - (input[0]), 2)) + (math.pow((3.3) - (input[1]), 2))) + (math.pow((6.0) - (input[2]), 2))) + (math.pow((2.5) - (input[3]), 2))))
    var28 = math.exp((var0) * ((((math.pow((4.9) - (input[0]), 2)) + (math.pow((2.5) - (input[1]), 2))) + (math.pow((4.5) - (input[2]), 2))) + (math.pow((1.7) - (input[3]), 2))))
    var29 = math.exp((var0) * ((((math.pow((6.0) - (input[0]), 2)) + (math.pow((2.2) - (input[1]), 2))) + (math.pow((5.0) - (input[2]), 2))) + (math.pow((1.5) - (input[3]), 2))))
    var30 = math.exp((var0) * ((((math.pow((7.9) - (input[0]), 2)) + (math.pow((3.8) - (input[1]), 2))) + (math.pow((6.4) - (input[2]), 2))) + (math.pow((2.0) - (input[3]), 2))))
    var31 = math.exp((var0) * ((((math.pow((7.2) - (input[0]), 2)) + (math.pow((3.0) - (input[1]), 2))) + (math.pow((5.8) - (input[2]), 2))) + (math.pow((1.6) - (input[3]), 2))))
    var32 = math.exp((var0) * ((((math.pow((7.7) - (input[0]), 2)) + (math.pow((3.8) - (input[1]), 2))) + (math.pow((6.7) - (input[2]), 2))) + (math.pow((2.2) - (input[3]), 2))))
    return [(((((((((((((((((((-0.08359187780790468) + ((var1) * (-0.0))) + ((var2) * (-0.0))) + ((var3) * (-0.4393498355605194))) + ((var4) * (-0.009465620856664334))) + ((var5) * (-0.16223369966927))) + ((var6) * (-0.26861888775075243))) + ((var7) * (-0.4393498355605194))) + ((var8) * (-0.4393498355605194))) + ((var9) * (-0.0))) + ((var10) * (-0.0))) + ((var11) * (-0.19673905328606292))) + ((var12) * (0.3340655283922188))) + ((var13) * (0.3435087305152051))) + ((var14) * (0.4393498355605194))) + ((var15) * (0.0))) + ((var16) * (0.28614124535416424))) + ((var17) * (0.11269159286168087))) + ((var18) * (0.0))) + ((var19) * (0.4393498355605194)), (((((((((((((((((((((-0.18563912331454907) + ((var20) * (-0.0))) + ((var21) * (-0.06014273244194299))) + ((var22) * (-0.0))) + ((var23) * (-0.031132453078851926))) + ((var24) * (-0.0))) + ((var25) * (-0.3893079321588921))) + ((var26) * (-0.06738007627290196))) + ((var27) * (-0.1225075748937126))) + ((var28) * (-0.3893079321588921))) + ((var29) * (-0.29402231709614085))) + ((var30) * (-0.3893079321588921))) + ((var31) * (-0.0))) + ((var32) * (-0.028242141062729226))) + ((var12) * (0.16634667752431267))) + ((var13) * (0.047772685163074764))) + ((var14) * (0.3893079321588921))) + ((var15) * (0.3893079321588921))) + ((var16) * (0.0))) + ((var17) * (0.0))) + ((var18) * (0.3893079321588921))) + ((var19) * (0.3893079321588921)), ((((((((((((((((((((((((0.5566649875797668) + ((var20) * (-25.563066587228416))) + ((var21) * (-38.35628154976547))) + ((var22) * (-38.35628154976547))) + ((var23) * (-0.0))) + ((var24) * (-38.35628154976547))) + ((var25) * (-0.0))) + ((var26) * (-0.0))) + ((var27) * (-0.0))) + ((var28) * (-6.2260303727828745))) + ((var29) * (-18.42781911624364))) + ((var30) * (-0.14775026537286423))) + ((var31) * (-7.169755983020096))) + ((var32) * (-0.0))) + ((var1) * (12.612328267927264))) + ((var2) * (6.565812506955159))) + ((var3) * (0.0))) + ((var4) * (38.35628154976547))) + ((var5) * (0.0))) + ((var6) * (38.35628154976547))) + ((var7) * (0.0))) + ((var8) * (0.0))) + ((var9) * (38.35628154976547))) + ((var10) * (38.35628154976547))) + ((var11) * (0.0))]

And here are some timings:

%%timeit -n 10000
score([1, 2, 3, 4])

310 µs ± 658 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit -n 10000
score_raw([1, 2, 3, 4])

39.4 µs ± 136 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Results seems to be identical:

np.testing.assert_allclose(score([1, 2, 3, 4]), score_raw([1, 2, 3, 4]))

Please share your thoughts about this refactoring.

Add support for Gradient Boosting from scikit-learn

Is there any way to transpile sklearn GBM models? I know it's not supported in the library. But maybe some way to convert it into one of the supported models, and then transpile. Any leads would be appreciated. Thanks!

Any plan for Rust generation?

Code generated for XGBoost models returns invalid scores when tree_method is set to "hist"

I have trained xgboost models in Python and am using the CLI interface to convert the serialized models to pure python. However, when I use the pure python, the results differ from the predictions using the model directly.

Python 3.7
xgboost 0.90

My model has a large number of parameters (somewhat over 500).
Here are predicted class probabilities from the original model:

Here are the same predicted probabilities using the generated python code via m2cgen:

We can see that the results are similar but not the same. The result is a significant number of cases that are moved into different classes between the two sets of predictions.

I have also tested this with binary classification models and have the same issues.

What might cause an invalid load key on conversion?

When calling m2cgen tpot_classify.pkl --language go on a perfectly fine Pickle file, I receive the following error:

Traceback (most recent call last):
  File "/usr/local/bin/m2cgen", line 10, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/site-packages/m2cgen/cli.py", line 86, in main
    print(generate_code(args))
  File "/usr/local/lib/python3.7/site-packages/m2cgen/cli.py", line 71, in generate_code
    model = pickle.load(f)
_pickle.UnpicklingError: invalid load key, '\x00'.

I'm curious if you might know what would cause this NULL byte to appear?

Add support for XGBoost Random Forest for multiclass task

Initial support for regression and binary classification tasks was added in #157. Unfortunately, multiclass case is not so trivial and requires deeper knowledge of XGBoost internals (dumped model representation and prediction logic) to add support for it.

Add support for probit and cauchy link functions in statsmodels GLM

Refer to #195 (comment).

probit
cauchy

Better document the usage of OOP API

I can't find any documentation of how to use m2cgen via OOP API. If I'm not mistaken, we only have examples how to use it via functional API. However, OOP API gives more options to customize code generation. I mean, the only way to change e.g. bin_depth_threshold is to change attributes of interpretor class. I believe it is very important because in most cases we set default values just randomly to pass particular tests. For instance, refer to

m2cgen/m2cgen/interpreters/r/interpreter.py

Lines 11 to 20 in 8115243

 # R doesn't allow to have more than 50 nested if, [, [[, {, ( calls. 

 # It raises contextstack overflow error not only for explicitly nested 

 # calls, but also if met above mentioned number of parentheses 

 # in one expression. Given that there is no way to control 

 # the number of parentheses in one expression for now, 

 # the following variable set to 50 / 2 value is expected to prevent 

 # contextstack overflow error occurrence. 

 # This value is just a heuristic and is subject to change in the future 

 # based on the users' feedback. 

 bin_depth_threshold = 25

Also, we may not know about different other limitations of supported languages. For example, today I learned from one recent great blog post that C# has a limit for number of local variables.

Есть неплохая библиотека на питоне m2cgen, которая позволяет экспортировать в C, C#, Dart, Go, Haskell, Java, JavaScript, PHP, PowerShell, Python, R, Ruby, Visual Basic. На выходе вы получаете готовый модуль, который может быть скомпилирован вашим любимым компилятором (т.е. без использования каких-либо dll!). С m2cgen есть некоторые ограничения на сложность (к примеру C# может уткнуться в ограничение 64 тысячи локальных переменных, можно попытаться обойти ограничение путем создания нескольких небольших процедур в замен одной большой).
https://imageman72.livejournal.com/47186.html

One can easily overcome similar limitations with the help of our mixins by inheriting them in custom class without the need to modify package source code. And it can't be done via functional API.

NotImplementedError: Model int is not supported OpenNMT-py

Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/bin/m2cgen", line 10, in
sys.exit(main())
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/m2cgen/cli.py", line 85, in main
print(generate_code(args))
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/m2cgen/cli.py", line 80, in generate_code
return exporter(model, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/m2cgen/exporters.py", line 47, in export_to_python
return _export(model, interpreter)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/m2cgen/exporters.py", line 70, in _export
assembler_cls = assemblers.get_assembler_cls(model)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/m2cgen/assemblers/init.py", line 76, in get_assembler_cls
"Model {} is not supported".format(model_name))
NotImplementedError: Model int is not supported

Code generated for XGBoost models return error scores when feature input include zero which result in xgboost "missing"

I’m try using m2cgen to generate js code for XGBoost model，but find that if the feature input include zero，the result which calculate by generated js has a big difference with the result which predicted by model. For example, if the feature input is [0.4444,0.55555,0.3545,0.22333]，the result which calculate by generated js equals the result which predicted by model，but if the feature input is [0.4444,0,0,0.22333]，the result which calculate by generated js will be very different from the result which predicted by model,maybe one result is 0.22 ,the other one result is 0.04。After we validate by demo，we find that m2cgen not process “missing” condition. when xgboost result in “missing”, m2cgen will process it as “yes”

Remove numpy from default PythonInterpreter and potentially introduce PythonNumpyInterpreter

Right now only Python uses third party library (specifically numpy) for linear algebra. This is inconsistent with:

our mission "with zero dependencies"
other languages.

The first step was to drop numpy from cases without vectors, implemented in PR #111 .

As the second step I want to suggest dropping numpy altogether from PythonInterpreter and potentially implement PythonNumpyInterpreter to use in cases where it would be beneficial.

As for the user API I see 2 options:

Adding a new method export_to_python_with_numpy
Adding a arameter with_numpy to an existing export_to_python method which would be False by default.

I personally think first option is better as users would have higher chances of noticing extra method than extra parameter with a default value.

memcpy instead of assign_array

For the generated C code, you could use memcpy instead of your assign_array function.

Split BaseCodeGenerator into 2: CLike and Python (approx.)

Support for LightGBM Booster and XGBoost Booster

We're training our LightGBM model outside of python (spark) so we need to load it from a model file before passing it to m2c. I don't believe LightGBM can load directly into LGBMRegressor though, it must be loaded into lgb.Booster.

It would be nice if m2cgen supported lgb.Booster

Example

import lightgbm as lgb
import m2cgen as m2c

model = lgb.Booster(model_file='model.txt')

# this fails
# m2c.export_to_java(model)

# This works but is awkward 
from lightgbm.sklearn import LGBMRegressor
r = LGBMRegressor()
r._Booster = model

code = m2c.export_to_java(r)

Can support native xgboost.core.Booster model?

convert lightgbm gbdt bug

Model Booster is not supported error. When using a light gbm model trained with 'gbdt'. export function fails with model not supported error. Below is snippet code.

import os
import h5py

import lightgbm as lgb
import numpy as np
import m2cgen as m2c

with h5py.File('./sample.hdf5') as f:
    X, y = f['X'][()], f['y'][()]

dtrain=lgb.Dataset(X[:1000,:],label=y[:1000])

param = {
    'boosting_type': 'gbdt',
    'objective': 'binary',
    'num_leaves':2**4,
    'max_depth': 4, 
    'learning_rate': 0.1,
    'verbose': 0}
n_estimators = 5

bst = lgb.train(param, dtrain, n_estimators)
code = m2c.export_to_c(bst)

Additionally, here is the error output:

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-5-226faacde9f0> in <module>
----> 1 code = m2c.export_to_java(bst)
      2 print(code)

C:\ProgramData\Anaconda3\envs\light_gbm\lib\site-packages\m2cgen\exporters.py in export_to_java(model, package_name, class_name, indent)
     26         class_name=class_name,
     27         indent=indent)
---> 28     return _export(model, interpreter)
     29 
     30 

C:\ProgramData\Anaconda3\envs\light_gbm\lib\site-packages\m2cgen\exporters.py in _export(model, interpreter)
     87 
     88 def _export(model, interpreter):
---> 89     assembler_cls = assemblers.get_assembler_cls(model)
     90     model_ast = assembler_cls(model).assemble()
     91     return interpreter.interpret(model_ast)

C:\ProgramData\Anaconda3\envs\light_gbm\lib\site-packages\m2cgen\assemblers\__init__.py in get_assembler_cls(model)
     74     if not assembler_cls:
     75         raise NotImplementedError(
---> 76             "Model {} is not supported".format(model_name))
     77 
     78     return assembler_cls

NotImplementedError: Model Booster is not supported

sigmoid and softmax as language-specific functions

Maybe it is better to require from supported languages to implement sigmoid and softmax functions than defining them as expressions? It will simplify the readability of generated code and speed it up by more efficient native implementations.

Also, we can fallback to the current expressions when functions are missed.

Why I'm rising this issue is because I think that we currently have some kind of inconsistency, requiring implementation of Tanh function from languages, but at the same time defining sigmoid as expression, while they both can be written via Exp function.

m2cgen/m2cgen/assemblers/utils.py

Lines 83 to 100 in dcd62f4

 def sigmoid_expr(expr, to_reuse=False): 

 neg_expr = ast.BinNumExpr(ast.NumVal(0), expr, ast.BinNumOpType.SUB) 

 exp_expr = ast.ExpExpr(neg_expr) 

 return ast.BinNumExpr( 

 ast.NumVal(1), 

 ast.BinNumExpr(ast.NumVal(1), exp_expr, ast.BinNumOpType.ADD), 

 ast.BinNumOpType.DIV, 

 to_reuse=to_reuse) 

 def softmax_exprs(exprs): 

 exp_exprs = [ast.ExpExpr(e, to_reuse=True) for e in exprs] 

 exp_sum_expr = apply_op_to_expressions(ast.BinNumOpType.ADD, *exp_exprs, 

 to_reuse=True) 

 return [ 

 ast.BinNumExpr(e, exp_sum_expr, ast.BinNumOpType.DIV) 

 for e in exp_exprs 

 ]

m2cgen/m2cgen/interpreters/visual_basic/tanh.bas

Lines 1 to 11 in dcd62f4

 Function Tanh(ByVal number As Double) As Double 

 If number > 44.0 Then ' exp(2*x) <= 2^127 

 Tanh = 1.0 

 Exit Function 

 End If 

 If number < -44.0 Then 

 Tanh = -1.0 

 Exit Function 

 End If 

 Tanh = (Math.Exp(2 * number) - 1) / (Math.Exp(2 * number) + 1) 

 End Function

how to give input to the generated code

Error: Main method not found in class Extratreesregressor, please define the main method as:
public static void main(String[] args)
or a JavaFX application class must extend javafx.application.Application
How to run without main file and what is the input to the generated code?Kindly help im a newbie.

Migrate to f-strings

m2cgen's codebase heavily utilizes string formatting and concatenation mechanisms. f-strings (brief guide) are known to be the fastest method to format string (1, 2) and they increase code readability a lot.

Source: https://cito.github.io/blog/f-strings/.

One problem is only that they are supported starting from Python 3.6, but currently m2cgen supports Python 3.5.

m2cgen/setup.py

Line 28 in 7626a60

"Programming Language :: Python :: 3.5",

My suggestion is the following. Mark the next 0.8.0 release as the latest which supports Python 3.5 and drop the support in 0.9.0 release. It will be approximately in the same time as Python 3.5 reaches its EOL (2020-09-13).

C# support

Is there a roadmap for converting models to C# code? I work at a Microsoft shop and this would be great to use instead of ML.NET since this is so much more lightweight. Thanks!

when i try to export svc model to c, it takes a long time and without responding

compared to export xgboost, when i try to export the sklearn svc model, it takes so long, and doesn't show any error or infomation, i dont know how long should i wait for that

PCA

Could you support PCA transformation (as it's just a matrix multiplication when the algorithm is fitted)?

Code generated from XGBoost model includes "None"

When transpiling XGBRegressor and XGBClassifier models such as the following basic example:

from xgboost import XGBRegressor
from sklearn import datasets
import m2cgen as m2c

iris_data = datasets.load_iris(return_X_y=True)

mod = XGBRegressor(booster="gblinear", max_depth=2)
X, y = iris_data
mod.fit(X[:120], y[:120])

code = m2c.export_to_c(mod)

print(code)

the resulting c-code includes a Pythonesque None :

double score(double * input) {
    return (None) + (((((-0.391196) + ((input[0]) * (-0.0196191))) + ((input[1]) * (-0.11313))) + ((input[2]) * (0.137024))) + ((input[3]) * (0.645197)));
}

Probably I am missing some basic step?

Reduce RAM and ROM footprint

I'm using m2cgen to convert some classifier to C. It works great and results are consistent, thanks for the library!

I have the problem that the compiled binaries are too large to fit on my embedded device. I checked and the binaries are around double the size of the binaries created with e.g sklearn_porter. However, m2cgen is the only libraries that can convert my python classifiers to C without introducing errors into the classification.
Even if I reduce the size of the classifier, I run into the problem that the RAM of the device is exceeded (think of something in the kB range).

Do you have any idea how the footprint of the c code could be reduced?

m2cgen output for xgboost with binary:logistic objective returns raw (not transformed) scores

Our xgboost models use the binary:logistic' objective function, however the m2cgen converted version of the models return raw scores instead of the transformed scores.

This is fine as long as the user knows this is happening! I didn't, so it took a while to figure out what was going on. I'm wondering if perhaps a useful warning could be raised for users to alert them of this issue? A warning could include a note that they can transform these scores back to the expected probabilities [0, 1] by prob = logistic.cdf(score - base_score) where base_score is an attribute of the xgboost model.

In our case, I'd like to minimize unnecessary processing on the device, so I am actually happy with the current m2cgen output and will instead inverse transform our threshold when evaluating the model output from the transpiled model...but it did take me a bit before I figured out what was going on, which is why I'm suggesting that a user friendly message might be raised when an unsupported objective function is encountered.

Thanks for creating & sharing this great tool!

How to run generated java code without main method?

Sorry, i newbie in java. But i need to run this code for ML program.

how to run this code
`public class Model {

public static double score(double[] input) {
    return (((((((((((((36.45948838508965) + ((input[0]) * (-0.10801135783679647))) + ((input[1]) * (0.04642045836688297))) + ((input[2]) * (0.020558626367073608))) + ((input[3]) * (2.6867338193449406))) + ((input[4]) * (-17.76661122830004))) + ((input[5]) * (3.8098652068092163))) + ((input[6]) * (0.0006922246403454562))) + ((input[7]) * (-1.475566845600257))) + ((input[8]) * (0.30604947898516943))) + ((input[9]) * (-0.012334593916574394))) + ((input[10]) * (-0.9527472317072884))) + ((input[11]) * (0.009311683273794044))) + ((input[12]) * (-0.5247583778554867));
}

without main code, im a bit confusing. thanks!

Flaky e2e test for the XGBoost model with the 'gblinear' booster

Context from @StrikerRUS:

Now Go is failing (refer to #200 (comment)):

=================================== FAILURES ===================================
_ test_e2e[xgboost_XGBClassifier - go_lang - train_model_classification_binary2] _
estimator = XGBClassifier(base_score=0.6, booster='gblinear', colsample_bylevel=None,
              colsample_bynode=None, colsamp...ambda=0, scale_pos_weight=1, subsample=None,
              tree_method=None, validate_parameters=False, verbosity=None)
executor_cls = <class 'tests.e2e.executors.go.GoExecutor'>

...

expected=[0.04761511 0.9523849 ], actual=[0.047615, 0.952385]
expected=[0.06296992 0.9370301 ], actual=[0.06297, 0.93703]
expected=[0.12447995 0.87552005], actual=[0.124479, 0.875521]
expected=[0.0757848 0.9242152], actual=[0.075784, 0.924216]
expected=[0.8092151  0.19078489], actual=[0.809212, 0.190788]

BTW, in attempts to check my guess from #200 (comment), I found that coefs in gblinear are also float32:
https://github.com/dmlc/xgboost/blob/67d267f9da3b15a6e5a8393afae9be921a4e224b/src/gbm/gblinear_model.h#L110

https://github.com/dmlc/xgboost/blob/67d267f9da3b15a6e5a8393afae9be921a4e224b/src/gbm/gblinear_model.h#L120

https://github.com/dmlc/xgboost/blob/67d267f9da3b15a6e5a8393afae9be921a4e224b/src/gbm/gblinear_model.h#L82

https://github.com/dmlc/xgboost/blob/67d267f9da3b15a6e5a8393afae9be921a4e224b/src/gbm/gblinear_model.h#L91

and from #188 (comment) we know that bst_float is actually float
https://github.com/dmlc/xgboost/blob/8d06878bf9b778db68ae98f68d99a3557c7ea885/include/xgboost/base.h#L110-L111

Created dmlc/xgboost#5634.

RecursionError: maximum recursion depth exceeded

The problem here is the number of columns. My df shape is [1428 rows x 3100 columns]

CODE:

#%%

import pandas as pd

df = pd.read_csv("doc_vector.csv")

#%%

df = df.drop(columns=["document", "keywords"])

#%%

X = df.drop(columns=["category"]).values
y = df.filter(["category"]).values.ravel()

#%%

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.33,
    random_state=42
)

#%%

from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
                     'C': [1, 10, 100, 1000]},
                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]
svc = SVC(gamma='auto')
clf = GridSearchCV(svc, parameters)
clf.fit(X_train, y_train)
sorted(clf.cv_results_.keys())
"""
OUTPUT:
['mean_fit_time',
 'mean_score_time',
 'mean_test_score',
 'param_C',
 'param_gamma',
 'param_kernel',
 'params',
 'rank_test_score',
 'split0_test_score',
 'split1_test_score',
 'split2_test_score',
 'split3_test_score',
 'split4_test_score',
 'std_fit_time',
 'std_score_time',
 'std_test_score']
"""

#%%

clf.cv_results_
"""
OUTPUT:

{'mean_fit_time': array([1.81106772, 2.2082305 , 1.76409402, 1.627356  , 1.75857201,
        1.56271949, 1.75669813, 1.56454921, 1.49936023, 1.51295609,
        1.53478389, 1.54050641]),
 'std_fit_time': array([0.02475751, 0.01825799, 0.03386408, 0.01870898, 0.03761543,
        0.01331777, 0.03751674, 0.01426082, 0.01782674, 0.03306172,
        0.04617239, 0.03143336]),
 'mean_score_time': array([0.33980088, 0.41657252, 0.33987026, 0.32597055, 0.3384655 ,
        0.32319117, 0.33914285, 0.32390838, 0.31457753, 0.31692972,
        0.32040634, 0.32276688]),
 'std_score_time': array([0.00426356, 0.00054301, 0.00069592, 0.00271754, 0.00489743,
        0.00426302, 0.00481799, 0.00548289, 0.00591126, 0.00769205,
        0.00599371, 0.01014503]),
 'param_C': masked_array(data=[1, 1, 10, 10, 100, 100, 1000, 1000, 1, 10, 100, 1000],
              mask=[False, False, False, False, False, False, False, False,
                    False, False, False, False],
        fill_value='?',
             dtype=object),
 'param_gamma': masked_array(data=[0.001, 0.0001, 0.001, 0.0001, 0.001, 0.0001, 0.001,
                    0.0001, --, --, --, --],
              mask=[False, False, False, False, False, False, False, False,
                     True,  True,  True,  True],
        fill_value='?',
             dtype=object),
 'param_kernel': masked_array(data=['rbf', 'rbf', 'rbf', 'rbf', 'rbf', 'rbf', 'rbf', 'rbf',
                    'linear', 'linear', 'linear', 'linear'],
              mask=[False, False, False, False, False, False, False, False,
                    False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'C': 1, 'gamma': 0.001, 'kernel': 'rbf'},
  {'C': 1, 'gamma': 0.0001, 'kernel': 'rbf'},
  {'C': 10, 'gamma': 0.001, 'kernel': 'rbf'},
  {'C': 10, 'gamma': 0.0001, 'kernel': 'rbf'},
  {'C': 100, 'gamma': 0.001, 'kernel': 'rbf'},
  {'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'},
  {'C': 1000, 'gamma': 0.001, 'kernel': 'rbf'},
  {'C': 1000, 'gamma': 0.0001, 'kernel': 'rbf'},
  {'C': 1, 'kernel': 'linear'},
  {'C': 10, 'kernel': 'linear'},
  {'C': 100, 'kernel': 'linear'},
  {'C': 1000, 'kernel': 'linear'}],
 'split0_test_score': array([0.78645833, 0.63020833, 0.85416667, 0.80208333, 0.84375   ,
        0.84895833, 0.84375   , 0.83854167, 0.83854167, 0.83854167,
        0.83854167, 0.83854167]),
 'split1_test_score': array([0.79057592, 0.64397906, 0.85863874, 0.80104712, 0.85340314,
        0.85340314, 0.85863874, 0.85340314, 0.85863874, 0.85863874,
        0.85863874, 0.85863874]),
 'split2_test_score': array([0.79057592, 0.64921466, 0.85863874, 0.81151832, 0.85863874,
        0.86387435, 0.85863874, 0.85863874, 0.85863874, 0.85863874,
        0.85863874, 0.85863874]),
 'split3_test_score': array([0.80628272, 0.64397906, 0.86387435, 0.80628272, 0.86387435,
        0.86910995, 0.85863874, 0.86387435, 0.86387435, 0.86387435,
        0.86387435, 0.86387435]),
 'split4_test_score': array([0.77486911, 0.60732984, 0.84816754, 0.78534031, 0.84293194,
        0.84816754, 0.84293194, 0.84816754, 0.84293194, 0.84293194,
        0.84293194, 0.84293194]),
 'mean_test_score': array([0.7897524 , 0.63494219, 0.85669721, 0.80125436, 0.85251963,
        0.85670266, 0.85251963, 0.85252509, 0.85252509, 0.85252509,
        0.85252509, 0.85252509]),
 'std_test_score': array([0.01006947, 0.01517817, 0.00525755, 0.00877064, 0.00819737,
        0.00835564, 0.00749881, 0.00873473, 0.00991084, 0.00991084,
        0.00991084, 0.00991084]),
 'rank_test_score': array([11, 12,  2, 10,  8,  1,  8,  3,  3,  3,  3,  3], dtype=int32)}
"""

#%%

clf.best_params_
"""
OUTPUT:
{'C': 100, 'gamma': 0.0001, 'kernel': 'rbf'}
"""

#%%

clf.best_estimator_
"""
OUTPUT:

SVC(C=100, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.0001, kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)
"""

#%%

import m2cgen as m2c

code = m2c.export_to_c(clf.best_estimator_)

print(code)

ERROR:


---------------------------------------------------------------------------

RecursionError                            Traceback (most recent call last)

<ipython-input-24-43271f44552b> in <module>
      1 import m2cgen as m2c
      2 
----> 3 code = m2c.export_to_c(clf.best_estimator_)
      4 
      5 print(code)

~/PycharmProjects/jupyter_test/venv/lib/python3.6/site-packages/m2cgen/exporters.py in export_to_c(model, indent)
     64     """
     65     interpreter = interpreters.CInterpreter(indent=indent)
---> 66     return _export(model, interpreter)
     67 
     68 

~/PycharmProjects/jupyter_test/venv/lib/python3.6/site-packages/m2cgen/exporters.py in _export(model, interpreter)
    197 def _export(model, interpreter):
    198     assembler_cls = assemblers.get_assembler_cls(model)
--> 199     model_ast = assembler_cls(model).assemble()
    200     return interpreter.interpret(model_ast)

~/PycharmProjects/jupyter_test/venv/lib/python3.6/site-packages/m2cgen/assemblers/svm.py in assemble(self)
     37     def assemble(self):
     38         if self._output_size > 1:
---> 39             return self._assemble_multi_class_output()
     40         else:
     41             return self._assemble_single_output()

~/PycharmProjects/jupyter_test/venv/lib/python3.6/site-packages/m2cgen/assemblers/svm.py in _assemble_multi_class_output(self)
     66         n_support_len = len(n_support)
     67 
---> 68         kernel_exprs = self._apply_kernel(support_vectors, to_reuse=True)
     69 
     70         support_ranges = []

~/PycharmProjects/jupyter_test/venv/lib/python3.6/site-packages/m2cgen/assemblers/svm.py in _apply_kernel(self, support_vectors, to_reuse)
    100         kernel_exprs = []
    101         for v in support_vectors:
--> 102             kernel = self._kernel_fun(v)
    103             kernel_exprs.append(ast.SubroutineExpr(kernel, to_reuse=to_reuse))
    104         return kernel_exprs

~/PycharmProjects/jupyter_test/venv/lib/python3.6/site-packages/m2cgen/assemblers/svm.py in _rbf_kernel(self, support_vector)
    113         ]
    114         kernel = utils.apply_op_to_expressions(ast.BinNumOpType.ADD,
--> 115                                                *elem_wise)
    116         kernel = utils.mul(self._neg_gamma_expr, kernel)
    117         return ast.ExpExpr(kernel)

~/PycharmProjects/jupyter_test/venv/lib/python3.6/site-packages/m2cgen/assemblers/utils.py in apply_op_to_expressions(op, to_reuse, *exprs)
     55             apply_bin_op(current_expr, rest_exprs[0], op), *rest_exprs[1:])
     56 
---> 57     result = _inner(apply_bin_op(exprs[0], exprs[1], op), *exprs[2:])
     58     result.to_reuse = to_reuse
     59     return result

~/PycharmProjects/jupyter_test/venv/lib/python3.6/site-packages/m2cgen/assemblers/utils.py in _inner(current_expr, *rest_exprs)
     53 
     54         return _inner(
---> 55             apply_bin_op(current_expr, rest_exprs[0], op), *rest_exprs[1:])
     56 
     57     result = _inner(apply_bin_op(exprs[0], exprs[1], op), *exprs[2:])

... last 1 frames repeated, from the frame below ...

~/PycharmProjects/jupyter_test/venv/lib/python3.6/site-packages/m2cgen/assemblers/utils.py in _inner(current_expr, *rest_exprs)
     53 
     54         return _inner(
---> 55             apply_bin_op(current_expr, rest_exprs[0], op), *rest_exprs[1:])
     56 
     57     result = _inner(apply_bin_op(exprs[0], exprs[1], op), *exprs[2:])

RecursionError: maximum recursion depth exceeded

C-code generated from XGBoost model includes "None

When running the following code:

from xgboost import XGBRegressor
from sklearn import datasets
import m2cgen as m2c

iris_data = datasets.load_iris(return_X_y=True)

mod = XGBRegressor()
X, y = iris_data
mod.fit(X, y)

code = m2c.export_to_c(mod)

print(code)

the printed c code includes a Pythonesque None :

...
 return (((((((((((((((((((((((((((((((((((((((((((((((((((((((
(((((((((((((((((((((((((((((((((((((((((((((None) + (var0)) +
 (var1)) + (var2)) + (var3)) + (var4)) + (var5)) + (var6)) +
 (var7)) + (var8)) + (var9)) + (var10)) + (var11)) + (var12)) +  ...
...

Refactor count_exprs function

Exclude IdExpr: #208 (comment).
Fix compatibility with fallback expressions: #208 (comment).

gcc crashes compiling output

Similar to #88, but in my case the problem isn't the size of the binary but that gcc sometimes runs out of memory building the C output.

Could the output be broken up over multiple files to reduce compiler memory use?

Will there be an R interface

Hello, do you plan to provide an interface for the R language, I think that the R language is comparable to Python in some aspects, can you give R an interface?

Travis 50min limit... again

Today I saw our jobs at master hit 50min Travis limit per job 3 times. Guess, it's time to either review #243 or reorganize jobs at Travis. Refer to #125 for the past experience and to #114 for some further ideas.

cc @izeigerman

Do not set with_math_module flag for fallback expressions

Refer to #225 (comment).

In Boosting Assembler wrapping each estimator into a subroutine causes a performance degradation

I've recalled the real motivation behind not wrapping every individual estimator into its own subroutine - generation of many nested function calls leads to a performance degradation in Java. The observed difference reaches 4x for larger models (eg. XGBoost with 1000 estimators). The basic test I created (sorry about Scala):

@ import com.github.m2cgen.ModelOld
import com.github.m2cgen.ModelOld

@ import com.github.m2cgen.ModelNew
import com.github.m2cgen.ModelNew

@ def nextRandomData(): Array[Double] = (0 until 4).map(_ => Random.nextDouble).toArray
defined function nextRandomData

@ def testScore: Unit = {
    val start = System.currentTimeMillis()
    (0 until 100000).foreach(_ => <ModelNew|ModelOld>.score(nextRandomData))
    println("Runtime: " + (System.currentTimeMillis() - start).toString)
  }

Results for ModelOld:

@ testScore
Runtime: 2973

For ModelNew:

@ testScore
Runtime: 10747

The test model has been trained using the sklearn.datasets.load_iris() dataset. Classifier has been created as following:

model = XGBClassifier(n_estimators=1000)

In the attached archive I included the following:

ModelNew.java - java code generated with the most recent master.
ModelOld.java - java code generated with the release 0.5.0 version.
Models.jar - the jar containing both compiled sources.
xgboost_model2 - the trained estimator in Pickle format.

CC: @StrikerRUS FYI

add support for declarative languages

It seems that especially for functional languages it is very common to write Score() function and apply it to some data.

At present, assemblers make assumption about that target language is imperative.

LightGBM's original predictions differ from the transpiled JS predictions

I've a LGBMRegressor model (LightGBM v2.3.1) and I transpiled it to JavaScript code.

My dataset has about 60 numeric features (no categorical features). Features can contain missing values (they are np.nan in Python code and null in JavaScript).

This is my LightGBM model:

lgb.LGBMRegressor(
    objective='mse',
    max_depth=5,
    first_metric_only=True,
    boosting_type='gbdt',
    importance_type='gain',
    feature_fraction=np.sqrt(len(features))/len(features),
    subsample=0.4,
    seed=RANDOM_STATE,
    eta=0.02,
    nthread=0,
    reg_alpha=0.1,
    reg_lambda=0.1,
    num_leaves=31,
    n_estimators=50
)

Unfortunately I can't share the dataset.

The problem is that LightGBM's original predictions (from Python) differ from the transpiled JS predictions. They are similar, but different. If I plot a distribution of the differences between the original predictions and the JS predictions, I get a sort of normal distribution with mean=0, but in the tail of the distribution there are some pretty large differences.

	# One-vs-one decisions.
	decisions = []
	for i in range(n_support_len):
	for j in range(i + 1, n_support_len):
	kernel_weight_mul_ops = [
	utils.mul(kernel_exprs[k], ast.NumVal(coef[i][k]))
	for k in range(*support_ranges[j])
	]
	kernel_weight_mul_ops.extend([
	utils.mul(kernel_exprs[k], ast.NumVal(coef[j - 1][k]))
	for k in range(*support_ranges[i])
	])
	decision = utils.apply_op_to_expressions(
	ast.BinNumOpType.ADD,
	ast.NumVal(intercept[len(decisions)]),
	*kernel_weight_mul_ops
	)
	decisions.append(decision)

	if self.with_vectors or self.with_math_module:
	self._cg.add_dependency("numpy", alias="np")

	# R doesn't allow to have more than 50 nested if, [, [[, {, ( calls.
	# It raises contextstack overflow error not only for explicitly nested
	# calls, but also if met above mentioned number of parentheses
	# in one expression. Given that there is no way to control
	# the number of parentheses in one expression for now,
	# the following variable set to 50 / 2 value is expected to prevent
	# contextstack overflow error occurrence.
	# This value is just a heuristic and is subject to change in the future
	# based on the users' feedback.
	bin_depth_threshold = 25

	def sigmoid_expr(expr, to_reuse=False):
	neg_expr = ast.BinNumExpr(ast.NumVal(0), expr, ast.BinNumOpType.SUB)
	exp_expr = ast.ExpExpr(neg_expr)
	return ast.BinNumExpr(
	ast.NumVal(1),
	ast.BinNumExpr(ast.NumVal(1), exp_expr, ast.BinNumOpType.ADD),
	ast.BinNumOpType.DIV,
	to_reuse=to_reuse)


	def softmax_exprs(exprs):
	exp_exprs = [ast.ExpExpr(e, to_reuse=True) for e in exprs]
	exp_sum_expr = apply_op_to_expressions(ast.BinNumOpType.ADD, *exp_exprs,
	to_reuse=True)
	return [
	ast.BinNumExpr(e, exp_sum_expr, ast.BinNumOpType.DIV)
	for e in exp_exprs
	]

	Function Tanh(ByVal number As Double) As Double
	If number > 44.0 Then ' exp(2*x) <= 2^127
	Tanh = 1.0
	Exit Function
	End If
	If number < -44.0 Then
	Tanh = -1.0
	Exit Function
	End If
	Tanh = (Math.Exp(2 * number) - 1) / (Math.Exp(2 * number) + 1)
	End Function