Giter Club home page Giter Club logo

szilard / benchm-ml Goto Github PK

View Code? Open in Web Editor NEW
1.9K 149.0 335.0 1.11 MB

A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).

License: MIT License

R 83.98% Python 16.02%
machine-learning data-science r python gradient-boosting-machine random-forest deep-learning xgboost h2o spark

benchm-ml's Introduction

Simple/limited/incomplete benchmark for scalability, speed and accuracy of machine learning libraries for classification

All benchmarks are wrong, but some are useful

This project aims at a minimal benchmark for scalability, speed and accuracy of commonly used implementations of a few machine learning algorithms. The target of this study is binary classification with numeric and categorical inputs (of limited cardinality i.e. not very sparse) and no missing data, perhaps the most common problem in business applications (e.g. credit scoring, fraud detection or churn prediction). If the input matrix is of n x p, n is varied as 10K, 100K, 1M, 10M, while p is ~1K (after expanding the categoricals into dummy variables/one-hot encoding). This particular type of data structure/size (the largest) stems from this author's interest in some particular business applications.

A large part of this benchmark was done in 2015, with a number of updates later on as things have changed. Make sure you read at the end of this repo a summary of how the focus has changed over time, and why instead of updating this benchmark I started a new one (and where to find it).

The algorithms studied are

  • linear (logistic regression, linear SVM)
  • random forest
  • boosting
  • deep neural network

in various commonly used open source implementations like

  • R packages
  • Python scikit-learn
  • Vowpal Wabbit
  • H2O
  • xgboost
  • lightgbm (added in 2017)
  • Spark MLlib.

Update (June 2015): It turns out these are the most popular tools used for machine learning indeed. If your software tool of choice is not here, you can do a minimal benchmark with little work with the following instructions.

Random forest, boosting and more recently deep neural networks are the algos expected to perform the best on the structure/sizes described above (e.g. vs alternatives such as k-nearest neighbors, naive-Bayes, decision trees, linear models etc). Non-linear SVMs are also among the best in accuracy in general, but become slow/cannot scale for the larger n sizes we want to deal with. The linear models are less accurate in general and are used here only as a baseline (but they can scale better and some of them can deal with very sparse features, so they are great in other use cases).

By scalability we mean here that the algos are able to complete (in decent time) for the given data sizes with the main constraint being RAM (a given algo/implementation will crash if running out of memory). Some of the algos/implementations can work in a distributed setting, although the largest dataset in this study n = 10M is less than 1GB, so scaling out to multiple machines should not be necessary and is not the focus of this current study. (Also, some of the algos perform relatively poorly speedwise in the multi-node setting, where communication is over the network rather than via updating shared memory.) Speed (in the single node setting) is determined by computational complexity but also if the algo/implementation can use multiple processor cores. Accuracy is measured by AUC. The interpretability of models is not of concern in this project.

In summary, we are focusing on which algos/implementations can be used to train relatively accurate binary classifiers for data with millions of observations and thousands of features processed on commodity hardware (mainly one machine with decent RAM and several cores).

Data

Training datasets of sizes 10K, 100K, 1M, 10M are generated from the well-known airline dataset (using years 2005 and 2006). A test set of size 100K is generated from the same (using year 2007). The task is to predict whether a flight will be delayed by more than 15 minutes. While we study primarily the scalability of algos/implementations, it is also interesting to see how much more information and consequently accuracy the same model can obtain with more data (more observations).

Setup

The tests have been carried out on a Amazon EC2 c3.8xlarge instance (32 cores, 60GB RAM). The tools are freely available and their installation is trivial (version information here). For some of the models that ran out of memory for the larger data sizes a r3.8xlarge instance (32 cores, 250GB RAM) has been used occasionally. For deep learning on GPUs, p2.xlarge (1 GPU with 12GB video memory, 4 CPU cores, 60GB RAM) instance has been used.

Update (January 2018): A more modern approach would use docker for fully automated installing of all ML software and automated timing/running of tests (which would make it more easy to rerun the tests on new versions of the tools, would make them more reproducible etc). This approach has been actually used in a successor of this benchmark focusing on the top performing GBM implementations only, see here.

Results

For each algo/tool and each size n we observe the following: training time, maximum memory usage during training, CPU usage on the cores, and AUC as a measure for predictive accuracy. Times to read the data, pre-process the data, score the test data are also observed but not reported (not the bottleneck).

Linear Models

The linear models are not the primary focus of this study because of their not so great accuracy vs the more complex models (on this type of data). They are analyzed here only to get some sort of baseline.

The R glm function (the basic R tool for logistic regression) is very slow, 500 seconds on n = 0.1M (AUC 70.6). Therefore, for R the glmnet package is used. For Python/scikit-learn LogisticRegression (based on the LIBLINEAR C++ library) has been used.

Tool n Time (sec) RAM (GB) AUC
R 10K 0.1 1 66.7
. 100K 0.5 1 70.3
. 1M 5 1 71.1
. 10M 90 5 71.1
Python 10K 0.2 2 67.6
. 100K 2 3 70.6
. 1M 25 12 71.1
. 10M crash/360 71.1
VW 10K 0.3 (/10) 66.6
. 100K 3 (/10) 70.3
. 1M 10 (/10) 71.0
. 10M 15 71.0
H2O 10K 1 1 69.6
. 100K 1 1 70.3
. 1M 2 2 70.8
. 10M 5 3 71.0
Spark 10K 1 1 66.6
. 100K 2 1 70.2
. 1M 5 2 70.9
. 10M 35 10 70.9

Python crashes on the 60GB machine, but completes when RAM is increased to 250GB (using a sparse format would help with memory footprint and likely runtime as well). The Vowpal Wabbit (VW) running times are reported in the table for 10 passes (online learning) over the data for the smaller sizes. While VW can be run on multiple cores (as multiple processes communicating with each other), it has been run here in the simplest possible way (1 core). Also keep in mind that VW reads the data on the fly while for the other tools the times reported exclude reading the data into memory.

One can play with various parameters (such as regularization) and even do some search in the parameter space with cross-validation to get better accuracy. However, very quick experimentation shows that at least for the larger sizes regularization does not increase accuracy significantly (which is expected since n >> p).

plot-time plot-auc

The main conclusion here is that it is trivial to train linear models even for n = 10M rows virtually in any of these tools on a single machine in a matter of seconds. H2O and VW are the most memory efficient (VW needs only 1 observation in memory at a time therefore is the ultimately scalable solution). H2O and VW are also the fastest (for VW the time reported includes the time to read the data as it is read on the fly). Again, the differences in memory efficiency and speed will start to really matter only for larger sizes and beyond the scope of this study.

Learning Curve of Linear vs Non-Linear Models

For this dataset the accuracy of the linear model tops-off at moderate sizes while the accuracy of non-linear models (e.g. random forest) continues to increase with increasing data size. This is because a simple linear structure can be extracted already from a smaller dataset and having more data points will not change the classification boundary significantly. On the other hand, more complex models such as random forests can improve further with increasing data size by adjusting further the classification boundary.

This means that having more data ("big data") does not improve further the accuracy of the linear model (at least for this dataset).

Note also that the random forest model is more accurate than the linear one for any size, and contrary to the conventional wisdom of "more data beats better algorithms", the random forest model on 1% of the data (100K records) beats the linear model on all the data (10M records).

plot-auc

Similar behavior can be observed in other non-sparse datasets, e.g. the Higgs dataset. Contact me (e.g. submit a github issue) if you have learning curves for linear vs non-linear models on other datasets (dense or sparse).

On the other hand, there is certainly a price for higher accuracy in terms of larger required training (CPU) time.

Ultimately, there is a data size - algo (complexity) - cost (CPU time) - accuracy tradeoff (to be studied in more details later). Some quick results for H2O:

n Model Time (sec) AUC
10M Linear 5 71.0
0.1M RF 150 72.5
10M RF 4000 77.8

Random Forest

Note: The random forests results have been published in a more organized and self-contained form in this blog post.

Random forests with 500 trees have been trained in each tool choosing the default of square root of p as the number of variables to split on.

Tool n Time (sec) RAM (GB) AUC
R 10K 50 10 68.2
. 100K 1200 35 71.2
. 1M crash
Python 10K 2 2 68.4
. 100K 50 5 71.4
. 1M 900 20 73.2
. 10M crash
H2O 10K 15 2 69.8
. 100K 150 4 72.5
. 1M 600 5 75.5
. 10M 4000 25 77.8
Spark 10K 50 10 69.1
. 100K 270 30 71.3
. 1M crash/2000 71.4
xgboost 10K 4 1 69.9
. 100K 20 1 73.2
. 1M 170 2 75.3
. 10M 3000 9 76.3

plot-time plot-auc

The R implementation (randomForest package) is slow and inefficient in memory use. It cannot cope by default with a large number of categories, therefore the data had to be one-hot encoded. The implementation uses 1 processor core, but with 2 lines of extra code it is easy to build the trees in parallel using all the cores and combine them at the end. However, it runs out of memory already for n = 1M. I have to emphasize this has nothing to do with R per se (and I still stand by arguing R is the best data science platform esp. when it comes to data munging of structured data or visualization), it is just this particular (C and Fortran) RF implementation used by the randomForest package that is inefficient.

The Python (scikit-learn) implementation is faster, more memory efficient and uses all the cores. Variables needed to be one-hot encoded (which is more involved than for R) and for n = 10M doing this exhausted all the memory. Even if using a larger machine with 250GB of memory (and 140GB free for RF after transforming all the data) the Python implementation runs out of memory and crashes for this larger size. The algo finished successfully though when run on the larger box with simple integer encoding (which for some datasets/cases might be actually a good approximation/choice).

The H2O implementation is fast, memory efficient and uses all cores. It deals with categorical variables automatically. It is also more accurate than the studied R/Python packages, which may be because of dealing properly with the categorical variables, i.e. internally in the algo rather than working from a previously 1-hot encoded dataset (where the link between the dummies belonging to the same original variable is lost).

Spark (MLlib) implementation is slower and has a larger memory footprint. It runs out of memory already at n = 1M (with 250G of RAM it finishes for n = 1M, but it crashes for n = 10M). However, as Spark can run on a cluster one can throw in even more RAM by using more nodes. I also tried to provide the categorical variables encoded simply as integers and passing the categoricalFeaturesInfo parameter, but that made training much slower. A convenience issue, reading the data is more than one line of code and at the start of this benchmark project Spark did not provide a one-hot encoder for the categorical data (therefore I used R for that). This has been ammnded since, thanks @jkbradley for native 1-hot encoding code. In earlier versions of this benchmark there was an issue of Spark random forests having low prediction accuracy vs the other methods. This was due to aggregating votes rather than probabilities and it has been addressed by @jkbradley in this code (will be included in next Spark release). There is still an open issue on the accuracy for n = 1M (see the breaking trend in the AUC graph). To get more insights on the issues above see more comments by Joseph Bradley @jkbradley of Databricks/Spark project (thanks, Joseph).

Update (September 2016): Spark 2.0 introduces a new API (Pipelines/"Spark ML" vs "Spark MLlib") and the code becomes significantly simpler. Furthermore, Spark 1.5, 1.6 and 2.0 introduced several optimizations ("Tungsten") that have improved significantly for example the speed on queries (SparkSQL). However, there is no speed improvement for random forests, they actually got a bit slower.

I also tried xgboost, a popular library for boosting which is capable to build random forests as well. It is fast, memory efficient and of high accuracy. Note the different shapes of the AUC and runtime vs dataset size curves for H2O and xgboost, some discussions here.

Both H2O and xgboost have interfaces from R and Python.

A few other RF implementations (open source and commercial as well) have been benchmarked quickly on 1M records and runtime and AUC are reported here.

It would be nice to study the dependence of running time and accuracy as a function of the (hyper)parameter values of the algorithm, but a quick idea can be obtained easily for the H2O implementation from this table (n = 10M on 250GB RAM):

ntree depth nbins mtries Time (hrs) AUC
500 20 20 -1 (2) 1.2 77.8
500 50 200 -1 (2) 4.5 78.9
500 50 200 3 5.5 78.9
5000 50 200 -1 (2) 45 79.0
500 100 1000 -1 (2) 8.3 80.1

other hyperparameters being sample rate (at each tree), min number of observations in nodes, impurity function.

One can see that the AUC could be improved further and the best AUC from this dataset with random forests seems to be around 80 (the best AUC from linear models seems to be around 71, and we will compare with boosting and deep learning later).

Boosting (Gradient Boosted Trees/Gradient Boosting Machines)

Compared to random forests, GBMs have a more complex relationship between hyperparameters and accuracy (and also runtime). The main hyperparameters are learning (shrinkage) rate, number of trees, max depth of trees, while some others are number of bins, sample rate (at each tree), min number of observations in nodes. To add to complexity, GBMs can overfit in the sense that adding more trees at some point will result in decreasing accuracy on a test set (while on the training set "accuracy" keeps increasing).

For example using xgboost for n = 100K learn_rate = 0.01 max_depth = 16 (and the printEveryN = 100 and eval_metric = "auc" options) the AUC on the train and test sets, respectively after n_trees number of iterations are:

plot-overfit

One can see the AUC on the test set decreases after 1000 iterations (overfitting). xgboost has a handy early stopping option (early_stop_round = k - training will stop if performance e.g. on a holdout set keeps getting worse consecutively for k rounds). If one does not know where to stop, one might underfit (too few iterations) or overfit (too many iterations) and the resulting model will be suboptimal in accuracy (see Fig. above).

Doing an extensive search for the best model is not the main goal of this project. Nevertheless, a quick exploratory search in the hyperparameter space has been conducted using xgboost (with the early stopping option). For this a separate validation set of size 100K from 2007 data not used in the test set has been generated. The goal is to find parameter values that provide decent accuracy and then run all GBM implementations (R, Python scikit-learn, etc) with those parameter values to compare speed/scalability (and accuracy).

The smaller the learn_rate the better the AUC, but for very small values training time increases dramatically, therefore we use learn_rate = 0.01 as a compromise. Unlike recommended in much of the literature, shallow trees don't produce best (or close to best) results, the grid search showed better accuracy e.g. with max_depth = 16. The number of trees to produce optimal results for the above hyperparameter values depend though on the training set size. For n_trees = 1000 we don't reach the overfitting regime for either size and we use this value for studying the speed/scalability of the different implementations. (Values for the other hyper-parameters that seem to work well are: sample_rate = 0.5 min_obs_node = 1.) We call this experiment A (in the table below).

Unfortunately some implementations take too much time to run for the above parameter values (and Spark runs out of memory). Therefore, another set of parameter values (that provide lower accuracy but faster training times) has been also used to study speed/scalability: learn_rate = 0.1 max_depth = 6 n_trees = 300. We call this experiment B.

I have to emphasize that while I make the effort to match parameter values for all algos/implementations, every implementation is different, some don't have all the above parameters, while some might use the existing ones in a slightly different way (you can also see the resulting model/AUC is somewhat different). Nevertheless, the results below give us a pretty good idea of how the implementations compare to each other.

Tool n Time (s) A Time (s) B AUC A AUC B RAM(GB) A RAM(GB) B
R 10K 20 3 64.9 63.1 1 1
. 100K 200 30 72.3 71.6 1 1
. 1M 3000 400 74.1 73.9 1 1
. 10M 5000 74.3 4
Python 10K 1100 120 69.9 69.1 2 2
. 100K 1500 72.9 3
. 1M
. 10M
H2O 10K 90 7 68.2 67.7 3 2
. 100K 500 40 71.8 72.3 3 2
. 1M 900 60 75.9 74.3 9 2
. 10M 3500 300 78.3 74.6 11 20
Spark 10K 180000 700 66.4 67.8 30 10
. 100K 1200 72.3 30
. 1M 6000 73.8 30
. 10M (60000) (74.1) crash (110)
xgboost 10K 6 1 70.3 69.8 1 1
. 100K 40 4 74.1 73.5 1 1
. 1M 400 45 76.9 74.5 1 1
. 10M 9000 1000 78.7 74.7 6 5

plot-time plot-auc

The memory footprint of GBMs is in general smaller than for random forests, therefore the bottleneck is mainly training time (although besides being slow Spark is inefficient in memory use as well especially for deeper trees, therefore it crashes).

Similar to random forests, H2O and xgboost are the fastest (both use multithreading). R does relatively well considering that it's a single-threaded implementation. Python is very slow with one-hot encoding of categoricals, but almost as fast as R (just 1.5x slower) with simple/integer encoding. Spark is slow and memory inefficient, but at least for shallow trees it achieves similar accuracy to the other methods (unlike in the case of random forests, where Spark provides lower accuracy than its peers).

Compared to random forests, boosting requires more tuning to get a good choice of hyperparameters. Quick results for H2O and xgboost with n = 10M (largest data) learn_rate = 0.01 (the smaller the better AUC, but also longer and longer training times) max_depth = 20 (after rough search with max_depth = 2,5,10,20,50) n_trees = 5000 (close to xgboost early stop) min_obs_node = 1 (and sample_rate = 0.5 for xgboost, n_bins = 1000 for H2O):

Tool Time (hr) AUC
H2O 7.5 79.8
H2O-3 9.5 81.2
xgboost 14 81.1

Compare with H2O random forest from previous section (Time 8.3 hr, AUC 80.1). H2O-3 is the new generation/version of H2O.

Update (May 2017): A new tool for GBMs, LightGBM came out recently. While it's not (yet) as widely used as the tools above, it is now the fastest one. There is also recent work in running xgboost and LightGBM on GPUs. Therefore I started a new (leaner) github repo to keep track of the best GBM tools here (and ignore mediocre tools such as Spark).

Update (January 2018): I dockerized the GBM measurements for h2o, xgboost and lightgbm (both CPU and GPU versions). The repo linked in the paragraph above will contain all further development w.r.t. GBM implementations. GBMs are typically the most accurate algos for supervised learning on structured/tabular data and therefore of my main interest (e.g. compared with the other 3 algos discussed in this current benchmark - linear models, random forests and neural networks), and the dockerization makes it easier to keep that other repo up to date with tests on the newest versions of the tools and potentially adding new ML tools. Therefore this new GBM-perf repo can be considered as a "successor" of the current one.

Deep neural networks

Deep learning has been extremely successful on a few classes of data/machine learning problems such as involving images, speech and text (supervised learning) and games (reinforcement learning). However, it seems that in "traditional" machine learning problems such as fraud detection, credit scoring or churn, deep learning is not as successful and it provides lower accuracy than random forests or gradient boosting machines. My experiments (November 2015) on the airline dataset used in this repo and also on another commercial dataset have conjectured this, but unfortunately most of the hype surrounding deep learning and "artificial intelligence" overwhelms this reality, and there are only a few references in this direction e.g. here, here or here.

Here are the results of a few fully connected network architectures trained with various optimization schemes (adaptive, rate annealing, momentum etc.) and various regularizers (dropout, L1, L2) using H2O with early stopping on the 10M dataset:

Params AUC Time (s) Epochs
default: activation = "Rectifier", hidden = c(200,200) 73.1 270 1.8
hidden = c(50,50,50,50), input_dropout_ratio = 0.2 73.2 140 2.7
hidden = c(50,50,50,50) 73.2 110 1.9
hidden = c(20,20) 73.1 100 4.6
hidden = c(20) 73.1 120 6.7
hidden = c(10) 73.2 150 12
hidden = c(5) 72.9 110 9.3
hidden = c(1) (~logistic regression) 71.2 120 13
hidden = c(200,200), l1 = 1e-5, l2 = 1e-5 73.1 260 1.8
RectifierWithDropout, c(200,200,200,200), dropout=c(0.2,0.1,0.1,0) 73.3 440 2.0
ADADELTA rho = 0.95, epsilon = 1e-06 71.1 240 1.7
rho = 0.999, epsilon = 1e-08 73.3 270 1.9
adaptive = FALSE default: rate = 0.005, decay = 1, momentum = 0 73.0 340 1.1
rate = 0.001, momentum = 0.5 / 1e5 / 0.99 73.2 410 0.7
rate = 0.01, momentum = 0.5 / 1e5 / 0.99 73.3 280 0.9
rate = 0.01, rate_annealing = 1e-05, momentum = 0.5 / 1e5 / 0.99 73.5 360 1
rate = 0.01, rate_annealing = 1e-04, momentum = 0.5 / 1e5 / 0.99 72.7 3700 8.7
rate = 0.01, rate_annealing = 1e-05, momentum = 0.5 / 1e5 / 0.9 73.4 350 0.9

It looks like the neural nets are underfitting and are not able to capture the same structure in the data as the random forests/GBMs can (AUC 80-81). Therefore adding various forms of regularization does not improve accuracy (see above). Note also that by using early stopping (based on the decrease of accuracy on a validation dataset during training iterations) the training takes relatively short time (compared to RF/GBM), also a sign of effectively low model complexity. Remarkably, the nets with more layers (deep) are not performing better than a simple net with 1 hidden layer and a small number of neurons in that layer (10).

Timing on the 1M dataset of various tools (fully connected networks, 2 hidden layers, 200 neurons each, ReLU,
SGD, learning rate 0.01, momentum 0.9, 1 epoch), code here:

Tool Time GPU Time CPU
h2o - 50
mxnet 35 65
keras+TF 35 60
keras+theano 25 70

(GPU = p2.xlarge, CPU = r3.8xlarge 32c for h2o/mxnet, p2.xlarge 4c for TF/theano, theano uses 1 core only)

Despite not being great (in accuracy) on tabular data of the type above, deep learning has been a blast in domains such as image, speech and somewhat text, and I'm planing to do a benchmark of tools in that area as well (mostly conv-nets and RNNs/LSTMs).

Big(ger) Data

While my primary interest is in machine learning on datasets of 10M records, you might be interested in larger datasets. Some problems might need a cluster, though there has been a tendency recently to solve every problem with distributed computing, needed or not. As a reminder, sending data over a network vs using shared memory is a big speed difference. Also several popular distributed systems have significant computation and memory overhead, and more fundamentally, their communication patterns (e.g. map-reduce style) are not the best fit for many of the machine learning algos.

Larger Data Sizes (on a Single Server)

For linear models, most tools, including single-core R work well on 100M records still on a single server (r3.8xlarge instance with 32 cores, 250GB RAM used here). (A 10x copy of the 10M dataset has been used, therefore information on AUC vs size is invalid and is not considered here.)

Linear models, 100M rows:

Tool Time[s] RAM[GB]
R 1000 60
Spark 160 120
H2O 40 20
VW 150

Some tools can handle 1B records on a single machine (in fact VW never runs out of memory, so if larger runtimes are acceptable, you can go further still on one machine).

Linear models, 1B rows:

Tool Time[s] RAM[GB]
H2O 500 100
VW 1400

For tree-based ensembles (RF, GBM) H2O and xgboost can train on 100M records on a single server, though the training times become several hours:

RF/GBM, 100M rows:

Algo Tool Time[s] Time[hr] RAM[GB]
RF H2O 40000 11 80
. xgboost 36000 10 60
GBM H2O 35000 10 100
. xgboost 110000 30 50

One usually hopes here (and most often gets) much better accuracy for the 1000x in training time vs linear models.

Distributed Systems

Some quick results:

H2O logistic runtime (sec):

size 1 node 5 nodes
100M 42 9.9
1B 480 101

H2O RF runtime (sec) (5 trees):

size 1 node 5 nodes
10M 42 41
100M 405 122

Summary

As of January 2018:

When I started this benchmark in March 2015, the "big data" hype was all the rage, and the fanboys wanted to do machine learning on "big data" with distributed computing (Hadoop, Spark etc.), while for the datasets most people had single-machine tools were not only good enough, but also faster, with more features and less bugs. I gave quite a few talks at conferences and meetups about these benchmarks starting 2015 and while at the beginning I had several people asking angrily about my results on Spark, by 2017 most people realized single machine tools are much better for solving most of their ML problems. While Spark is a decent tool for ETL on raw data (which often is indeed "big"), its ML libraries are totally garbage and outperformed (in training time, memory footpring and even accuracy) by much better tools by orders of magnitude. Furthermore, the increase in available RAM over the last years in servers and also in the cloud, and the fact that for machine learning one typically refines the raw data into a much smaller sized data matrix is making the mostly single-machine highly-performing tools (such as xgboost, lightgbm, VW but also h2o) the best choice for most practical applications now. The big data hype is finally over.

What's happening now is a new wave of hype, namely deep learning. The fanboys now think deep learning (or as they miscall it: AI) is the best solution to all machine learning problems. While deep learning has been extremely successful indeed on a few classes of data/machine learning problems such as involving images, speech and somewhat text (supervised learning) and games/virtual environments (reinforcement learning), in more "traditional" machine learning problems encountered in business such as fraud detection, credit scoring or churn (with structured/tabular data) deep learning is not as successful and it provides lower accuracy than random forests or gradient boosting machines (GBM). Therefore, lately I'm concentrating my benchmarking efforts mostly on GBM implementations and I have started a new github repo GBM-perf that's more "focused" and lean and also uses more modern tools (such as docker) to make the benchmarks more maintainable and reproducible. Also, it has become apparent recently that GPUs can be a powerful computing platform for GBMs too, and the new repo includes benchmarks of the available GPU implementations as well.

I started these benchmarks mostly out of curiousity and the desire to learn (and also in order to be able to choose good tools for my projects). It's been quite some experience and I'd like to thank all the folks (especially the developers of the tools) for helping me in tuning and getting the most out of their ML tools. As a side effect of this work I had the pleasure to be invited to talk at several conferences (KDD, R-finance, useR!, eRum, H2O World, Crunch, Predictive Analytics World, EARL, Domino Data Science Popup, Big Data Day LA, Budapest Data Forum) and to over 10 meetups, e.g.:

  • KDD Invited Talk - Machine Learning Software in Practice: Quo Vadis? - Halifax, Canada, August 2017
  • R in Finance Keynote - No-Bullshit Data Science - Chicago, May 2017
  • LA Data Science Meetup - Machine Learning in Production - Los Angeles, May 2017
  • useR! 2016 - Size of Datasets for Analytics and Implications for R - Stanford, June 2016
  • H2O World - Benchmarking open source ML platforms - Mountain View, November 2015
  • LA Machine Learning Meetup - Benchmarking ML Tools for Scalability, Speed and Accuracy - LA, June 2015

(see code/slides and for some video recordings here). These talks/materials are also probably the best place to get a grasp on the findings of this benchmark (and if you want to pick the one that is most up to date and summarizes the most watch the video of my KDD talk). The work goes on, expect more results...

Citation

If benchm-ml was useful for your research, please consider citing it, for instance using the latest commit:

@misc{,
	author = {Pafka, Szilard},
	title = {benchm-ml},
	publisher = {GitHub},
	year = {2019},
	journal = {GitHub repository},
	url = {https://github.com/szilard/benchm-ml},
	howpublished = {\url{https://github.com/szilard/benchm-ml}},
	commit = {13325ce3edd7c902390197f43bcc7938c306bbe3}
}

benchm-ml's People

Contributors

dirmeier avatar earino avatar hetong007 avatar jangorecki avatar jstokes avatar nicolaskruchten avatar szilard avatar xhudik avatar yinxusen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

benchm-ml's Issues

benchmarking with autosklearn (zeroconf)

Great initiative, thanks for making this public!
You might be interested in extending your benchmarking to the auto-sklearn. https://github.com/automl/auto-sklearn
I have created a script that can take in a sparse dataset in the pandas HDFS dataframe .h5 format and run a binary classification on it on multiprocessing cluster with auto-sklearn. https://github.com/Motorrat/autosklearn-zeroconf Myself I will try to duplicate your benchmark, but just in case you are on it you might want to try out yourself.

GBM variable 1: Month is not of type numeric, ordered, or factor.

For gbm_2.1.1 and R 3.3.1 I get the following error for
benchm-ml/3-boosting/1-gbm.R

> system.time({
+   md <- gbm(dep_delayed_15min ~ ., data = d_train, distribution = "bernoulli", 
+             n.trees = 1000, 
+             interaction.depth = 16, shrinkage = 0.01, n.minobsinnode = 1,
+             bag.fraction = 0.5)
+ })

Error in gbm.fit(x, y, offset = offset, distribution = distribution, w = w,  : 
  variable 1: Month is not of type numeric, ordered, or factor.
Timing stopped at: 0.02 0 0.01 

Tobias

Spark random forest issues

This is to collaborate on some issues with Spark RF also addressed by @jkbradley in comments to this post http://datascience.la/benchmarking-random-forest-implementations/ (see comments by Joseph Bradley). cc: @mengxr

Please see “Absolute Minimal Benchmark” for random forests https://github.com/szilard/benchm-ml/tree/master/z-other-tools and let's use the 1M row training set and the test set linked in from there.

@jkbradley says: One-hot encoder: Spark 1.4 includes this, plus a lot more feature transformers. Preprocessing should become ever-easier, especially using DataFrames (Spark 1.3+).

Yes, indeed. Can you please provide code that reads in the original dataset (pre- 1-hot encoding) and does the 1-hot encoding in Spark. Also, if random forest 1.4 API can use data frames, I guess we should use that for the training. Can you please provide code for that too.

@jkbradley says: AUC/accuracy: The AUC issue appears to be caused by MLlib tree ensembles aggregating votes, rather than class probabilities, as you suggested. I re-ran your test using class probabilities (which can be aggregated by hand), and then got the same AUC as other libraries. We’re planning on including this fix in Spark 1.5 (and thanks for providing some evidence of its importance!).

Fantastic. Can you please share code that does that already? I would be happy to check it out.

Possible data leakage

Hi, szilard!
thanks for your benchmarks, I think that you found an interesting dataset for comparison.

HOWEVER

The time of departure present in the data is exact time when aircraft takes off.
Thus, by analyzing the aircrafts from airport X to airport Y by carrier Z one can establish at which time aircrafts should take off to be in time (and that's what deep trees do, to my belief).

At least, I could easily see such patterns in data.

It doesn't seem to be very useful to predict if aircraft departures in time given you already know this information.

So, my suggestion is either to replace DepTime with PlannedDepTime (if you know how to get this infomation) or put DepTime = DepTime // 200 to reduce possibility of using this information, while this altered feature gives approximate information about the flight schedule.

License for datasets

Hi
Could you please help me to understand if MIT covers datasets mentioned here
Training datasets of sizes 10K, 100K, 1M, 10M are generated
from the well-known airline dataset (using years 2005 and 2006). A test set of size 100K is generated from the same (using year 2007).

mxnet sparse data format

Motivation: I can't run mxnet on the 10M records airline set #29 because model.matrix crashes out of RAM (on g2.8xlarge with 60GB or RAM - largest available for GPU instances).

Using Matrix::sparse.model.matrix to encode the categorical data would be great (uses <2GB RAM), but I get:

Error in asMethod(object) : 
  Cholmod error 'problem too large' at file ../Core/cholmod_dense.c, line 105

Strangely on the 1M dataset I get another error:

Error: io.cc:50: Seems X, y was passed in a Row major way, MXNetR adopts a column major convention.

How to time the algorithms?

How do you time the training time of the different algorithms? Are there tools available that can be used for all ML software?

Spark logistic regression issues

Splitting #5 in two: logistic regression here and random forest in a different issue.

Summary: Logistic regression has lower AUC in Spark.

For n=1M Spark gets AUC = 0.703 while R/Python etc. AUC = 0.711.

Code here https://github.com/szilard/benchm-ml/blob/master/1a-spark-logistic/spark.txt
Train data here https://s3.amazonaws.com/benchm-ml--spark/spark-train-1m.csv test data here https://s3.amazonaws.com/benchm-ml--spark/spark-test-1m.csv

Spark version used: 1.3.0

Spark Random forest accuracy --spam?

Hi guys

I was running random forest using spark in R

Can any one tell me how I get accuracy

I would have got normall r square but it drops certain row when random forest runs

so to get r square I need equal rows in original data and predicted data

other dataset of such type for benchmarking?

@tqchen I moved your last question to a new issue:

Thanks for the clarification! BTW, do you have any idea if there is any other dataset of such type for benchmarking? For example, a dataset with more columns and rows.

One thing I noticed about this dataset is that seems the output was very dependent on one variable(when the features are randomly dropped at rate of 50%, one output tree could be very bad). This might make the result become a singular case where the result simply repeatively cut on a single feature.

LightGBM results

New GBM implementation released by Microsoft: https://github.com/Microsoft/LightGBM

on 10M dataset, r3.8xlarge

trying to match xgboost & LightGBM params

nround = 100, max_depth = 10, eta = 0.1
num_iterations=100  learning_rate=0.1  num_leaves=1024  min_data_in_leaf=100
num_iterations=100  learning_rate=0.1  num_leaves=512   min_data_in_leaf=100
num_iterations=100  learning_rate=0.1  num_leaves=1024  min_data_in_leaf=0
Tool time (s) AUC
xgboost 350 0.7511
LightGBM 1 500 0.7848
LightGBM 2 350 0.7729
LightGBM 3 450 0.7897

Code to get the results here

benchm-ml/z-other-tools/4-h2o.R change in import format

Hi,
for H2O cluster version: 3.8.3.3
the import function in benchm-ml/z-other-tools/4-h2o.R should be corrected to

dx_train <- h2o.importFile(path = "train-1m.csv")

otherwise the following error occours

> dx_train <- h2o.importFile(h2oServer, path = "train-1m.csv")
Error: is.character(key) && length(key) == 1L && !is.na(key) is not TRUE

Cheers
Tobias

Citation

Hey,
thanks for this repository. It's tremendously useful. Would it be possible to maybe add info on how to cite this repository? Maybe sth like:

@misc{,
	author = {Pafka, Szilard},
	title = {benchm-ml},
	publisher = {GitHub},
	year = {2019},
	journal = {GitHub repository},
	howpublished = {\url{https://github.com/szilard/benchm-ml}},
	commit = {13325ce3edd7c902390197f43bcc7938c306bbe3}
}

Best,
Simon

DL with h2o

Trying to see if DL can match RF/GBM in accuracy on the airline dataset (where train is sampled from years 2005-2006, while validation and test sets sampled disjunctly from 2007). Also, some variables are kept categorical artificially and are intentionally not encoded as ordinal variables (to better match the structure of business datasets).

Recap: with 10M records training (largest in the benchmark) RF AUC 0.80 GBM 0.81 (on test set).

So far I get 0.73 with DL with h2o on 1M and 10M train as well:
https://github.com/szilard/benchm-ml/blob/master/4-DL/1-h2o.R

I tried a few architectures/activation/regularizations, but it won't beat the default. Runs about 2-3 minutes with early stopping (using validation set) on a 32 cores EC2 box.

The "problem" is DL learns very fast, the best AUC reached after 1.3 epochs on 1M rows train and 0.15 epochs on 10M (and early stopping kicks in around 9 and 0.9, rsp). On the other hand RF/GBM runs ~1hr to get good accuracy. That is the DL model seems underfitted to me.

Surely, DL might not beat GBM on this kind of data (proxy for general business data such as credit risk or fraud detection), but it should do better than 0.73.

Datasets:
https://s3.amazonaws.com/benchm-ml--main/train-1m.csv
https://s3.amazonaws.com/benchm-ml--main/train-10m.csv
https://s3.amazonaws.com/benchm-ml--main/valid.csv
https://s3.amazonaws.com/benchm-ml--main/test.csv

Update Latest version of XGBoost

Thanks to this benchmark, we now have a good understanding of what is going on in #14 in #2
Specifically, cacheline related issues for exact greedy algorithm. See our detailed analysis in this paper http://arxiv.org/abs/1603.02754

In short, exact greedy algorithm will suffer from cache line issues when facing dataset larger than 1M, which we can counter balance with pre-fetching but still not perfect.

We are adding a new option call tree_method to xgboost, which will allow user to choose the algorithm. By default it will choose the faster one, and will send a message to user when approximate algorithm is choosed. I think it might be interesting to rerun the benchmark on this latest version.

See https://github.com/dmlc/xgboost/tree/master/R-package for instructions. The drat or install from source should work. To confirm you are using the latest version, check if the message occur when running on 10M data

Tree method is automatically selected to be 'approx'...

SMILE

Thanks for great work! We have an open source machine learning library called SMILE (https://github.com/haifengl/smile). We have incorporated your benchmark (https://github.com/haifengl/smile/blob/master/benchmark/src/main/scala/smile/benchmark/Airline.scala). We found that our system is much faster for this data set. For 100K training data on a 4 core machine, we can train a random forest with 500 trees in 100 seconds, and gradient boost trees of 300 trees in 180 seconds. Projected to 32 cores, I think that we will be much faster than all the tools you tested. You can try it out by cloning our project. Then

sbt benchmark/run

This also includes benchmark on USPS data, which you may ignore. Thanks!

comment:re sklearn -- integer encoding vs 1-hot (py)

(Your post popped up in my twitter feed)
I'm not sure why you said you needed to one-hot encode categorical variables for scikit's random forest; I'm fairly certain you do not need to(and probably shouldn't). It's been awhile since I looked at the source, but I'm pretty sure it handles categorical variables encoded as a single vector of numbers just fine from empirical tests; performance is almost always worse if the features were one-hot encoded.

Add Rborist

Could you add Rborist in serial and parallel mode to add another (fast?) random forest implementation?

Great project. Very useful to have comparisons.

Datacratic MLDB results

This code gives an AUC of 0.7417 in 12.1s for the 1M training set on an r3.8xlarge EC2 instance with the latest release of Datacratic's Machine Learning Database (MLDB), available at http://mldb.ai/

from pymldb import Connection
mldb = Connection("http://localhost/")

mldb.v1.datasets("bench-train-1m").put({
    "type": "text.csv.tabular",
    "params": { "dataFileUrl": "https://s3.amazonaws.com/benchm-ml--main/train-1m.csv" }
})

mldb.v1.datasets("bench-test").put({
    "type": "text.csv.tabular",
    "params": { "dataFileUrl": "https://s3.amazonaws.com/benchm-ml--main/test.csv" }
})

mldb.v1.procedures("benchmark").put({
    "type": "classifier.experiment",
    "params": {
        "experimentName": "benchm_ml",
        "training_dataset": {"id": "bench-train-1m"},
        "testing_dataset": {"id": "bench-test"},
        "configuration": {
            "type": "bagging",
            "num_bags": 100,
            "validation_split": 0.50,
            "weak_learner": {
                "type": "decision_tree",
                "max_depth": 20,
                "random_feature_propn": 0.5
            }
        },
        "modelFileUrlPattern": "file://tmp/models/benchml_$runid.cls",
        "label": "dep_delayed_15min = 'Y'",
        "select": "* EXCLUDING(dep_delayed_15min)",
        "mode": "boolean"
    }
})

import time

start_time = time.time()

result = mldb.v1.procedures("benchmark").runs.post({})

run_time = time.time() - start_time
auc = result.json()["status"]["folds"][0]["results"]["auc"]

print "\n\nAUC = %0.4f, time = %0.4f\n\n" % (auc, run_time)

Spark random forest low AUC etc

Splitting #5 in two: random forest here, logistic regression in different issue.

Summary: Random forest in Spark has low AUC (and is slower/larger memory footprint).

For n = 100K Spark gets AUC = 0.65 vs e.g. 0.72/0.73 in H2O/xgboost.

Code here https://github.com/szilard/benchm-ml/blob/master/2-rf/5b-spark.txt
Train data here https://s3.amazonaws.com/benchm-ml--spark/spark-train-0.1m.csv test data here https://s3.amazonaws.com/benchm-ml--spark/spark-test-0.1m.csv

Originally ran on 1.3.0, but same in 1.4.0 (a bit faster, but same AUC).

Can you guys look at the code and optimize it/make it better, especially get better AUC?

Rborist

Thanks @suiji for Rborist code. If I run it with 100 trees as in https://github.com/szilard/benchm-ml/tree/master/z-other-tools (on 32 core box) I get:
Time: 87 sec
AUC: 66.43
Something is wrong, the AUC is very low.

I checked out latest github version, then in ArboristBridgeR/Package dir i ran ./dev.sh which created Rborist.tar.gz and then I installed with R CMD INSTALL

Question on the metric of AUC

It seems to be a little bit confused that the evaluation on classification tasks uses the probabilities output directly in calculating the AUC.
For example, in 6-xgboost.R#L39,
Will it be better to do that with (phat>0.5)?

New bench-ml 4-h2o.R for H2O cluster version: 3.8.3.3

Hi,
this is the corrected R code for for H2O cluster version: 3.8.3.3 and R 3.3.1
The old code would not run under these versions. The final AUC with sample_rate = 1.0 for 1 million records is 0.77 which tops the old results.

For 10M the AUC is 0.7922 for a quad core CPU@4Ghz in 1676.02 seconds (more accurate and also 2x faster than the current report with a 32 thread machine and using only two GByte of RAM).

This needs code validation.

# works for H2O Flow 3.8.3.3 and R 3.3.1 (July 2016)
# load H2O Flow at http://localhost:54321/flow/index.html
library(h2o)
# because H2o is limited to two cores by default we need to assign all  cores/threads
h2o.init(nthreads = -1)

# load data from current directory
dx_train <- h2o.importFile(path = "train-1m.csv")
dx_test <- h2o.importFile( path = "test.csv")

# assign variables
Xnames <- names(dx_train)[which(names(dx_train)!="dep_delayed_15min")]

# start training H2O random forest 
system.time({
    md <- h2o.randomForest(x = Xnames, y = "dep_delayed_15min", training_frame= dx_train, sample_rate = 0.632, ntrees = 100, max_depth = 20)
    })

# prediction
phat <- h2o.predict(md, dx_test)

# extract  accuracy and compare against test set
phat$Accuracy <- phat$predict == dx_test$dep_delayed_15min
# display Accuracy (0.70)
mean(phat$Accuracy)

# display AUC (0.73)
system.time({
  print(h2o.performance(md, dx_test)@metrics$AUC)
})

best boosting AUC?

@tqchen @hetong007 I'm trying to get a good AUC with boosting for the largest dataset (n = 10M). Would be nice to beat random forests :)

So far I did some basic grid search https://github.com/szilard/benchm-ml/blob/master/3-boosting/0-xgboost-init-grid.R for n = 1M (not the largest dataset) and seems like deeper trees, min_child_weight = 1 subsample = 0.5 work well.

I'm running now https://github.com/szilard/benchm-ml/blob/master/3-boosting/6a-xgboost-grid.R with n = 10M by just looping over max_depth = c(2,5,10,20,50) but it's been running for a while.

Any suggestions?

Smallest learning rate I'm using is eta = 0.01, any experience with smaller values?

PS: See results so far here: https://github.com/szilard/benchm-ml#boosting-gradient-boosted-treesgradient-boosting-machines

DL with mxnet

Trying to see if DL can match RF/GBM in accuracy on the airline dataset (where train is sampled from years 2005-2006, while validation and test sets sampled disjunctly from 2007). Also, some variables are kept categorical artificially and are intentionally not encoded as ordinal variables (to better match the structure of business datasets).

Recap: with 10M records training (largest in the benchmark) RF AUC 0.80 GBM 0.81 (on test set).

So far I get 0.72 with DL with mxnet on 1M train:
https://github.com/szilard/benchm-ml/blob/master/4-DL/2-mxnet.R

Comparably on the 1M train xgboost has achieved 0.77 and with some tuning I think it can get 0.79.

I tried a few architectures (#hidden layers etc), but it won't beat the settings I took from an mxnet example. Runs about 1 minute to train on a EC2 g2.8xlarge box using 1 GPU (if using all 4 GPUs it was slower). nvidia-smi shows GPU utilization ~20% and memory usage ~2GB (out of 4GB). On CPU (32 cores) it training takes about 5 mins.

The "problem" is DL learns very fast, the best AUC (on a validation set) is reached after 2 epochs. On the other hand xgboost runs ~1hr to get good accuracy. That is the DL model seems underfitted to me.

Surely, DL might not beat GBM on this kind of data (proxy for general business data such as credit risk or fraud detection), but it should do better than 0.72.

Datasets:
https://s3.amazonaws.com/benchm-ml--main/train-1m.csv
https://s3.amazonaws.com/benchm-ml--main/train-10m.csv
https://s3.amazonaws.com/benchm-ml--main/valid.csv
https://s3.amazonaws.com/benchm-ml--main/test.csv

5-spark.txt: spark-train-10m.csv

file: 1-linear/5-spark.txt contains lines:
val d_train = load("spark-train-10m.csv").repartition(32).cache()
val d_test = load("spark-test-10m.csv").repartition(32).cache()

However, those files are not created anywhere else (0-init). I'm wondering shouldn't be:
train-10m.csv and test.csv instead? (those files are in 0-init)

running your benchmarks from beginning to end

Hey Szilard,

I'd like to replicate your code from beginning to end perhaps on Google Compute Engine (GCE), mainly to test out GCE with Vagrant. Do you know have a sense of how long the entire process would take assuming a similar server size as what you used on EC2?

Is there a convenient way to run all your scripts in from folder 0 to 4? That is, is there a master script that executes them all?

I notice that the results are written out to the console. Do you have a script that scrapes all the AUC's for your comparison analysis?

Thanks!

mllib test code - RAM / AUC improvements needed

@szilard For MLlib, you should repartition the data to match the number of cores. For example, try train.repartition(32).cache(). Otherwise, you may not use all the cores. Also, if the data is sparse, you should create sparse vectors instead of dense vectors.

xgboost RF bump for n=10M

Moved "something weird happens for the largest data size (n=10M) - the trend for Run time and AUC "breaks", see figures main README" issue from #2 here.

RandomForest Example

Hello,
I'm trying to train RandomForest Model, but getting same result for each test(about 300 entries)
Here's Java Code

RandomForest model;
model = new RandomForest(X,Y,500);                                                        

For this data, python sklearn works as expected

clf = RandomForestClassifier(n_estimators=500)
clf = clf.fit(X, Y)

What am I missing?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.