h2oai / h2o4gpu Goto Github PK

H2Oai GPU Edition

License: Apache License 2.0

Makefile 1.77% C++ 26.17% C 7.97% Cuda 17.90% Python 23.50% Shell 1.67% R 3.10% Jupyter Notebook 2.61% Ruby 0.01% JavaScript 0.33% CSS 4.29% HTML 0.70% Batchfile 0.03% Groovy 7.93% CMake 0.63% SWIG 0.50% Sass 0.92%

gpu glm python cuda c-plus-plus cpu lasso elastic-net svd pca

h2o4gpu's Issues

Implement quantiles regression

Logging framework

Add a proper logging framework for C++ and Python enabling debug prints and (maybe) performance timings.

GLM documentation

Add docs.

GPU support for data.table

Use GPUs as a backend for data.table, to make sorting, grouping and joining faster.
https://github.com/h2oai/h2oai-prototypes/tree/master/sorting

automatic lambda_min_ratio search

Start out at machine precision lambda_min_ratio, and adaptively line-search + bisect to find smallest (say) validation error.

GLM: pass training data as int (both X and y)

X = np.array([1,2,3])
y = np.array([1,2,3,4,5,6,7,8,9,10])
lm = h2o4gpu.LinearRegression()
lm.fit(X, y)

gives the following error

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-40-b5e451bf1fa8> in <module>()
      1 lm = h2o4gpu.LinearRegression()
----> 2 lm.fit(X, y)

~/anaconda3/lib/python3.6/site-packages/h2o4gpu/solvers/elastic_net_base.py in fit(self, train_x, train_y, valid_x, valid_y, weight, give_full_path, do_predict, free_input_data, tol, lambda_stop_early, glm_stop_early, glm_stop_early_error_fraction, max_iterations, verbose, order)
    711             valid_x_np,
    712             valid_y_np,
--> 713             weight_np,
    714         )
    715         precision = 0  # won't be used

~/anaconda3/lib/python3.6/site-packages/h2o4gpu/solvers/elastic_net_base.py in _upload_data(self, source_dev, train_x, train_y, valid_x, valid_y, weight)
   1355         d = c_void_p(0)
   1356         e = c_void_p(0)
-> 1357         if self.double_precision == 1:
   1358             c_ftype = c_double
   1359 

AttributeError: 'LinearRegression' object has no attribute 'double_precision'

if X is float but y is integer, no errors are raised but the model doesn't train properly

X = np.array([1,2,3])
X = X.astype(float)
y = np.array([1,2,3,4,5,6,7,8,9,10])

lm = h2o4gpu.LinearRegression()
lm.fit(X, y)

print(lm.predict(np.array([15.0, 16.0])))
print(lm.X)

the output is

[[ 0.  0.]]
[[ 0.]]

Improve accuracy of GLM

Review wamsi's notebook

GLM: Pass alpha and lambda as single value, tuple, or list.

And choose what to return for predict or transform (i.e. full path or not) based upon what user inputted.

Standard and Truncated SVD

Standard SVD on GPU is in place without API, but holding until can see how to do truncated SVD. Also, seems sklearn truncated SVD uses lots of memory despite being truncated -- can one optimize more for memory vs. time?

Unify CPU/GPU default count

KMeans by default uses 1 GPU, GLM uses by default all GPUs.

Both should use all gpus by default.

subkmeans for improved handling of high-dimensional data

KMeans R API

GLM not as accurate on MAPD's Wamsi's data set

Implement linear and general SVM

CUDA implementation takes priority. Then Python API, C++ and R.
Add unit tests
Add performance tests comparing to other frameworks

xgboost isn't properly handling massive-sized data on multiple GPUs

Fails with NCCL related errors once over 60GB on DGX-1

GPU support for MOJO scoring

Add a GPU backend for tree model scoring (GBM/DRF).

The goal is to increase the throughput for scoring of large datasets (in mini-batch) and for large numbers of trees.

Possibly predict for all possible paths down the tree and pick the "right" path at the end, to avoid branch misprediction issues.

Might be related to the TF MOJO: https://github.com/h2oai/h2oai-prototypes/tree/master/tf-mojo

KMeans documentation

Add docs.

Directly solve GLM in case of no regularization

For now can use sklearn as backup.

Implement PCA

CUDA implementation takes priority. Then Python API, C++ and R.
Add unit tests
Add performance tests comparing to other frameworks

GLM (all models): should accept integer for model parameters

logreg = h2o4gpu.LogisticRegression(alpha_max = 1, alpha_min = 1)

error:

---------------------------------------------------------------------------
H2O4GPUTypeError                          Traceback (most recent call last)
<ipython-input-55-55d86fdd97e3> in <module>()
      1 logreg = h2o4gpu.LogisticRegression(alpha_max = 1, 
----> 2                                     alpha_min = 1)

~/anaconda3/lib/python3.6/site-packages/h2o4gpu/solvers/logistic.py in __init__(self, n_threads, n_gpus, fit_intercept, lambda_min_ratio, n_lambdas, n_folds, n_alphas, tol, lambda_stop_early, glm_stop_early, glm_stop_early_error_fraction, max_iter, verbose, give_full_path, lambda_max, alpha_max, alpha_min)
     60             alpha_max=alpha_max,
     61             alpha_min=alpha_min,
---> 62             order=None,)

~/anaconda3/lib/python3.6/site-packages/h2o4gpu/solvers/elastic_net_base.py in __init__(self, n_threads, n_gpus, intercept, lambda_min_ratio, n_lambdas, n_folds, n_alphas, tol, lambda_stop_early, glm_stop_early, glm_stop_early_error_fraction, max_iterations, verbose, family, give_full_path, lambda_max, alpha_max, alpha_min, order)
     86                           'elasticnet'], "family should be set to 'logistic' or 'elasticnet' but got " + family
     87         assert_is_type(lambda_max, float, None)
---> 88         assert_is_type(alpha_max, float, None)
     89         assert_is_type(alpha_min, float, None)
     90 

~/anaconda3/lib/python3.6/site-packages/h2o4gpu/util/typechecks.py in assert_is_type(var, *types, **kwargs)
    448     vtn = _get_type_name(type(var))
    449     raise H2O4GPUTypeError(var_name=vname, var_value=var, var_type_name=vtn, exp_type_name=etn, message=message,
--> 450                        skip_frames=skip_frames)
    451 
    452 

H2O4GPUTypeError: Argument `alpha_max` should be a ?float, got integer 1

while the following works:
logreg = h2o4gpu.LogisticRegression(alpha_max = 1.0, alpha_min = 1.0)

Connector to cuDF

TBA

Implement Kalman Filters

KMeans CUDA code

Implementation should be in cpp/cu files mostly
DRY the code so we don't duplicate it

GLM C++/CUDA code

Implementation should be in cpp/cu files mostly
DRY the code so we don't duplicate it

GLM R API

Add necessary open-source artifacts

We need:

See: https://help.github.com/articles/setting-guidelines-for-repository-contributors/

Implement PCA

Add a GPU backend to H2O PCA, by letting the GPU do the SVD, for example here:
https://github.com/h2oai/h2o-3/blob/master/h2o-algos/src/main/java/hex/svd/SVD.java#L325-L325

This could be done via a simple JNI call, similar to the XGBoost/DeepWater integration.

Proof of concept:
https://github.com/arnocandel/cuda-rpca-admm

H2O-3 XGBoost inaccurate

H2O's fork and DMLC XGboost version mismatch on accuracy.

RandomForest wrapper

Add a RandomForest wrapper, API should resemble SciKit-Learn's RandomForestRegressor and RandomForestClassifier (up for discussion)
Implementation should for now just call XGBoost but be as generic as possible should we want to run other algorithms instead (for example performance based heuristics etc.).
Don't forget tests.

Review licensing

Make sure all the license/legal related notes are in place (in the README, LICENSE and code files).

Document pip install process at top of README and warn that overwrite various packages due to requirements.

GLM wrappers for sklearn

Multi GPU - Coordinate Descent GPU

Investigate and (if possible) implement.

Sphinx and pillow for automatic documentation like sklearn uses

https://pillow.readthedocs.io/en/4.2.x/
http://www.sphinx-doc.org/en/stable/

Progress bar for each solver

Would be good to know progress of k-means, glm, etc as it is solving.

KMeans minibatch

As in sklearn

Fix intercept to be like sklearn -- handle simple linear fit

KMeans invalid pointer call.

How to reproduce:

run kmeans.fit(X) first using kmeans object with 1 GPU, then twice (or more) with 2 GPUs.

Error:

GPUassert: invalid device pointer gpu/include/kmeans_labels.h 170

Seems like we are deallocating something too fast or not checking if every pointer we use got properly allocated.

KMeans DAAL implementation.

Implement KMeans on DAAL, preferably in C++ so it's accessible in Python and R.

GradientBoosting wrapper

Add a GradientBoosting wrapper, API should resemble SciKit-Learn's GradientBoostingRegressor and GradientBoostingClassifier (up for discussion)
Implementation should for now just call XGBoost but be as generic as possible should we want to run other algorithms instead (for example performance based heuristics etc.).
Don't forget tests.

If input pandas, output should be pandas (i.e. output should have pandas names for columns put back in)

NCCL Support for multi GPU and multi machine

TBA

Jenkins do all make tests

Separate builds for make test, make testbig, make testperf, make testbigperf . Jenkins should use dotest, dotestbig, dotestperf, dotestbigperf

Ying-Yang and other K-Means

O(N^2) algorithm of K-Means and Nearest Neighbors can be massively accelerated using GPUs.

Proof of concept: https://github.com/arnocandel/kmcuda

Improve contributor's guide

Add a proper contributors guide with details (best practices used in the project, our PR strategy, how to run CI tests, where to find issues etc.)

Jenkins TV board for accuracy and speed of h2o4gpu

For all of glm vs. h2o-3, kmeans vs sklearn/Intel daal, xgboost vs. lightgbm

Drop to sklearn in case we have no such implementation

Unsure how to do technically, but should be possible.

Use Intel DAAL for CPU side of many algorithms

https://software.intel.com/en-us/intel-daal/details

Algorithms

Data Analysis: Characterization, Summarization, and Transformation

Low Order Moments
Computes the basic dataset characteristics such as sums, means, second order raw moments, variances, standard deviations, etc.

Quantile
Computes quantiles that summarize the distribution of data across equal-sized groups as defined by quantile orders.

Correlation and Variance-Covariance Matrices
Quantifies pairwise statistical relationship between feature vectors.

Cosine Distance Matrix
Measures pairwise similarity between feature vectors using cosine distances.

Correlation Distance Matrix
Measures pairwise similarity between feature vectors using correlation distances.

Cholesky Decomposition
Decomposes a symmetric positive-definite matrix into a product of a lower triangular matrix and its transpose. This decomposition is a basic operation used in solving linear systems, non-linear optimization, Kalman filtration, etc.

QR Decomposition
Decomposes a general matrix into a product of an orthogonal matrix and an upper triangular matrix. This decomposition is used in solving linear inverse and least squares problems. It is also a fundamental operation in finding eigenvalues and eigenvectors.

SVD
Singular Value Decomposition decomposes a matrix into a product of a left singular vector, singular values, and a right singular vector. It is the basis of Principal Component Analysis, solving linear inverse problems, and data fitting.

PCA
Principal Component Analysis reduces the dimensionality of data by transforming input feature vectors into a new set of principal components orthogonal to each other.

K-Means
Partitions a dataset into clusters of similar data points. Each cluster is represented by a centroid, which is the mean of all data points in the cluster.

Expectation-Maximization
Finds maximum-likelihood estimate of the parameters in models. It is used for the Gaussian Mixture Model as a clustering method. It can also be used in non-linear dimensionality reduction, missing value problems, etc.

Outlier Detection
Identifies observations that are abnormally distant from other observations. An entire feature vector (multivariate) or a single feature value (univariate), can be considered in determining if the corresponding observation is an outlier.

Association Rules
Discovers a relationship between variables with certain level of confidence.

Linear and Radial Basis Function Kernel Functions
Map data onto higher-dimensional space.

Quality Metrics
Compute a set of numeric values to characterize quantitative properties of the results returned by analytical algorithms. These metrics include Confusion Matrix, Accuracy, Precision, Recall, F-score, etc.

Machine Learning: Regression, Classification, and More

Neural Networks for Deep Learning
A programming paradigm which enables a computer to learn from observational data.

Linear and Ridge Regressions
Models relationship between dependent variables and one or more explanatory variables by fitting linear equations to observed data.

Naïve Bayes Classifier
Splits observations into distinct classes by assigning labels. Naïve Bayes is a probabilistic classifier that assumes independence between features. Often used in text classification and medical diagnosis, it works well even when there are some level of dependence between features.

Boosting
Builds a strong classifier from an ensemble of weighted weak classifiers, by iteratively re-weighting according to the accuracy measured for the weak classifiers. A decision stump is provided as a weak classifier. Available boosting algorithms include AdaBoost (a binary classifier), BrownBoost (a binary classifier), and LogitBoost (a multi-class classifier).

SVM
Support Vector Machine is a popular binary classifier. It computes a hyperplane that separates observed feature vectors into two classes.

Multi-Class Classifier
Builds a multi-class classifier using a binary classifier such as SVM.

ALS
Alternating Linear Squares is a collaborative filtering method of making predictions about the preferences of a user, based on preference information collected from many users.

Use improved adaptive rho for GLM

http://www.cs.umd.edu/sites/default/files/scholarly_papers/ZhengXu.pdf
ZhengXu.pdf

@pseudotensor already started thinking/implementing, and not hard, just needs time/focus/energy.

h2oai / h2o4gpu Goto Github PK

h2o4gpu's Issues

Recommend Projects

Recommend Topics

Recommend Org