Giter Club home page Giter Club logo

spark-sklearn's Introduction

Deprecation

This project is deprecated. We now recommend using scikit-learn and Joblib Apache Spark Backend to distribute scikit-learn hyperparameter tuning tasks on a Spark cluster:

You need pyspark>=2.4.4 and scikit-learn>=0.21 to use Joblib Apache Spark Backend, which can be installed using pip:

pip install joblibspark

The following example shows how to distributed GridSearchCV on a Spark cluster using joblibspark. Same applies to RandomizedSearchCV.

from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from joblibspark import register_spark
from sklearn.utils import parallel_backend

register_spark() # register spark backend

iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svr = svm.SVC(gamma='auto')

clf = GridSearchCV(svr, parameters, cv=5)

with parallel_backend('spark', n_jobs=3):
    clf.fit(iris.data, iris.target)

Scikit-learn integration package for Apache Spark

This package contains some tools to integrate the Spark computing framework with the popular scikit-learn machine library. Among other things, it can:

  • train and evaluate multiple scikit-learn models in parallel. It is a distributed analog to the multicore implementation included by default in scikit-learn
  • convert Spark's Dataframes seamlessly into numpy ndarray or sparse matrices
  • (experimental) distribute Scipy's sparse matrices as a dataset of sparse vectors

It focuses on problems that have a small amount of data and that can be run in parallel. For small datasets, it distributes the search for estimator parameters (GridSearchCV in scikit-learn), using Spark. For datasets that do not fit in memory, we recommend using the distributed implementation in `Spark MLlib.

This package distributes simple tasks like grid-search cross-validation. It does not distribute individual learning algorithms (unlike Spark MLlib).

Installation

This package is available on PYPI:

pip install spark-sklearn

This project is also available as Spark package.

The developer version has the following requirements:

  • scikit-learn 0.18 or 0.19. Later versions may work, but tests currently are incompatible with 0.20.
  • Spark >= 2.1.1. Spark may be downloaded from the Spark website. In order to use this package, you need to use the pyspark interpreter or another Spark-compliant python interpreter. See the Spark guide for more details.
  • nose (testing dependency only)
  • pandas, if using the pandas integration or testing. pandas==0.18 has been tested.

If you want to use a developer version, you just need to make sure the python/ subdirectory is in the PYTHONPATH when launching the pyspark interpreter:

PYTHONPATH=$PYTHONPATH:./python:$SPARK_HOME/bin/pyspark

You can directly run tests:

cd python && ./run-tests.sh

This requires the environment variable SPARK_HOME to point to your local copy of Spark.

Example

Here is a simple example that runs a grid search with Spark. See the Installation section on how to install the package.

from sklearn import svm, datasets
from spark_sklearn import GridSearchCV
iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svr = svm.SVC(gamma='auto')
clf = GridSearchCV(sc, svr, parameters)
clf.fit(iris.data, iris.target)

This classifier can be used as a drop-in replacement for any scikit-learn classifier, with the same API.

Documentation

API documentation is currently hosted on Github pages. To build the docs yourself, see the instructions in docs/.

https://travis-ci.org/databricks/spark-sklearn.svg?branch=master

spark-sklearn's People

Contributors

ajaysaini725 avatar digital10111 avatar hahnicity avatar jkbradley avatar kautenja avatar kornosk avatar mengxr avatar shaunswanson avatar smurching avatar srowen avatar thunterdb avatar vlad17 avatar weichenxu123 avatar zero323 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.