Giter Club home page Giter Club logo

dask-ml's Introduction

Dask

Build Status Coverage status Documentation Status Discuss Dask-related things and ask for help Version Status NumFOCUS

Dask is a flexible parallel computing library for analytics. See documentation for more information.

LICENSE

New BSD. See License File.

dask-ml's People

Contributors

abduhbm avatar cmarmo avatar datajanko avatar dsevero avatar eric-czech avatar fujiisoup avatar glemaitre avatar hristog avatar jacobtomlinson avatar jakirkham avatar jameslamb avatar jcrist avatar jjerphan avatar jorisvandenbossche avatar jrbourbeau avatar jsignell avatar kchare avatar lebedov avatar mmccarty avatar mrocklin avatar prasunanand avatar raybellwaves avatar rmsare avatar rth avatar ryan-deak-zefr avatar stsievert avatar thomasjpfan avatar tomaugspurger avatar vibhujawa avatar yl-1993 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dask-ml's Issues

Inconsistent inherited docstrings

A few estimators subclass estimators from other ML packages (e.g. scikit-learn) and when a method is not overwritten, the docstring from the parent package is used directly in the documentation, which is not necessarily consistent. For instance,

I wonder if some automatic re-writing of the non-overwritten methods could partially address this...

Collect routines from other packages?

In some cases we import routines from other packages. In some cases we don't.

For example we import LogisticRegression from dask_glm but don't import GridSearchCV from dask_searchcv. Which approach should we take here? I'm slightly inclined to import from everywhere and consolidate here.

Candidate transformers

I think it'd be nice to have some transformers that work on dask and numpy arrays, & dask and pandas DataFrames. This would be good since

  1. We can depend on dask and pandas, scikit-learn can't
  2. A basic transformer is generally much less work than a full-blown estimato

API

See https://github.com/tomaugspurger/sktransformers for some inspiration (and maybe some tests)?

  • We should match scikit-learn as closely as possible where things overlap
  • All transformers should take an optional columns argument. When specified, the transformation will be limited to just columns (e.g. if doing a standard scaling, and columns=['A', 'B'], only 'A' and 'B' are scaled). By default, all columns are scaled
  • We should operate on np.ndarray, dask.array.Array, pandas.core.NDFrame, dask.dataframe._Frame.
  • Should our operations be nan-safe?

The big question right now is should fitting be eager, and fitted values concrete? e.g.

scaler = StandardScaler()
scaler.fit(X)  # X is a dask.array

So, has scaler.mean_ been computed, and is it a dask.array or a numpy.array? This is a big decision.

Candidates

Add ICA

Would be nice to have some form of ICA available for large Dask Arrays. Ideally could just use an implementation or a few as appropriate from scikit-learn.

LogisticRegression cannot train from Dask DataFrame

A simple example:

from dask import dataframe as dd
from dask_glm.datasets import make_classification
from dask_ml.linear_model import LogisticRegression

X, y = make_classification(n_samples=10000, n_features=2)

X = dd.from_dask_array(X, columns=["a","b"])
y = dd.from_array(y)

lr = LogisticRegression()
lr.fit(X, y)

Returns KeyError: (<class 'dask.dataframe.core.DataFrame'>,)

I did not have time to try if it is also the case for other models.

No module named 'daskml.cluster._k_means'

Hey guys,

Just got back and I tried to run pytest and got:

ImportError while importing test module '/home/severo/Projects/dask-ml/tests/test_kmeans.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/test_kmeans.py:7: in <module>
    from daskml.cluster import KMeans as DKKMeans
daskml/cluster/__init__.py:4: in <module>
    from .k_means import KMeans  # noqa
daskml/cluster/k_means.py:12: in <module>
    from ._k_means import _centers_dense
E   ModuleNotFoundError: No module named 'daskml.cluster._k_means'

I understand that it is a Cythonized .pyx file, but shouldn't we be calling pyximport.install() somewhere?

Once I added this to the top of daskml/cluster/k_means.py everything went back to normal.

k-means Memory usage

Debugging a memory usage issue I'm seeing with k-means initialization. The issue is in this loop. A small example is

import numpy as np
import dask.array as da
from dask import delayed
from distributed import Client, wait

from sklearn.datasets import make_classification


if __name__ == '__main__':

    c = Client()
    s = c.cluster.scheduler
    N_SAMPLES = 1_000_000
    N_BLOCKS = 24

    def mem():
        print("{:.2f}".format(sum(s.worker_bytes.values()) / 10**9), "GB")

    def make_block(n_samples):
        X, y = make_classification(n_samples=n_samples)
        return X

    blocks = [delayed(make_block)(N_SAMPLES) for i in range(N_BLOCKS)]

    arrays = [da.from_delayed(block, dtype='f8', shape=(N_SAMPLES, 20))
              for block in blocks]
    stacked = da.vstack(arrays)

    print(stacked.nbytes / 10**9, "GB")
    X = c.persist(stacked)
    wait(X)

    for i in range(5):
        idx = np.random.randint(0, len(X), size=5)
        centers = X[idx].compute()
        mem()

Which outputs

3.84 GB
4.32 GB
4.80 GB
5.12 GB
5.60 GB
6.08 GB

@mrocklin is that increasing memory usage expected? Inside that for loop, the result is always going to be a small NumPy array.

(side-note, that generates some exceptions I've left out of the output):

distributed.protocol.core - CRITICAL - Failed to deserialize
Traceback (most recent call last):
  File "/Users/taugspurger/.virtualenvs/dask-dev/lib/python3.6/site-packages/distributed/distributed/protocol/core.py", line 122, in loads
    value = _deserialize(head, fs)
  File "/Users/taugspurger/.virtualenvs/dask-dev/lib/python3.6/site-packages/distributed/distributed/protocol/serialize.py", line 160, in deserialize
    f = deserializers[header.get('type')]
KeyError: 'numpy.ndarray'

looking into those now.

Shuffle blocks for partial_fit wrappers

https://twitter.com/shoyer/status/912387291749822465

The incremental learn's are susceptible to bias when the blocks aren't IID (the dataset is ordered by time, say, and X or y is correlated with time).

How best to handle this? A full shuffle is costly, and unnecessary. A shuffle of the blocks (plus maybe an in-memory shuffle of each block) should be fine. Actually... I should check now if the scheduler goes through the blocks in order.

DOC: Add install page

Before I forget, it'd be good to document all the required / optional dependencies. I'll do this later today.

@wraps breaks introspection

@wraps(func)

The use of @wraps here leads to wrong behavior of some introspection tools, like ?? in Jupyter (will show you the code for the sklearn version).

I feel that it's best to wrap thing manually (assigning the __name__, __doc__ and __module__ attributes for inner manually), after line 17, before the return.

This issue is making some unpleasant workarounds necessary in the xarray_filters codebase (same datasets.py file) (where we are trying to give more helpful docstrings users beyond what's in sklearn). The issue for me revolves around knowing whether I'm going to wrap the make_blobs from dask_ml or from sklearn. I could do it by inspecting the signature, or the __module__ attribute, but those attributes are set equal to the ones in sklearn right now.

I'll put links to relevants parts of the xarray_filters codebase here as soon as I push the new code up.

Joblib, parallel_backend, and performance

hi ya'll - Matthew suggested I ask some questions here.

I'm a little confused on some results I'm seeing - and am wondering if you guys can help me figure out why I am seemingly executing grid searches in serial.

I modified dask-docker and am using containers on a docker swarm with 64GB RAM and 48 cores each.

search = RandomizedSearchCV(self.estimator, self.estimator.param_space, cv=3, n_iter=20, verbose=10)

if self.estimator.use_dask:

    address = 'tcp://dask-cluster:8786'
    c = Client(address)

    with parallel_backend('dask.distributed', scheduler_host=address,
                          scatter=[X.values, y.values]):
            search.fit(X.values, y.values)

where X is pd.DataFrame and y is pd.Series.

When I run the above with a small number of rows, I can see in the status page that lots of tasks get executed in parallel. The blocks of work on the task stream end up looking like a straight vertical since they all get dispatched nearly simultaneously.

Once I start increasing the number of rows, I seem to get more and more serial execution, where the status page shows that really one additional task gets added at a time.

In this screenshot, I started at the full dataset, and started subtracting 10k on each run to see the affect on the execution time / parallelism. For some reason occasionally, e.g. on 80k and 40k, the work gets distributed a little differently?

screen shot 2018-01-10 at 2 17 32 pm

When the number is higher, there is never more than one task active. When the number is lower, I see more (up to 4) getting triggered simultaneously.

Anyhow, my question ultimately is:

  • am I doing something in particular wong?
  • does this pattern look indicative of something I setup incorrectly?

Sparse arrays support

While sparse arrays are supported in dask, this issue aims to open the discussion on how this could be applied in the the context of dask-ml.

In particular, even if #5 about TF-IDF gets resolved, the estimators downstream in the pipeline would also need to support sparse arrays for this to be of any use. The simplest example of such pipeline could for instance be #115 : some text vectorizers, training a wrapped out-of code scikit-learn model (e.g. PartialMultinomialNB).

Potentially relevant estimators

TruncatedSVD, text vectorizer, some estimators in dask_ml.preprocessing, and wrapped scikit learn models that support incremental learning and sparse arrays natively.

Sparse array format

Several choices here,

  1. should the mrocklin/sparse be added as a hard dependency as the package that works out of the box for sparse arrays?
  2. should sparse arrays from scipy be wrapped to make them compatible in some limited fashion (if possible at all)?

In particular, as far as I understand, for the application at hand there is not need for ND sparse COO arrays provided by sparse package, 2D would be enough. Furthermore, scikit learns mostly uses CSR and while it's relatively easy to convert between COO and CSR/CSC in the non distrubuted case, I'm not sure if it's still the case for dask. Then there is the partitioning strategy (see next section).

Partitioning strategy

At least as far text vectorizers and incremental learning estimators are concerned, I imagine, it might be easier to partition the arrays row wise (using all the column width), which might also be natural with the CSR format.

File storage format

For instance once someone manages to compute a distributed TF-IDF , the question arises how can one store it on disk without loading everything in memory at once. At present, it doesn't look like there is a canonical way to do this (dask/dask#2562 (comment)). zarr-developers/zarr-python#152 might be relevant but it essentially stores dense format with compression, as far as I understand, which make difficult to do later computation with the data, I believe.

Just a few general thoughts. I'm not sure what's your vision of the project in this respect is @mrocklin @TomAugspurger ; how much work this would represent or what might the easiest to start with..

cc @jorisvandenbossche @ogrisel

Alternating least squares

Someone mentioned wanting to implement ALS (I think @daniel-severo ?)

I thought I'd raise an issue to solicit and discuss algorithms.

Distributed TFIDF

Greetings!

I recently used dask to implement a distributed version of tfidf. I want to contribute to the dask project by putting it somewhere.

Would this be the correct repo.?

I thought maybe a feature_extraction directory would be appropriate.

First Release

I'd like to do an initial release to PyPI and conda forge this weekend.

Any objects @mrocklin, @daniel-severo? I'll merge #51 shortly, and #11 once I have another look over it and if you're comfortable with it @daniel-severo.

Revise documentation on hyperparameter search

Right around the time that I stumbled upon dask-searchcv, scikit-learn 0.19 was released where pipelines now support the memory parameter:
http://scikit-learn.org/stable/whats_new.html#version-0-19

As such, perhaps this section of the docs should be revised:

With the regular scikit-learn version, each stage of the pipeline must be fit for each of the combinations of the parameters, even if that step isn’t being searched over.

I'd be interested to hear if any work was done to compare/contrast dask-searchcv vs scikit-learn's changes in 0.19. Presumably dask-searchcv allows you to more easily harness multiple machines, but perhaps some rudimentary benchmarks could/would be interesting/appropriate.

From:
http://dask-ml.readthedocs.io/en/latest/hyper-parameter-search.html#efficient-search

sklearn-xarray compatability

There was a question over in pangeo-data/pangeo#61 (comment) about how dask-ml works with sklearn-xarray

When I first tried it out, things like dask_ml.preprocessing.StandardScalar() failed since we do X.mean(0), instead of X.mean(axis=0). I think I tried fixing that locally and a small example worked, though my memory is hazy.

I don't personally have much experience with xarray , but would certainly welcome PRs to make things work smoothly.

cc @phausamann , @gmaze

Docs reorganization

I've been walking through the documentation and had a few notes. I'd be happy to make these changes but wanted to check in before I submitted anything.

The separation between single machine and distributed learning seems odd to me. Many of the topics listed in single machine (grid search, pipelining, possibly even incremental learning) are still relevant when on a cluster.

I might instead flatten the TOC to just remove the single-machine/distributed distinction, and give all of the subsections their own home. This might flatten the TOC to something like the following:

  1. Pipelines
  2. Hyper-parameter search
  3. Incremental learning
  4. Generalized Linear Models
  5. Joblib
  6. XGBoost
  7. Clustering
  8. Examples
  9. API

I've also found it pleasant recently to start sections with the API relevant for that section. Futures docs for an example. This gives a quick list at the beginning of each section on the API relevant for that section. Those functions still link to the main API doc page.

Any thoughts or objections to this reorganization @TomAugspurger ?

Pandoc required to build docs

I was trying to build the dask-ml documentation locally and got the following output from make html

building [mo]: targets for 0 po files that are out of date
building [html]: targets for 45 source files that are out of date
updating environment: 45 added, 0 changed, 0 removed
reading sources... [ 11%] examples/dask-glm                                                                                                                                   
Notebook error:
PandocMissing in examples/dask-glm.ipynb:
Pandoc wasn't found.
Please check that pandoc is installed:
http://pandoc.org/installing.html
make: *** [html] Error 1

It looks like Pandoc is missing from my dev environment (created using conda env create -f ci/environment.yml --name=dask-ml_dev). After manually installing Pandoc from conda-forge, I was able to build the documentation successfully.

Not sure if I missed something when setting up my environment. If not, I'd be happy to submit a PR adding Pandoc to ci/environment.yml and ci/environment-27.yml

Cython should be added as a dependency

Hey Guys,

Just attempting to install in a fresh virtualenv:

➜ pip install dask-ml
Collecting dask-ml
  Downloading dask-ml-0.3.1.tar.gz (262kB)
    100% |████████████████████████████████| 266kB 1.4MB/s 
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/cy/fp9dph0x14x9s7fszwjjjh4c0000gn/T/pip-build-HNpvGT/dask-ml/setup.py", line 5, in <module>
        from Cython.Build import cythonize
    ImportError: No module named Cython.Build
    
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/cy/fp9dph0x14x9s7fszwjjjh4c0000gn/T/pip-build-HNpvGT/dask-ml/

Seems like pip install Cython was all I needed - this should be added to the requirements in setup.py, no?

Rename to dask_ml?

Most dask packages follow the following naming convention:

  1. dask-foo for package name on PyPI, and github
  2. dask_foo for import name in Python

The dask-ml package is currently imported as daskml rather than dask_ml. The underscore is a bit of a pain, so I can see how the current name would be preferred. Still though, I thought I'd bring this up. We may want to switch for conformity's sake.

cc @TomAugspurger @jcrist

Add License

I apparently forgot to include a license when creating the repo. The intent was BSD-3.

@daniel-severo, @mrocklin, as the other contributors up to this point, are you OK with adding that?

How does dask-ml handle large datasets?

From the latest demos of dask-sklearn (and related projects) that I've seen, dask has primarily focused on parallelizing grid/random search and CV, and provides some smart caching mechanisms for transformers and estimators that have already been evaluated on a dataset.

However, that approach breaks down when, say, a slow algorithm is being trained on a very large dataset. With some slower algorithms, training one instance of the algorithm on one fold of the dataset can take hours or days, which makes training the algorithm unviable for all but the most patient practitioner.

Can dask-ml speed up the underlying ML algorithms that it supports?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.