dask / dask-searchcv Goto Github PK

View Code? Open in Web Editor NEW

240.0 240.0 43.0 274 KB

dask-searchcv is now part of dask-ml: https://github.com/dask/dask-ml

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

dask-searchcv's Introduction

Dask

Dask is a flexible parallel computing library for analytics. See documentation for more information.

LICENSE

New BSD. See License File.

dask-searchcv's People

Contributors

Stargazers

Watchers

dask-searchcv's Issues

Failure on model pipeline that succeeds using stock scikit-learn

I can't reliably get my model pipeline that runs under scikit-learn to work using dask-searchcv, Is there an expectation of how much memory overhead dask will require as compared to scikit-learn? My test dataset is fairly small (< 10MB TSV) and my VM has something on the order of 8GB available to it.

I'm attempting to use dask-searchcv's GridSearchCV to build a model pipeline without redundantly executing a few expensive pipeline components. scikit-learn 0.19 has made strides in this area, but I'm looking to the future where I'll want to further parallelize my workload across more marchines and/or handle larger data sets and more complicated pipelines.

My dataset is fairly small (< 10MB TSV), so I'm trying to distribute the execution and avoid redundant computation, but don't necessarily need a distributed access to a dataframe at this time.

I modified my code to use dask-searchcv's GridSearchCV in "local distributed" mode as follows:

# from sklearn.model_selection import GridSearchCV
from dask_searchcv import GridSearchCV
...
    pipe = sklearn.pipeline.Pipeline(
    ....
    )
    # Use dask in distributed mode
    from distributed import Client
    c = Client()
    best_model = GridSearchCV(
            pipe,
            param_grid=param_grid,
            cv=10,
            scoring=my_error_metrics_function,
            n_jobs=cpu_count
    )

    with ProgressBar():
        best_model.fit(train_x, train_y)

I skimmed the documentation for distributed and it seems to support the idea that no additional setup is required. Since it didn't immediately die on me, I'm assuming my config/setup is correct.

gridsearchcv implementation	result	notes
scikit-learn 0.19	reliably succeeds	Used in default mode without saving state
dask-searchcv	occasionally fails	From reading the code, this appears to use `threaded_get`. I've noticed that failures are more likely to appear if more than one core is available.
dask-searchcv in "local distributed mode"	Ran for "a while" and then died with `distributed.scheduler.KilledWorker`.	appeared to get smashed by the oomkiller.

Example dask-searchcv failure in default mode

I believe this is related to scikit-learn/scikit-learn#2755

  File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/sklearn/ensemble/forest.py", line 315, in fit
    random_state=random_state)
  File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/sklearn/ensemble/base.py", line 127, in _make_estimator
    for p in self.estimator_params))
  File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/sklearn/base.py", line 264, in set_params
    valid_params = self.get_params(deep=True)
  File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/sklearn/base.py", line 240, in get_params
    warnings.filters.pop(0)
IndexError: pop from empty list

Example failure in `local distributed` mode

I'm pretty sure this was the victim of the oomkiller.

File "MY_MODULE_build_models.py", line 426, in <module>
    best_model.fit(train_x, train_y)
  File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/dask_searchcv/model_selection.py", line 793, in fit
    out = scheduler(dsk, keys, num_workers=n_jobs)
  File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/distributed/client.py", line 1923, in get
    results = self.gather(packed, asynchronous=asynchronous)
  File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/distributed/client.py", line 1368, in gather
    asynchronous=asynchronous)
  File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/distributed/client.py", line 540, in sync
    return sync(self.loop, func, *args, **kwargs)
  File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/distributed/utils.py", line 239, in sync
    six.reraise(*error[0])
  File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/six.py", line 686, in reraise
    raise value
  File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/distributed/utils.py", line 227, in f
    result[0] = yield make_coro()
  File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/distributed/client.py", line 1246, in _gather
    traceback)
  File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/six.py", line 686, in reraise
    raise value
distributed.scheduler.KilledWorker: ("('mypipelinestep-fit-dfe1861ca5d06ad69d926709c897ffc9', 34, 0)", 'tcp://127.0.0.1:46091')

My environment

$ conda list | egrep '^(numpy|scipy|dask|scikit|distributed|python |tornado)'
dask                      0.15.2                   py36_0
dask-searchcv             0.0.2+8.g2c2056d           <pip>
distributed               1.18.1                   py36_0
numpy                     1.13.1                   py36_0
python                    3.6.1                         2
scikit-learn              0.19.0              np113py36_0
scipy                     0.19.1              np113py36_0
tornado                   4.5.2                    py36_0

What happened to Pipeline

The Pipeline class seems to have disappeared. Are we now just introspecting sklearn pipelines to avoid recomputations of earlier stages?

Incompatibility With Keras Scikit-Learn Wrapper

The Keras neural-network package has a sklearn wrapper that works with the sklearn RandomizedSearchCV and GridSearchCV classes. However, it fails with the dask-searchcv equivalents. Thus, there seem to be additional requirements beyond the sklearn estimator interface that must be met in order for dask-searchcv to work. Would it be possible to list these, such that other projects could be adapted to be used with dask-searchcv?

GridSearchCV vs DaskGridSearchCV

Thoughts on one name over the other? I have a slight preference for GridSearchCV in order to make switching out imports easier (though I can also accomplish this with import ... as ...)

Efficiency for GridSearchCV on large graphs

Hi,

We have another issue of DaskGridSearchCV. The example is XGBRegressor (https://github.com/pinjutien/grid-search/tree/master/grid_search_issue2).

In this case, when DaskGridSearchCV is on, there is nothing showing up in dash-board(for example, http://173.208.222.74:8866/status/). Both Client or Executor have no dash-board showing up.

Thanks,
Pin-Ju

dask-searchcv incompatible with Dask v0.18

Trying to run dask-searchcv using version 0.2.0 with Dask 0.18.1 and I'm finding that the scheduling back-end 'distributed' is no longer picking up the tasks, and instead they run from the submitting Python kernel instead of the registered worker. The web-ui confirms as much since the tasks fail to show, and instead I see console output in the submitting kernel, as opposed to the dask-worker.

By downgrading to Dask 0.17.5, this issue is resolved.

Fold dask-searchcv into Dask-ML

Currently dask-ml.model_selection just imports dask-searchcv. For ease of maintenance, I'd like to move dask-searchcv into dask-ml. Does anyone have objections?

cc @jcrist

Difference between `dklearn.Pipeline` and `sklearn.Pipeline`

What's the difference between dklearn.Pipeline and sklearn.Pipeline? Does dklearn cache the intermediate results?

Reasin I ask -- I'm trying to fit a large number of sklearn Pipelines in parallel, comparing building the computational graph w/ dask with a handrolled (way less sophisticated) joblib implementation. dask is ~50% slower, and I wonder whether that's a result of the fact that joblib memmap's large objects to reduce communication overhead. Any thoughts? Happy to share code if it's useful.

Multi-threading or -processing doesn't work for simple sklearn Pipeline

Hello,

I am in trouble using this nice tool dask-searchcv on simple Pipeline.

Given the fact I tried it on simple sklearn Pipeline (StandardScaler + SVC_rbf):

Pipeline(memory=None,
     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svc', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=42, shrinking=True,
  tol=0.001, verbose=False))])

with n_jobs=-1 and scheduler="threading" or scheduler="multiprocessing" and search grid on C and gamma parameters, in all time I got only one process used (on my 16 available).

Whereas when I used dask-searchcv on composed Pipeline including moreover PCA, I got as expected one process used at 1600 % CPU or 16 processes launched.

I don't understand why dask-searchcv multi-threading or -processing doesn't work for the first case...

Any explanation ?

Matt

Implement partial_fit support in DaskBaseSearchCV

It looks like this will require passing a keyword argument to method.py#fit.

I think there also needs to be some specification of the random state to make sure the train/test split is preserved on repeated calls.

AttributeError: 'unicode' object has no attribute 'version' - LooseVersion with Py 2.7

I encountered this bug in a dask-searchcv LooseVersion Python 2.7 usage with elm in CI. It appears to be the same idea as this LooseVersion issue python.org issue 14894.

To avoid attribute error, fix this line for Python 2.7:

if LooseVersion(dask.__version__) > '0.15.4'

Works with stock Pipeline now?

Hey.
From what I can see this now works with the stock pipeline. Is that correct? That's amazing!
Can you maybe add an example to make that clear? All the blog-posts etc have a dask-sklearn pipeline.

Keep up the great work!

Wish: support multiple metric scoring as in scikit-learn 0.19

bayesian optimization for hyperparameter tuning

should we consider incorporating bayesian optimization for hyperparameter search?

it could be distributed with across different machines, just as with gridsearch, but would converge much faster.

here is the 2017 nips paper (http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf)

COMPAT: /Users/taugspurger/Envs/dask-dev/lib/python3.6/site-packages/scikit-learn/sklearn/base.py:114: DeprecationWarning: Estimator Pipeline modifies parameters in init.

in, e.g. pytest dask_searchcv/tests/test_model_selection.py::test_pipeline_feature_union

Lots of warnings like

  /Users/taugspurger/Envs/dask-dev/lib/python3.6/site-packages/scikit-learn/sklearn/base.py:114: DeprecationWarning: Estimator Pipeline modifies parameters in __init__. This behavior is deprecated as of 0.18 and support for this behavior will be removed in 0.20.
    % type(estimator).__name__, DeprecationWarning)

I'll take a look this afternoon.

Compatibility with scikit-learn 0.19

Does dask-searchcv work with scikit-learn 0.19+? Getting the following error

  File "/home/vagrant/.pyenv/versions/miniconda3-4.1.11/envs/gambit/lib/python3.6/site-packages/dask_searchcv/methods.py", line 16, in <module>
    from sklearn.utils.fixes import rankdata
ImportError: cannot import name 'rankdata'

re Custom Sklearn Transforms with Dask Apply inside

I've created some custom Sklearn transforms that I'm putting into a pipeline. These custom transforms take a pandas object and apply some function over it like this example to extract text from html strings to pass into a CountVectorizer.

class GetText(BaseEstimator, TransformerMixin):

    def get_text(self, html_string):
        return lxml.html.document_fromstring(html_string).text_content()

    def fit(self, x, y=None):
        return self

    def transform(self, X):
        return X.apply(self.get_text)

I realize that dask.apply can be generically swapped in for pandas apply, especially if I convert the data to dask dataframes. But since these are in sklearn pipeline, I wasn't sure how much parallelism I'd get.

TypeError: can't pickle NotImplementedType objects on sklearn.metrics.make_scorer and FeatureUnion

I'm implementing a gridsearch on pipeline where the classifiers are swapped out. Using sklearn gridsearch, it worked ok.

gridSearch_features = GridSearchCV(estimator=pipeline,
            param_grid=feature_parameters,
            scoring =  'neg_log_loss', 
            n_jobs=-1, refit=True)

from sklearn.metrics import make_scorer,  brier_score_loss
brier = make_scorer(brier_score_loss, greater_is_better=False)

gridSearch_features = GridSearchCV(estimator=pipeline,
            param_grid=feature_parameters,
            scoring =  brier, 
            n_jobs=-1, refit=True)

I tried to drop in dask.GridSearchCV and got an error that I believe is connected to make_scorer object:

  File "hal_crossval_search_features_multi.py", line 397, in <module>
    gridSearch_features_model = gridSearch_features.fit(train_x,train_y)
  File "/mnt/var/lib/anaconda2/lib/python2.7/site-packages/dask_searchcv/model_selection.py", line 801, in fit
    out = scheduler(dsk, keys, num_workers=n_jobs)
  File "/mnt/var/lib/anaconda2/lib/python2.7/site-packages/dask/threaded.py", line 75, in get
    pack_exception=pack_exception, **kwargs)
  File "/mnt/var/lib/anaconda2/lib/python2.7/site-packages/dask/local.py", line 521, in get_async
    raise_exception(exc, tb)
  File "/mnt/var/lib/anaconda2/lib/python2.7/site-packages/dask/local.py", line 290, in execute_task
    result = _execute_task(task, data)
  File "/mnt/var/lib/anaconda2/lib/python2.7/site-packages/dask/local.py", line 271, in _execute_task
    return func(*args2)
  File "/mnt/var/lib/anaconda2/lib/python2.7/site-packages/dask_searchcv/methods.py", line 216, in fit_transform
    est = set_params(est, fields, params)
  File "/mnt/var/lib/anaconda2/lib/python2.7/site-packages/dask_searchcv/methods.py", line 182, in set_params
    est = copy_estimator(est)
  File "/mnt/var/lib/anaconda2/lib/python2.7/site-packages/dask_searchcv/utils.py", line 65, in copy_estimator
    return copy.deepcopy(est)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 190, in deepcopy
    y = _reconstruct(x, rv, 1, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 334, in _reconstruct
    state = deepcopy(state, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 230, in _deepcopy_list
    y.append(deepcopy(a, memo))
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 237, in _deepcopy_tuple
    y.append(deepcopy(a, memo))
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 190, in deepcopy
    y = _reconstruct(x, rv, 1, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 334, in _reconstruct
    state = deepcopy(state, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 230, in _deepcopy_list
    y.append(deepcopy(a, memo))
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 237, in _deepcopy_tuple
    y.append(deepcopy(a, memo))
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 190, in deepcopy
    y = _reconstruct(x, rv, 1, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 334, in _reconstruct
    state = deepcopy(state, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 190, in deepcopy
    y = _reconstruct(x, rv, 1, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 334, in _reconstruct
    state = deepcopy(state, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 190, in deepcopy
    y = _reconstruct(x, rv, 1, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 334, in _reconstruct
    state = deepcopy(state, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 190, in deepcopy
    y = _reconstruct(x, rv, 1, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 334, in _reconstruct
    state = deepcopy(state, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 190, in deepcopy
    y = _reconstruct(x, rv, 1, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 334, in _reconstruct
    state = deepcopy(state, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 163, in deepcopy
    y = copier(x, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 257, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 182, in deepcopy
    rv = reductor(2)
TypeError: can't pickle NotImplementedType objects

pip installable

First, thanks for putting this library up. I find it very useful.

Could you do this library pip installable so I can add it as a requirement to one of my packages? (https://github.com/alvarouc/polyssifier)

What will it take to get this production ready?

This is exactly what I've been looking for (ok, looking for is too strong- half-assedly-hacking-together-pieces-of-this-myself is probably a more accurate description).

I'm running into a bunch of the issues you call out with sklearn's native pipelines.

I'd love to integrate this into auto_ml, which we use in production at DoorDash. So I'm wondering what it would take for you to happily certify this production ready?

Dask Learn Fails for Random Forest Grid Search

Hi,

I compare grid search between sklearn and dask-learn.
As per example documented in: https://github.com/pinjutien/grid-search,

dask-learn is invoked by: python dklearn_grid_search_script.py
scikit learn is invoked by: python sklearn_grid_search_script.py

Dask Learn fails with the following error:

Traceback (most recent call last):
File "dklearn_grid_search_script.py", line 41, in
grid_search.fit(X, y)
File "/home/pinju/.conda/envs/dask_env/lib/python3.5/site-packages/dask_learn-0+untagged.82.gb5f2bab-py3.5.egg/dklearn/model_se$
File "/home/pinju/.conda/envs/dask_env/lib/python3.5/site-packages/dask/threaded.py", line 76, in get
**kwargs)
File "/home/pinju/.conda/envs/dask_env/lib/python3.5/site-packages/dask/async.py", line 493, in get_async
raise(remote_exception(res, tb))
dask.async.IndexError: pop from empty list

Thank you!
Pin-Ju

Replace `Matrix` with `da.Array` with unknown shapes

The Matrix class was created to support array-like things with unknown shapes, as well as sparse matrices. Since dask.array supports these now, we may be able to remove this class.

One issue is that the Matrix class does support sparse matrices, while dask array doesn't. In most places we know whether the output of an operation is sparse or dense. The best solution might be to only use Matrix for sparse matrices, and use da.Array elsewhere.

Asynchronous algorithms and "Good enough" RandomSearchCV

It would be interesting to start exploring asynchronous algorithms within this project using the dask.distributed API. Because this API is somewhat different it might be wise to start with something simple.

One simple application would be to build a variant of RandomSearchCV that, instead of taking a number of candidates to try, instead took a stopping criterion like "have tried 100 options and not improved more than 1%" and then continued submitting computations while this has not been met.

My initial approach to do this would be to periodically check the number of cores I had

client = `distributed.client.default_client()`
ncores = sum(client.ncores().valuse())

and try to keep roughly twice that many candidates in flight

candidate_pool = create_infinite_candidates(parameterspace)
futures = client.map(try_and_score, list(toolz.take(ncores * 2, candidate_pool)))

Then I would consume those futures as they finished

af = distributed.client.as_finished(futures)
for future in af:
    score, params = future.result()
    if score > best:
        best = score
        best_params = params
        ...

and then submit new futures as necessary

    future = client.submit(try_and_score, next(candidate_pool))
    af.add(future)

If we wanted to be cool, we could also check the number of cores periodically and submit more or less accordingly.

cc @jcrist @amueller

Reassure users of solid parts of this library

It seems that the pipeline and gridsearch components of this library are more stable than the "do not use" disclaimer would lead users to believe. We should find some graceful way of reassuring users that they can uses these components with more confidence than with other parts of the library.

Some options:

Remove "do not use" label entirely. Possibly
Create a dask.learn submodule in the core dask project and only import the solid parts. This is what we do for dask.distributed
?

cc @jcrist

The horrible journey to building Elemental with conda on OS X

should be significantly easier for Linux, but maybe not.

Problem #1, cmake doesn't like conda's MPI very much, so we'll have to hardcode some MPI variables into the build system. Here's what it looks like on my box:

cmake -D CMAKE_INSTALL_PREFIX="~/anaconda3/envs/elemental" \
            -D MPI_C_INCLUDE_PATH:STRING=/Users/qmj240/anaconda3/envs/elemental/include \
            -D MPI_C_LINK_FLAGS:STRING=-L/Users/qmj240/anaconda3/envs/elemental/lib/ \
            -D "MPI_C_LIBRARIES:STRING=-lmpi -lpmpi \
            -L/Users/qmj240/anaconda3/envs/elemental/lib/" ..

Maybe add verbose parameter to "RandomizedSearchCV" ?

I like the verbose parameter to get a feeling how long the grid search will take.

If you point me in the right direction, I can also try to submit a pr. However, I do not have any experience with dask so far

Rename this library?

This library originally came out of experiments I did last summer trying various ways to make dask and scikit-learn play well together. Some things were nice (and useful), others were less so.

Recently, in an effort to clean things up, I've removed everything except the GridSearchCV and RandomizedSearchCV functionality. These implementations have been improved, and are now (almost) 100% compatible with their scikit-learn counterparts. There are a few unsupported parameters (e.g. verbose), and the output doesn't include the timings, but other than that these should be full drop-ins.

I like this limited scope, and would be slightly against expanding the scope of this library to include other things. Other machine-learning functionality should live in other libraries IMO (e.g. dask-glm). That said, the name "dask-learn" implies a larger scope than I think we can/should provide here. I'd like to rename this library to reflect the limited scope (just hyper-parameter searching).

A few ideas:

dask-cv
dask-crossval
dask-gridsearch

Import names could be one word, (e.g. daskcv) or use an underscore (e.g. dask_crossval).

Naming things is hard. Ping @mrocklin, @amueller for thoughts/other name ideas.

Support lazy evaluation

I was trying to see the graph produced by DaskGridSearchCV.fit. My first attempt was to add a compute=False keyword. Looking at the code it's not clear that this is possible. Is this difficult to support?

BaseSearchCV throws IndexError for particular sized optional arguments to BaseSearchCV.fit

For certain sized optional arguments to estimator.fit, dask-searchcv throws an IndexError. It looks like it's trying to index the optional array with the number of examples.

Minimal working example:

from sklearn.base import BaseEstimator
import numpy as np
import dask_searchcv as dcv
from sklearn.datasets import make_classification
import pytest


class Dummy(BaseEstimator):
    def __init__(self, alpha=0):
        pass

    def fit(self, X, y, classes=None):
        return self

    def score(self, X, y):
        return 1


if __name__ == "__main__":
    X, y = make_classification(n_samples=25,
                               n_classes=2, random_state=0)

    clf = Dummy()
    grid = {'alpha': np.logspace(-3, 0)}
    classes = np.linspace(0, 1, num=24)

    gs = dcv.RandomizedSearchCV(clf, grid)

    with pytest.raises(IndexError):
        gs.fit(X, y, classes=classes)

    gs.fit(X, y)

Exceptions are raised only when len(classes) >= 25. As expected, the pass with scikit-learn and exceptions are also throw with dask_searchcv.GridSearchCV.

I ran into this issue while integrating #72.

Traceback when the pytest check is removed:

Traceback (most recent call last):
  File "test2.py", line 35, in <module>
    gs.fit(X, y, classes=classes)
  File "/Users/ssievert/anaconda3/lib/python3.6/site-packages/dask_searchcv/model_selection.py", line 867, in fit
    out = scheduler(dsk, keys, num_workers=n_jobs)
  File "/Users/ssievert/anaconda3/lib/python3.6/site-packages/dask/threaded.py", line 75, in get
    pack_exception=pack_exception, **kwargs)
  File "/Users/ssievert/anaconda3/lib/python3.6/site-packages/dask/local.py", line 521, in get_async
    raise_exception(exc, tb)
  File "/Users/ssievert/anaconda3/lib/python3.6/site-packages/dask/compatibility.py", line 67, in reraise
    raise exc
  File "/Users/ssievert/anaconda3/lib/python3.6/site-packages/dask/local.py", line 290, in execute_task
    result = _execute_task(task, data)
  File "/Users/ssievert/anaconda3/lib/python3.6/site-packages/dask/local.py", line 270, in _execute_task
    args2 = [_execute_task(a, cache) for a in args]
  File "/Users/ssievert/anaconda3/lib/python3.6/site-packages/dask/local.py", line 270, in <listcomp>
    args2 = [_execute_task(a, cache) for a in args]
  File "/Users/ssievert/anaconda3/lib/python3.6/site-packages/dask/local.py", line 271, in _execute_task
    return func(*args2)
  File "/Users/ssievert/anaconda3/lib/python3.6/site-packages/dask_searchcv/methods.py", line 141, in cv_extract_params
    return {k: cvs.extract_param(tok, v, n) for (k, tok), v in zip(keys, vals)}
  File "/Users/ssievert/anaconda3/lib/python3.6/site-packages/dask_searchcv/methods.py", line 141, in <dictcomp>
    return {k: cvs.extract_param(tok, v, n) for (k, tok), v in zip(keys, vals)}
  File "/Users/ssievert/anaconda3/lib/python3.6/site-packages/dask_searchcv/methods.py", line 93, in extract_param
    out = safe_indexing(x, self.splits[n][0]) if _is_arraylike(x) else x
  File "/Users/ssievert/anaconda3/lib/python3.6/site-packages/sklearn/utils/__init__.py", line 160, in safe_indexing
    return X.take(indices, axis=0)
IndexError: index 9 is out of bounds for size 2

Memory Error when number of worker increase in daskgridsearch

Hello,

I encounter a memory error when I increase the number of workers. This example uses the code here

Invoked by : python dklearn_grid_search_script.py

env: python3
cv = 3
training set: ~27 mb
grid search points: ~5* 20 * 4 * 5 = 2000 points
'DecisionTreeClassifier': [
DecisionTreeClassifier(),
{
'max_depth' : np.arange(1, 5),
'random_state' : np.arange(1, 20),
'max_features' : ['auto', 'sqrt', 'log2', None],
'min_samples_leaf' : np.arange(1, 5)
}
],

When the number of workers is around 7 or 8, then the memory error shows up, such as

To understand this more detail, I would like to know what is the memory load for every worker.
When number of worker = 1, I choose different kind of cv

number of workers: 1
number of grid points: 2000
cv = 2
running time: 237s
size of graph: 7322

cv = 3
running time: 359s
size of graph: 10980 ~ 7322 * 1.5

cv = 4
running time: 478s
size of graph: 14638 ~ 7322 * 2

So far so good, as we increase cv time scales linearly.
So each dask-worker creates a process that does the work. I did a snapshot of top to capture memory usage of this worker at cv =2 and cv =4.
when cv =2, it was ~3.178 GB at peak

when cv =4, it was ~5.2 GB at peak

So, a few questions.

when cv =2, the single worker increases to ~ 3.178 GB. What is the cause of this?
- it cannot be the data, since there is only one copy of the training data set, i.e. ~27mb
- is it the graph which the dask-scheduler has distributed to it.
  - Does dask-scheduler give the whole graph to the worker or just part of graph? In case of multiple workers does each worker still get the complete graph?
  - what factor decides the size of graph? only grid search points? or grid search point + train set size + cv
Why does every worker need to complete copy of the graphs as many times as cv?
Is it because every worker behave like a process and process is hared to share data with each other?

For example, In this comment, @jcrist pointed out that the size of graph is linear to cv. It means if cv = 2, dask-worker copy two of graphs? In my understanding, cv just decided how many sub-train set i want to divide. And dask-worker apply a graph on every sub-train set?

Please let me know if my understanding is correct. I appreciate any thoughts on this.