Dask is a flexible parallel computing library for analytics. See documentation for more information.
New BSD. See License File.
dask-searchcv is now part of dask-ml: https://github.com/dask/dask-ml
License: BSD 3-Clause "New" or "Revised" License
Dask is a flexible parallel computing library for analytics. See documentation for more information.
New BSD. See License File.
I can't reliably get my model pipeline that runs under scikit-learn to work using dask-searchcv, Is there an expectation of how much memory overhead dask will require as compared to scikit-learn? My test dataset is fairly small (< 10MB TSV) and my VM has something on the order of 8GB available to it.
I'm attempting to use dask-searchcv's GridSearchCV
to build a model pipeline without redundantly executing a few expensive pipeline components. scikit-learn 0.19 has made strides in this area, but I'm looking to the future where I'll want to further parallelize my workload across more marchines and/or handle larger data sets and more complicated pipelines.
My dataset is fairly small (< 10MB TSV), so I'm trying to distribute the execution and avoid redundant computation, but don't necessarily need a distributed access to a dataframe at this time.
I modified my code to use dask-searchcv's GridSearchCV
in "local distributed" mode as follows:
# from sklearn.model_selection import GridSearchCV
from dask_searchcv import GridSearchCV
...
pipe = sklearn.pipeline.Pipeline(
....
)
# Use dask in distributed mode
from distributed import Client
c = Client()
best_model = GridSearchCV(
pipe,
param_grid=param_grid,
cv=10,
scoring=my_error_metrics_function,
n_jobs=cpu_count
)
with ProgressBar():
best_model.fit(train_x, train_y)
I skimmed the documentation for distributed and it seems to support the idea that no additional setup is required. Since it didn't immediately die on me, I'm assuming my config/setup is correct.
gridsearchcv implementation | result | notes |
---|---|---|
scikit-learn 0.19 | reliably succeeds | Used in default mode without saving state |
dask-searchcv | occasionally fails | From reading the code, this appears to use threaded_get . I've noticed that failures are more likely to appear if more than one core is available. |
dask-searchcv in "local distributed mode" | Ran for "a while" and then died with distributed.scheduler.KilledWorker . |
appeared to get smashed by the oomkiller. |
I believe this is related to scikit-learn/scikit-learn#2755
File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/sklearn/ensemble/forest.py", line 315, in fit
random_state=random_state)
File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/sklearn/ensemble/base.py", line 127, in _make_estimator
for p in self.estimator_params))
File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/sklearn/base.py", line 264, in set_params
valid_params = self.get_params(deep=True)
File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/sklearn/base.py", line 240, in get_params
warnings.filters.pop(0)
IndexError: pop from empty list
local distributed
modeI'm pretty sure this was the victim of the oomkiller.
File "MY_MODULE_build_models.py", line 426, in <module>
best_model.fit(train_x, train_y)
File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/dask_searchcv/model_selection.py", line 793, in fit
out = scheduler(dsk, keys, num_workers=n_jobs)
File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/distributed/client.py", line 1923, in get
results = self.gather(packed, asynchronous=asynchronous)
File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/distributed/client.py", line 1368, in gather
asynchronous=asynchronous)
File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/distributed/client.py", line 540, in sync
return sync(self.loop, func, *args, **kwargs)
File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/distributed/utils.py", line 239, in sync
six.reraise(*error[0])
File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/six.py", line 686, in reraise
raise value
File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/distributed/utils.py", line 227, in f
result[0] = yield make_coro()
File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
value = future.result()
File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
raise_exc_info(self._exc_info)
File "<string>", line 4, in raise_exc_info
File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
yielded = self.gen.throw(*exc_info)
File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/distributed/client.py", line 1246, in _gather
traceback)
File "MY_VIRTUALENV_PATH/lib/python3.6/site-packages/six.py", line 686, in reraise
raise value
distributed.scheduler.KilledWorker: ("('mypipelinestep-fit-dfe1861ca5d06ad69d926709c897ffc9', 34, 0)", 'tcp://127.0.0.1:46091')
$ conda list | egrep '^(numpy|scipy|dask|scikit|distributed|python |tornado)'
dask 0.15.2 py36_0
dask-searchcv 0.0.2+8.g2c2056d <pip>
distributed 1.18.1 py36_0
numpy 1.13.1 py36_0
python 3.6.1 2
scikit-learn 0.19.0 np113py36_0
scipy 0.19.1 np113py36_0
tornado 4.5.2 py36_0
The Pipeline class seems to have disappeared. Are we now just introspecting sklearn pipelines to avoid recomputations of earlier stages?
The Keras neural-network package has a sklearn wrapper that works with the sklearn RandomizedSearchCV and GridSearchCV classes. However, it fails with the dask-searchcv equivalents. Thus, there seem to be additional requirements beyond the sklearn estimator interface that must be met in order for dask-searchcv to work. Would it be possible to list these, such that other projects could be adapted to be used with dask-searchcv?
Thoughts on one name over the other? I have a slight preference for GridSearchCV in order to make switching out imports easier (though I can also accomplish this with import ... as ...
)
Hi,
We have another issue of DaskGridSearchCV. The example is XGBRegressor (https://github.com/pinjutien/grid-search/tree/master/grid_search_issue2).
In this case, when DaskGridSearchCV is on, there is nothing showing up in dash-board(for example, http://173.208.222.74:8866/status/). Both Client or Executor have no dash-board showing up.
Thanks,
Pin-Ju
Trying to run dask-searchcv using version 0.2.0 with Dask 0.18.1 and I'm finding that the scheduling back-end 'distributed' is no longer picking up the tasks, and instead they run from the submitting Python kernel instead of the registered worker. The web-ui confirms as much since the tasks fail to show, and instead I see console output in the submitting kernel, as opposed to the dask-worker.
By downgrading to Dask 0.17.5, this issue is resolved.
Currently dask-ml.model_selection just imports dask-searchcv
. For ease of maintenance, I'd like to move dask-searchcv into dask-ml. Does anyone have objections?
cc @jcrist
What's the difference between dklearn.Pipeline
and sklearn.Pipeline
? Does dklearn
cache the intermediate results?
Reasin I ask -- I'm trying to fit a large number of sklearn
Pipelines in parallel, comparing building the computational graph w/ dask
with a handrolled (way less sophisticated) joblib
implementation. dask
is ~50% slower, and I wonder whether that's a result of the fact that joblib
memmap's large objects to reduce communication overhead. Any thoughts? Happy to share code if it's useful.
Hello,
I am in trouble using this nice tool dask-searchcv on simple Pipeline.
Given the fact I tried it on simple sklearn Pipeline (StandardScaler + SVC_rbf):
Pipeline(memory=None,
steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svc', SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
max_iter=-1, probability=False, random_state=42, shrinking=True,
tol=0.001, verbose=False))])
with n_jobs=-1
and scheduler="threading"
or scheduler="multiprocessing"
and search grid on C and gamma parameters, in all time I got only one process used (on my 16 available).
Whereas when I used dask-searchcv on composed Pipeline including moreover PCA, I got as expected one process used at 1600 % CPU or 16 processes launched.
I don't understand why dask-searchcv multi-threading or -processing doesn't work for the first case...
Any explanation ?
Matt
It looks like this will require passing a keyword argument to method.py#fit.
I think there also needs to be some specification of the random state to make sure the train/test split is preserved on repeated calls.
I encountered this bug in a dask-searchcv LooseVersion
Python 2.7 usage with elm
in CI. It appears to be the same idea as this LooseVersion
issue python.org issue 14894.
To avoid attribute error, fix this line for Python 2.7:
if LooseVersion(dask.__version__) > '0.15.4'
Hey.
From what I can see this now works with the stock pipeline. Is that correct? That's amazing!
Can you maybe add an example to make that clear? All the blog-posts etc have a dask-sklearn pipeline.
Keep up the great work!
should we consider incorporating bayesian optimization for hyperparameter search?
it could be distributed with across different machines, just as with gridsearch, but would converge much faster.
here is the 2017 nips paper (http://papers.nips.cc/paper/4522-practical-bayesian-optimization-of-machine-learning-algorithms.pdf)
in, e.g. pytest dask_searchcv/tests/test_model_selection.py::test_pipeline_feature_union
Lots of warnings like
/Users/taugspurger/Envs/dask-dev/lib/python3.6/site-packages/scikit-learn/sklearn/base.py:114: DeprecationWarning: Estimator Pipeline modifies parameters in __init__. This behavior is deprecated as of 0.18 and support for this behavior will be removed in 0.20.
% type(estimator).__name__, DeprecationWarning)
I'll take a look this afternoon.
Does dask-searchcv work with scikit-learn 0.19+? Getting the following error
File "/home/vagrant/.pyenv/versions/miniconda3-4.1.11/envs/gambit/lib/python3.6/site-packages/dask_searchcv/methods.py", line 16, in <module>
from sklearn.utils.fixes import rankdata
ImportError: cannot import name 'rankdata'
I've created some custom Sklearn transforms that I'm putting into a pipeline. These custom transforms take a pandas object and apply some function over it like this example to extract text from html strings to pass into a CountVectorizer.
class GetText(BaseEstimator, TransformerMixin):
def get_text(self, html_string):
return lxml.html.document_fromstring(html_string).text_content()
def fit(self, x, y=None):
return self
def transform(self, X):
return X.apply(self.get_text)
I realize that dask.apply can be generically swapped in for pandas apply, especially if I convert the data to dask dataframes. But since these are in sklearn pipeline, I wasn't sure how much parallelism I'd get.
I'm implementing a gridsearch on pipeline where the classifiers are swapped out. Using sklearn gridsearch, it worked ok.
gridSearch_features = GridSearchCV(estimator=pipeline,
param_grid=feature_parameters,
scoring = 'neg_log_loss',
n_jobs=-1, refit=True)
from sklearn.metrics import make_scorer, brier_score_loss
brier = make_scorer(brier_score_loss, greater_is_better=False)
gridSearch_features = GridSearchCV(estimator=pipeline,
param_grid=feature_parameters,
scoring = brier,
n_jobs=-1, refit=True)
I tried to drop in dask.GridSearchCV and got an error that I believe is connected to make_scorer object:
File "hal_crossval_search_features_multi.py", line 397, in <module>
gridSearch_features_model = gridSearch_features.fit(train_x,train_y)
File "/mnt/var/lib/anaconda2/lib/python2.7/site-packages/dask_searchcv/model_selection.py", line 801, in fit
out = scheduler(dsk, keys, num_workers=n_jobs)
File "/mnt/var/lib/anaconda2/lib/python2.7/site-packages/dask/threaded.py", line 75, in get
pack_exception=pack_exception, **kwargs)
File "/mnt/var/lib/anaconda2/lib/python2.7/site-packages/dask/local.py", line 521, in get_async
raise_exception(exc, tb)
File "/mnt/var/lib/anaconda2/lib/python2.7/site-packages/dask/local.py", line 290, in execute_task
result = _execute_task(task, data)
File "/mnt/var/lib/anaconda2/lib/python2.7/site-packages/dask/local.py", line 271, in _execute_task
return func(*args2)
File "/mnt/var/lib/anaconda2/lib/python2.7/site-packages/dask_searchcv/methods.py", line 216, in fit_transform
est = set_params(est, fields, params)
File "/mnt/var/lib/anaconda2/lib/python2.7/site-packages/dask_searchcv/methods.py", line 182, in set_params
est = copy_estimator(est)
File "/mnt/var/lib/anaconda2/lib/python2.7/site-packages/dask_searchcv/utils.py", line 65, in copy_estimator
return copy.deepcopy(est)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 190, in deepcopy
y = _reconstruct(x, rv, 1, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 334, in _reconstruct
state = deepcopy(state, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 257, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 230, in _deepcopy_list
y.append(deepcopy(a, memo))
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 237, in _deepcopy_tuple
y.append(deepcopy(a, memo))
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 190, in deepcopy
y = _reconstruct(x, rv, 1, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 334, in _reconstruct
state = deepcopy(state, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 257, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 230, in _deepcopy_list
y.append(deepcopy(a, memo))
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 237, in _deepcopy_tuple
y.append(deepcopy(a, memo))
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 190, in deepcopy
y = _reconstruct(x, rv, 1, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 334, in _reconstruct
state = deepcopy(state, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 257, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 190, in deepcopy
y = _reconstruct(x, rv, 1, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 334, in _reconstruct
state = deepcopy(state, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 257, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 190, in deepcopy
y = _reconstruct(x, rv, 1, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 334, in _reconstruct
state = deepcopy(state, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 257, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 190, in deepcopy
y = _reconstruct(x, rv, 1, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 334, in _reconstruct
state = deepcopy(state, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 257, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 190, in deepcopy
y = _reconstruct(x, rv, 1, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 334, in _reconstruct
state = deepcopy(state, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 257, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 163, in deepcopy
y = copier(x, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 257, in _deepcopy_dict
y[deepcopy(key, memo)] = deepcopy(value, memo)
File "/mnt/var/lib/anaconda2/lib/python2.7/copy.py", line 182, in deepcopy
rv = reductor(2)
TypeError: can't pickle NotImplementedType objects
First, thanks for putting this library up. I find it very useful.
Could you do this library pip installable so I can add it as a requirement to one of my packages? (https://github.com/alvarouc/polyssifier)
This is exactly what I've been looking for (ok, looking for is too strong- half-assedly-hacking-together-pieces-of-this-myself is probably a more accurate description).
I'm running into a bunch of the issues you call out with sklearn's native pipelines.
I'd love to integrate this into auto_ml, which we use in production at DoorDash. So I'm wondering what it would take for you to happily certify this production ready?
Hi,
I compare grid search between sklearn and dask-learn.
As per example documented in: https://github.com/pinjutien/grid-search,
Dask Learn fails with the following error:
Traceback (most recent call last):
File "dklearn_grid_search_script.py", line 41, in
grid_search.fit(X, y)
File "/home/pinju/.conda/envs/dask_env/lib/python3.5/site-packages/dask_learn-0+untagged.82.gb5f2bab-py3.5.egg/dklearn/model_se$
File "/home/pinju/.conda/envs/dask_env/lib/python3.5/site-packages/dask/threaded.py", line 76, in get
**kwargs)
File "/home/pinju/.conda/envs/dask_env/lib/python3.5/site-packages/dask/async.py", line 493, in get_async
raise(remote_exception(res, tb))
dask.async.IndexError: pop from empty list
Thank you!
Pin-Ju
The Matrix
class was created to support array-like things with unknown shapes, as well as sparse matrices. Since dask.array
supports these now, we may be able to remove this class.
One issue is that the Matrix
class does support sparse matrices, while dask array doesn't. In most places we know whether the output of an operation is sparse or dense. The best solution might be to only use Matrix
for sparse matrices, and use da.Array
elsewhere.
It would be interesting to start exploring asynchronous algorithms within this project using the dask.distributed API. Because this API is somewhat different it might be wise to start with something simple.
One simple application would be to build a variant of RandomSearchCV that, instead of taking a number of candidates to try, instead took a stopping criterion like "have tried 100 options and not improved more than 1%" and then continued submitting computations while this has not been met.
My initial approach to do this would be to periodically check the number of cores I had
client = `distributed.client.default_client()`
ncores = sum(client.ncores().valuse())
and try to keep roughly twice that many candidates in flight
candidate_pool = create_infinite_candidates(parameterspace)
futures = client.map(try_and_score, list(toolz.take(ncores * 2, candidate_pool)))
Then I would consume those futures as they finished
af = distributed.client.as_finished(futures)
for future in af:
score, params = future.result()
if score > best:
best = score
best_params = params
...
and then submit new futures as necessary
future = client.submit(try_and_score, next(candidate_pool))
af.add(future)
If we wanted to be cool, we could also check the number of cores periodically and submit more or less accordingly.
It seems that the pipeline and gridsearch components of this library are more stable than the "do not use" disclaimer would lead users to believe. We should find some graceful way of reassuring users that they can uses these components with more confidence than with other parts of the library.
Some options:
dask.learn
submodule in the core dask project and only import the solid parts. This is what we do for dask.distributed
cc @jcrist
should be significantly easier for Linux, but maybe not.
Problem #1, cmake doesn't like conda's MPI very much, so we'll have to hardcode some MPI variables into the build system. Here's what it looks like on my box:
cmake -D CMAKE_INSTALL_PREFIX="~/anaconda3/envs/elemental" \
-D MPI_C_INCLUDE_PATH:STRING=/Users/qmj240/anaconda3/envs/elemental/include \
-D MPI_C_LINK_FLAGS:STRING=-L/Users/qmj240/anaconda3/envs/elemental/lib/ \
-D "MPI_C_LIBRARIES:STRING=-lmpi -lpmpi \
-L/Users/qmj240/anaconda3/envs/elemental/lib/" ..
I like the verbose
parameter to get a feeling how long the grid search will take.
If you point me in the right direction, I can also try to submit a pr. However, I do not have any experience with dask so far
This library originally came out of experiments I did last summer trying various ways to make dask and scikit-learn play well together. Some things were nice (and useful), others were less so.
Recently, in an effort to clean things up, I've removed everything except the GridSearchCV
and RandomizedSearchCV
functionality. These implementations have been improved, and are now (almost) 100% compatible with their scikit-learn counterparts. There are a few unsupported parameters (e.g. verbose
), and the output doesn't include the timings, but other than that these should be full drop-ins.
I like this limited scope, and would be slightly against expanding the scope of this library to include other things. Other machine-learning functionality should live in other libraries IMO (e.g. dask-glm
). That said, the name "dask-learn
" implies a larger scope than I think we can/should provide here. I'd like to rename this library to reflect the limited scope (just hyper-parameter searching).
A few ideas:
dask-cv
dask-crossval
dask-gridsearch
Import names could be one word, (e.g. daskcv
) or use an underscore (e.g. dask_crossval
).
Naming things is hard. Ping @mrocklin, @amueller for thoughts/other name ideas.
I was trying to see the graph produced by DaskGridSearchCV.fit
. My first attempt was to add a compute=False
keyword. Looking at the code it's not clear that this is possible. Is this difficult to support?
For certain sized optional arguments to estimator.fit
, dask-searchcv throws an IndexError
. It looks like it's trying to index the optional array with the number of examples.
Minimal working example:
from sklearn.base import BaseEstimator
import numpy as np
import dask_searchcv as dcv
from sklearn.datasets import make_classification
import pytest
class Dummy(BaseEstimator):
def __init__(self, alpha=0):
pass
def fit(self, X, y, classes=None):
return self
def score(self, X, y):
return 1
if __name__ == "__main__":
X, y = make_classification(n_samples=25,
n_classes=2, random_state=0)
clf = Dummy()
grid = {'alpha': np.logspace(-3, 0)}
classes = np.linspace(0, 1, num=24)
gs = dcv.RandomizedSearchCV(clf, grid)
with pytest.raises(IndexError):
gs.fit(X, y, classes=classes)
gs.fit(X, y)
Exceptions are raised only when len(classes) >= 25
. As expected, the pass with scikit-learn and exceptions are also throw with dask_searchcv.GridSearchCV.
I ran into this issue while integrating #72.
Traceback when the pytest check is removed:
Traceback (most recent call last):
File "test2.py", line 35, in <module>
gs.fit(X, y, classes=classes)
File "/Users/ssievert/anaconda3/lib/python3.6/site-packages/dask_searchcv/model_selection.py", line 867, in fit
out = scheduler(dsk, keys, num_workers=n_jobs)
File "/Users/ssievert/anaconda3/lib/python3.6/site-packages/dask/threaded.py", line 75, in get
pack_exception=pack_exception, **kwargs)
File "/Users/ssievert/anaconda3/lib/python3.6/site-packages/dask/local.py", line 521, in get_async
raise_exception(exc, tb)
File "/Users/ssievert/anaconda3/lib/python3.6/site-packages/dask/compatibility.py", line 67, in reraise
raise exc
File "/Users/ssievert/anaconda3/lib/python3.6/site-packages/dask/local.py", line 290, in execute_task
result = _execute_task(task, data)
File "/Users/ssievert/anaconda3/lib/python3.6/site-packages/dask/local.py", line 270, in _execute_task
args2 = [_execute_task(a, cache) for a in args]
File "/Users/ssievert/anaconda3/lib/python3.6/site-packages/dask/local.py", line 270, in <listcomp>
args2 = [_execute_task(a, cache) for a in args]
File "/Users/ssievert/anaconda3/lib/python3.6/site-packages/dask/local.py", line 271, in _execute_task
return func(*args2)
File "/Users/ssievert/anaconda3/lib/python3.6/site-packages/dask_searchcv/methods.py", line 141, in cv_extract_params
return {k: cvs.extract_param(tok, v, n) for (k, tok), v in zip(keys, vals)}
File "/Users/ssievert/anaconda3/lib/python3.6/site-packages/dask_searchcv/methods.py", line 141, in <dictcomp>
return {k: cvs.extract_param(tok, v, n) for (k, tok), v in zip(keys, vals)}
File "/Users/ssievert/anaconda3/lib/python3.6/site-packages/dask_searchcv/methods.py", line 93, in extract_param
out = safe_indexing(x, self.splits[n][0]) if _is_arraylike(x) else x
File "/Users/ssievert/anaconda3/lib/python3.6/site-packages/sklearn/utils/__init__.py", line 160, in safe_indexing
return X.take(indices, axis=0)
IndexError: index 9 is out of bounds for size 2
Hello,
I encounter a memory error when I increase the number of workers. This example uses the code here
Invoked by : python dklearn_grid_search_script.py
When the number of workers is around 7 or 8, then the memory error shows up, such as
To understand this more detail, I would like to know what is the memory load for every worker.
When number of worker = 1, I choose different kind of cv
number of workers: 1
number of grid points: 2000
cv = 2
running time: 237s
size of graph: 7322
cv = 3
running time: 359s
size of graph: 10980 ~ 7322 * 1.5
cv = 4
running time: 478s
size of graph: 14638 ~ 7322 * 2
So far so good, as we increase cv time scales linearly.
So each dask-worker creates a process that does the work. I did a snapshot of top to capture memory usage of this worker at cv =2 and cv =4.
when cv =2, it was ~3.178 GB at peak
when cv =4, it was ~5.2 GB at peak
So, a few questions.
when cv =2, the single worker increases to ~ 3.178 GB. What is the cause of this?
Why does every worker need to complete copy of the graphs as many times as cv?
Is it because every worker behave like a process and process is hared to share data with each other?
For example, In this comment, @jcrist pointed out that the size of graph is linear to cv. It means if cv = 2, dask-worker copy two of graphs? In my understanding, cv just decided how many sub-train set i want to divide. And dask-worker apply a graph on every sub-train set?
Please let me know if my understanding is correct. I appreciate any thoughts on this.
Thank you!
Pin-Ju
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.