Comments (5)
Thanks, I'll try to debug this later today or tomorrow. Can you say a bit more about
- The graph backing
X
andy
. Do they involve disk IO? Do all the workers share a file system? - The estimator you're fitting? This probably isn't the issue, but it may help with debugging.
from dask-ml.
1.) X and y are already in memory - no additional disk IO after they've been read from std.io to a dataframe earlier in the process. The workers in the above are 10 containers spread across three physical hosts, so each worker on average shares a filesystem with 2 other workers.
2.) The estimator here was RandomForestClassifier
from dask-ml.
Sorry for the delay on this! Let's try to narrow this down to see if it's just the scheduler that's not working properly. Could you setup a cluster / client and try out the following:
import dask.distributed
import pandas as pd
import numpy as np
import dask_ml.joblib
from sklearn.externals import joblib
from sklearn.model_selection import RandomizedSearchCV
from scipy import stats
from sklearn.base import BaseEstimator
class DummyEstimator(BaseEstimator):
def __init__(self, parameter=None):
self.parameter = parameter
def fit(self, X, y=None):
return self
def predict(self, X):
return np.zeros(len(X))
def score(self, X, y=None):
return 0
search = RandomizedSearchCV(DummyEstimator(), {"parameter": stats.uniform}, cv=3, n_iter=20, verbose=10)
%%time
N = 100_000
X = pd.DataFrame(np.random.randn(N, 10))
y = pd.Series(np.random.uniform(size=N))
addr = client.scheduler_info()['address']
with joblib.parallel_backend('dask.distributed', addr,
scatter=[X.values, y.values]) as pb:
search.fit(X, y)
For me, that finishes in ~4 seconds, just using a local cluster.
from dask-ml.
FYI, if you're able to try out and joblib master and dask.distributed
master, things may have improved in the last couple weeks. Nothing specific to this issue, but we were making changes to that code and it might have fixed things magically :)
from dask-ml.
Hey Tom - Sorry I sorta dropped off the face of the planet. I appreciate your responses - I might not get around to re-checking this out for a bit - so I will close this issue for now :)
from dask-ml.
Related Issues (20)
- LinearRegression doesn't return lazy object HOT 2
- Logistic Regression Fails with ValueError: Shapes Not Aligned HOT 1
- Better error message when using invalid `MinMaxScaler.fit(...)` inputs
- Default datatype when using CountVectorizer and HashingVectorizer should be sparse COO
- LabelEncoder raises errors with string and string[pyarrow] types HOT 1
- LabelEncoder doesn't handle missing values in *dask* series of strings HOT 3
- can't set attribute error when running PCA
- KFold cross validation fails with dask dataframes HOT 2
- Mistake.
- Add backward compatibility for supported version of scikit-learn
- Bug in ColumnTransformer HOT 2
- HashingVectorizer behaves differently from FeatureHasher HOT 1
- sklearn handles text labels differently than ml_dask on OneHotEncoding
- Implementation for make_s_curve HOT 2
- Import dask_ml with python 3.10 failed due to conflict with dask.distributed HOT 4
- Python 3.11 support HOT 2
- LogisticRegression.score returns an empty dask array
- Incremental does not handle dask arrays of ndim>2 in estimator training HOT 2
- loading dask_ml gives error contextualversionconflict with sklearn HOT 4
- For a single record data frame train_test_split() sometimes assigns this single record to test set. HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dask-ml.