Dask is a flexible parallel computing library for analytics. See documentation for more information.
New BSD. See License File.
Scalable Machine Learning with Dask
Home Page: http://ml.dask.org
License: BSD 3-Clause "New" or "Revised" License
Dask is a flexible parallel computing library for analytics. See documentation for more information.
New BSD. See License File.
A few estimators subclass estimators from other ML packages (e.g. scikit-learn) and when a method is not overwritten, the docstring from the parent package is used directly in the documentation, which is not necessarily consistent. For instance,
I wonder if some automatic re-writing of the non-overwritten methods could partially address this...
I have no direct experience with it but this NIPS 2016 paper looks very interesting and has some theoretical guarantees on the approximation.
Fast and Provably Good Seedings for k-Means
http://papers.nips.cc/paper/6478-fast-and-provably-good-seedings-for-k-means.pdf
Here are some benchmarks:
Just pick a good default.
Would be nice to have scikit-learn's DictionaryLearning wrapped up for use with large Dask Arrays.
In some cases we import routines from other packages. In some cases we don't.
For example we import LogisticRegression from dask_glm but don't import GridSearchCV from dask_searchcv. Which approach should we take here? I'm slightly inclined to import from everywhere and consolidate here.
This would both be useful / not too difficult (hopefully). Can build off the SVD that's in dask.array.
I think it'd be nice to have some transformers that work on dask and numpy arrays, & dask and pandas DataFrames. This would be good since
See https://github.com/tomaugspurger/sktransformers for some inspiration (and maybe some tests)?
columns
argument. When specified, the transformation will be limited to just columns
(e.g. if doing a standard scaling, and columns=['A', 'B']
, only 'A'
and 'B'
are scaled). By default, all columns are scalednp.ndarray
, dask.array.Array
, pandas.core.NDFrame
, dask.dataframe._Frame
.The big question right now is should fitting be eager, and fitted values concrete? e.g.
scaler = StandardScaler()
scaler.fit(X) # X is a dask.array
So, has scaler.mean_
been computed, and is it a dask.array
or a numpy.array
? This is a big decision.
Candidates
Maybe we should consider using hypothesis for testing. I use it here at work to test ETL and data pipelines and it works like a charm.
Validate that the dtypes match, and that the categories match for Categorical columns.
Hi all,
Is there any plan to implement a parallel version of GMM (Gaussian Mixture Modelling) ?
Thanks
g
Would be nice to have some form of ICA available for large Dask Arrays. Ideally could just use an implementation or a few as appropriate from scikit-learn
.
A simple example:
from dask import dataframe as dd
from dask_glm.datasets import make_classification
from dask_ml.linear_model import LogisticRegression
X, y = make_classification(n_samples=10000, n_features=2)
X = dd.from_dask_array(X, columns=["a","b"])
y = dd.from_array(y)
lr = LogisticRegression()
lr.fit(X, y)
Returns KeyError: (<class 'dask.dataframe.core.DataFrame'>,)
I did not have time to try if it is also the case for other models.
Hey guys,
Just got back and I tried to run pytest
and got:
ImportError while importing test module '/home/severo/Projects/dask-ml/tests/test_kmeans.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/test_kmeans.py:7: in <module>
from daskml.cluster import KMeans as DKKMeans
daskml/cluster/__init__.py:4: in <module>
from .k_means import KMeans # noqa
daskml/cluster/k_means.py:12: in <module>
from ._k_means import _centers_dense
E ModuleNotFoundError: No module named 'daskml.cluster._k_means'
I understand that it is a Cythonized .pyx file, but shouldn't we be calling pyximport.install()
somewhere?
Once I added this to the top of daskml/cluster/k_means.py
everything went back to normal.
Debugging a memory usage issue I'm seeing with k-means initialization. The issue is in this loop. A small example is
import numpy as np
import dask.array as da
from dask import delayed
from distributed import Client, wait
from sklearn.datasets import make_classification
if __name__ == '__main__':
c = Client()
s = c.cluster.scheduler
N_SAMPLES = 1_000_000
N_BLOCKS = 24
def mem():
print("{:.2f}".format(sum(s.worker_bytes.values()) / 10**9), "GB")
def make_block(n_samples):
X, y = make_classification(n_samples=n_samples)
return X
blocks = [delayed(make_block)(N_SAMPLES) for i in range(N_BLOCKS)]
arrays = [da.from_delayed(block, dtype='f8', shape=(N_SAMPLES, 20))
for block in blocks]
stacked = da.vstack(arrays)
print(stacked.nbytes / 10**9, "GB")
X = c.persist(stacked)
wait(X)
for i in range(5):
idx = np.random.randint(0, len(X), size=5)
centers = X[idx].compute()
mem()
Which outputs
3.84 GB
4.32 GB
4.80 GB
5.12 GB
5.60 GB
6.08 GB
@mrocklin is that increasing memory usage expected? Inside that for loop, the result is always going to be a small NumPy array.
(side-note, that generates some exceptions I've left out of the output):
distributed.protocol.core - CRITICAL - Failed to deserialize
Traceback (most recent call last):
File "/Users/taugspurger/.virtualenvs/dask-dev/lib/python3.6/site-packages/distributed/distributed/protocol/core.py", line 122, in loads
value = _deserialize(head, fs)
File "/Users/taugspurger/.virtualenvs/dask-dev/lib/python3.6/site-packages/distributed/distributed/protocol/serialize.py", line 160, in deserialize
f = deserializers[header.get('type')]
KeyError: 'numpy.ndarray'
looking into those now.
https://twitter.com/shoyer/status/912387291749822465
The incremental learn's are susceptible to bias when the blocks aren't IID (the dataset is ordered by time, say, and X or y is correlated with time).
How best to handle this? A full shuffle is costly, and unnecessary. A shuffle of the blocks (plus maybe an in-memory shuffle of each block) should be fine. Actually... I should check now if the scheduler goes through the blocks in order.
See make_poisson and similar functions in dask-glm's datasets.py. Transfer that approach into dask-ml
's datasets.py
so larger synthetic data sets can be created. Currently the approach in dask-ml
's datasets.py
just wraps sklearn.datasets
.
Would be good to add Lars.
Before I forget, it'd be good to document all the required / optional dependencies. I'll do this later today.
Line 11 in cb3d0ea
The use of @wraps
here leads to wrong behavior of some introspection tools, like ??
in Jupyter (will show you the code for the sklearn version).
I feel that it's best to wrap thing manually (assigning the __name__
, __doc__
and __module__
attributes for inner
manually), after line 17, before the return
.
This issue is making some unpleasant workarounds necessary in the xarray_filters
codebase (same datasets.py
file) (where we are trying to give more helpful docstrings users beyond what's in sklearn). The issue for me revolves around knowing whether I'm going to wrap the make_blobs
from dask_ml
or from sklearn
. I could do it by inspecting the signature, or the __module__
attribute, but those attributes are set equal to the ones in sklearn right now.
I'll put links to relevants parts of the xarray_filters
codebase here as soon as I push the new code up.
Would be nice to have scikit-learn's FA wrapped up for use with large Dask Arrays.
Probably through a partial_fit API.
Misfire:
This line uses NumPy:
informative_idx = np.random.choice(n_features, n_informative)
However now I see that this is likely small. Closing.
Would be good to add Lasso.
https://readthedocs.org/projects/dask-ml/builds/6364862/
Fails on the sphinx-gallery examples with OOM errors. We'll probably need to just generate those ahead of time and include them statically :/
would be state of the art and very useful if Dask could natively handle distributed TF IDF matrices as input to a multinomial naive bayes model. I know this is a difficult problem to solve because for most implementations of computing TF IDF you need the entire Term Document matrix in memory so I'm not sure know how to solve this problem tbh.
Problem referenced here: https://stackoverflow.com/questions/25145552/tfidf-for-large-dataset
Would be good to add ElasticNet.
hi ya'll - Matthew suggested I ask some questions here.
I'm a little confused on some results I'm seeing - and am wondering if you guys can help me figure out why I am seemingly executing grid searches in serial.
I modified dask-docker and am using containers on a docker swarm with 64GB RAM and 48 cores each.
search = RandomizedSearchCV(self.estimator, self.estimator.param_space, cv=3, n_iter=20, verbose=10)
if self.estimator.use_dask:
address = 'tcp://dask-cluster:8786'
c = Client(address)
with parallel_backend('dask.distributed', scheduler_host=address,
scatter=[X.values, y.values]):
search.fit(X.values, y.values)
where X
is pd.DataFrame
and y is pd.Series
.
When I run the above with a small number of rows, I can see in the status page that lots of tasks get executed in parallel. The blocks of work on the task stream end up looking like a straight vertical since they all get dispatched nearly simultaneously.
Once I start increasing the number of rows, I seem to get more and more serial execution, where the status page shows that really one additional task gets added at a time.
In this screenshot, I started at the full dataset, and started subtracting 10k on each run to see the affect on the execution time / parallelism. For some reason occasionally, e.g. on 80k and 40k, the work gets distributed a little differently?
When the number is higher, there is never more than one task active. When the number is lower, I see more (up to 4) getting triggered simultaneously.
Anyhow, my question ultimately is:
While sparse arrays are supported in dask, this issue aims to open the discussion on how this could be applied in the the context of dask-ml.
In particular, even if #5 about TF-IDF gets resolved, the estimators downstream in the pipeline would also need to support sparse arrays for this to be of any use. The simplest example of such pipeline could for instance be #115 : some text vectorizers, training a wrapped out-of code scikit-learn model (e.g. PartialMultinomialNB
).
TruncatedSVD
, text vectorizer, some estimators in dask_ml.preprocessing
, and wrapped scikit learn models that support incremental learning and sparse arrays natively.
Several choices here,
In particular, as far as I understand, for the application at hand there is not need for ND sparse COO arrays provided by sparse
package, 2D would be enough. Furthermore, scikit learns mostly uses CSR and while it's relatively easy to convert between COO and CSR/CSC in the non distrubuted case, I'm not sure if it's still the case for dask. Then there is the partitioning strategy (see next section).
At least as far text vectorizers and incremental learning estimators are concerned, I imagine, it might be easier to partition the arrays row wise (using all the column width), which might also be natural with the CSR format.
For instance once someone manages to compute a distributed TF-IDF , the question arises how can one store it on disk without loading everything in memory at once. At present, it doesn't look like there is a canonical way to do this (dask/dask#2562 (comment)). zarr-developers/zarr-python#152 might be relevant but it essentially stores dense format with compression, as far as I understand, which make difficult to do later computation with the data, I believe.
Just a few general thoughts. I'm not sure what's your vision of the project in this respect is @mrocklin @TomAugspurger ; how much work this would represent or what might the easiest to start with..
http://scikit-learn.org/stable/developers/contributing.html#rolling-your-own-estimator
It'd be nice if these passed.
Someone mentioned wanting to implement ALS (I think @daniel-severo ?)
I thought I'd raise an issue to solicit and discuss algorithms.
Would be good to add LassoLars
https://pdfs.semanticscholar.org/c2ef/1651aaa2bdbf4c49d18682ac6f4401f95750.pdf
Once spectral clustering with Nystrom is done, this will be straightforward since we have all the building blocks
And register it with pytest so we still have assertion rewriting:
pytest.register_assert_rewrite('daskml.utils')
Greetings!
I recently used dask to implement a distributed version of tfidf. I want to contribute to the dask project by putting it somewhere.
Would this be the correct repo.?
I thought maybe a feature_extraction
directory would be appropriate.
Would be nice to have scikit-learn's NMF wrapped up for use with large Dask Arrays.
Right around the time that I stumbled upon dask-searchcv, scikit-learn 0.19 was released where pipelines now support the memory parameter:
http://scikit-learn.org/stable/whats_new.html#version-0-19
As such, perhaps this section of the docs should be revised:
With the regular scikit-learn version, each stage of the pipeline must be fit for each of the combinations of the parameters, even if that step isn’t being searched over.
I'd be interested to hear if any work was done to compare/contrast dask-searchcv vs scikit-learn's changes in 0.19. Presumably dask-searchcv allows you to more easily harness multiple machines, but perhaps some rudimentary benchmarks could/would be interesting/appropriate.
From:
http://dask-ml.readthedocs.io/en/latest/hyper-parameter-search.html#efficient-search
There was a question over in pangeo-data/pangeo#61 (comment) about how dask-ml works with sklearn-xarray
When I first tried it out, things like dask_ml.preprocessing.StandardScalar()
failed since we do X.mean(0)
, instead of X.mean(axis=0)
. I think I tried fixing that locally and a small example worked, though my memory is hazy.
I don't personally have much experience with xarray , but would certainly welcome PRs to make things work smoothly.
cc @phausamann , @gmaze
See #6 (comment) for more. I think the eventual goal is to have a compute
keyword in each estimators __init__
method that defaults to True.
Need to update MinMaxScaler
and StandardScaler
.
I've been walking through the documentation and had a few notes. I'd be happy to make these changes but wanted to check in before I submitted anything.
The separation between single machine and distributed learning seems odd to me. Many of the topics listed in single machine (grid search, pipelining, possibly even incremental learning) are still relevant when on a cluster.
I might instead flatten the TOC to just remove the single-machine/distributed distinction, and give all of the subsections their own home. This might flatten the TOC to something like the following:
I've also found it pleasant recently to start sections with the API relevant for that section. Futures docs for an example. This gives a quick list at the beginning of each section on the API relevant for that section. Those functions still link to the main API doc page.
Any thoughts or objections to this reorganization @TomAugspurger ?
https://github.com/Microsoft/LightGBM
They have parallel and distributed modes.
I was trying to build the dask-ml
documentation locally and got the following output from make html
building [mo]: targets for 0 po files that are out of date
building [html]: targets for 45 source files that are out of date
updating environment: 45 added, 0 changed, 0 removed
reading sources... [ 11%] examples/dask-glm
Notebook error:
PandocMissing in examples/dask-glm.ipynb:
Pandoc wasn't found.
Please check that pandoc is installed:
http://pandoc.org/installing.html
make: *** [html] Error 1
It looks like Pandoc is missing from my dev environment (created using conda env create -f ci/environment.yml --name=dask-ml_dev
). After manually installing Pandoc from conda-forge
, I was able to build the documentation successfully.
Not sure if I missed something when setting up my environment. If not, I'd be happy to submit a PR adding Pandoc to ci/environment.yml
and ci/environment-27.yml
Hey Guys,
Just attempting to install in a fresh virtualenv:
➜ pip install dask-ml
Collecting dask-ml
Downloading dask-ml-0.3.1.tar.gz (262kB)
100% |████████████████████████████████| 266kB 1.4MB/s
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/private/var/folders/cy/fp9dph0x14x9s7fszwjjjh4c0000gn/T/pip-build-HNpvGT/dask-ml/setup.py", line 5, in <module>
from Cython.Build import cythonize
ImportError: No module named Cython.Build
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/cy/fp9dph0x14x9s7fszwjjjh4c0000gn/T/pip-build-HNpvGT/dask-ml/
Seems like pip install Cython
was all I needed - this should be added to the requirements in setup.py
, no?
Most dask packages follow the following naming convention:
The dask-ml
package is currently imported as daskml
rather than dask_ml
. The underscore is a bit of a pain, so I can see how the current name would be preferred. Still though, I thought I'd bring this up. We may want to switch for conformity's sake.
I apparently forgot to include a license when creating the repo. The intent was BSD-3.
@daniel-severo, @mrocklin, as the other contributors up to this point, are you OK with adding that?
From the latest demos of dask-sklearn (and related projects) that I've seen, dask has primarily focused on parallelizing grid/random search and CV, and provides some smart caching mechanisms for transformers and estimators that have already been evaluated on a dataset.
However, that approach breaks down when, say, a slow algorithm is being trained on a very large dataset. With some slower algorithms, training one instance of the algorithm on one fold of the dataset can take hours or days, which makes training the algorithm unviable for all but the most patient practitioner.
Can dask-ml speed up the underlying ML algorithms that it supports?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.