adalabucsd / cerebro-system Goto Github PK
View Code? Open in Web Editor NEWData System for Optimized Deep Learning Model Selection
License: Apache License 2.0
Data System for Optimized Deep Learning Model Selection
License: Apache License 2.0
Add support for training DL models on text data. For text data pre-processing rely on the Spark-NLP library. Also, add an example to the example directory.
Currently, the spark backend assumes each spark worker will have a single GPU and almost one task will be run on a worker. In the general case, workers can have multiple GPUs and multiple tasks can run on a single worker. In such a setting each Spark task should be pinned to a specific GPU (or a CPU core) to avoid resource contention.
ModelSearchModel should store tf.Keras models and should expose them to the end-user.
End-user can then store and manipulate them in the way they like.
Cerebro should be able to handle spark task failures.
Spark will handle the fault-tolerance of workers. Cerebro should detect the failures and re-execute the models.
Faults can also be user bugs. These have to be identified separately.
Currently, Cerebro lets users specify the model at the beginning of the model selection workload and doesn't allow any changes after that. But for some workloads like fine-tuning we need to freeze/unfreexe parts of the model at the end of different epochs. To achieve this we need to add the support for specifying a call back function which can modify the kers model at the end of each epoch.
Cerebro is currently developed as a library.
In order to support use cases such as web UIs we need to implement a REST API.
Current Keras estimator and Params classes are tightly coupled with the Spark backend.
Refactor them to extract common interfaces that can be implemented for other backends too (e.g., Dask)
Cerebro should be integrated with MLFlow for logging and final model storing.
We will support two logging mechanisms. TensorBoard (TB) will be the primary logging mechanism and will be local to a particular model selection job.
For MLFlow, users will provide an MLFlow instance for which Cerebro will log. MLFlow logging will be across model selection jobs.
TypeError Traceback (most recent call last)
in
45
46 backend = SparkBackend(spark_context=spark.sparkContext, num_workers=1)
---> 47 store = HDFSStore('hdfs:///master:9000/tmp')
48
49 search_space = {'lr': hp_choice([0.01, 0.001, 0.0001])}
/mnt/local/cerebro-system/cerebro/storage/hdfs.py in init(self, prefix_path, host, port, user, kerb_ticket, driver, extra_conf, temp_dir, *args, **kwargs)
68 driver=driver,
69 extra_conf=extra_conf)
---> 70 self._hdfs = self._get_filesystem_fn()()
71
72 super(HDFSStore, self).init(prefix_path, *args, **kwargs)
/mnt/local/cerebro-system/cerebro/storage/hdfs.py in fn()
147
148 def fn():
--> 149 return pa.hdfs.connect(**hdfs_kwargs)
150 return fn
151
TypeError: connect() got an unexpected keyword argument 'driver'
pyarrow==0.17.0
https://arrow.apache.org/docs/python/generated/pyarrow.hdfs.connect.html
driver looks like to be a deprecated arg
Same issue faced in Ray:
tensorflow/tensorflow#32159
ASHA is an AutoML procedure that combines random search with early stopping.
More details: https://arxiv.org/pdf/1810.05934.pdf
Currently, data shuffling for each sub-epoch is disabled.
We need to add a configurable parameter to enable/disable sub-epoch level partition level data shuffling.
Full data shuffling (across partitions) will be handled later if needed.
Create a shell script to automate the release of the python package to the PyPi package repository.
Cerebro experiences low GPU utilization for reading remote data.
Lack of pre-fetching may be the main reason.
Keras pre-fetching thread pools should be user configurable
Currrently, we assume the entire dataset will be used to train the model.
But this is not the case. Only some columns may be selected by a particular model, or even some rows (e.g., grouped learning) or both.
Data selection logic should be a part of the hyperparameter search space and estimators should take care of it.
Have to handle data partitions being of different sizes.
I'm attempting to run Cerebro GridSearch on an MLP with 4 inputs and 1 output. The dataset has the below schema:
root
|-- Temperature: double (nullable = true)
|-- Occupancy: double (nullable = true)
|-- Temperature_1: double (nullable = true)
|-- Temperature_2: double (nullable = true)
|-- Temperature_3: double (nullable = true)
The objective is to predict Occupancy
with the Temperature columns.
When we run the cerebro model_selection.fit
method, we get this error (on each Worker):
Traceback (most recent call last):
File "/Users/arunavgupta/anaconda3/envs/cerebro/lib/python3.7/site-packages/cerebro_dl-1.0.0-py3.7.egg/cerebro/tune/base.py", line 200, in fit
result = self._fit_on_prepared_data(metadata)
File "/Users/arunavgupta/anaconda3/envs/cerebro/lib/python3.7/site-packages/cerebro_dl-1.0.0-py3.7.egg/cerebro/tune/grid.py", line 86, in _fit_on_prepared_data
return _fit_on_prepared_data(self, metadata)
File "/Users/arunavgupta/anaconda3/envs/cerebro/lib/python3.7/site-packages/cerebro_dl-1.0.0-py3.7.egg/cerebro/tune/grid.py", line 226, in _fit_on_prepared_data
self.label_cols)
File "/Users/arunavgupta/anaconda3/envs/cerebro/lib/python3.7/site-packages/cerebro_dl-1.0.0-py3.7.egg/cerebro/backend/spark/backend.py", line 260, in train_for_one_epoch
raise Exception(status.sub_epoch_result['error'])
Exception: in user code:
/Users/arunavgupta/anaconda3/envs/cerebro/lib/python3.7/site-packages/cerebro_dl-1.0.0-py3.7.egg/cerebro/keras/spark/util.py:155 prep *
tuple(
/Users/arunavgupta/anaconda3/envs/cerebro/lib/python3.7/site-packages/cerebro_dl-1.0.0-py3.7.egg/cerebro/keras/spark/util.py:148 get_col_from_row_fn *
return getattr(row, col)
AttributeError: 'petastorm_schema_view_view' object has no attribute 'input_layer'
Where input_layer
is the name of the first layer in the model.
It seems that targeting various versions of pyspark (2.4.4, 2.4.8, and 3.2.0) as well as downloading cerebro from source and from pypi does not seem to fix this issue. Any ideas on how to get around this issue are appreciated, I will post the relevant source code in a comment to this issue.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.