adalabucsd / cerebro-system Goto Github PK

View Code? Open in Web Editor NEW

19.0 7.0 11.0 31.99 MB

Data System for Optimized Deep Learning Model Selection

License: Apache License 2.0

Makefile 0.56% Python 99.20% Batchfile 0.24%

model-selection deep-learning hyperparameter-tuning resource-efficient

cerebro-system's People

Contributors

Stargazers

Watchers

Forkers

scnakandala makemebitter abhishek2304 jiange91 vigneshn1997 sanjuprk dyex719 0az vitorsabbagh knagrecha

cerebro-system's Issues

Add support for training a DL model on text data

Add support for training DL models on text data. For text data pre-processing rely on the Spark-NLP library. Also, add an example to the example directory.

Implement CPU and GPU pinning to the Spark backend

Currently, the spark backend assumes each spark worker will have a single GPU and almost one task will be run on a worker. In the general case, workers can have multiple GPUs and multiple tasks can run on a single worker. In such a setting each Spark task should be pinned to a specific GPU (or a CPU core) to avoid resource contention.

Improve the ModelSelectionResult class

ModelSearchModel should store tf.Keras models and should expose them to the end-user.
End-user can then store and manipulate them in the way they like.

Implement Fault-Tollerance to the Spark Backend

Cerebro should be able to handle spark task failures.
Spark will handle the fault-tolerance of workers. Cerebro should detect the failures and re-execute the models.

Faults can also be user bugs. These have to be identified separately.

Add support for modifying the model specification after each epoch

Currently, Cerebro lets users specify the model at the beginning of the model selection workload and doesn't allow any changes after that. But for some workloads like fine-tuning we need to freeze/unfreexe parts of the model at the end of different epochs. To achieve this we need to add the support for specifying a call back function which can modify the kers model at the end of each epoch.

Implement REST API for Cerebro

Cerebro is currently developed as a library.
In order to support use cases such as web UIs we need to implement a REST API.

Refactor Keras Estimator and Params interfaces

Current Keras estimator and Params classes are tightly coupled with the Spark backend.
Refactor them to extract common interfaces that can be implemented for other backends too (e.g., Dask)

Add Support for MLFlow

Cerebro should be integrated with MLFlow for logging and final model storing.
We will support two logging mechanisms. TensorBoard (TB) will be the primary logging mechanism and will be local to a particular model selection job.
For MLFlow, users will provide an MLFlow instance for which Cerebro will log. MLFlow logging will be across model selection jobs.

HDFS storage backend is not working

TypeError Traceback (most recent call last)
in
45
46 backend = SparkBackend(spark_context=spark.sparkContext, num_workers=1)
---> 47 store = HDFSStore('hdfs:///master:9000/tmp')
48
49 search_space = {'lr': hp_choice([0.01, 0.001, 0.0001])}
/mnt/local/cerebro-system/cerebro/storage/hdfs.py in init(self, prefix_path, host, port, user, kerb_ticket, driver, extra_conf, temp_dir, *args, **kwargs)
68 driver=driver,
69 extra_conf=extra_conf)
---> 70 self._hdfs = self._get_filesystem_fn()()
71
72 super(HDFSStore, self).init(prefix_path, *args, **kwargs)
/mnt/local/cerebro-system/cerebro/storage/hdfs.py in fn()
147
148 def fn():
--> 149 return pa.hdfs.connect(**hdfs_kwargs)
150 return fn
151
TypeError: connect() got an unexpected keyword argument 'driver'

pyarrow==0.17.0
https://arrow.apache.org/docs/python/generated/pyarrow.hdfs.connect.html
driver looks like to be a deprecated arg

TF 2.0 regression: cloudpickle cannot serialize tf.keras.Sequential

Same issue faced in Ray:
tensorflow/tensorflow#32159

Add support for ASHA AutoML Search Procedure

ASHA is an AutoML procedure that combines random search with early stopping.
More details: https://arxiv.org/pdf/1810.05934.pdf

Improve System Documentation

Enable per-epoch level partitioned data shuffling

Currently, data shuffling for each sub-epoch is disabled.
We need to add a configurable parameter to enable/disable sub-epoch level partition level data shuffling.

Full data shuffling (across partitions) will be handled later if needed.

PyPi release script

Create a shell script to automate the release of the python package to the PyPi package repository.

Distributed training low-gpu utilization

Cerebro experiences low GPU utilization for reading remote data.
Lack of pre-fetching may be the main reason.
Keras pre-fetching thread pools should be user configurable

Remove separate path for TensorBoard logging

Add support for data transformations

Currrently, we assume the entire dataset will be used to train the model.
But this is not the case. Only some columns may be selected by a particular model, or even some rows (e.g., grouped learning) or both.

Data selection logic should be a part of the hyperparameter search space and estimators should take care of it.

Have to handle data partitions being of different sizes.

AttributeError when running MLP on Cerebro

I'm attempting to run Cerebro GridSearch on an MLP with 4 inputs and 1 output. The dataset has the below schema:

root
 |-- Temperature: double (nullable = true)
 |-- Occupancy: double (nullable = true)
 |-- Temperature_1: double (nullable = true)
 |-- Temperature_2: double (nullable = true)
 |-- Temperature_3: double (nullable = true)

The objective is to predict Occupancy with the Temperature columns.

When we run the cerebro model_selection.fit method, we get this error (on each Worker):

Traceback (most recent call last):
  File "/Users/arunavgupta/anaconda3/envs/cerebro/lib/python3.7/site-packages/cerebro_dl-1.0.0-py3.7.egg/cerebro/tune/base.py", line 200, in fit
    result = self._fit_on_prepared_data(metadata)
  File "/Users/arunavgupta/anaconda3/envs/cerebro/lib/python3.7/site-packages/cerebro_dl-1.0.0-py3.7.egg/cerebro/tune/grid.py", line 86, in _fit_on_prepared_data
    return _fit_on_prepared_data(self, metadata)
  File "/Users/arunavgupta/anaconda3/envs/cerebro/lib/python3.7/site-packages/cerebro_dl-1.0.0-py3.7.egg/cerebro/tune/grid.py", line 226, in _fit_on_prepared_data
    self.label_cols)
  File "/Users/arunavgupta/anaconda3/envs/cerebro/lib/python3.7/site-packages/cerebro_dl-1.0.0-py3.7.egg/cerebro/backend/spark/backend.py", line 260, in train_for_one_epoch
    raise Exception(status.sub_epoch_result['error'])
Exception: in user code:

    /Users/arunavgupta/anaconda3/envs/cerebro/lib/python3.7/site-packages/cerebro_dl-1.0.0-py3.7.egg/cerebro/keras/spark/util.py:155 prep  *
        tuple(
    /Users/arunavgupta/anaconda3/envs/cerebro/lib/python3.7/site-packages/cerebro_dl-1.0.0-py3.7.egg/cerebro/keras/spark/util.py:148 get_col_from_row_fn  *
        return getattr(row, col)

    AttributeError: 'petastorm_schema_view_view' object has no attribute 'input_layer'

Where input_layer is the name of the first layer in the model.

It seems that targeting various versions of pyspark (2.4.4, 2.4.8, and 3.2.0) as well as downloading cerebro from source and from pypi does not seem to fix this issue. Any ideas on how to get around this issue are appreciated, I will post the relevant source code in a comment to this issue.