Giter Club home page Giter Club logo

cerebro-system's People

Contributors

arunkk09 avatar makemebitter avatar scnakandala avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

cerebro-system's Issues

Implement CPU and GPU pinning to the Spark backend

Currently, the spark backend assumes each spark worker will have a single GPU and almost one task will be run on a worker. In the general case, workers can have multiple GPUs and multiple tasks can run on a single worker. In such a setting each Spark task should be pinned to a specific GPU (or a CPU core) to avoid resource contention.

Improve the ModelSelectionResult class

ModelSearchModel should store tf.Keras models and should expose them to the end-user.
End-user can then store and manipulate them in the way they like.

Implement Fault-Tollerance to the Spark Backend

Cerebro should be able to handle spark task failures.
Spark will handle the fault-tolerance of workers. Cerebro should detect the failures and re-execute the models.

Faults can also be user bugs. These have to be identified separately.

Add support for modifying the model specification after each epoch

Currently, Cerebro lets users specify the model at the beginning of the model selection workload and doesn't allow any changes after that. But for some workloads like fine-tuning we need to freeze/unfreexe parts of the model at the end of different epochs. To achieve this we need to add the support for specifying a call back function which can modify the kers model at the end of each epoch.

Implement REST API for Cerebro

Cerebro is currently developed as a library.
In order to support use cases such as web UIs we need to implement a REST API.

Refactor Keras Estimator and Params interfaces

Current Keras estimator and Params classes are tightly coupled with the Spark backend.
Refactor them to extract common interfaces that can be implemented for other backends too (e.g., Dask)

Add Support for MLFlow

  • Cerebro should be integrated with MLFlow for logging and final model storing.

  • We will support two logging mechanisms. TensorBoard (TB) will be the primary logging mechanism and will be local to a particular model selection job.

  • For MLFlow, users will provide an MLFlow instance for which Cerebro will log. MLFlow logging will be across model selection jobs.

HDFS storage backend is not working


TypeError Traceback (most recent call last)
in
45
46 backend = SparkBackend(spark_context=spark.sparkContext, num_workers=1)
---> 47 store = HDFSStore('hdfs:///master:9000/tmp')
48
49 search_space = {'lr': hp_choice([0.01, 0.001, 0.0001])}
/mnt/local/cerebro-system/cerebro/storage/hdfs.py in init(self, prefix_path, host, port, user, kerb_ticket, driver, extra_conf, temp_dir, *args, **kwargs)
68 driver=driver,
69 extra_conf=extra_conf)
---> 70 self._hdfs = self._get_filesystem_fn()()
71
72 super(HDFSStore, self).init(prefix_path, *args, **kwargs)
/mnt/local/cerebro-system/cerebro/storage/hdfs.py in fn()
147
148 def fn():
--> 149 return pa.hdfs.connect(**hdfs_kwargs)
150 return fn
151
TypeError: connect() got an unexpected keyword argument 'driver'

pyarrow==0.17.0
https://arrow.apache.org/docs/python/generated/pyarrow.hdfs.connect.html
driver looks like to be a deprecated arg

Enable per-epoch level partitioned data shuffling

Currently, data shuffling for each sub-epoch is disabled.
We need to add a configurable parameter to enable/disable sub-epoch level partition level data shuffling.

Full data shuffling (across partitions) will be handled later if needed.

PyPi release script

Create a shell script to automate the release of the python package to the PyPi package repository.

Distributed training low-gpu utilization

Cerebro experiences low GPU utilization for reading remote data.
Lack of pre-fetching may be the main reason.
Keras pre-fetching thread pools should be user configurable

Add support for data transformations

Currrently, we assume the entire dataset will be used to train the model.
But this is not the case. Only some columns may be selected by a particular model, or even some rows (e.g., grouped learning) or both.

Data selection logic should be a part of the hyperparameter search space and estimators should take care of it.

Have to handle data partitions being of different sizes.

AttributeError when running MLP on Cerebro

I'm attempting to run Cerebro GridSearch on an MLP with 4 inputs and 1 output. The dataset has the below schema:

root
 |-- Temperature: double (nullable = true)
 |-- Occupancy: double (nullable = true)
 |-- Temperature_1: double (nullable = true)
 |-- Temperature_2: double (nullable = true)
 |-- Temperature_3: double (nullable = true)

The objective is to predict Occupancy with the Temperature columns.

When we run the cerebro model_selection.fit method, we get this error (on each Worker):

Traceback (most recent call last):
  File "/Users/arunavgupta/anaconda3/envs/cerebro/lib/python3.7/site-packages/cerebro_dl-1.0.0-py3.7.egg/cerebro/tune/base.py", line 200, in fit
    result = self._fit_on_prepared_data(metadata)
  File "/Users/arunavgupta/anaconda3/envs/cerebro/lib/python3.7/site-packages/cerebro_dl-1.0.0-py3.7.egg/cerebro/tune/grid.py", line 86, in _fit_on_prepared_data
    return _fit_on_prepared_data(self, metadata)
  File "/Users/arunavgupta/anaconda3/envs/cerebro/lib/python3.7/site-packages/cerebro_dl-1.0.0-py3.7.egg/cerebro/tune/grid.py", line 226, in _fit_on_prepared_data
    self.label_cols)
  File "/Users/arunavgupta/anaconda3/envs/cerebro/lib/python3.7/site-packages/cerebro_dl-1.0.0-py3.7.egg/cerebro/backend/spark/backend.py", line 260, in train_for_one_epoch
    raise Exception(status.sub_epoch_result['error'])
Exception: in user code:

    /Users/arunavgupta/anaconda3/envs/cerebro/lib/python3.7/site-packages/cerebro_dl-1.0.0-py3.7.egg/cerebro/keras/spark/util.py:155 prep  *
        tuple(
    /Users/arunavgupta/anaconda3/envs/cerebro/lib/python3.7/site-packages/cerebro_dl-1.0.0-py3.7.egg/cerebro/keras/spark/util.py:148 get_col_from_row_fn  *
        return getattr(row, col)

    AttributeError: 'petastorm_schema_view_view' object has no attribute 'input_layer'

Where input_layer is the name of the first layer in the model.

It seems that targeting various versions of pyspark (2.4.4, 2.4.8, and 3.2.0) as well as downloading cerebro from source and from pypi does not seem to fix this issue. Any ideas on how to get around this issue are appreciated, I will post the relevant source code in a comment to this issue.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.