studioml / studio Goto Github PK

View Code? Open in Web Editor NEW

380.0 380.0 52.0 2.57 MB

Studio: Simplify and expedite model building process

Home Page: https://studio.ml

License: Apache License 2.0

Python 91.01% HTML 5.07% Shell 3.70% Dockerfile 0.09% TeX 0.13%

hacktoberfest

studio's People

Contributors

Stargazers

Watchers

studio's Issues

default_config.yaml should contain the name of the default queue

Users shouldn't have to worry about the names of queues, etc. Can we have a setting in the default_config.yaml with the name of the queue being used?

I don't think each project needs it's own queue either. By default all jobs go to the same queue.

If an expert user wants to side-step this queue they can provide it as a command-line argument.

Have an option for compressing weights

Illia suggested some interesting approaches:

https://github.com/gstaff/tfzip
https://petewarden.com/2016/05/03/how-to-quantize-neural-networks-with-tensorflow/

Authentication with firebase server credentials for headless execution

Per @arshak 's request

Remove links to non-existing artifacts from firebase

By @karlmutch's request

Autoscaling vs cloud worker

Using a queue-based autoscaling group could be a nice feature to avoid running GPU instances continuously and possibly to avoid the 10 minute minimum billing overhead every time a new instance is started.

However the alpha queue-based autoscaler has some undesirable restrictions (
https://cloud.google.com/compute/docs/autoscaler/scaling-queue-based)

The biggest issue I see is this sentence from their documentation:

Currently, only topics with a constant message flow (at least 1 per minute) are supported. This issue is being addressed in future releases.

This is obviously not realistic in a data science environment. Another issue is the fact that the queue-scaling-acceptable-backlog-per-instance can only be specified in the number of queued tasks and not by time. This may result in a scenario where my job waits indefinitely for the necessary other jobs to be queued before it can be processed NUMBER_OF_ACCEPTABLE_BACKLOG (let's say it was set to < 1)

Therefore! I'd like to request we make the cloud worker smarter and ditch the autoscaler. Can it process multiple jobs when it spins up an instance? Running one job and killing the instance has the overhead of the minimum 10 minute billing increment (https://cloud.google.com/compute/pricing) and the other overhead is the time that it takes for the software environment to be set up.

So when a cloud worker instance is spun up it should process a batch of jobs and not be too eager to shut itself off before the 10 minutes are up. Does this make sense @pzhokhov @karlmutch ?

keras-like API for models

Both tensorflow and keras provide great toolsets for training deep learning models. But for using these models I feel like the tools are a bit sub-par. In particuar, a lot of data preprocessing has to be done manually - keras has handy background generator reading, but only supports numeric data types; tensorflow supports non-numeric data types and operations with custom code, but no buffering within the graph; and the data types have to be specified upfront (which is very non-pythonic). The latter two are inherently related to automatic differentiation - basically, tensorflow needs to know which variables are back-propagatable, and be able to backprop them. But for the inference we don't need either of that.

The use case that I have in mind feels fairly standard - data comes as a list of urls, and comes out as a dictionary
{url: annotation}.
The urls have to be downloaded and resized using multiple processes, in parallel with inference (which can be done in batches on gpu). Bad urls have to be handled, also, there may be additional post-processing (also using multiple cpu processes).

The user code should look approximately like this:

from PIL import Image
from io import BytesIO
from studio import model_util

 mw = model_util.KerasModelWrapper(checkpoint_file)
 mw.add_preprocessing(model_util.resize_image_to_input(mw), num_workers=10) 
 mw.add_preprocessing(lambda bytes: Image.open(BytesIO(bytes)))
 mw.add_preprocessing(lambda url: urllib.urlopen(url).read())

 output = mw.apply(<list_of_urls>)
 output = mw.apply(<generator_of_urls>)
 output = mw.apply(<set_of_urls>)

That should add an input pre-processing pipeline with 10 workers filling the inference queue, which read the urls, convert them to image tensors, and resize the tensors to the proper input size (and handle the image dimensions order etc). The items for which preprocessing throws an exception should return None, and they should not be passed to the inference (so that they don't spoil the entire batch of inference).

We can also try to write it in more graph-building style, like this:

 mw = KerasModelWrapper(checkpoint_file)
 mp = ModelPipe()   # analog of keras.model.Sequential
 mp.add(lambda url: urllib.urlopen(url).read())
 mp.add(lambda bytes: Image.open(BytesIO(bytes)))
 mp.add(lambda img: resize_image_to_input(mw)(img), num_workers=10)
 mp.add(mw)



 output = mp.apply(<list_of_urls>)
 output = mp.apply(<generator_of_urls>)
 output = mp.apply(<set_of_urls>)

Note that in both cases first three calls are fused together, and preprocessing queue will only be inserted when num_workers argument is specified.

I like second option a bit better because the order of adding operations is more logical, and more coherent with keras.

@arshak @ilblackdragon @michael-leece-st @nieoh your thoughts on this are very appreciated :)

Match experiment to git repo / commit if git dir is clean

Ability to delete experiment

Cloud workers

open source

Spiff up the UI a little

The UI uses materialize but we need to have the right divs in the html to get decent formatting. Maybe we can use the structure of one of the materialize templates like:

http://materializecss.com/templates/starter-template/preview.html
http://materializecss.com/themes.html

We just need to follow the standard materialize hierarchy

section

container
- row
  - cols

@pzhokhov grab me if you want to spend 5-10 minutes to fix this.

set up CI build

Delete experiments from experiment cache

When deleting experiments, they are not being deleted from cache, causing UI to think that they still exist, and creating failures

Add option of custom docker images in studio-start-remote-worker

Authentication / authorization phase 3

We need a user-grain authorization rules, so that one user cannot delete experiments of another. Firebase does not allow one to create these (unless having an administrator privileges, in which case you can delete all experiments of another user anyways).
So far 2 options are available:

Hide firebase behind API server (the server will have REST requests like "add / get experiment, add / get experiment artifact"). The server will have full access to the database, and manage access rights for users. Pros: simple Cons: scalability is difficult, may have security vulnerabilities; requires server maintenance
Have a firebase app that creates permissions for users on the fly. I.e. user requests experiment creation once, app sets up permissions for the experiment, and then user interacts with firebase directly. Pros: scalability is much easier (traffic through the app is small; and it can also rely on standard firebase solutions); more secure; less brittle; can be run in firebase app engine. Cons: more complex to deploy.

Fix the dependencies

Tried clean install in master in a virtualenv - some package version problems for googleapi packages.

EC2 Spot instances

--hdd option is not working with ec2 instances

resolve runner/script arguments conflict

Ensure that arguments to the script are not being accidentally attributed to the runner

clean up scheduler threads on exit

FirebaseAuth error

Getting the following when I try to login

email:[email protected]
password:
Traceback (most recent call last):
File "/Users/arshak.navruzyan/miniconda2/bin/studio", line 6, in
exec(compile(open(file).read(), file, 'exec'))
File "/Users/arshak.navruzyan/studio/studio/scripts/studio", line 2, in
from studio import studio
File "/Users/arshak.navruzyan/studio/studio/studio.py", line 7, in
db_provider = model.get_db_provider()
File "/Users/arshak.navruzyan/studio/studio/model.py", line 322, in get_db_provider
return FirebaseProvider(db_config)
File "/Users/arshak.navruzyan/studio/studio/model.py", line 77, in init
self.auth = FirebaseAuth(app)
File "/Users/arshak.navruzyan/studio/studio/auth.py", line 18, in init
self._update_user()
File "/Users/arshak.navruzyan/studio/studio/auth.py", line 29, in _update_user
self.user = self.firebase.auth().sign_in_with_email_and_password(email, password)
File "/Users/arshak.navruzyan/miniconda2/lib/python2.7/site-packages/pyrebase/pyrebase.py", line 85, in sign_in_with_email_and_password
raise_detailed_error(request_object)
File "/Users/arshak.navruzyan/miniconda2/lib/python2.7/site-packages/pyrebase/pyrebase.py", line 448, in raise_detailed_error
raise HTTPError(e, request_object.text)
requests.exceptions.HTTPError: [Errno 400 Client Error: Bad Request for url: https://www.googleapis.com/identitytoolkit/v3/relyingparty/verifyPassword?key=AIzaSyCLQbp5X2B4SWzBw-sz9rUnGHNSdMl0Yx8] {
"error": {
"errors": [
{
"domain": "global",
"reason": "invalid",
"message": "EMAIL_NOT_FOUND"
}
],
"code": 400,
"message": "EMAIL_NOT_FOUND"
}
}

Exception AttributeError: "'FirebaseAuth' object has no attribute 'sched'" in <bound method FirebaseAuth.del of <studio.auth.FirebaseAuth object at 0x112800bd0>> ignored

Add PyPi support for pip install support

As an experimenter, or owner of a python runner deployment
I want to install TFStudio from a well known public package repository
In order that deployment for TFStudio can be curated, automated and version managed

Notes

Using arbitrary naming inside PyPi (https://pypi.python.org/pypi) until longer term decisions are made

how to: create image with studio

write instructions

http://confluence.geneticfinance.com/display/~antoine.saliou/Run+Docker+containers+on+gpupiter

disable capturing of a default artifact

Per @nieoh 's request - sometimes the working folder may be too big; and the user might not want to capture it. We could add something like --capture=null:workspace to disable workspace capture.

Get model artifacts from studio ui

stop experiment from UI

read config values from environment variables

per @karlmutch request,
add values that refer to environment variables. The example use case is
serviceAccount: $GOOGLE_APPLICATION_CREDENTIALS should read location of service account credentials JSON from GOOGLE_APPLICATION_CREDENTIALS env variable

Cache experiment artifacts to reduce downloads and improve ui responsiveness

default_config.yaml location

When we get to pip installable version of the app we should think about where default_config.yaml lives. I'm sort of in favor of how keras does it (~/.keras/keras.json) maybe we can follow a similar convention ~/.tfstudio/tfstudio.yaml

Show TensorBoard in the Studio Web UI

reorg: front page read.me

Hyperparameter search

Add support for a TFSTUDIO_HOME env var

Currently all experiment data is rooted at ~/.tfstudio/.... In order to run on shared infrastructure the default values inside the JSON need to be ignored and a scratch $HOME equivalent be created for each pubsub job being received. This will allow the runner to destroy all data when the keras or TF experiment is done, it is the python code that is responsible for pushing results back to the storage currently. We need a better way as currently experiments need to run completely cloud agnostic, including pushing results back.

Push UI flask app to GAE as a separate project

It's a little painful to pass credentials around to all team members that want to just see experiment results. Should we consider doing a studio ui deploy that pushes the app to Google App Engine standard environment?

Obviously people run more important apps under their project so we should not make TFStudio the default service.

https://cloud.google.com/appengine/docs/standard/python/microservices-on-app-engine

Gcloud storage link for artifacts

Is it possible to provide a link to gcloud storage location like https://console.cloud.google.com/storage/browser/bucket/?project=project

sometimes artifacts are huge and downloading them just to see results isn't as easy as being able to browse the bucket/directory

Python 3.5 support

Can we support python 3.5?

"You also need to have Python 2.7 or 3.3+ to run the Google Python Client Library."
https://cloud.google.com/compute/docs/tutorials/python-guide

"Pyrebase was written for python 3 and will not work correctly with python 2."
https://github.com/thisbejim/Pyrebase

Add verbosity levels to studio-runnner

Studio-runner prints out a bunch of stuff that is not necessarily useful to the user, and may clutter the output of the script being run. Would be nice to have verbosity controlling flags.

make a recepie to bake credentials into a docker image

Per discussion with @asaliou0809 . For remote machines with docker-only access it may not be possible to set up keys and credentials separately and then load them into the container. It would be convenient to be able to bake keys (firebase authentication key, google application credentials, aws credentials) into the docker image; and then disable loading of those keys. Ideally, the docker image with keys should inherit from docker image without keys to be rebuilt quickly by users.

Customize python environment form the command line

Let's say you are running experiments mostly on a cloud / remote machine, and at some point you get package version mismatch problem. Right now it means that you have to fix the local python environment first, and only then will be able to proceed. But the remote / cloud workers install environment from scratch anyways; so it would be convenient to add an option that allows to customize python packages. Say, --python-pkg=keras=2.0.5. Note that these packages might need to be installed after the rest of the environment (for their dependencies to be processed correctly).

Integrate with dockerhub

check compatibility with python3

large artifacts and streaming

If we are (re-)using the large artifacts (say, imagenet dataset), it would be cool to have an option of stating experiment before the download of the artifact is complete, and finishing download in the background. Of course, the user code then has to check if the particular shard of data (say, image) is in place before using it.

Standalone documentation server

@arshak suggested mkdocs. If we are migrating to tensorflow/contrib, do we still needed it as a separate page? Where should it be hosted?

Add custom directories to what's captured by studio-runner

Make authentication app screen in the same style as the rest of web UI

Storage using google cloud storage / S3

The firebase storage gets pricey after 5Gb of data; need to make an option to use google cloud storage (directly, without firebase layer) or S3

caching of experiment info

when number of experiments hits ~50--100 loading them one by one (even the simple database reads, with no storage) is as large as few seconds, making dashboard annoyingly slow to load. We can avoid it by caching the data about experiments within FirebaseProvider class

Remote worker

There is usually access to 1 or more servers [with GPUs for example] that user wants to run their jobs on.

Right now one needs to develop locally, checking code, push, pull on server, make small modifications and then run it there.

Ideally something like studio-runner --worker=my_gpu_server my_job.py should schedule and execute on the server and stream logs (via db) to the user.

Command line tool for monitoring / killing experiments

per @jasonzliang 's request

Compute engine service account instead of credentials.json

For studio runner, Is there a way to use the compute engine service account ([email protected]) instead of the firebase key (credentials.json)?

Keeping that key secure for short lived studio runner instances is a bit of a devops headache whereas the service account credentials are already installed on every instance.

studioml / studio Goto Github PK

studio's People

Contributors

Stargazers

Watchers

Forkers

studio's Issues

Recommend Projects

Recommend Topics

Recommend Org