Giter Club home page Giter Club logo

studio's People

Contributors

andreidenissov-cog avatar arshak avatar dependabot[bot] avatar ilblackdragon avatar jasonzliang avatar karlmutch avatar mafia-server avatar mistobaan avatar nieoh avatar nzw0301 avatar pzhokhov avatar staubda avatar trdeal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

studio's Issues

default_config.yaml should contain the name of the default queue

Users shouldn't have to worry about the names of queues, etc. Can we have a setting in the default_config.yaml with the name of the queue being used?

I don't think each project needs it's own queue either. By default all jobs go to the same queue.

If an expert user wants to side-step this queue they can provide it as a command-line argument.

Autoscaling vs cloud worker

Using a queue-based autoscaling group could be a nice feature to avoid running GPU instances continuously and possibly to avoid the 10 minute minimum billing overhead every time a new instance is started.

However the alpha queue-based autoscaler has some undesirable restrictions (
https://cloud.google.com/compute/docs/autoscaler/scaling-queue-based)

The biggest issue I see is this sentence from their documentation:

Currently, only topics with a constant message flow (at least 1 per minute) are supported. This issue is being addressed in future releases.

This is obviously not realistic in a data science environment. Another issue is the fact that the queue-scaling-acceptable-backlog-per-instance can only be specified in the number of queued tasks and not by time. This may result in a scenario where my job waits indefinitely for the necessary other jobs to be queued before it can be processed NUMBER_OF_ACCEPTABLE_BACKLOG (let's say it was set to < 1)

Therefore! I'd like to request we make the cloud worker smarter and ditch the autoscaler. Can it process multiple jobs when it spins up an instance? Running one job and killing the instance has the overhead of the minimum 10 minute billing increment (https://cloud.google.com/compute/pricing) and the other overhead is the time that it takes for the software environment to be set up.

So when a cloud worker instance is spun up it should process a batch of jobs and not be too eager to shut itself off before the 10 minutes are up. Does this make sense @pzhokhov @karlmutch ?

keras-like API for models

Both tensorflow and keras provide great toolsets for training deep learning models. But for using these models I feel like the tools are a bit sub-par. In particuar, a lot of data preprocessing has to be done manually - keras has handy background generator reading, but only supports numeric data types; tensorflow supports non-numeric data types and operations with custom code, but no buffering within the graph; and the data types have to be specified upfront (which is very non-pythonic). The latter two are inherently related to automatic differentiation - basically, tensorflow needs to know which variables are back-propagatable, and be able to backprop them. But for the inference we don't need either of that.

The use case that I have in mind feels fairly standard - data comes as a list of urls, and comes out as a dictionary
{url: annotation}.
The urls have to be downloaded and resized using multiple processes, in parallel with inference (which can be done in batches on gpu). Bad urls have to be handled, also, there may be additional post-processing (also using multiple cpu processes).

The user code should look approximately like this:

from PIL import Image
from io import BytesIO
from studio import model_util

 mw = model_util.KerasModelWrapper(checkpoint_file)
 mw.add_preprocessing(model_util.resize_image_to_input(mw), num_workers=10) 
 mw.add_preprocessing(lambda bytes: Image.open(BytesIO(bytes)))
 mw.add_preprocessing(lambda url: urllib.urlopen(url).read())

 output = mw.apply(<list_of_urls>)
 output = mw.apply(<generator_of_urls>)
 output = mw.apply(<set_of_urls>)

That should add an input pre-processing pipeline with 10 workers filling the inference queue, which read the urls, convert them to image tensors, and resize the tensors to the proper input size (and handle the image dimensions order etc). The items for which preprocessing throws an exception should return None, and they should not be passed to the inference (so that they don't spoil the entire batch of inference).

We can also try to write it in more graph-building style, like this:

 mw = KerasModelWrapper(checkpoint_file)
 mp = ModelPipe()   # analog of keras.model.Sequential
 mp.add(lambda url: urllib.urlopen(url).read())
 mp.add(lambda bytes: Image.open(BytesIO(bytes)))
 mp.add(lambda img: resize_image_to_input(mw)(img), num_workers=10)
 mp.add(mw)



 output = mp.apply(<list_of_urls>)
 output = mp.apply(<generator_of_urls>)
 output = mp.apply(<set_of_urls>)

Note that in both cases first three calls are fused together, and preprocessing queue will only be inserted when num_workers argument is specified.

I like second option a bit better because the order of adding operations is more logical, and more coherent with keras.

@arshak @ilblackdragon @michael-leece-st @nieoh your thoughts on this are very appreciated :)

Authentication / authorization phase 3

We need a user-grain authorization rules, so that one user cannot delete experiments of another. Firebase does not allow one to create these (unless having an administrator privileges, in which case you can delete all experiments of another user anyways).
So far 2 options are available:

  1. Hide firebase behind API server (the server will have REST requests like "add / get experiment, add / get experiment artifact"). The server will have full access to the database, and manage access rights for users. Pros: simple Cons: scalability is difficult, may have security vulnerabilities; requires server maintenance

  2. Have a firebase app that creates permissions for users on the fly. I.e. user requests experiment creation once, app sets up permissions for the experiment, and then user interacts with firebase directly. Pros: scalability is much easier (traffic through the app is small; and it can also rely on standard firebase solutions); more secure; less brittle; can be run in firebase app engine. Cons: more complex to deploy.

Fix the dependencies

Tried clean install in master in a virtualenv - some package version problems for googleapi packages.

FirebaseAuth error

Getting the following when I try to login

email:[email protected]
password:
Traceback (most recent call last):
File "/Users/arshak.navruzyan/miniconda2/bin/studio", line 6, in
exec(compile(open(file).read(), file, 'exec'))
File "/Users/arshak.navruzyan/studio/studio/scripts/studio", line 2, in
from studio import studio
File "/Users/arshak.navruzyan/studio/studio/studio.py", line 7, in
db_provider = model.get_db_provider()
File "/Users/arshak.navruzyan/studio/studio/model.py", line 322, in get_db_provider
return FirebaseProvider(db_config)
File "/Users/arshak.navruzyan/studio/studio/model.py", line 77, in init
self.auth = FirebaseAuth(app)
File "/Users/arshak.navruzyan/studio/studio/auth.py", line 18, in init
self._update_user()
File "/Users/arshak.navruzyan/studio/studio/auth.py", line 29, in _update_user
self.user = self.firebase.auth().sign_in_with_email_and_password(email, password)
File "/Users/arshak.navruzyan/miniconda2/lib/python2.7/site-packages/pyrebase/pyrebase.py", line 85, in sign_in_with_email_and_password
raise_detailed_error(request_object)
File "/Users/arshak.navruzyan/miniconda2/lib/python2.7/site-packages/pyrebase/pyrebase.py", line 448, in raise_detailed_error
raise HTTPError(e, request_object.text)
requests.exceptions.HTTPError: [Errno 400 Client Error: Bad Request for url: https://www.googleapis.com/identitytoolkit/v3/relyingparty/verifyPassword?key=AIzaSyCLQbp5X2B4SWzBw-sz9rUnGHNSdMl0Yx8] {
"error": {
"errors": [
{
"domain": "global",
"reason": "invalid",
"message": "EMAIL_NOT_FOUND"
}
],
"code": 400,
"message": "EMAIL_NOT_FOUND"
}
}

Exception AttributeError: "'FirebaseAuth' object has no attribute 'sched'" in <bound method FirebaseAuth.del of <studio.auth.FirebaseAuth object at 0x112800bd0>> ignored

Add PyPi support for pip install support

As an experimenter, or owner of a python runner deployment
I want to install TFStudio from a well known public package repository
In order that deployment for TFStudio can be curated, automated and version managed

Notes

Using arbitrary naming inside PyPi (https://pypi.python.org/pypi) until longer term decisions are made

disable capturing of a default artifact

Per @nieoh 's request - sometimes the working folder may be too big; and the user might not want to capture it. We could add something like --capture=null:workspace to disable workspace capture.

read config values from environment variables

per @karlmutch request,
add values that refer to environment variables. The example use case is
serviceAccount: $GOOGLE_APPLICATION_CREDENTIALS should read location of service account credentials JSON from GOOGLE_APPLICATION_CREDENTIALS env variable

default_config.yaml location

When we get to pip installable version of the app we should think about where default_config.yaml lives. I'm sort of in favor of how keras does it (~/.keras/keras.json) maybe we can follow a similar convention ~/.tfstudio/tfstudio.yaml

Add support for a TFSTUDIO_HOME env var

Currently all experiment data is rooted at ~/.tfstudio/.... In order to run on shared infrastructure the default values inside the JSON need to be ignored and a scratch $HOME equivalent be created for each pubsub job being received. This will allow the runner to destroy all data when the keras or TF experiment is done, it is the python code that is responsible for pushing results back to the storage currently. We need a better way as currently experiments need to run completely cloud agnostic, including pushing results back.

Add verbosity levels to studio-runnner

Studio-runner prints out a bunch of stuff that is not necessarily useful to the user, and may clutter the output of the script being run. Would be nice to have verbosity controlling flags.

make a recepie to bake credentials into a docker image

Per discussion with @asaliou0809 . For remote machines with docker-only access it may not be possible to set up keys and credentials separately and then load them into the container. It would be convenient to be able to bake keys (firebase authentication key, google application credentials, aws credentials) into the docker image; and then disable loading of those keys. Ideally, the docker image with keys should inherit from docker image without keys to be rebuilt quickly by users.

Customize python environment form the command line

Let's say you are running experiments mostly on a cloud / remote machine, and at some point you get package version mismatch problem. Right now it means that you have to fix the local python environment first, and only then will be able to proceed. But the remote / cloud workers install environment from scratch anyways; so it would be convenient to add an option that allows to customize python packages. Say, --python-pkg=keras=2.0.5. Note that these packages might need to be installed after the rest of the environment (for their dependencies to be processed correctly).

large artifacts and streaming

If we are (re-)using the large artifacts (say, imagenet dataset), it would be cool to have an option of stating experiment before the download of the artifact is complete, and finishing download in the background. Of course, the user code then has to check if the particular shard of data (say, image) is in place before using it.

caching of experiment info

when number of experiments hits ~50--100 loading them one by one (even the simple database reads, with no storage) is as large as few seconds, making dashboard annoyingly slow to load. We can avoid it by caching the data about experiments within FirebaseProvider class

Remote worker

There is usually access to 1 or more servers [with GPUs for example] that user wants to run their jobs on.

Right now one needs to develop locally, checking code, push, pull on server, make small modifications and then run it there.

Ideally something like studio-runner --worker=my_gpu_server my_job.py should schedule and execute on the server and stream logs (via db) to the user.

Compute engine service account instead of credentials.json

For studio runner, Is there a way to use the compute engine service account ([email protected]) instead of the firebase key (credentials.json)?

Keeping that key secure for short lived studio runner instances is a bit of a devops headache whereas the service account credentials are already installed on every instance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.