studioml / studio Goto Github PK
View Code? Open in Web Editor NEWStudio: Simplify and expedite model building process
Home Page: https://studio.ml
License: Apache License 2.0
Studio: Simplify and expedite model building process
Home Page: https://studio.ml
License: Apache License 2.0
Users shouldn't have to worry about the names of queues, etc. Can we have a setting in the default_config.yaml with the name of the queue being used?
I don't think each project needs it's own queue either. By default all jobs go to the same queue.
If an expert user wants to side-step this queue they can provide it as a command-line argument.
Illia suggested some interesting approaches:
https://github.com/gstaff/tfzip
https://petewarden.com/2016/05/03/how-to-quantize-neural-networks-with-tensorflow/
Per @arshak 's request
By @karlmutch's request
Using a queue-based autoscaling group could be a nice feature to avoid running GPU instances continuously and possibly to avoid the 10 minute minimum billing overhead every time a new instance is started.
However the alpha queue-based autoscaler has some undesirable restrictions (
https://cloud.google.com/compute/docs/autoscaler/scaling-queue-based)
The biggest issue I see is this sentence from their documentation:
Currently, only topics with a constant message flow (at least 1 per minute) are supported. This issue is being addressed in future releases.
This is obviously not realistic in a data science environment. Another issue is the fact that the queue-scaling-acceptable-backlog-per-instance can only be specified in the number of queued tasks and not by time. This may result in a scenario where my job waits indefinitely for the necessary other jobs to be queued before it can be processed NUMBER_OF_ACCEPTABLE_BACKLOG (let's say it was set to < 1)
Therefore! I'd like to request we make the cloud worker smarter and ditch the autoscaler. Can it process multiple jobs when it spins up an instance? Running one job and killing the instance has the overhead of the minimum 10 minute billing increment (https://cloud.google.com/compute/pricing) and the other overhead is the time that it takes for the software environment to be set up.
So when a cloud worker instance is spun up it should process a batch of jobs and not be too eager to shut itself off before the 10 minutes are up. Does this make sense @pzhokhov @karlmutch ?
Both tensorflow and keras provide great toolsets for training deep learning models. But for using these models I feel like the tools are a bit sub-par. In particuar, a lot of data preprocessing has to be done manually - keras has handy background generator reading, but only supports numeric data types; tensorflow supports non-numeric data types and operations with custom code, but no buffering within the graph; and the data types have to be specified upfront (which is very non-pythonic). The latter two are inherently related to automatic differentiation - basically, tensorflow needs to know which variables are back-propagatable, and be able to backprop them. But for the inference we don't need either of that.
The use case that I have in mind feels fairly standard - data comes as a list of urls, and comes out as a dictionary
{url: annotation}.
The urls have to be downloaded and resized using multiple processes, in parallel with inference (which can be done in batches on gpu). Bad urls have to be handled, also, there may be additional post-processing (also using multiple cpu processes).
The user code should look approximately like this:
from PIL import Image
from io import BytesIO
from studio import model_util
mw = model_util.KerasModelWrapper(checkpoint_file)
mw.add_preprocessing(model_util.resize_image_to_input(mw), num_workers=10)
mw.add_preprocessing(lambda bytes: Image.open(BytesIO(bytes)))
mw.add_preprocessing(lambda url: urllib.urlopen(url).read())
output = mw.apply(<list_of_urls>)
output = mw.apply(<generator_of_urls>)
output = mw.apply(<set_of_urls>)
That should add an input pre-processing pipeline with 10 workers filling the inference queue, which read the urls, convert them to image tensors, and resize the tensors to the proper input size (and handle the image dimensions order etc). The items for which preprocessing throws an exception should return None, and they should not be passed to the inference (so that they don't spoil the entire batch of inference).
We can also try to write it in more graph-building style, like this:
mw = KerasModelWrapper(checkpoint_file)
mp = ModelPipe() # analog of keras.model.Sequential
mp.add(lambda url: urllib.urlopen(url).read())
mp.add(lambda bytes: Image.open(BytesIO(bytes)))
mp.add(lambda img: resize_image_to_input(mw)(img), num_workers=10)
mp.add(mw)
output = mp.apply(<list_of_urls>)
output = mp.apply(<generator_of_urls>)
output = mp.apply(<set_of_urls>)
Note that in both cases first three calls are fused together, and preprocessing queue will only be inserted when num_workers argument is specified.
I like second option a bit better because the order of adding operations is more logical, and more coherent with keras.
@arshak @ilblackdragon @michael-leece-st @nieoh your thoughts on this are very appreciated :)
The UI uses materialize but we need to have the right divs in the html to get decent formatting. Maybe we can use the structure of one of the materialize templates like:
http://materializecss.com/templates/starter-template/preview.html
http://materializecss.com/themes.html
We just need to follow the standard materialize hierarchy
section
@pzhokhov grab me if you want to spend 5-10 minutes to fix this.
When deleting experiments, they are not being deleted from cache, causing UI to think that they still exist, and creating failures
We need a user-grain authorization rules, so that one user cannot delete experiments of another. Firebase does not allow one to create these (unless having an administrator privileges, in which case you can delete all experiments of another user anyways).
So far 2 options are available:
Hide firebase behind API server (the server will have REST requests like "add / get experiment, add / get experiment artifact"). The server will have full access to the database, and manage access rights for users. Pros: simple Cons: scalability is difficult, may have security vulnerabilities; requires server maintenance
Have a firebase app that creates permissions for users on the fly. I.e. user requests experiment creation once, app sets up permissions for the experiment, and then user interacts with firebase directly. Pros: scalability is much easier (traffic through the app is small; and it can also rely on standard firebase solutions); more secure; less brittle; can be run in firebase app engine. Cons: more complex to deploy.
Tried clean install in master in a virtualenv - some package version problems for googleapi packages.
Ensure that arguments to the script are not being accidentally attributed to the runner
Getting the following when I try to login
email:[email protected]
password:
Traceback (most recent call last):
File "/Users/arshak.navruzyan/miniconda2/bin/studio", line 6, in
exec(compile(open(file).read(), file, 'exec'))
File "/Users/arshak.navruzyan/studio/studio/scripts/studio", line 2, in
from studio import studio
File "/Users/arshak.navruzyan/studio/studio/studio.py", line 7, in
db_provider = model.get_db_provider()
File "/Users/arshak.navruzyan/studio/studio/model.py", line 322, in get_db_provider
return FirebaseProvider(db_config)
File "/Users/arshak.navruzyan/studio/studio/model.py", line 77, in init
self.auth = FirebaseAuth(app)
File "/Users/arshak.navruzyan/studio/studio/auth.py", line 18, in init
self._update_user()
File "/Users/arshak.navruzyan/studio/studio/auth.py", line 29, in _update_user
self.user = self.firebase.auth().sign_in_with_email_and_password(email, password)
File "/Users/arshak.navruzyan/miniconda2/lib/python2.7/site-packages/pyrebase/pyrebase.py", line 85, in sign_in_with_email_and_password
raise_detailed_error(request_object)
File "/Users/arshak.navruzyan/miniconda2/lib/python2.7/site-packages/pyrebase/pyrebase.py", line 448, in raise_detailed_error
raise HTTPError(e, request_object.text)
requests.exceptions.HTTPError: [Errno 400 Client Error: Bad Request for url: https://www.googleapis.com/identitytoolkit/v3/relyingparty/verifyPassword?key=AIzaSyCLQbp5X2B4SWzBw-sz9rUnGHNSdMl0Yx8] {
"error": {
"errors": [
{
"domain": "global",
"reason": "invalid",
"message": "EMAIL_NOT_FOUND"
}
],
"code": 400,
"message": "EMAIL_NOT_FOUND"
}
}
Exception AttributeError: "'FirebaseAuth' object has no attribute 'sched'" in <bound method FirebaseAuth.del of <studio.auth.FirebaseAuth object at 0x112800bd0>> ignored
As an experimenter, or owner of a python runner deployment
I want to install TFStudio from a well known public package repository
In order that deployment for TFStudio can be curated, automated and version managed
Notes
Using arbitrary naming inside PyPi (https://pypi.python.org/pypi) until longer term decisions are made
Per @nieoh 's request - sometimes the working folder may be too big; and the user might not want to capture it. We could add something like --capture=null:workspace to disable workspace capture.
per @karlmutch request,
add values that refer to environment variables. The example use case is
serviceAccount: $GOOGLE_APPLICATION_CREDENTIALS should read location of service account credentials JSON from GOOGLE_APPLICATION_CREDENTIALS env variable
When we get to pip installable version of the app we should think about where default_config.yaml lives. I'm sort of in favor of how keras does it (~/.keras/keras.json) maybe we can follow a similar convention ~/.tfstudio/tfstudio.yaml
Currently all experiment data is rooted at ~/.tfstudio/.... In order to run on shared infrastructure the default values inside the JSON need to be ignored and a scratch $HOME equivalent be created for each pubsub job being received. This will allow the runner to destroy all data when the keras or TF experiment is done, it is the python code that is responsible for pushing results back to the storage currently. We need a better way as currently experiments need to run completely cloud agnostic, including pushing results back.
It's a little painful to pass credentials around to all team members that want to just see experiment results. Should we consider doing a studio ui deploy that pushes the app to Google App Engine standard environment?
Obviously people run more important apps under their project so we should not make TFStudio the default service.
https://cloud.google.com/appengine/docs/standard/python/microservices-on-app-engine
Is it possible to provide a link to gcloud storage location like https://console.cloud.google.com/storage/browser/bucket/?project=project
sometimes artifacts are huge and downloading them just to see results isn't as easy as being able to browse the bucket/directory
Can we support python 3.5?
"You also need to have Python 2.7 or 3.3+ to run the Google Python Client Library."
https://cloud.google.com/compute/docs/tutorials/python-guide
"Pyrebase was written for python 3 and will not work correctly with python 2."
https://github.com/thisbejim/Pyrebase
Studio-runner prints out a bunch of stuff that is not necessarily useful to the user, and may clutter the output of the script being run. Would be nice to have verbosity controlling flags.
Per discussion with @asaliou0809 . For remote machines with docker-only access it may not be possible to set up keys and credentials separately and then load them into the container. It would be convenient to be able to bake keys (firebase authentication key, google application credentials, aws credentials) into the docker image; and then disable loading of those keys. Ideally, the docker image with keys should inherit from docker image without keys to be rebuilt quickly by users.
Let's say you are running experiments mostly on a cloud / remote machine, and at some point you get package version mismatch problem. Right now it means that you have to fix the local python environment first, and only then will be able to proceed. But the remote / cloud workers install environment from scratch anyways; so it would be convenient to add an option that allows to customize python packages. Say, --python-pkg=keras=2.0.5. Note that these packages might need to be installed after the rest of the environment (for their dependencies to be processed correctly).
If we are (re-)using the large artifacts (say, imagenet dataset), it would be cool to have an option of stating experiment before the download of the artifact is complete, and finishing download in the background. Of course, the user code then has to check if the particular shard of data (say, image) is in place before using it.
@arshak suggested mkdocs. If we are migrating to tensorflow/contrib, do we still needed it as a separate page? Where should it be hosted?
The firebase storage gets pricey after 5Gb of data; need to make an option to use google cloud storage (directly, without firebase layer) or S3
when number of experiments hits ~50--100 loading them one by one (even the simple database reads, with no storage) is as large as few seconds, making dashboard annoyingly slow to load. We can avoid it by caching the data about experiments within FirebaseProvider class
There is usually access to 1 or more servers [with GPUs for example] that user wants to run their jobs on.
Right now one needs to develop locally, checking code, push, pull on server, make small modifications and then run it there.
Ideally something like studio-runner --worker=my_gpu_server my_job.py
should schedule and execute on the server and stream logs (via db) to the user.
per @jasonzliang 's request
For studio runner, Is there a way to use the compute engine service account ([email protected]) instead of the firebase key (credentials.json)?
Keeping that key secure for short lived studio runner instances is a bit of a devops headache whereas the service account credentials are already installed on every instance.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.