Giter Club home page Giter Club logo

examples's Introduction

Notice

Blog post: HELP WANTED: Repackaging Kaggle Getting Started into Kubeflow Examples

higlights:

  • We'd like to help bolster the kubeflow/examples repo
  • Help people get involved in open source/kubeflow project/community
  • Give people an opportunity to make a little side hustle income

kubeflow-examples

A repository to share extended Kubeflow examples and tutorials to demonstrate machine learning concepts, data science workflows, and Kubeflow deployments. The examples illustrate the happy path, acting as a starting point for new users and a reference guide for experienced users.

This repository is home to the following types of examples and demos:

End-to-end

Author: Sascha Heyer

This example covers the following concepts:

  1. Build reusable pipeline components
  2. Run Kubeflow Pipelines with Jupyter notebooks
  3. Train a Named Entity Recognition model on a Kubernetes cluster
  4. Deploy a Keras model to AI Platform
  5. Use Kubeflow metrics
  6. Use Kubeflow visualizations

Author: Hamel Husain

This example covers the following concepts:

  1. Natural Language Processing (NLP) with Keras and Tensorflow
  2. Connecting to Jupyterhub
  3. Shared persistent storage
  4. Training a Tensorflow model
    1. CPU
    2. GPU
  5. Serving with Seldon Core
  6. Flask front-end

Author: Nick Harvey & Daniel Whitenack

This example covers the following concepts:

  1. A production pipeline for pre-processing, training, and model export
  2. CI/CD for model binaries, building and deploying a docker image for serving in Seldon
  3. Full tracking of what data produced which model, and what model is being used for inference
  4. Automatic updates of models based on changes to training data or code
  5. Training with single node Tensorflow and distributed TF-jobs

Author: David Sabater

This example covers the following concepts:

  1. Distributed Data Parallel (DDP) training with Pytorch on CPU and GPU
  2. Shared persistent storage
  3. Training a Pytorch model
    1. CPU
    2. GPU
  4. Serving with Seldon Core
  5. Flask front-end

Author: Elson Rodriguez

This example covers the following concepts:

  1. Image recognition of handwritten digits
  2. S3 storage
  3. Training automation with Argo
  4. Monitoring with Argo UI and Tensorboard
  5. Serving with Tensorflow

Author: Daniel Castellanos

This example covers the following concepts:

  1. Gathering and preparing the data for model training using K8s jobs
  2. Using Kubeflow tf-job and tf-operator to launch a distributed object training job
  3. Serving the model through Kubeflow's tf-serving

Author: Sven Degroote

This example covers the following concepts:

  1. Deploying Kubeflow to a GKE cluster
  2. Exploration via JupyterHub (prospect data, preprocess data, develop ML model)
  3. Training several tensorflow models at scale with TF-jobs
  4. Deploy and serve with TF-serving
  5. Iterate training and serving
  6. Training on GPU
  7. Using Kubeflow Pipelines to automate ML workflow

Author: Zane Durante

This example covers the following concepts:

  1. How to create pipeline components from python functions in jupyter notebook
  2. How to compile and run a pipeline from jupyter notebook

Author: Dan Sanche and Jin Chi He

This example covers the following concepts:

  1. Run MNIST Pipelines sample on a Google Cloud Platform (GCP).
  2. Run MNIST Pipelines sample for on premises cluster.

Component-focused

Author: Puneith Kaul

This example covers the following concepts:

  1. Training an XGBoost model
  2. Shared persistent storage
  3. GCS and GKE
  4. Serving with Seldon Core

Demos

Demos are for showing Kubeflow or one of its components publicly, with the intent of highlighting product vision, not necessarily teaching. In contrast, the goal of the examples is to provide a self-guided walkthrough of Kubeflow or one of its components, for the purpose of teaching you how to install and use the product.

In an example, all commands should be embedded in the process and explained. In a demo, most details should be done behind the scenes, to optimize for on-stage rhythm and limited timing.

You can find the demos in the /demos directory.

Third-party hosted

Source Example Description

Get Involved

In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.

The Kubeflow community is guided by our Code of Conduct, which we encourage everybody to read before participating.

examples's People

Contributors

abcdefgs0324 avatar activatedgeek avatar amygdala avatar cwbeitel avatar daniel-sanche avatar dependabot[bot] avatar dsdinter avatar elsonrodriguez avatar gabrielwen avatar govindkag avatar hougangliu avatar iancoffey avatar ironpan avatar jinchihe avatar jlewi avatar josepholaide avatar js-ts avatar kbthu avatar kimwnasptd avatar ldcastell avatar lluunn avatar neokish avatar oblynx avatar puneith avatar richardsliu avatar sarahmaddox avatar svendegroote91 avatar texasmichelle avatar tomcli avatar zhenghuiwang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

examples's Issues

GCR Registry examples

Some examples are publishing Docker images. We should probably create a GCR registry to host these.

Example:
GitHub issue example referring to an image in gcr.io/agwl-kubeflow

[GH Issue] E2E test

We should add E2E testing for the GH issue example to make sure it doesn't break.

Some things to test

  • Make sure we can deploy all the components using ksonnet
  • Make sure training runs (just run a couple steps)
  • Test predict RPC generate responses

DeepVariant Example

DeepVariant might make an interesting.

This could be a nice example to illustrate various aspects such as

  • Preprocessing
  • Pipelines.
  • Large scale training

[GH Example] Friction points from Katacoda

Katacoda recently created a scenario out of the GitHub issue summarization example and ran into a number of frustrating issues. The following items need to be addressed in order to turn this example into a platform-independent self-contained unit for use in a wide number of environments.

[GH Label Prediction] Extend GH Issue Summarization to Predict Labels

Should we extend our example on GH issue summarization to predict issue labels?

One of the points of the original blog post was to train useful word embeddings on the entire corpus. So we could potentially use this to learn features that would then allow us to train models specific to repositories/orgs which likely have less data and their own taxonomy of labels.

Predicting issue labels would be useful for creating examples that highlight model analysis tooling. A lot of model analysis tools assume you can compute metrics like true positive/true negative etc... If we predict labels we can use actual labels to compute this.

With our existing text summarization example there's no obvious way to compute whether a summarization is accurate or not, which limits our ability to use it as an example of model analysis.

This would also be useful as a proxy for a large class of ecommerce problems where the goal is to dedupe related posts (e.g. different ebay postings for the same product).

@hamelsmu Any idea how difficult this would be and whether it would be valuable?

[Agents RL] Demonstrate Kubeflow with an E2E RL example

The purpose of this example is to showcase the benefits of the Kubeflow infrastructure in training a reinforcement learning agent.

Core tasks:

  • Case study described that communicates the business value of the example; who will care about this example and why?
  • Illustrate the config, submit, monitor, render workflow for single-node training
  • Prow test verifies model trains in notebook container
  • Illustration of practice for building and pushing containers efficiently
  • Distributed training with TFJob operator (e.g. using @danijar's idea)
  • Illustration of simple hyperparameter tuning
  • Uses accelerators

Optionally:

  • Build a custom gym environment that captures a business problem of interest, e.g. reinforcement learning in the context of datacenter cooling, scheduling, hyperparameter tuning, etc.
  • Deploy the agent and custom environment, e.g. if this environment concerns kubernetes scheduling then use it to schedule resources on a cluster and measure whether there was a benefit

/cc @nkashy1 @danijar @aronchick @jlewi

Linting instructions

Create instructions for running pylint locally and troubleshooting presubmit failures that prevent PRs from being merged. Follow-on from merged PR #61 .

Document which tools to use locally (tf-operator describes yapf here) and how to find and use the versions in our test infrastructure. Since the Prow UI only shows file-level granularity, explain how to access the test cluster, find the right pods, & view the logs containing line-level failure details.

Katacoda scenario on github summarization example; friction log

Running through the scenario

step 1: no issue

step2:

  • [Major issue] Creating GCP token (secret) not working --> error: invalid literal source test, expected key=value
  • [Major issue] As a result, the tfjob failed to start: secrets "gcp-credentials" not found
  • git clone https://github.com/kubeflow/examples.git; cd examples/github_issue_summarization/notebooks/ks-app
    This is not working. The app is at examples/github_issue_summarization/ks-kubeflow
  • It's creating an environment called tfjob. I think the name should be changed.
    Using tfjob as an env name is confusing.
  • Once an environment has been delayed --> typo ? s/delayed/created ?
  • Points to IssueSummarization.py for the code. But training code should be training.py
  • The image is currently gcr.io/agwl-kubeflow/tf-job-issue-summarization. We should use
    gcr.io/kubeflow-images-public
  • The kubectl log command (very last one) needs -nkubeflow

step3:

  • Got The environment has expired. Please refresh to get a new environment. when started. So I
    refresh, redo step 1, and go straight to step3.
  • I was able to get the prediction back. But it took ~5min for the pod to be ready. We should probably
    mention that (also how to check the status, kubectl describe ....

step4:

  • [Major issue] ks apply frontendenv -c ui failed because github_token is not set.
    Needs ks param set ui github_token $GITHUB_TOKEN

cc @jlewi @ankushagarwal

[GH Issue] Use Pachyderm to launch TFJobs

See kubeflow/kubeflow#151

We'd like to provide an example of Pachyderm + TFJob to illustrate

  • Combining Pachyderm's orchestration capabilities with TFJob for distributed training
  • Highlighting Pachyderm's data provenance features with TFJob

The current thought see kubeflow/kubeflow#151 is to create a simple Pachyderm pipeline to launch a TFJob to train the model.

The main challenge is that the data needs to be exported from Pachyderm and the resulting model imported into Pachyderm.

There's lots of discussion in kubeflow/kubeflow#151 about how to do this. The basic idea is

  • Pachyderm invokes a script that launches a TFJob

  • As part of the TFJob we export/import data from the Pachyderm data store

    • A variety of ideas have been suggested; e.g. using an Argo workflow, init containers, sidecars etc...

To use Pachyderm we would also need to deploy Pachyderm on K8s.

[GH Issue Summarization] Hyperparameter tuning(grid search)?

@hamelsmu @ankushagarwal

For the GitHub issue summarization model would it make sense to do a simple grid search for hyperparameter tuning?

I think this would be pretty straightforward to implement

  • Create a simple Python program(controller) to do a grid search

    • Launch N TFJobs at a time and wait to complete.
  • Controller can store information in a file on a PD for resilience

    • SQLLite might be a good choice
  • Run the controller as a K8s Job.

I think my main question, does the model have hyperparameters worth tuning? Are through suitable metrics for deciding which model is best?

[GH Example] Use katib/modeldb to store model results

It looks like kartib has some very nice features for keeping track of your models and then surfacing metrics for those models e.g. by launching TensorBoard.

It would be great to combine this with our GH issue summarization example. In particular it would be great if we could load the trained model in a DB and then use the kartib/model DB UI to browse models and look at results.

/cc @gaocegege @YujiOshima

tensorflow serving not working with minio

๐Ÿ‘‹ Great work guys!!!

Running the mnist example with minio, getting this error:

FileSystemStoragePathSource encountered a file-system access error: Could not find base path s3://mybucket/models/myjob-8b52d/export/mnist/ 

Any chance you can publish code for elsonrodriguez/model-server:1.0?

[GH Issue] Use persistent volumes for the data

To support Katacoda #89 we should remove the need for GCP credentials.

  1. For the output we should support using PD and making it easy for users to set the PD by parameters.

  2. The input is trickier. I think GCS requires an account even for public buckets. If its a single file we could use the http URL to access it.

    • My suggestion would be to create a script that would copy the data using curl to a PD. We could then run that script as a K8s job.

[GH Issue] Don't require the model is baked into the docker image

Currently the model hard codes the paths here of

  • seq2seq_model_tutorial.h5 - the keras model
  • body_pp.dpkl - the serialized body preprocessor
  • title_pp.dpkl - the serialized title preprocessor

This means users have to rebuild the docker image just to try out their model. This makes things more difficult from the perspective of rerunning the demo on their own model.

A better approach would be to allow these files to be over written e.g. using environment variables.

We might still want to bake the data into the Docker image so that users can try serving without having to train.

See #89

Create tf-serving examples for models trained using higher level APIs

Example: Estimator and Keras API training to serving + Unit testing
Priority: P1 - It's not a must have, but could be a great contribution to the TF community, where many data scientists train using one of these two APIs.

The k8s-model-server component in Kubeflow currently contains an inception client example that interacts with a custom model graph: https://github.com/tensorflow/serving/blob/master/tensorflow_serving/example/inception_saved_model.py

ML practitioners often use TF estimator and Keras APIs to train models, as it greatly simplifies the training and validation process. However, converting these to servable models can be trickier and harder to debug. Add some examples of how to build servable models trained using Estimator and Keras APIs, and unit test examples.

[GH Issue Summarization] Train model distributed using TFJob

We should be able to train the model using TFJob so that we can take advantage of K8s to train the model distributed.

Right now the instructions only describe how to train inside the Jupyter notebook
https://github.com/kubeflow/examples/blob/master/github_issue_summarization/training_the_model.md

The Argo workflow does train the model as a batch job but its not running distributed
https://github.com/kubeflow/examples/blob/master/github_issue_summarization/workflow/github_issues_summarization.yaml

Sketch out recommendation example

I think it would be very valuable to have a recommendation example. The purpose of this issue is to identify a scenario and dataset around which we could build a solution.

Some possible datasets

GitHub Data

  • We could recommend repositories based on stars
  • Issues/PRs (comments could be used to indicate a user was interested in an issue)
  • Recommend reviewers for PRs

Hacker News or Reddit

  • I think both datasets are available publicly in BigQuery

The MovieLens data seems less interesting because it isn't updated frequently.

Keras model exported as TensorFlow model doesn't work with TensorFlow serving

I am using the keras model as defined in this tutorial: https://github.com/hamelsmu/Seq2Seq_Tutorial/blob/master/notebooks/Tutorial.ipynb

I exported the encoder model using extract_encoder_model and exported it as a Tensorflow model. When used with TensorFlow serving, I get the following error

AbortionError: AbortionError(code=StatusCode.INVALID_ARGUMENT, details="Expected multiples argument to be a vector of length 3 but got length 2
[[Node: Encoder-Last-GRU_1/Tile = Tile[T=DT_FLOAT, Tmultiples=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Encoder-Last-GRU_1/ExpandDims, Encoder-Last-GRU_1/Tile/multiples)]]")

Create buckets and other resources to store example data

We should create buckets and other resources (GCR) to store data for our examples.

We currently have bucket gs://kubeflow-examples but that's owned by project kubeflow-dev which isn't really the best thing.

I created project: kubeflow-examples

[GH Issue Summarization] E2E Solution on Kubeflow

Replicated from Kubeflow #157.

@hamelsmu published a great blog post about using sequence to sequence models to summarize GitHub issues.

It would be great to turn this into an E2E solution using Kubeflow that highlights the benefit of using Kubeflow and K8s for data science.

There are lots of reasons why I think this blog post would make for a fantastic E2E solution

  • Text summarization has a lot of applications
  • It uses GitHub data which is a very rich dataset
  • Training and preprocessing take enough time (~30 minutes and ~1 hour respectively) that I think it makes sense to run these as K8s jobs but not so much time to be a barrier.

Here's a stab at what an E2E solution might look like

  • Entrypoint would be a notebook (based on the one in the blogpost).
  • Notebook would walk through the various steps but instead of (or in addition) to running code directly in the notebook.

I think there's quite a bit of work to be done but I think we can split this into tasks

  • Setup a shared Kubeflow cluster and Cloud project for dev team to use
  • Create a Docker image to be used with Jupyter with all dependencies installed
  • Refactor the notebook into libraries with suitable main functions so that relevant steps (preprocessing and training) steps can be invoked in K8s Jobs and TFJobs
  • Build docker image (using Argo to be used by TFJob, TFServing, etc...
  • Create a model server using TFServing Seldon core
  • Create a web app to serve as the front end
  • Create ksonnet component for deploy the model and web app

[GH Issue] Scale out preprocessing using Apache Beam

In the original blog post, Hamel filtered down the number of issues from 5M to 2M.

Can we use Apache Beam to run the preprocessing on all 5 million issues and scale out horizontally?

This would be the first step in giving us a very nice scaling out story.

[GH Issue] Document preprocessing and training times

It would be good to provide a rough estimate of how long it takes to preprocess and train the model using datasets of different sizes; e.g.

  • Using 2M issues sampled from the dataset per the original blog post
  • Using the entire dataset

The original blog post had some measurements of how long things took using how much resources.

[Enhance] Image enhancement example

Goals:

  • Demonstrate a high-impact biomedical imaging use case
  • Demonstrate distributed training
  • Demonstrate hyperparameter tuning, potentially informing future design of a TFStudy hptuning CRD
  • Demonstrate batch inference
  • Demonstrate the use of tensor2tensor, positioning for increased leverage in developing additional examples

Steps:

  • Launcher interface for running component steps in batch and testing for job success; each step smoke tested to run in batch at least displaying help message
  • Illustrate a tfhub-based development workflow (primarily in regard to how model code and dependencies are shipped to jobs) that sufficiently minimizes friction, has support of community
  • Batch data downloader pulls raw data to NFS
  • Correct definition of t2t Problem's for the image identity mapping and super-resolution problems
  • Batch example generator uses t2t-datagen to generate examples
  • T2TExperiment object abstracts interface for triggering a TFJob running t2t-trainer that is amenable to strategy for hyperparameter tuning; launches a job that makes use of a stock t2t model that trains in distributed form
  • Minimal prototype for creating a new hyperparameter study (e.g. registering in redis)
  • StudyRunner runs in batch, periodically launching newly registered experiments
  • Inference step runs in batch, wrapping t2t-decoder to allow user to enhance images by simply providing input and output paths (on NFS)
  • Model actually performs well on the provided task

Potential additional or non-steps:

  • Generalizing beyond NFS to support a variety of storage types
  • Implementing a production-caliber hyperparameter tuning solution

Current PR: #60
Readme: https://github.com/cwbeitel/examples/tree/enhance/enhance

Kubeflow examples needs to have an appropriate directory structure

We need to have an example directory structure. E.g., Do we organize top level folder as frameworks i.e., tensorflow, xgboost, scikit etc? Or do we use problem type or something else? This needs to be figured out sooner so that we can add examples in appropriate folders as the current flat structure will soon be hard to navigate.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.