Light

kubeflow / examples Goto Github PK

View Code? Open in Web Editor NEW

1.4K 46.0 753.0 265.74 MB

A repository to host extended examples and tutorials

License: Apache License 2.0

Jupyter Notebook 31.32% Python 5.05% Shell 0.34% HTML 0.24% Makefile 0.20% CSS 0.14% JavaScript 0.09% Dockerfile 0.29% Jsonnet 62.20% Jinja 0.14%

examples's Introduction

Notice

Blog post: HELP WANTED: Repackaging Kaggle Getting Started into Kubeflow Examples

higlights:

We'd like to help bolster the kubeflow/examples repo
Help people get involved in open source/kubeflow project/community
Give people an opportunity to make a little side hustle income

kubeflow-examples

A repository to share extended Kubeflow examples and tutorials to demonstrate machine learning concepts, data science workflows, and Kubeflow deployments. The examples illustrate the happy path, acting as a starting point for new users and a reference guide for experienced users.

This repository is home to the following types of examples and demos:

End-to-end
Component-focused
Demos

End-to-end

Named Entity Recognition

Author: Sascha Heyer

This example covers the following concepts:

Build reusable pipeline components
Run Kubeflow Pipelines with Jupyter notebooks
Train a Named Entity Recognition model on a Kubernetes cluster
Deploy a Keras model to AI Platform
Use Kubeflow metrics
Use Kubeflow visualizations

GitHub issue summarization

Author: Hamel Husain

This example covers the following concepts:

Natural Language Processing (NLP) with Keras and Tensorflow
Connecting to Jupyterhub
Shared persistent storage
Training a Tensorflow model
1. CPU
2. GPU
Serving with Seldon Core
Flask front-end

Pachyderm Example - GitHub issue summarization

Author: Nick Harvey & Daniel Whitenack

This example covers the following concepts:

A production pipeline for pre-processing, training, and model export
CI/CD for model binaries, building and deploying a docker image for serving in Seldon
Full tracking of what data produced which model, and what model is being used for inference
Automatic updates of models based on changes to training data or code
Training with single node Tensorflow and distributed TF-jobs

Pytorch MNIST

Author: David Sabater

This example covers the following concepts:

Distributed Data Parallel (DDP) training with Pytorch on CPU and GPU
Shared persistent storage
Training a Pytorch model
1. CPU
2. GPU
Serving with Seldon Core
Flask front-end

MNIST

Author: Elson Rodriguez

This example covers the following concepts:

Image recognition of handwritten digits
S3 storage
Training automation with Argo
Monitoring with Argo UI and Tensorboard
Serving with Tensorflow

Distributed Object Detection

Author: Daniel Castellanos

This example covers the following concepts:

Gathering and preparing the data for model training using K8s jobs
Using Kubeflow tf-job and tf-operator to launch a distributed object training job
Serving the model through Kubeflow's tf-serving

Financial Time Series

Author: Sven Degroote

This example covers the following concepts:

Deploying Kubeflow to a GKE cluster
Exploration via JupyterHub (prospect data, preprocess data, develop ML model)
Training several tensorflow models at scale with TF-jobs
Deploy and serve with TF-serving
Iterate training and serving
Training on GPU
Using Kubeflow Pipelines to automate ML workflow

Pipelines

Simple notebook pipeline

Author: Zane Durante

This example covers the following concepts:

How to create pipeline components from python functions in jupyter notebook
How to compile and run a pipeline from jupyter notebook

MNIST Pipelines

Author: Dan Sanche and Jin Chi He

This example covers the following concepts:

Run MNIST Pipelines sample on a Google Cloud Platform (GCP).
Run MNIST Pipelines sample for on premises cluster.

Component-focused

XGBoost - Ames housing price prediction

Author: Puneith Kaul

This example covers the following concepts:

Training an XGBoost model
Shared persistent storage
GCS and GKE
Serving with Seldon Core

Demos

Demos are for showing Kubeflow or one of its components publicly, with the intent of highlighting product vision, not necessarily teaching. In contrast, the goal of the examples is to provide a self-guided walkthrough of Kubeflow or one of its components, for the purpose of teaching you how to install and use the product.

In an example, all commands should be embedded in the process and explained. In a demo, most details should be done behind the scenes, to optimize for on-stage rhythm and limited timing.

You can find the demos in the /demos directory.

Third-party hosted

Source	Example	Description

Get Involved

In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, education, socio-economic status, nationality, personal appearance, race, religion, or sexual identity and orientation.

The Kubeflow community is guided by our Code of Conduct, which we encourage everybody to read before participating.

examples's People

Contributors

Stargazers

Watchers

Forkers

puneith zomglings yupbank elsonrodriguez vicaire hamelsmu jlewi openinfoloka bkungfoo abkosar holdenk ldcastell lluunn willingc benhall katacoda sozercan dwhitena ysungl julienstroheker cimomo kredaro elasti-ronenc elastifile mgyong pdmack maerville afcarl texasmichelle netankit wei-he divya063 royxue inc0 cedrickchee yixinshi richardsliu safibaig analyticalmonk nicholas-fwang potix2 akado2009 yogeshsomawar kkasravi gridl codeaudit bigrlab murraju svendegroote91 gaybro8777 firefoxxy8 mkm177 ppadgett aman161199 henrypan dsdinter oblynx praveen-mg connected-bsamadi harshu-pathak sensorsdriven karthikv2k sasha-gitg chenzhiwei ironpan stjordanis jiade-wu deltaresprojects chuqiaoshen arrikto sarahmaddox hougangliu amirunpri2018 gyliu513 swiftdiaries yaalsn ironpanorg govindkag pablocbre devops8012 joesan uday1bhanu khsibr azmikamis julyhurt direkshan-digital krenshaw2018 cechu66 mbrukman hholst80 gabrielwen smizy aiweaver larrydisdog zhenghuiwang abhi-g hsiachubby cwbeitel y44k0v agilestacks

examples's Issues

GCR Registry examples

Some examples are publishing Docker images. We should probably create a GCR registry to host these.

Example:
GitHub issue example referring to an image in gcr.io/agwl-kubeflow

[GH Issue] Add links and info about Kubeflow to web app

The GH web app should include links and information about Kubeflow.

Its free advertising.

/assign @ankushagarwal

[GH Issue Summarization] Create a ksonnet app

Create a ksonnet app for deploying the model & web app.

Component of #14.

[GH Issue] Link to IssueSummarization.py is broken

The link on this page
https://github.com/kubeflow/examples/blob/master/github_issue_summarization/serving_the_model.md

to IssueSummarization is broken. Looks like the correct link should be

https://github.com/kubeflow/examples/blob/master/github_issue_summarization/notebooks/IssueSummarization.py

/assign @ankushagarwal

[GH Issue] E2E test

We should add E2E testing for the GH issue example to make sure it doesn't break.

Some things to test

Make sure we can deploy all the components using ksonnet
Make sure training runs (just run a couple steps)
Test predict RPC generate responses

DeepVariant Example

DeepVariant might make an interesting.

This could be a nice example to illustrate various aspects such as

Preprocessing
Pipelines.
Large scale training

[GH Issue Summarization] Create docker image for use with Jupyter

Component of #14.

[GH Issue Summarization] Organize notebook into components

Refactor the example notebook into libraries to be invoked with K8sJob & TFJobs:

Preprocessing
Training

Component of #14.

[GH Example] Friction points from Katacoda

Katacoda recently created a scenario out of the GitHub issue summarization example and ran into a number of frustrating issues. The following items need to be addressed in order to turn this example into a platform-independent self-contained unit for use in a wide number of environments.

@BenHall to add his list

Create XGBoost Zillow housing prediction example from Kaggle kernel

Create XGBoost Zillow housing prediction example from Kaggle Zestimate kernel.

[GH Label Prediction] Extend GH Issue Summarization to Predict Labels

Should we extend our example on GH issue summarization to predict issue labels?

One of the points of the original blog post was to train useful word embeddings on the entire corpus. So we could potentially use this to learn features that would then allow us to train models specific to repositories/orgs which likely have less data and their own taxonomy of labels.

Predicting issue labels would be useful for creating examples that highlight model analysis tooling. A lot of model analysis tools assume you can compute metrics like true positive/true negative etc... If we predict labels we can use actual labels to compute this.

With our existing text summarization example there's no obvious way to compute whether a summarization is accurate or not, which limits our ability to use it as an example of model analysis.

This would also be useful as a proxy for a large class of ecommerce problems where the goal is to dedupe related posts (e.g. different ebay postings for the same product).

@hamelsmu Any idea how difficult this would be and whether it would be valuable?

[GH Issue] GitHub API token should be set via a secret and not as a parameter

See here

We currently set the GitHub token as a parameter. This would lead it to be checked into source control. But GItHub tokens should be kept secret.

So instead we should modify the APP to use a K8s secret to supply it.

/assign @texasmichelle

[Agents RL] Demonstrate Kubeflow with an E2E RL example

The purpose of this example is to showcase the benefits of the Kubeflow infrastructure in training a reinforcement learning agent.

Core tasks:

Case study described that communicates the business value of the example; who will care about this example and why?
Illustrate the config, submit, monitor, render workflow for single-node training
Prow test verifies model trains in notebook container
Illustration of practice for building and pushing containers efficiently
Distributed training with TFJob operator (e.g. using @danijar's idea)
Illustration of simple hyperparameter tuning
Uses accelerators

Optionally:

Build a custom gym environment that captures a business problem of interest, e.g. reinforcement learning in the context of datacenter cooling, scheduling, hyperparameter tuning, etc.
Deploy the agent and custom environment, e.g. if this environment concerns kubernetes scheduling then use it to schedule resources on a cluster and measure whether there was a benefit

/cc @nkashy1 @danijar @aronchick @jlewi

[Request] Move examples in tf-operator to this repo

I think we could move the examples in tf-operator to this repo to maintain all examples in one repo.

Writing model directly to GCS instead of saving locally and then copying to GCS

The tfjob example for github issue summarization saves the model locally and then uploads it to GCS. We should try to write the model directly to GCS.

Linting instructions

Create instructions for running pylint locally and troubleshooting presubmit failures that prevent PRs from being merged. Follow-on from merged PR #61 .

Document which tools to use locally (tf-operator describes yapf here) and how to find and use the versions in our test infrastructure. Since the Prow UI only shows file-level granularity, explain how to access the test cluster, find the right pods, & view the logs containing line-level failure details.

Katacoda scenario on github summarization example; friction log

Running through the scenario

step 1: no issue

step2:

[Major issue] Creating GCP token (secret) not working --> error: invalid literal source test, expected key=value
[Major issue] As a result, the tfjob failed to start: secrets "gcp-credentials" not found
git clone https://github.com/kubeflow/examples.git; cd examples/github_issue_summarization/notebooks/ks-app
This is not working. The app is at examples/github_issue_summarization/ks-kubeflow
It's creating an environment called tfjob. I think the name should be changed.
Using tfjob as an env name is confusing.
Once an environment has been delayed --> typo ? s/delayed/created ?
Points to IssueSummarization.py for the code. But training code should be training.py
The image is currently gcr.io/agwl-kubeflow/tf-job-issue-summarization. We should use
gcr.io/kubeflow-images-public
The kubectl log command (very last one) needs -nkubeflow

step3:

Got The environment has expired. Please refresh to get a new environment. when started. So I
refresh, redo step 1, and go straight to step3.
I was able to get the prediction back. But it took ~5min for the pod to be ready. We should probably
mention that (also how to check the status, kubectl describe ....

step4:

[Major issue] ks apply frontendenv -c ui failed because github_token is not set.
Needs ks param set ui github_token $GITHUB_TOKEN

cc @jlewi @ankushagarwal

[GH Issue] Use Pachyderm to launch TFJobs

See kubeflow/kubeflow#151

We'd like to provide an example of Pachyderm + TFJob to illustrate

Combining Pachyderm's orchestration capabilities with TFJob for distributed training
Highlighting Pachyderm's data provenance features with TFJob

The current thought see kubeflow/kubeflow#151 is to create a simple Pachyderm pipeline to launch a TFJob to train the model.

The main challenge is that the data needs to be exported from Pachyderm and the resulting model imported into Pachyderm.

There's lots of discussion in kubeflow/kubeflow#151 about how to do this. The basic idea is

Pachyderm invokes a script that launches a TFJob
As part of the TFJob we export/import data from the Pachyderm data store
- A variety of ideas have been suggested; e.g. using an Argo workflow, init containers, sidecars etc...

To use Pachyderm we would also need to deploy Pachyderm on K8s.

[GH Issue Summarization] Hyperparameter tuning(grid search)?

@hamelsmu @ankushagarwal

For the GitHub issue summarization model would it make sense to do a simple grid search for hyperparameter tuning?

I think this would be pretty straightforward to implement

Create a simple Python program(controller) to do a grid search
- Launch N TFJobs at a time and wait to complete.
Controller can store information in a file on a PD for resilience
- SQLLite might be a good choice
Run the controller as a K8s Job.

I think my main question, does the model have hyperparameters worth tuning? Are through suitable metrics for deciding which model is best?

[GH Example] Use katib/modeldb to store model results

It looks like kartib has some very nice features for keeping track of your models and then surfacing metrics for those models e.g. by launching TensorBoard.

It would be great to combine this with our GH issue summarization example. In particular it would be great if we could load the trained model in a DB and then use the kartib/model DB UI to browse models and look at results.

/cc @gaocegege @YujiOshima

[GH Issue] Instructions reference build_image.sh; but that script is missing

https://github.com/kubeflow/examples/blob/master/github_issue_summarization/serving_the_model.md#wrap-the-model-into-a-seldon-core-microservice

Instructions say

The build/ directory contains all the necessary files to build the seldon-core microservice image

But I don't see that directory or build_image.sh

@ankushagarwal @texasmichelle is there something I'm missing? I assume the instructions are just outdated?

tensorflow serving not working with minio

👋 Great work guys!!!

Running the mnist example with minio, getting this error:

FileSystemStoragePathSource encountered a file-system access error: Could not find base path s3://mybucket/models/myjob-8b52d/export/mnist/

Any chance you can publish code for elsonrodriguez/model-server:1.0?

[GH Issue] Use persistent volumes for the data

To support Katacoda #89 we should remove the need for GCP credentials.

For the output we should support using PD and making it easy for users to set the PD by parameters.
The input is trickier. I think GCS requires an account even for public buckets. If its a single file we could use the http URL to access it.
- My suggestion would be to create a script that would copy the data using curl to a PD. We could then run that script as a K8s job.

[GH Issue] Don't require the model is baked into the docker image

Currently the model hard codes the paths here of

seq2seq_model_tutorial.h5 - the keras model
body_pp.dpkl - the serialized body preprocessor
title_pp.dpkl - the serialized title preprocessor

This means users have to rebuild the docker image just to try out their model. This makes things more difficult from the perspective of rerunning the demo on their own model.

A better approach would be to allow these files to be over written e.g. using environment variables.

We might still want to bake the data into the Docker image so that users can try serving without having to train.

See #89

Create a distributed object detection training example

Create tf-serving examples for models trained using higher level APIs

Example: Estimator and Keras API training to serving + Unit testing
Priority: P1 - It's not a must have, but could be a great contribution to the TF community, where many data scientists train using one of these two APIs.

The k8s-model-server component in Kubeflow currently contains an inception client example that interacts with a custom model graph: https://github.com/tensorflow/serving/blob/master/tensorflow_serving/example/inception_saved_model.py

ML practitioners often use TF estimator and Keras APIs to train models, as it greatly simplifies the training and validation process. However, converting these to servable models can be trickier and harder to debug. Add some examples of how to build servable models trained using Estimator and Keras APIs, and unit test examples.

[GH Issue Summarization] Train model distributed using TFJob

We should be able to train the model using TFJob so that we can take advantage of K8s to train the model distributed.

Right now the instructions only describe how to train inside the Jupyter notebook
https://github.com/kubeflow/examples/blob/master/github_issue_summarization/training_the_model.md

The Argo workflow does train the model as a batch job but its not running distributed
https://github.com/kubeflow/examples/blob/master/github_issue_summarization/workflow/github_issues_summarization.yaml

Setup a shared environment

Setup a k8s cluster & cloud project for use in creating examples.

Component of Kubeflow #157.

Sketch out recommendation example

I think it would be very valuable to have a recommendation example. The purpose of this issue is to identify a scenario and dataset around which we could build a solution.

Some possible datasets

GitHub Data

We could recommend repositories based on stars
Issues/PRs (comments could be used to indicate a user was interested in an issue)
Recommend reviewers for PRs

Hacker News or Reddit

I think both datasets are available publicly in BigQuery

The MovieLens data seems less interesting because it isn't updated frequently.

Switch to project kubeflow-ci

We need to move out of mlkube-testing and into kubeflow-ci

See kubeflow/testing#18

[GH Issue Summarization] Deploy GH web app on kubeflow.org and make it widely accessible

I think it would be interesting to deploy the web app and model on a K8s cluster and make it publicly available.

This would be a good test bed for a variety of things; e.g. periodic training and rollouts; monitoring etc...

Create E2E example of distributed model training and serving.

Currently most of the examples do not show how to complete the training of a model within kubeflow, and also take that trained model and serve it with kubeflow.

We need an example that covers this from start to finish.

[GH Issue Summarization] Deploy on dev.kubeflow.org

We should deploy the webserver and model on our dev instance of Kubeflow (dev.kubeflow.org) and provide a public URL for accessing the app.

Keras model exported as TensorFlow model doesn't work with TensorFlow serving

I am using the keras model as defined in this tutorial: https://github.com/hamelsmu/Seq2Seq_Tutorial/blob/master/notebooks/Tutorial.ipynb

I exported the encoder model using extract_encoder_model and exported it as a Tensorflow model. When used with TensorFlow serving, I get the following error

AbortionError: AbortionError(code=StatusCode.INVALID_ARGUMENT, details="Expected multiples argument to be a vector of length 3 but got length 2
[[Node: Encoder-Last-GRU_1/Tile = Tile[T=DT_FLOAT, Tmultiples=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](Encoder-Last-GRU_1/ExpandDims, Encoder-Last-GRU_1/Tile/multiples)]]")

[GH Issue Summarization] Create a model server

Create a model server using TFServing.

Component of #14.

Create buckets and other resources to store example data

We should create buckets and other resources (GCR) to store data for our examples.

We currently have bucket gs://kubeflow-examples but that's owned by project kubeflow-dev which isn't really the best thing.

I created project: kubeflow-examples

[GH Issue Summarization] E2E Solution on Kubeflow

Replicated from Kubeflow #157.

@hamelsmu published a great blog post about using sequence to sequence models to summarize GitHub issues.

It would be great to turn this into an E2E solution using Kubeflow that highlights the benefit of using Kubeflow and K8s for data science.

There are lots of reasons why I think this blog post would make for a fantastic E2E solution

Text summarization has a lot of applications
It uses GitHub data which is a very rich dataset
Training and preprocessing take enough time (~30 minutes and ~1 hour respectively) that I think it makes sense to run these as K8s jobs but not so much time to be a barrier.

Here's a stab at what an E2E solution might look like

Entrypoint would be a notebook (based on the one in the blogpost).
Notebook would walk through the various steps but instead of (or in addition) to running code directly in the notebook.

I think there's quite a bit of work to be done but I think we can split this into tasks

Setup a shared Kubeflow cluster and Cloud project for dev team to use
Create a Docker image to be used with Jupyter with all dependencies installed
~~Refactor the notebook into libraries with suitable main functions so that relevant steps (preprocessing and training) steps can be invoked in K8s Jobs and TFJobs~~
Build docker image (using Argo to be used by TFJob, TFServing, etc...
Create a model server using ~~TFServing~~ Seldon core
Create a web app to serve as the front end
Create ksonnet component for deploy the model and web app

[GH Issue Summarization] Create a front-end web app

Component of #14.

[GH Issue] ksonnet app is missing vendor directory

It doesn't look like the vendor directory got checked in.
https://github.com/kubeflow/examples/tree/master/github_issue_summarization/ks-kubeflow

I'm guessing because of our .gitignore.

We should check it in.

Updating `Training MNIST using Kubeflow, S3, and Argo` documentation

We went through this documentation using minio as S3 storage and a Kubernetes Cluster hosted on Azure and we found some issues.

I am planning to open a PR to share our learnings from that.

To avoid duplicate sections, I was thinking to create a new md file for minio and refer to it in the README.md in the same folder as few tweaks are necessary to make it work.

WDYT?

@jlewi @wbuchwalter @ritazh @sozercan

Fix the linear training mode in the mnist example, upstream to Tensorflow

I never got the Linear training mode working, so I stuck to CNN for the example.

examples/mnist/model.py

Line 142 in 1be7ccb

if TF_MODEL_TYPE == "LINEAR":

If the Linear portion of the model were fixed, it could be upstreamed in tensorflow to replace their example and reduce duplication.

Also the upstream model does not support distributed training in its current form, or exporting.

[GH Example] Surface RPC metrics

It would be great to be able to show how we collect/surface RPC metrics for models.

The current thinking is to use ISTIO; see kubeflow/kubeflow#464

@lluunn Once we have ISTIO working

[GH Issue] Scale out preprocessing using Apache Beam

In the original blog post, Hamel filtered down the number of issues from 5M to 2M.

Can we use Apache Beam to run the preprocessing on all 5 million issues and scale out horizontally?

This would be the first step in giving us a very nice scaling out story.

[GH Issue] Document preprocessing and training times

It would be good to provide a rough estimate of how long it takes to preprocess and train the model using datasets of different sizes; e.g.

Using 2M issues sampled from the dataset per the original blog post
Using the entire dataset

The original blog post had some measurements of how long things took using how much resources.

[Enhance] Image enhancement example

Goals:

Demonstrate a high-impact biomedical imaging use case
Demonstrate distributed training
Demonstrate hyperparameter tuning, potentially informing future design of a TFStudy hptuning CRD
Demonstrate batch inference
Demonstrate the use of tensor2tensor, positioning for increased leverage in developing additional examples

Steps:

Potential additional or non-steps:

Generalizing beyond NFS to support a variety of storage types
Implementing a production-caliber hyperparameter tuning solution

Current PR: #60
Readme: https://github.com/cwbeitel/examples/tree/enhance/enhance

Move tf-controller-examples from kubeflow/kubeflow to kubeflow/examples

Kubeflow examples needs to have an appropriate directory structure

We need to have an example directory structure. E.g., Do we organize top level folder as frameworks i.e., tensorflow, xgboost, scikit etc? Or do we use problem type or something else? This needs to be figured out sooner so that we can add examples in appropriate folders as the current flat structure will soon be hard to navigate.

[GH Issue Summarization] Create build image

Build a docker image using Argo to be used by K8sJob, TFJob, TFServing, etc.

Component of #14.

[GH Issue Summarization] Train and push periodically

I think the GitHub data is updated regularly. So we could try setting up a cron job or other solution to periodically retrain and push the model.

This depends on deploying it first (#39 ).

[Agents RL] Remove unnecessary tools/ directory

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.