Giter Club home page Giter Club logo

edl's Introduction

PaddlePaddle EDL: Elastic Deep Learning

While many hardware and software manufacturers are working on improving the running time of deep learning jobs, EDL optimizes

  1. the global utilization of the cluster, and
  2. the waiting time of job submitters.

For more about the project EDL, please refer to this invited blog post on the Kubernetes official blog.

EDL includes two parts:

  1. a Kubernetes controller for the elastic scheduling of distributed deep learning jobs, and

  2. making PaddlePaddle a fault-tolerable deep learning framework. This directory contains the Kubernetes controller. For more information about fault-tolerance, please refer to the design.

We deployed EDL on a real Kubernetes cluster, dlnel.com, opened for graduate students of Tsinghua University. The performance test report of EDL on this cluster is here.

Build

glide install --strip-vendor
go build -o path/to/output github.com/paddlepaddle/edl/cmd/edl

Usage

To deploy the EDL to your kubernetes cluster, there are 2 major steps:

  1. Create a Third Party Resource "Training-job" to allow creating a PaddlePaddle machine learning job in one yaml file.
  2. Deploy the EDL controller to monitor and control overall cluster resource distribution between the online services and the PaddlePaddle training-jobs.

Please note, TPR (Third Party Resource) is deprecated after Kubernetes 1.7. We are working to support CRD (Custom Resource Definitions, the successor of TPR). Stay tuned!

Prepare your cluster

So before everything, make sure you have a running Kubernetes v1.7.* cluster and a working kubectl.

If you just trying to play EDL in your laptop, go with minikube with the following command is good enough to get you ready.

minikube start --kubernetes-version v1.7.5

To verify your minikube and kubectl works, run the following command:

kubectl version

if you are able to see both client and server version, AND server version is v1.7.5, you are good to go.

Create TPR "Training-job"

As simple as running the following command

kubectl create -f thirdpartyresource.yaml

To verify the creation of the resource, run the following command:

kubectl describe ThirdPartyResource training-job

if there is no error returned, that means your training-job TPR is successfully created.

Deploy EDL controller

EDL is supposed to run as a docker images to run in the Kubernetes cluster in most of the cases, but it's always possible to run the EDL binary outside the cluster along with Kubernetes config file. In this section we will assume that the EDL runs as docker image in the Kubernetes cluster.

Before we get to the docker image part, we recommend running the EDL controller within a Kubernetes namespace, which provides better isolation among apps. By default, the EDL runs under namespace "paddlecloud". To create it, run the following command if you don't have it created.

kubectl create namespace paddlecloud

There are 2 ways to obtain the EDL docker image:

  1. Directly pull the pre-built image from docker hub's paddle repo
  2. Build your own

If you decide to use the pre-built image, there is nothing you need to do now, you can skip to the deployment part.

To build your own docker images, run the following command:

docker build -t yourRepoName/edl-controller .

This command will take the Dockerfile, build the EDL docker image and tag it as yourRepoName/edl-controller

Now you want to push it to your docker hub so that Kubernetes cluster is able to pull and deploy it.

docker push yourRepoName/edl-controller

Before deploying your EDL controller, open edl_controller.yaml with any text editor to change the docker image uri from paddlepaddle/edl-controller to yourRepoName/edl-controller

Now let's deploy the EDL controller:

kubectl create -f edl_controller.yaml

To verify the deployment, let's firstly verify the depolyment's pod is successfully created:

kubectl get pods --namespace paddlecloud

NAME                                       READY     STATUS    RESTARTS   AGE
training-job-controller-2033113564-w80q6   1/1       Running   0          4m

Wait until you see STATUS is Running, run the following command to see controller's working log:

kubectl logs training-job-controller-2033113564-w80q6 --namespace paddlecloud

when you see logs like this:

t=2018-03-13T22:13:19+0000 lvl=dbug msg="Cluster.InquiryResource done" resource="{NodeCount:1 GPURequest:0 GPULimit:0 GPUTotal:0 CPURequestMilli:265 CPULimitMilli:0 CPUTotalMilli:2000 MemoryRequestMega:168 MemoryLimitMega:179 MemoryTotalMega:1993 Nodes:{NodesCPUIdleMilli:map[minikube:1735] NodesMemoryFreeMega:map[minikube:1824]}}" stack="[github.com/paddlepaddle/edl/pkg/autoscaler.go:466 github.com/paddlepaddle/edl/pkg/controller.go:72]"

That means your EDL controller is actively working monitoring and adjusting resource distributions.

Deploying a training-job

Now we have a resource typed training-job defined in Kubernetes and we have the EDL watching and optimizing the resource distribution, let's create a training job to see how it works.

�Firstly, let's create your training job's docker image, which contains training logic in example/train_ft.py

cd example
docker build -t yourRepoName/my_edl_training_job .

then push it to docker hub to be accessible by Kubernetes:

docker push yourRepoName/my_edl_training_job

Please note, docker build uses Dockerfile in example directory, which indicates our my_edl_training_job is based on docker image paddlepaddle/paddlecloud-job. This images has PaddlePaddle installed and configured, so that you do not have to install on your own.

Now we have defined "What to run" for Kubernetes, it's time to define "How to run" the training job, which is supposed to configured in a yaml file. please find the example yaml definition of a training job from example/examplejob.yaml.

In this file, change the image uri from paddlepaddle/paddlecloud-job to yourRepoName/my_edl_training_job in this case.

In spec section you will see 2 major members trainer and pserver, their configurations are trying to define how "distributed" this job is. Like trainer and pserver 's min-instance and max-instance are showing the desired trainer count range, so that EDL will adjust the instance count based on these information. We'll have a separate document to describe these fields soon.

Now let's start the training job by run command below:

kubectl create -f example.yaml

Resource Adjustments by EDL

TBD

FAQ

TBD

License

PaddlePaddle EDL is provided under the Apache-2.0 license.

edl's People

Contributors

gongweibao avatar helinwang avatar m3ngyang avatar putcn avatar qizheng09 avatar seiriosplus avatar tizhou86 avatar typhoonzero avatar wanghaoshuang avatar wangkuiyi avatar yancey1989 avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.