update-codegen.sh doesn't work after code-generator upgrading to 1.19

mxnet-operator: Kubernetes operator for Apache MXNet jobs

⚠️ kubeflow/mxnet-operator is not maintained

This operator has been merged into Kubeflow Training Operator. This repository is not maintained and has been archived.

Overview

MXNet Operator provides a Kubernetes custom resource MXJob that makes it easy to run distributed or non-distributed Apache MXNet (incubating) jobs (training and tuning) and other extended framework like BytePS jobs on Kubernetes. Using a Custom Resource Definition (CRD) gives users the ability to create and manage Apache MXNet jobs just like built-in K8S resources.

Prerequisites

Kubernetes >= 1.8
kubectl >= 1.14
kustomize >= 3.2.0
Apache MXNet >= v1.2.0

Installing the MXJob CRD and operator on your k8s cluster

Deploy MXJob CRD and Apache MXNet Operator

kustomize build manifests/overlays/v1 | kubectl apply -f -

Verify that MXJob CRD and Apache MXNet Operator are installed

Check that the Apache MXNet custom resource is installed via:

kubectl get crd

The output should include mxjobs.kubeflow.org like the following:

NAME                                           AGE
...
mxjobs.kubeflow.org                            4d
...

Check that the Apache MXNet operator is running via:

kubectl get pods

The output should include mxnet-operaror-xxx like the following:

NAME                             READY   STATUS    RESTARTS   AGE
mxnet-operator-d466b46bc-xbqvs   1/1     Running   0          4m37s

Creating a Apache MXNet training job

You create a training job by defining a MXJob with MXTrain mode and then creating it with.

kubectl create -f examples/train/mx_job_dist_gpu_v1.yaml

Each replicaSpec defines a set of Apache MXNet processes. The mxReplicaType defines the semantics for the set of processes. The semantics are as follows:

scheduler

A job must have 1 and only 1 scheduler
The pod must contain a container named mxnet
The overall status of the MXJob is determined by the exit code of the mxnet container
- 0 = success
- 1 || 2 || 126 || 127 || 128 || 139 = permanent errors:
  - 1: general errors
  - 2: misuse of shell builtins
  - 126: command invoked cannot execute
  - 127: command not found
  - 128: invalid argument to exit
  - 139: container terminated by SIGSEGV(Invalid memory reference)
- 130 || 137 || 143 = retryable error for unexpected system signals:
  - 130: container terminated by Control-C
  - 137: container received a SIGKILL
  - 143: container received a SIGTERM
- 138 = reserved in tf-operator for user specified retryable errors
- others = undefined and no guarantee

worker

A job can have 0 to N workers
The pod must contain a container named mxnet
Workers are automatically restarted if they exit

server

A job can have 0 to N servers
parameter servers are automatically restarted if they exit

For each replica you define a template which is a K8S PodTemplateSpec. The template allows you to specify the containers, volumes, etc... that should be created for each replica.

Creating a TVM tuning job (AutoTVM)

TVM is a end to end deep learning compiler stack, you can easily run AutoTVM with mxnet-operator. You can create a auto tuning job by define a type of MXTune job and then creating it with

kubectl create -f examples/tune/mx_job_tune_gpu_v1.yaml

Before you use the auto-tuning example, there is some preparatory work need to be finished in advance. To let TVM tune your network, you should create a docker image which has TVM module. Then, you need a auto-tuning script to specify which network will be tuned and set the auto-tuning parameters. For more details, please see tutorials. Finally, you need a startup script to start the auto-tuning program. In fact, mxnet-operator will set all the parameters as environment variables and the startup script need to reed these variable and then transmit them to auto-tuning script. We provide an example under examples/tune/, tuning result will be saved in a log file like resnet-18.log in the example we gave. You can refer it for details.

Using GPUs

MXNet Operator supports training with GPUs.

Please verify your image is available for distributed training with GPUs.

For example, if you have the following, MXNet Operator will arrange the pod to nodes to satisfy the GPU limit.

command: ["python"]
args: ["/incubator-mxnet/example/image-classification/train_mnist.py","--num-epochs","1","--num-layers","2","--kv-store","dist_device_sync","--gpus","0"]
resources:
  limits:
    nvidia.com/gpu: 1

Monitoring your Apache MXNet job

To get the status of your job

kubectl get -o yaml mxjobs $JOB

Here is sample output for an example job

apiVersion: kubeflow.org/v1
kind: MXJob
metadata:
  creationTimestamp: 2021-03-24T15:37:27Z
  generation: 1
  name: mxnet-job
  namespace: default
  resourceVersion: "5123435"
  selfLink: /apis/kubeflow.org/v1/namespaces/default/mxjobs/mxnet-job
  uid: xx11013b-4a28-11e9-s5a1-704d7bb912f91
spec:
  cleanPodPolicy: All
  jobMode: MXTrain
  mxReplicaSpecs:
    Scheduler:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - image: mxjob/mxnet:gpu
            name: mxnet
            ports:
            - containerPort: 9091
              name: mxjob-port
            resources: {}
    Server:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - image: mxjob/mxnet:gpu
            name: mxnet
            ports:
            - containerPort: 9091
              name: mxjob-port
            resources: {}
    Worker:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - args:
            - /incubator-mxnet/example/image-classification/train_mnist.py
            - --num-epochs
            - "10"
            - --num-layers
            - "2"
            - --kv-store
            - dist_device_sync
            - --gpus
            - "0"
            command:
            - python
            image: mxjob/mxnet:gpu
            name: mxnet
            ports:
            - containerPort: 9091
              name: mxjob-port
            resources:
              limits:
                nvidia.com/gpu: "1"
status:
  completionTime: 2021-03-24T09:25:11Z
  conditions:
  - lastTransitionTime: 2021-03-24T15:37:27Z
    lastUpdateTime: 2021-03-24T15:37:27Z
    message: MXJob mxnet-job is created.
    reason: MXJobCreated
    status: "True"
    type: Created
  - lastTransitionTime: 2021-03-24T15:37:27Z
    lastUpdateTime: 2021-03-24T15:37:29Z
    message: MXJob mxnet-job is running.
    reason: MXJobRunning
    status: "False"
    type: Running
  - lastTransitionTime: 2021-03-24T15:37:27Z
    lastUpdateTime: 2021-03-24T09:25:11Z
    message: MXJob mxnet-job is successfully completed.
    reason: MXJobSucceeded
    status: "True"
    type: Succeeded
  mxReplicaStatuses:
    Scheduler: {}
    Server: {}
    Worker: {}
  startTime: 2021-03-24T15:37:29Z

The first thing to note is the RuntimeId. This is a random unique string which is used to give names to all the K8s resouces (e.g Job controllers & services) that are created by the MXJob.

As with other K8S resources status provides information about the state of the resource.

phase - Indicates the phase of a job and will be one of

Creating
Running
CleanUp
Failed
Done

state - Provides the overall status of the job and will be one of

Running
Succeeded
Failed

For each replica type in the job, there will be a ReplicaStatus that provides the number of replicas of that type in each state.

For each replica type, the job creates a set of K8s Job Controllers named

${REPLICA-TYPE}-${RUNTIME_ID}-${INDEX}

For example, if you have 2 servers and the runtime id is "76n0", then MXJob will create the following two jobs:

server-76no-0
server-76no-1

Contributing

Please refer to the this document for contributing guidelines.

Community

Please check out Kubeflow community page for more information on how to get involved in our community.

Description	Category	Status	Issue
Kustomize package	Required	Done	kubeflow/manifests#279
Application CR	Required	Not Done
Images listed in kustomization.yaml	Required	Not Done
Upgradeability	Required	Done
Separate cluster scoped and namespace scoped resources	Recommended	Done	N/A
Kustomize package should be deployable on its own	Recommended	Not Done

Description	Category	Status
Version stability	Required	Not Done
Backward compatibility	Required	Not Done
Supports status subresource	Required	Not Done
CRD schema validation	Required	Not Done
Training operators follow kubeflow/common conventions	Required	Not Done

Description	Category	Status
Liveness/Readiness signals	Required	Not Done
Prometheus metrics	Required	Not Done
Json logging	Recommended	Not Done

Description	Category	Status
E2E tests	Required	Not Done
Scalability / load testing	Required	Not Done
Continuous building of docker images	Recommended	Not Done
Continuous updating of Kustomize manifests	Recommended	Not Done

Description	Category	Status	Issue
API Reference docs	Required	Not Done
Application docs	Required	Not Done

Description	Category	Explanation	Status	Issue
Healthy number of committers and commits	Required	Committers are listed as approvers in owners filesNumber to be determined by TOC based on size and scope of application	Not Done
At least 2 different organizations are committers	Required	AWS, Caicloud, Huawei, Qihoo 360, TuSimple	Done	#63

	apiGroup: rbac.authorization.k8s.io
	kind: ClusterRole
	name: cluster-admin
	subjects:
	- kind: ServiceAccount
	name: default
	namespace: default

kubeflow / mxnet-operator Goto Github PK

mxnet-operator's Introduction

mxnet-operator: Kubernetes operator for Apache MXNet jobs

⚠️ kubeflow/mxnet-operator is not maintained

Overview

Prerequisites

Installing the MXJob CRD and operator on your k8s cluster

Deploy MXJob CRD and Apache MXNet Operator

Verify that MXJob CRD and Apache MXNet Operator are installed

Creating a Apache MXNet training job

Creating a TVM tuning job (AutoTVM)

Using GPUs

Monitoring your Apache MXNet job

Contributing

Community

mxnet-operator's People

Contributors

Stargazers

Watchers

Forkers

mxnet-operator's Issues

Background

Current State

Success Criteria

Configuration and deployment

Custom Resources

Logging and monitoring

CI/CD

Docs

Owners/Maintenance

Adoption

Background

Current State

Desired State

Success Criteria

Recommend Projects

Recommend Topics

Recommend Org