Giter Club home page Giter Club logo

mxnet-operator's Introduction

mxnet-operator: Kubernetes operator for Apache MXNet jobs

⚠️ kubeflow/mxnet-operator is not maintained

This operator has been merged into Kubeflow Training Operator. This repository is not maintained and has been archived.

Overview

MXNet Operator provides a Kubernetes custom resource MXJob that makes it easy to run distributed or non-distributed Apache MXNet (incubating) jobs (training and tuning) and other extended framework like BytePS jobs on Kubernetes. Using a Custom Resource Definition (CRD) gives users the ability to create and manage Apache MXNet jobs just like built-in K8S resources.

Prerequisites

Installing the MXJob CRD and operator on your k8s cluster

Deploy MXJob CRD and Apache MXNet Operator

kustomize build manifests/overlays/v1 | kubectl apply -f -

Verify that MXJob CRD and Apache MXNet Operator are installed

Check that the Apache MXNet custom resource is installed via:

kubectl get crd

The output should include mxjobs.kubeflow.org like the following:

NAME                                           AGE
...
mxjobs.kubeflow.org                            4d
...

Check that the Apache MXNet operator is running via:

kubectl get pods

The output should include mxnet-operaror-xxx like the following:

NAME                             READY   STATUS    RESTARTS   AGE
mxnet-operator-d466b46bc-xbqvs   1/1     Running   0          4m37s

Creating a Apache MXNet training job

You create a training job by defining a MXJob with MXTrain mode and then creating it with.

kubectl create -f examples/train/mx_job_dist_gpu_v1.yaml

Each replicaSpec defines a set of Apache MXNet processes. The mxReplicaType defines the semantics for the set of processes. The semantics are as follows:

scheduler

  • A job must have 1 and only 1 scheduler
  • The pod must contain a container named mxnet
  • The overall status of the MXJob is determined by the exit code of the mxnet container
    • 0 = success
    • 1 || 2 || 126 || 127 || 128 || 139 = permanent errors:
      • 1: general errors
      • 2: misuse of shell builtins
      • 126: command invoked cannot execute
      • 127: command not found
      • 128: invalid argument to exit
      • 139: container terminated by SIGSEGV(Invalid memory reference)
    • 130 || 137 || 143 = retryable error for unexpected system signals:
      • 130: container terminated by Control-C
      • 137: container received a SIGKILL
      • 143: container received a SIGTERM
    • 138 = reserved in tf-operator for user specified retryable errors
    • others = undefined and no guarantee

worker

  • A job can have 0 to N workers
  • The pod must contain a container named mxnet
  • Workers are automatically restarted if they exit

server

  • A job can have 0 to N servers
  • parameter servers are automatically restarted if they exit

For each replica you define a template which is a K8S PodTemplateSpec. The template allows you to specify the containers, volumes, etc... that should be created for each replica.

Creating a TVM tuning job (AutoTVM)

TVM is a end to end deep learning compiler stack, you can easily run AutoTVM with mxnet-operator. You can create a auto tuning job by define a type of MXTune job and then creating it with

kubectl create -f examples/tune/mx_job_tune_gpu_v1.yaml

Before you use the auto-tuning example, there is some preparatory work need to be finished in advance. To let TVM tune your network, you should create a docker image which has TVM module. Then, you need a auto-tuning script to specify which network will be tuned and set the auto-tuning parameters. For more details, please see tutorials. Finally, you need a startup script to start the auto-tuning program. In fact, mxnet-operator will set all the parameters as environment variables and the startup script need to reed these variable and then transmit them to auto-tuning script. We provide an example under examples/tune/, tuning result will be saved in a log file like resnet-18.log in the example we gave. You can refer it for details.

Using GPUs

MXNet Operator supports training with GPUs.

Please verify your image is available for distributed training with GPUs.

For example, if you have the following, MXNet Operator will arrange the pod to nodes to satisfy the GPU limit.

command: ["python"]
args: ["/incubator-mxnet/example/image-classification/train_mnist.py","--num-epochs","1","--num-layers","2","--kv-store","dist_device_sync","--gpus","0"]
resources:
  limits:
    nvidia.com/gpu: 1

Monitoring your Apache MXNet job

To get the status of your job

kubectl get -o yaml mxjobs $JOB

Here is sample output for an example job

apiVersion: kubeflow.org/v1
kind: MXJob
metadata:
  creationTimestamp: 2021-03-24T15:37:27Z
  generation: 1
  name: mxnet-job
  namespace: default
  resourceVersion: "5123435"
  selfLink: /apis/kubeflow.org/v1/namespaces/default/mxjobs/mxnet-job
  uid: xx11013b-4a28-11e9-s5a1-704d7bb912f91
spec:
  cleanPodPolicy: All
  jobMode: MXTrain
  mxReplicaSpecs:
    Scheduler:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - image: mxjob/mxnet:gpu
            name: mxnet
            ports:
            - containerPort: 9091
              name: mxjob-port
            resources: {}
    Server:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - image: mxjob/mxnet:gpu
            name: mxnet
            ports:
            - containerPort: 9091
              name: mxjob-port
            resources: {}
    Worker:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - args:
            - /incubator-mxnet/example/image-classification/train_mnist.py
            - --num-epochs
            - "10"
            - --num-layers
            - "2"
            - --kv-store
            - dist_device_sync
            - --gpus
            - "0"
            command:
            - python
            image: mxjob/mxnet:gpu
            name: mxnet
            ports:
            - containerPort: 9091
              name: mxjob-port
            resources:
              limits:
                nvidia.com/gpu: "1"
status:
  completionTime: 2021-03-24T09:25:11Z
  conditions:
  - lastTransitionTime: 2021-03-24T15:37:27Z
    lastUpdateTime: 2021-03-24T15:37:27Z
    message: MXJob mxnet-job is created.
    reason: MXJobCreated
    status: "True"
    type: Created
  - lastTransitionTime: 2021-03-24T15:37:27Z
    lastUpdateTime: 2021-03-24T15:37:29Z
    message: MXJob mxnet-job is running.
    reason: MXJobRunning
    status: "False"
    type: Running
  - lastTransitionTime: 2021-03-24T15:37:27Z
    lastUpdateTime: 2021-03-24T09:25:11Z
    message: MXJob mxnet-job is successfully completed.
    reason: MXJobSucceeded
    status: "True"
    type: Succeeded
  mxReplicaStatuses:
    Scheduler: {}
    Server: {}
    Worker: {}
  startTime: 2021-03-24T15:37:29Z

The first thing to note is the RuntimeId. This is a random unique string which is used to give names to all the K8s resouces (e.g Job controllers & services) that are created by the MXJob.

As with other K8S resources status provides information about the state of the resource.

phase - Indicates the phase of a job and will be one of

  • Creating
  • Running
  • CleanUp
  • Failed
  • Done

state - Provides the overall status of the job and will be one of

  • Running
  • Succeeded
  • Failed

For each replica type in the job, there will be a ReplicaStatus that provides the number of replicas of that type in each state.

For each replica type, the job creates a set of K8s Job Controllers named

${REPLICA-TYPE}-${RUNTIME_ID}-${INDEX}

For example, if you have 2 servers and the runtime id is "76n0", then MXJob will create the following two jobs:

server-76no-0
server-76no-1

Contributing

Please refer to the this document for contributing guidelines.

Community

Please check out Kubeflow community page for more information on how to get involved in our community.

mxnet-operator's People

Contributors

asifdxtreme avatar chanyilin avatar fisherxu avatar gaocegege avatar guohaiqing avatar jasonliu747 avatar jeffwan avatar jlewi avatar jzp1025 avatar kingonthestar avatar suleisl2000 avatar terrytangyuan avatar uzuku avatar wackxu avatar yanniszark avatar yeya24 avatar zijianjoy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mxnet-operator's Issues

Migrate to use kustomize to deploy mxnet-operator

ksonnet has been deprecated for long time ago. Let's migrate to kustomize and have clear guidance for deployment setups. Currently, MXNet is not in the manifest applications which means it's not a default option. That's why we need clear instructions.

Request for roadmap

Hello @suleisl2000, great work! Excited to see MXNet with Kubeflow.
I am very interested in working on MXNet + Kubeflow. Do you see the need for new contributions that can be helpful in this project? Can you please share the roadmap/ToDos/how to get started with the contributions?

Thanks for working on this.

Change to follow kubeflow/common convention and reuse implementation

Currently, mxnet operator API doesn't follow kubeflow/common convention and but controller does. It imports tf-operator implementation https://github.com/kubeflow/mxnet-operator/blob/c718707b304dc1ed0210a740c8efe7071d4ebb3e/pkg/controller.v1/mxnet/controller.go#L42

To graduate mxnet-operator to v1, we'd like to migrate to follow kubecom/common convention. It already has all the logic for ActiveDeadlineSeconds, BackoffLimit, etc. This will simplifies the implementation on mxnet-operator side.

Release a new container image with v1 binary

docker.io/mxjob/mxnet-operator:v1 this one exists but doesn't seem correctly built in the past, there's no binary under /opt/kubeflow, go version is 1.10..

docker.io/mxjob/mxnet-operator:v1beta1 seems have binary /opt/kubeflow/mxnet-operator.v1beta1 in the right place.

I think it would be better to onboard GCR to host these images.

the status of worker-0 is error, but the status of mxjob is Succeeded

kubeflow version: 0.5.0
mxnet-operator version: v1beta1

kubernetes dashboard display
image

worker-0 log:
INFO:root:start with arguments Namespace(add_stn=False, batch_size=64, data_dir='/admin/public/model/mxnet_distributed/data', disp_batches=10, dtype='float32', gc_threshold=0.5, gc_type='none', gpus='0', image_shape='1, 28, 28', initializer='default', kv_store='dist_device_sync', load_epoch=None, loss='', lr=0.05, lr_factor=0.1, lr_step_epochs='10', macrobatch_size=0, model_prefix=None, mom=0.9, monitor=0, network='mlp', num_classes=10, num_epochs=2, num_examples=6000, num_layers=2, optimizer='sgd', profile_server_suffix='', profile_worker_suffix='', save_period=1, test_io=0, top_k=0, warmup_epochs=5, warmup_strategy='linear', wd=0.0001)
Traceback (most recent call last):
File "/admin/public/model/mxnet_model/mxnet_distributed/train_mnist.py", line 99, in
fit.fit(args, sym, get_mnist_iter)
File "/admin/public/model/mxnet_model/mxnet_distributed/common/fit.py", line 180, in fit
(train, val) = data_loader(args, kv)
File "/admin/public/model/mxnet_model/mxnet_distributed/train_mnist.py", line 57, in get_mnist_iter
'train-labels-idx1-ubyte.gz', 'train-images-idx3-ubyte.gz')
File "/admin/public/model/mxnet_model/mxnet_distributed/train_mnist.py", line 37, in read_data
with gzip.open(os.path.join(args.data_dir,label)) as flbl:
File "/opt/conda/lib/python3.6/gzip.py", line 53, in open
binary_file = GzipFile(filename, gz_mode, compresslevel)
File "/opt/conda/lib/python3.6/gzip.py", line 163, in init
fileobj = self.myfileobj = builtins.open(filename, mode or 'rb')
FileNotFoundError: [Errno 2] No such file or directory: '/admin/public/model/mxnet_distributed/data/train-labels-idx1-ubyte.gz'

mxjob status:

{
  "status": {
        "completionTime": "2019-05-21T08:37:24Z",
        "conditions": [
            {
                "lastTransitionTime": "2019-05-21T08:36:41Z",
                "lastUpdateTime": "2019-05-21T08:36:41Z",
                "message": "MXJob mxnet-8d1f211e is created.",
                "reason": "MXJobCreated",
                "status": "True",
                "type": "Created"
            },
            {
                "lastTransitionTime": "2019-05-21T08:36:41Z",
                "lastUpdateTime": "2019-05-21T08:36:46Z",
                "message": "MXJob mxnet-8d1f211e is running.",
                "reason": "MXJobRunning",
                "status": "False",
                "type": "Running"
            },
            {
                "lastTransitionTime": "2019-05-21T08:36:41Z",
                "lastUpdateTime": "2019-05-21T08:37:24Z",
                "message": "MXJob mxnet-8d1f211e is successfully completed.",
                "reason": "MXJobSucceeded",
                "status": "True",
                "type": "Succeeded"
            }
        ],
        "mxReplicaStatuses": {
            "Scheduler": {},
            "Server": {},
            "Worker": {}
        },
        "startTime": "2019-05-21T08:36:44Z"
	}
}

MXjob kubernetes resource was created successfully, but SCHEDULER, SERVER, WORKER Job were not created.

I followed the README to create mxjob, the MXJob kubernetes resource was created successfully, But SCHEDULER, SERVER, WORKER these objects are not created.

kubernetes version: 1.12.2
kubeflow version: 0.3.1
ksonnet version: dev-2018-11-20T22:21:08+0800
jsonnet version: v0.11.2
client-go version: kubernetes-1.10.4

The following information may be useful:

# kubectl  get crd
NAME                     CREATED AT
mxjobs.kubeflow.org      2018-11-21T14:07:37Z
studyjobs.kubeflow.org   2018-11-20T14:35:52Z
tfjobs.kubeflow.org      2018-11-20T14:35:36Z
workflows.argoproj.io    2018-11-20T14:35:44Z
# kubectl  get pods -n kubeflow
NAME                                                      READY   STATUS             RESTARTS   AGE
ambassador-c97f7b448-cdqdx                                3/3     Running            1          7h9m
ambassador-c97f7b448-f8t8v                                3/3     Running            1          7h9m
ambassador-c97f7b448-hlnrg                                3/3     Running            1          7h9m
argo-ui-7495b79b59-96xkq                                  1/1     Running            0          7h9m
centraldashboard-798f8d68d5-swg7v                         1/1     Running            0          7h9m
modeldb-backend-d69695b66-fkxgs                           1/1     Running            0          7h9m
modeldb-db-975db58f7-4dbck                                1/1     Running            0          7h9m
modeldb-frontend-78ccff78b7-pr8kv                         1/1     Running            0          7h9m
mxnet-operator-6c49b767bc-5mpjg                           1/1     Running            0          12m
spartakus-volunteer-ffdfcdb5c-dlvz2                       1/1     Running            0          7h9m
studyjob-controller-7df5754ddf-5fdgk                      1/1     Running            0          7h9m
tf-hub-0                                                  1/1     Running            0          7h8m
tf-job-dashboard-7499d5cbcf-52dfr                         1/1     Running            0          7h9m
tf-job-operator-v1alpha2-644c5f7db7-vnglr                 1/1     Running            0          7h9m
# cat mx_job_dist.yaml
apiVersion: "kubeflow.org/v1alpha1"
kind: "MXJob"
metadata:
  name: "gpu-dist-job"
spec:
  jobMode: "dist"
  replicaSpecs:
    - replicas: 1
      mxReplicaType: SCHEDULER
      PsRootPort: 9000
      template:
        spec:
          containers:
            - image: mxjob/mxnet:gpu
              name: mxnet
          restartPolicy: OnFailure
    - replicas: 2
      mxReplicaType: SERVER
      template:
        spec:
          containers:
            - image: mxjob/mxnet:gpu
              name: mxnet
          restartPolicy: OnFailure
    - replicas: 2
      mxReplicaType: WORKER
      template:
        spec:
          containers:
            - image: mxjob/mxnet:gpu
              name: mxnet
              command: ["python"]
              args: ["/incubator-mxnet/example/image-classification/train_mnist.py","--num-epochs","1","--num-layers","2","--kv-store","dist_device_sync","--gpus","0,1"]
              resources:
                limits:
                  nvidia.com/gpu: 2
          restartPolicy: OnFailure

After the above preparation is completed, creating mxjob gpu-dist-job succeeded, but did not create schedler, server, worker job :(

# kubectl  get mxjobs
NAME           AGE
gpu-dist-job   12m
# kubectl  get jobs
No resources found.

How should I solve it? thx.

@gaocegege @suleisl2000

MXNet Operator: Move manifests development upstream

Background

Umbrella-Issue: kubeflow/manifests#1740

As part of the work of wg-manifests for 1.3
(kubeflow/manifests#1735), we are moving manifests
development to upstream repos. This gives the application developers full
ownership of their manifests, tracked in a single place.

To give an example from the notebook-controller, the diagram shows the current
situation. Manifests are in two repos, blurring separation of responsibilities
and making things harder for application developers.

diagram_1

Instead, we will copy all manifests from the manifests repo back to each
upstream's repo. From there, they will be imported in the manifests repo. The
following diagram presents the desired state:

diagram_2

Current State

MXNet Operator has manifests:

The proposed folder in the upstream repo to receive the manifests is:
manifests.

The goal is to consolidate all manifests development in the upstream repo.
The manifests repo will include a copy of the manifests under apps/mxnet-job/upstream.
This copy will be synced periodically.

Success Criteria

The manifests in apps/mxnet-job/upstream should be a copy of the upstream manifests
folder. To do that, the application manifests in kubeflow/manifests should be
moved to the upstream application repo.

/assign @yanniszark
cc @kubeflow/wg-training-leads

Move presubmit jobs to AWS CI

Community is asking different WG to own their infra and community won't provide a common shared testing infra anymore. Sees kubeflow/testing#752 for more details.

Pytorch migration works well and here's the PR to update test case to work on AWS Prow+ Argo cluster.
kubeflow/pytorch-operator#305

However, current mxnet doesn't really have any jobs defined. https://github.com/kubeflow/mxnet-operator/blob/master/prow_config.yaml

whoever can contribute an e2e tests, we can follow what pytorch operator did and kick it off in new testing infra

Graduate mxnet-operator to v1

I create this parent ticket to track all the changes to graduate mxnet-operator to v1.

Configuration and deployment

Description Category Status Issue
Kustomize package Required Done  kubeflow/manifests#279
Application CR Required Not Done  
Images listed in kustomization.yaml Required Not Done  
Upgradeability Required Done  
Separate cluster scoped and namespace scoped resources Recommended Done N/A
Kustomize package should be deployable on its own Recommended Not Done  

Custom Resources

Description Category Status Issue
Version stability Required Not Done  
Backward compatibility Required Not Done  
Supports status subresource Required Not Done  
CRD schema validation Required Not Done
Training operators follow kubeflow/common conventions Required Not Done

Logging and monitoring

Description Category Status Issue
Liveness/Readiness signals Required Not Done  
Prometheus metrics Required Not Done  
Json logging Recommended Not Done  

CI/CD

Description Category Status Issue
E2E tests Required Not Done  
Scalability / load testing Required Not Done  
Continuous building of docker images Recommended Not Done  
Continuous updating of Kustomize manifests Recommended Not Done  

Docs

Description Category Status Issue
API Reference docs Required Not Done  
Application docs Required Not Done  

Owners/Maintenance

Description Category Explanation Status Issue
Healthy number of committers and commits Required Committers are listed as approvers in owners filesNumber to be determined by TOC based on size and scope of application Not Done  
At least 2 different organizations are committers Required AWS, Caicloud, Huawei, Qihoo 360, TuSimple Done  #63

Adoption

Description Category Explanation
List of users running the application Recommended Suggest listing adopters willing to be identified publicly in ADOPTERS.md

XGBoost Operator: Move manifests development upstream

Background

Umbrella-Issue: kubeflow/manifests#1740

As part of the work of wg-manifests for 1.3
(kubeflow/manifests#1735), we are moving manifests
development to upstream repos. This gives the application developers full
ownership of their manifests, tracked in a single place.

To give an example from the notebook-controller, the diagram shows the current
situation. Manifests are in two repos, blurring separation of responsibilities
and making things harder for application developers.

diagram_1

Instead, we will copy all manifests from the manifests repo back to each
upstream's repo. From there, they will be imported in the manifests repo. The
following diagram presents the desired state:

diagram_2

Current State

XGBoost Operator has manifests:

Desired State

The proposed folder in the upstream repo to receive the manifests is:
config.

The goal is to consolidate all manifests development in the upstream repo.
The manifests repo will include a copy of the manifests under apps/xgboost-job/upstream.
This copy will be synced periodically.

Success Criteria

The manifests in apps/xgboost-job/upstream should be a copy of the upstream manifests
folder. To do that, the application manifests in kubeflow/manifests should be
moved to the upstream application repo.

/assign @yanniszark
cc @kubeflow/wg-training-leads

Customizing environment variables of the pods in mxnetjob crashes the operator

apiVersion: kubeflow.org/v1alpha1
kind: MXJob
metadata:
  name: mxnet-gpu-dist-job
spec:
  jobMode: dist
  replicaSpecs:
  - mxReplicaType: SCHEDULER
    PsRootPort: 9000
    replicas: 1
    template:
      spec:
        containers:
        - image: stsukrov/mxnetbench
          name: mxnet
          env:
          - name: PS_VERBOSE
            value: "2"
        restartPolicy: OnFailure
  - mxReplicaType: SERVER
    replicas: 2
    template:
      spec:
        containers:
        - image: stsukrov/mxnetbench
          name: mxnet
#          env:
#          - name: PS_VERBOSE
#            value: "2"
  - mxReplicaType: WORKER
    replicas: 4
    template:
      spec:
        containers:
        - image: stsukrov/mxnetbench
          args:
            - /incubator-mxnet/example/image-classification/train_imagenet.py
            - --num-epochs
            - '1'
            - --benchmark
            - '1'
            - --kv-store
            - dist_device_sync
            - --network
            - inception-v3
            - --batch-size
            - '64'
            - --image-shape
            - '3,299,299'
            - --gpus
            - '0'
          command:
            - python
#          env:
#            - name: PS_VERBOSE
#              value: "2"
          name: mxnet
          resources:
            limits:
                nvidia.com/gpu: 1
        restartPolicy: OnFailure

Enabling PS_VERBOSE on any of the pods crashes the operator:

f018986aae72:baictl stsukrov$ kubectl logs mxnet-operator-f46557c4f-wfklx
{"filename":"app/server.go:64","level":"info","msg":"KUBEFLOW_NAMESPACE not set, using default namespace","time":"2019-03-26T09:33:49Z"}
{"filename":"app/server.go:69","level":"info","msg":"[API Version: v1alpha1 Version: v0.1.0-alpha Git SHA: Not provided. Go Version: go1.10.2 Go OS/Arch: linux/amd64]","time":"2019-03-26T09:33:49Z"}
{"filename":"app/server.go:153","level":"info","msg":"No controller_config_file provided; using empty config.","time":"2019-03-26T09:33:49Z"}
{"filename":"controller/controller.go:138","level":"info","msg":"Setting up event handlers","time":"2019-03-26T09:33:49Z"}
I0326 09:33:49.093823       1 leaderelection.go:174] attempting to acquire leader lease...
E0326 09:33:49.115391       1 event.go:260] Could not construct reference to: '&v1.Endpoints{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"mxnet-operator", GenerateName:"", Namespace:"default", SelfLink:"/api/v1/namespaces/default/endpoints/mxnet-operator", UID:"c7a02cee-4efb-11e9-a1a1-025004746b4c", ResourceVersion:"204330", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63689114690, loc:(*time.Location)(0x18bc1a0)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string{"control-plane.alpha.kubernetes.io/leader":"{\"holderIdentity\":\"mxnet-operator-f46557c4f-wfklx\",\"leaseDurationSeconds\":15,\"acquireTime\":\"2019-03-25T12:44:50Z\",\"renewTime\":\"2019-03-26T09:33:49Z\",\"leaderTransitions\":0}"}, OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Subsets:[]v1.EndpointSubset(nil)}' due to: 'no kind is registered for the type v1.Endpoints'. Will not report event: 'Normal' 'LeaderElection' 'mxnet-operator-f46557c4f-wfklx became leader'
I0326 09:33:49.115722       1 leaderelection.go:184] successfully acquired lease default/mxnet-operator
{"filename":"controller/controller.go:176","level":"info","msg":"Starting MXJob controller","time":"2019-03-26T09:33:49Z"}
{"filename":"controller/controller.go:179","level":"info","msg":"Waiting for informer caches to sync","time":"2019-03-26T09:33:49Z"}
{"filename":"controller/controller.go:184","level":"info","msg":"Starting 1 workers","time":"2019-03-26T09:33:49Z"}
{"filename":"controller/controller.go:190","level":"info","msg":"Started workers","time":"2019-03-26T09:33:49Z"}
{"filename":"controller/controller.go:273","job":"default/mxnet-gpu-dist-job","level":"info","msg":"Creating new job default/mxnet-gpu-dist-job","time":"2019-03-26T09:33:49Z"}
{"filename":"trainer/replicas.go:507","job":"default/mxnet-gpu-dist-job","job_type":"SCHEDULER","level":"info","msg":"Job mxnet-gpu-dist-job missing pod for replica SCHEDULER index 0, creating a new one.","mx_job_name":"mxnet-gpu-dist-job","runtime_id":"pub8","time":"2019-03-26T09:33:49Z"}
{"filename":"controller/controller.go:245","job":"default/mxnet-gpu-dist-job","level":"info","msg":"Finished syncing job \"default/mxnet-gpu-dist-job\" (7.526234ms)","time":"2019-03-26T09:33:49Z"}
E0326 09:33:49.223718       1 runtime.go:66] Observed a panic: "index out of range" (runtime error: index out of range)
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/home/tusimple/go/src/runtime/asm_amd64.s:573
/home/tusimple/go/src/runtime/panic.go:502
/home/tusimple/go/src/runtime/panic.go:28
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/trainer/replicas.go:218
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/trainer/replicas.go:509
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/trainer/training.go:362
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:291
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:162
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:215
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:201
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:187
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/home/tusimple/go/src/runtime/asm_amd64.s:2361
panic: runtime error: index out of range [recovered]
	panic: runtime error: index out of range

goroutine 102 [running]:
github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x107
panic(0xfb2f60, 0x18aabd0)
	/home/tusimple/go/src/runtime/panic.go:502 +0x229
github.com/kubeflow/mxnet-operator/pkg/trainer.(*MXReplicaSet).CreatePodWithIndex(0xc42041f980, 0x0, 0x3f, 0xc4205ab378, 0x3)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/trainer/replicas.go:218 +0x11cd
github.com/kubeflow/mxnet-operator/pkg/trainer.(*MXReplicaSet).SyncPods(0xc42041f980, 0x0, 0x0)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/trainer/replicas.go:509 +0x3a8
github.com/kubeflow/mxnet-operator/pkg/trainer.(*TrainingJob).Reconcile(0xc420692370, 0xc4205dc1d0, 0xc420711300, 0x1a, 0xc420358158)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/trainer/training.go:362 +0x11f
github.com/kubeflow/mxnet-operator/pkg/controller.(*Controller).syncMXJob(0xc4205dc1b0, 0xc420711300, 0x1a, 0xc420086c00, 0x0, 0x0)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:291 +0xaee
github.com/kubeflow/mxnet-operator/pkg/controller.(*Controller).(github.com/kubeflow/mxnet-operator/pkg/controller.syncMXJob)-fm(0xc420711300, 0x1a, 0xc4205bd380, 0xf58b20, 0xc4203227c0)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:162 +0x3e
github.com/kubeflow/mxnet-operator/pkg/controller.(*Controller).processNextWorkItem(0xc4205dc1b0, 0xc4205ca100)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:215 +0xee
github.com/kubeflow/mxnet-operator/pkg/controller.(*Controller).runWorker(0xc4205dc1b0)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:201 +0x2b
github.com/kubeflow/mxnet-operator/pkg/controller.(*Controller).(github.com/kubeflow/mxnet-operator/pkg/controller.runWorker)-fm()
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:187 +0x2a
github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc42042e5b0)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc42042e5b0, 0x3b9aca00, 0x0, 0x1, 0xc4205e6600)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbd
github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc42042e5b0, 0x3b9aca00, 0xc4205e6600)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by github.com/kubeflow/mxnet-operator/pkg/controller.(*Controller).Run
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:187 +0x22b

no matches for kind "MXJob" in version "kubeflow.org/v1beta1"

I can ran the tf-job, but cannot run the mxnet-job.

The command I use to list crd is
kubectl get crd

The output of crd is
applications.app.k8s.io 2019-03-18T17:54:51Z
...
mxjobs.kubeflow.org 2019-03-24T23:20:39Z
pytorchjobs.kubeflow.org 2019-03-18T17:53:18Z
scheduledworkflows.kubeflow.org 2019-03-18T17:53:43Z
tfjobs.kubeflow.org 2019-03-18T21:21:14Z
workflows.argoproj.io 2019-03-18T17:53:36Z

The command I use to launch the task is
kubectl create -f mxnet-operator/examples/v1beta1/train/mx_job_dist_gpu.yaml

The output is
error: unable to recognize "mxnet-operator/examples/v1beta1/train/mx_job_dist_gpu.yaml": no matches for kind "MXJob" in version "kubeflow.org/v1beta1"

I really appreciate any help.

Next fine grain roles for mxnet-operator

Currently, we bind cluster-role to default service account in default namespace. This is coarse and it would be great to understand what exact policies to run the controller.

apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- kind: ServiceAccount
name: default
namespace: default

Failed to build container image in examples

Docker image doesn't work. It's probably because of base image change.
https://github.com/kubeflow/mxnet-operator/blob/master/examples/v1beta1/train/Dockerfile

$ docker build -t myusername/mxnet:gpu .
Sending build context to Docker daemon  8.192kB
Step 1/10 : FROM mxnet/python:gpu
 ---> 31528c92407c
Step 2/10 : RUN apt-get update     && apt-get install -y git     && apt-get install -y build-essential git     && apt-get install -y libopenblas-dev liblapack-dev     && apt-get install -y libopencv-dev
 ---> Using cache
 ---> c7e42befb56d
Step 3/10 : RUN rm -rf /mxnet     && git clone --recursive https://github.com/apache/incubator-mxnet.git -b v1.2.0
 ---> Running in a2275a3a24eb
fatal: could not create work tree dir 'incubator-mxnet': No such file or directory
The command '/bin/sh -c rm -rf /mxnet     && git clone --recursive https://github.com/apache/incubator-mxnet.git -b v1.2.0' returned a non-zero code: 128

It actually fetch image to root disk / which is not best practice. We can add WORKDIR and point to somewhere.

Adding an initContainer to a worker pod crashes the operator

Added an initContainer to mxnetjob.
Similar code works fine with kubeflow/mpijob

  - mxReplicaType: WORKER
    replicas: 4
    template:
      spec:
        initContainers:
        - image: alpine
          command:
            - echo
            - Hello world!
        containers:
        - image: stsukrov/mxnetbench
          args:
            - /incubator-mxnet/example/image-classification/train_imagenet.py
            - --num-epochs
            - '1'
            - --benchmark
            - '1'
            - --kv-store
            - dist_device_sync
            - --network
            - inception-v3
            - --batch-size
            - '64'
            - --image-shape
            - '3,299,299'
            - --gpus
            - '0'
          command:
            - python
#          env:
#            - name: PS_VERBOSE
#              value: "2"
          name: mxnet
          resources:
            limits:
                nvidia.com/gpu: 1
        restartPolicy: OnFailure

mxnet operator crashes with:

{"filename":"app/server.go:64","level":"info","msg":"KUBEFLOW_NAMESPACE not set, using default namespace","time":"2019-03-26T09:43:50Z"}
{"filename":"app/server.go:69","level":"info","msg":"[API Version: v1alpha1 Version: v0.1.0-alpha Git SHA: Not provided. Go Version: go1.10.2 Go OS/Arch: linux/amd64]","time":"2019-03-26T09:43:50Z"}
{"filename":"app/server.go:153","level":"info","msg":"No controller_config_file provided; using empty config.","time":"2019-03-26T09:43:50Z"}
{"filename":"controller/controller.go:138","level":"info","msg":"Setting up event handlers","time":"2019-03-26T09:43:50Z"}
I0326 09:43:50.859250       1 leaderelection.go:174] attempting to acquire leader lease...
E0326 09:43:50.882160       1 event.go:260] Could not construct reference to: '&v1.Endpoints{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"mxnet-operator", GenerateName:"", Namespace:"default", SelfLink:"/api/v1/namespaces/default/endpoints/mxnet-operator", UID:"c7a02cee-4efb-11e9-a1a1-025004746b4c", ResourceVersion:"205844", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63689114690, loc:(*time.Location)(0x18bc1a0)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string{"control-plane.alpha.kubernetes.io/leader":"{\"holderIdentity\":\"mxnet-operator-f46557c4f-4cxr5\",\"leaseDurationSeconds\":15,\"acquireTime\":\"2019-03-26T09:43:27Z\",\"renewTime\":\"2019-03-26T09:43:50Z\",\"leaderTransitions\":1}"}, OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, Subsets:[]v1.EndpointSubset(nil)}' due to: 'no kind is registered for the type v1.Endpoints'. Will not report event: 'Normal' 'LeaderElection' 'mxnet-operator-f46557c4f-4cxr5 became leader'
I0326 09:43:50.882755       1 leaderelection.go:184] successfully acquired lease default/mxnet-operator
{"filename":"controller/controller.go:176","level":"info","msg":"Starting MXJob controller","time":"2019-03-26T09:43:50Z"}
{"filename":"controller/controller.go:179","level":"info","msg":"Waiting for informer caches to sync","time":"2019-03-26T09:43:50Z"}
{"filename":"controller/controller.go:184","level":"info","msg":"Starting 1 workers","time":"2019-03-26T09:43:50Z"}
{"filename":"controller/controller.go:190","level":"info","msg":"Started workers","time":"2019-03-26T09:43:50Z"}
{"filename":"controller/controller.go:273","job":"default/mxnet-gpu-dist-job","level":"info","msg":"Creating new job default/mxnet-gpu-dist-job","time":"2019-03-26T09:43:50Z"}
{"filename":"trainer/replicas.go:507","job":"default/mxnet-gpu-dist-job","job_type":"SERVER","level":"info","msg":"Job mxnet-gpu-dist-job missing pod for replica SERVER index 0, creating a new one.","mx_job_name":"mxnet-gpu-dist-job","runtime_id":"1eg0","time":"2019-03-26T09:43:50Z"}
{"filename":"controller/controller.go:245","job":"default/mxnet-gpu-dist-job","level":"info","msg":"Finished syncing job \"default/mxnet-gpu-dist-job\" (13.789497ms)","time":"2019-03-26T09:43:50Z"}
E0326 09:43:50.997170       1 runtime.go:66] Observed a panic: "index out of range" (runtime error: index out of range)
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:72
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:65
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:51
/home/tusimple/go/src/runtime/asm_amd64.s:573
/home/tusimple/go/src/runtime/panic.go:502
/home/tusimple/go/src/runtime/panic.go:28
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/trainer/replicas.go:218
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/trainer/replicas.go:509
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/trainer/training.go:362
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:291
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:162
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:215
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:201
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:187
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88
/home/tusimple/go/src/runtime/asm_amd64.s:2361
panic: runtime error: index out of range [recovered]
	panic: runtime error: index out of range

goroutine 110 [running]:
github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/runtime.HandleCrash(0x0, 0x0, 0x0)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/runtime/runtime.go:58 +0x107
panic(0xfb2f60, 0x18aabd0)
	/home/tusimple/go/src/runtime/panic.go:502 +0x229
github.com/kubeflow/mxnet-operator/pkg/trainer.(*MXReplicaSet).CreatePodWithIndex(0xc4203d43c0, 0x0, 0x3f, 0xc42078d378, 0x3)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/trainer/replicas.go:218 +0x11cd
github.com/kubeflow/mxnet-operator/pkg/trainer.(*MXReplicaSet).SyncPods(0xc4203d43c0, 0x0, 0x0)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/trainer/replicas.go:509 +0x3a8
github.com/kubeflow/mxnet-operator/pkg/trainer.(*TrainingJob).Reconcile(0xc4207aa420, 0xc42040c260, 0xc420440000, 0x1a, 0xc4204fa568)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/trainer/training.go:362 +0x11f
github.com/kubeflow/mxnet-operator/pkg/controller.(*Controller).syncMXJob(0xc42040c240, 0xc420440060, 0x1a, 0xc420508c00, 0x0, 0x0)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:291 +0xaee
github.com/kubeflow/mxnet-operator/pkg/controller.(*Controller).(github.com/kubeflow/mxnet-operator/pkg/controller.syncMXJob)-fm(0xc420440060, 0x1a, 0xc420442c80, 0xf58b20, 0xc420538090)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:162 +0x3e
github.com/kubeflow/mxnet-operator/pkg/controller.(*Controller).processNextWorkItem(0xc42040c240, 0xc4203ac400)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:215 +0xee
github.com/kubeflow/mxnet-operator/pkg/controller.(*Controller).runWorker(0xc42040c240)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:201 +0x2b
github.com/kubeflow/mxnet-operator/pkg/controller.(*Controller).(github.com/kubeflow/mxnet-operator/pkg/controller.runWorker)-fm()
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:187 +0x2a
github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1(0xc420398160)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133 +0x54
github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc420398160, 0x3b9aca00, 0x0, 0x1, 0xc4207b0000)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134 +0xbd
github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait.Until(0xc420398160, 0x3b9aca00, 0xc4207b0000)
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88 +0x4d
created by github.com/kubeflow/mxnet-operator/pkg/controller.(*Controller).Run
	/home/tusimple/gopath/src/github.com/kubeflow/mxnet-operator/pkg/controller/controller.go:187 +0x22b
f018986aae72:baictl stsukrov$ kubectl delete -f mxnet.yaml

ReStartPolicy not work

While setting the Restartpolicy to Never , if I kill the pod (no matter what role it is) , the pod will still restart . Besides, if I kill the worker , the job will be still running incorrectly and actually it should be in state Failed .

Support gang-scheduling by kube-batch

Gang-scheduling is a common requirement from training job; kube-batch supports it right now. So it's better to do the integration for training jobs.

Mxnet operator v1 API

There are couple of minor api changes that are suggested. We can incorporate all these changes in the next API version.

Related: kubeflow/training-operator#935

  • Requires support of Status subresource in CRD
  • Add ActiveDeadlineSeconds and BackoffLimit
  • Use pod group instead of PDB for gang scheduling
  • Supporting multiple versions of CRD

@suleisl2000 @gaocegege

error in judging status of mxjob

i set scheduler replicas:0 , server replicas:0 and worker replicas:1 to run a simple mxnet training script(not distributed). At the moment I created mxjob, the status of the mxjob became "Succeeded", but worker pod is running.

mxjob detail is as follow:

{

"apiVersion": "kubeflow.org/v1beta1",
"kind": "MXJob",
"metadata": {
    "creationTimestamp": "2019-05-07T05:36:22Z",
    "generation": 1,
    "name": "mxnet-0da7ef98",
    "namespace": "86872fb0-64d1-11e9-96c1-1ee8ba783f60",
    "resourceVersion": "44498209",
    "selfLink": "/apis/kubeflow.org/v1beta1/namespaces/86872fb0-64d1-11e9-96c1-1ee8ba783f60/mxjobs/mxnet-0da7ef98",
    "uid": "0c639d60-708a-11e9-991a-005056ae0022"
},
"spec": {
    "cleanPodPolicy": "None",
    "jobMode": "MXTrain",
    "mxReplicaSpecs": {
        "Scheduler": {
            "replicas": 0,
            "restartPolicy": "Never",
            "template": {
                "metadata": {
                    "creationTimestamp": null
                },
                "spec": {
                    "containers": [
                        {
                            "args": [
                                "python3",
                                "/mxnet_test/example/image-classification/train_mnist.py",
                                "--num-layers=2",
                                "--num-epochs=20"
                            ],
                            "env": [
                                {
                                    "name": "NVIDIA_VISIBLE_DEVICES",
                                    "value": "none"
                                }
                            ],
                            "image": "100.2.28.186/admin/mxnet:cpu",
                            "name": "mxnet",
                            "ports": [
                                {
                                    "containerPort": 9091,
                                    "name": "mxjob-port"
                                }
                            ],
                            "resources": {
                                "limits": {
                                    "cpu": "0",
                                    "memory": "0"
                                },
                                "requests": {
                                    "cpu": "0",
                                    "memory": "0"
                                }
                            },
                            "volumeMounts": [
                                {
                                    "mountPath": "/mxnet_test/example/image-classification/",
                                    "name": "host-mount"
                                },
                                {
                                    "mountPath": "/mxnet_test/data",
                                    "name": "data-mount"
                                }
                            ]
                        }
                    ],
                    "nodeSelector": {
                        "group_label": "24c489bc-64d1-11e9-96c1-1ee8ba783f60"
                    },
                    "restartPolicy": "OnFailure",
                    "volumes": [
                        {
                            "hostPath": {
                                "path": "/mxnet_test/example/image-classification/",
                                "type": "Directory"
                            },
                            "name": "host-mount"
                        },
                        {
                            "hostPath": {
                                "path": "/mxnet_test/data",
                                "type": "Directory"
                            },
                            "name": "data-mount"
                        }
                    ]
                }
            }
        },
        "Server": {
            "replicas": 0,
            "restartPolicy": "Never",
            "template": {
                "metadata": {
                    "creationTimestamp": null
                },
                "spec": {
                    "containers": [
                        {
                            "args": [
                                "python3",
                                "/mxnet_test/example/image-classification/train_mnist.py",
                                "--num-layers=2",
                                "--num-epochs=20"
                            ],
                            "env": [
                                {
                                    "name": "NVIDIA_VISIBLE_DEVICES",
                                    "value": "none"
                                }
                            ],
                            "image": "100.2.28.186/admin/mxnet:cpu",
                            "name": "mxnet",
                            "ports": [
                                {
                                    "containerPort": 9091,
                                    "name": "mxjob-port"
                                }
                            ],
                            "resources": {
                                "limits": {
                                    "cpu": "0",
                                    "memory": "0"
                                },
                                "requests": {
                                    "cpu": "0",
                                    "memory": "0"
                                }
                            },
                            "volumeMounts": [
                                {
                                    "mountPath": "/mxnet_test/example/image-classification/",
                                    "name": "host-mount"
                                },
                                {
                                    "mountPath": "/mxnet_test/data",
                                    "name": "data-mount"
                                }
                            ]
                        }
                    ],
                    "nodeSelector": {
                        "group_label": "24c489bc-64d1-11e9-96c1-1ee8ba783f60"
                    },
                    "restartPolicy": "OnFailure",
                    "volumes": [
                        {
                            "hostPath": {
                                "path": "/mxnet_test/example/image-classification/",
                                "type": "Directory"
                            },
                            "name": "host-mount"
                        },
                        {
                            "hostPath": {
                                "path": "/mxnet_test/data",
                                "type": "Directory"
                            },
                            "name": "data-mount"
                        }
                    ]
                }
            }
        },
        "Worker": {
            "replicas": 1,
            "restartPolicy": "Never",
            "template": {
                "metadata": {
                    "creationTimestamp": null
                },
                "spec": {
                    "containers": [
                        {
                            "args": [
                                "python3",
                                "/mxnet_test/example/image-classification/train_mnist.py",
                                "--num-layers=2",
                                "--num-epochs=20"
                            ],
                            "env": [
                                {
                                    "name": "NVIDIA_VISIBLE_DEVICES",
                                    "value": "none"
                                }
                            ],
                            "image": "100.2.28.186/admin/mxnet:cpu",
                            "name": "mxnet",
                            "ports": [
                                {
                                    "containerPort": 9091,
                                    "name": "mxjob-port"
                                }
                            ],
                            "resources": {
                                "limits": {
                                    "cpu": "1",
                                    "memory": "1Gi"
                                },
                                "requests": {
                                    "cpu": "1",
                                    "memory": "1Gi"
                                }
                            },
                            "volumeMounts": [
                                {
                                    "mountPath": "/mxnet_test/example/image-classification/",
                                    "name": "host-mount"
                                },
                                {
                                    "mountPath": "/mxnet_test/data",
                                    "name": "data-mount"
                                }
                            ]
                        }
                    ],
                    "nodeSelector": {
                        "group_label": "24c489bc-64d1-11e9-96c1-1ee8ba783f60"
                    },
                    "restartPolicy": "OnFailure",
                    "volumes": [
                        {
                            "hostPath": {
                                "path": "/mxnet_test/example/image-classification/",
                                "type": "Directory"
                            },
                            "name": "host-mount"
                        },
                        {
                            "hostPath": {
                                "path": "/mxnet_test/data",
                                "type": "Directory"
                            },
                            "name": "data-mount"
                        }
                    ]
                }
            }
        }
    }
},
"status": {
    "completionTime": "2019-05-07T05:36:22Z",
    "conditions": [
        {
            "lastTransitionTime": "2019-05-07T05:36:22Z",
            "lastUpdateTime": "2019-05-07T05:36:22Z",
            "message": "MXJob mxnet-0da7ef98 is created.",
            "reason": "MXJobCreated",
            "status": "True",
            "type": "Created"
        },
        {
            "lastTransitionTime": "2019-05-07T05:36:22Z",
            "lastUpdateTime": "2019-05-07T05:36:22Z",
            "message": "MXJob mxnet-0da7ef98 is successfully completed.",
            "reason": "MXJobSucceeded",
            "status": "True",
            "type": "Succeeded"
        }
    ],
    "mxReplicaStatuses": {
        "Scheduler": {},
        "Server": {},
        "Worker": {}
    },
    "startTime": "2019-05-07T05:36:22Z"
}

}

Migrate from Dep to Go modules

Mxnet-operator project is still using Dep, It would be great to migrate to go modules.

  • Upgrade Golang version #53
  • Upgrade k8s.io dependencies to 1.15.x
  • Remove Gopkg.lock, Gopkg.toml and vendor files
  • Update development guidance

can not startup example training job

kubernetes version: v1.15.2
kubeflow version: v0.6.1
mxnet-operator version: v1beta1

I can not startup the example training job, after followed the steps of README.md.

Firstly, I have installed kubeflow in k8s cluster and got the mxjob support:

$ kubectl get crd | grep kube            
experiments.kubeflow.org                      2019-09-04T04:05:29Z
mpijobs.kubeflow.org                          2019-09-18T04:08:45Z
mxjobs.kubeflow.org                           2019-09-04T04:05:30Z
notebooks.kubeflow.org                        2019-09-04T04:05:29Z
poddefaults.kubeflow.org                      2019-09-04T04:05:28Z
profiles.kubeflow.org                         2019-09-04T04:05:31Z
pytorchjobs.kubeflow.org                      2019-09-04T04:05:29Z
scheduledworkflows.kubeflow.org               2019-09-04T04:05:31Z
tfjobs.kubeflow.org                           2019-09-04T04:05:30Z
trials.kubeflow.org                           2019-09-04T04:05:29Z
viewers.kubeflow.org                          2019-09-04T04:05:31Z

After creating mxjobs with following,I got pods status:

$ kubectl create -f examples/v1beta1/train/mx_job_dist_gpu.yaml
$ kubectl get pods -n dml -o wide
NAME                    READY   STATUS    RESTARTS   AGE     IP               NODE    NOMINATED NODE   READINESS GATES
mxnet-job-scheduler-0   2/2     Running   0          4h36m   192.168.107.59   p40-4   <none>           <none>
mxnet-job-server-0      2/2     Running   0          4h36m   192.168.107.60   p40-4   <none>           <none>
mxnet-job-worker-0      2/2     Running   0          4h36m   192.168.107.61   p40-4   <none>           <none>

All pods of the mxjobs which created are runining, even the containers in those pods have healthy condition:

246 status:
247   conditions:
248   - lastProbeTime: null
249     lastTransitionTime: "2019-09-24T08:47:58Z"
250     status: "True"
251     type: Initialized
252   - lastProbeTime: null
253     lastTransitionTime: "2019-09-24T08:48:02Z"
254     status: "True"
255     type: Ready
256   - lastProbeTime: null
257     lastTransitionTime: "2019-09-24T08:48:02Z"
258     status: "True"
259     type: ContainersReady
260   - lastProbeTime: null
261     lastTransitionTime: "2019-09-24T08:47:55Z"
262     status: "True"
263     type: PodScheduled
264   containerStatuses:
265   - containerID: docker://91d6fccccc646b70022e4dc1e840140096d2566e12a547ed01c31a7f2fff5e62
266     image: istio/proxyv2:1.2.5
267     imageID: docker-pullable://istio/proxyv2@sha256:8f210c3d09beb6b8658a4255d9ac30e25549295834a44083ed67d652ad7453e4
268     lastState: {}
269     name: istio-proxy
270     ready: true
271     restartCount: 0
272     state:
273       running:
274         startedAt: "2019-09-24T08:47:59Z"
275   - containerID: docker://b1892b0750dbae31b77c53ddf5883f35a1245b6d165f22ad5c6358ec4ced830b
276     image: mxjob/mxnet:gpu
277     imageID: docker-pullable://mxjob/mxnet@sha256:f0ab7315578dbcddab6af926855d2586190f4f0c3dd5f4bb34f28a5a15ac7c84
278     lastState: {}
279     name: mxnet
280     ready: true
281     restartCount: 0
282     state:
283       running:
284         startedAt: "2019-09-24T08:47:59Z"
285   hostIP: 10.1.2.14
286   initContainerStatuses:
287   - containerID: docker://fd7d003ef7ae3a8da1b12ebdbb64e731ed2aab767a9f230fc4cd840207f9f7fb
288     image: istio/proxy_init:1.2.5                                                                                                                                                                       
289     imageID: docker-pullable://istio/proxy_init@sha256:c9964a8c1c28b85cc631bbc90390eac238c90f82c8f929495d1e9f9a9135b724
290     lastState: {}
291     name: istio-init
292     ready: true
293     restartCount: 0
294     state:
295       terminated:
296         containerID: docker://fd7d003ef7ae3a8da1b12ebdbb64e731ed2aab767a9f230fc4cd840207f9f7fb
297         exitCode: 0
298         finishedAt: "2019-09-24T08:47:58Z"
299         reason: Completed
300         startedAt: "2019-09-24T08:47:57Z"
301   phase: Running
302   podIP: 192.168.107.61
303   qosClass: Burstable
304   startTime: "2019-09-24T08:47:55Z"

After that, I have step into the mxnet container in mxnet-job-worker-0 pod and found out that the training script was blocked on the code kv = mx.kvstore.create(args.kv_store) in L149:/incubator-mxnet/example/image-classification/common/fit.py:

141 def fit(args, network, data_loader, **kwargs):
142     """
143     train a model
144     args : argparse returns
145     network : the symbol definition of the nerual network
146     data_loader : function that returns the train and val data iterators
147     """
148     # kvstore
149     logging.info('creating kvstore')  # added to locate the blocked point
150     kv = mx.kvstore.create(args.kv_store)
151     logging.info('created kvstore')   # added to locate the blocked point  

The possible reasons which I can think out for above situation is that the communication is disabled bewteen the pods or containers, so I have manually pinged mxnet-job-scheduler-0 and mxnet-job-server-0 pods which are exposed by k8s service bound to those inner mxnet-job containers, and the connection between those containers is ok. But I can not ssh those containers with the hostnames like mxnet-job-server-0 , which means the precondition for mxnet distributed training is not meeted as there is no ssh service for mxnet distributed training.

Is there any incorrect operation for startup mxjob with kubectl or any wrong configures in mx_job_dist_gpu.yaml? If so, please let me know, and thank you for some suggestions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.