Giter Club home page Giter Club logo

Comments (12)

Jeffwan avatar Jeffwan commented on August 18, 2024 1

I think the problem is on the EFS side. Could you help verify EFS mount settings and security group setting? Make sure your use public subnet and ClusterSharedNodeGroup

https://github.com/pytorch/elastic/blob/master/docs/source/_static/img/efs-setup.jpg

from elastic.

chauhang avatar chauhang commented on August 18, 2024

@Jeffwan Can you please take a look for the EKS setting needed to make the EFS mount points working inside EKS?

from elastic.

Jeffwan avatar Jeffwan commented on August 18, 2024

/assign @Jeffwan

from elastic.

chauhang avatar chauhang commented on August 18, 2024

@Jeffwan I tried out the steps for EFS mounting. By default the VPC security group gets selected, so changed it to the ClusterSahredNodeGroup.

The Imagenet sample is still not working and and the pod logs show failedscheduleing error for GPU nodes:

kubectl describe pod  imagenet-worker-1
Name:         imagenet-worker-1
Namespace:    default
Priority:     0
Node:         <none>
Labels:       job-name=imagenet
              worker=1
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"labels":{"job-name":"imagenet","worker":"1"},"name":"imagenet-worker-1","nam...
              kubernetes.io/psp: eks.privileged
Status:       Pending
IP:           
Containers:
  elastic-trainer-worker:
    Image:      torchelastic/examples:0.1.0rc1
    Port:       <none>
    Host Port:  <none>
    Args:
      s3://torchelastic-ubuntu-m6cdt-s3bucket-1tbgx2u77u2j4/petctl/ubuntu/ubuntu-job/main.py
      --input_path
      /data/tiny-imagenet-200/train
      --epochs
      10
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:
      RDZV_ENDPOINT:  etcd:2379
      JOB_ID:         imagenet
      MIN_SIZE:       1
      MAX_SIZE:       3
      SIZE:           2
    Mounts:
      /data from persistent-storage (rw)
      /dev/shm from dshm (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6s4bz (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  persistent-storage:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  efs-claim
    ReadOnly:   false
  dshm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  default-token-6s4bz:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-6s4bz
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age               From               Message
  ----     ------            ----              ----               -------
  Warning  FailedScheduling  8s (x2 over 10s)  default-scheduler  0/2 nodes are available: 2 Insufficient nvidia.com/gpu.

The data download step is crashing due to the centos 'yum update' stuck in waiting for Y/N. Please update 'yum update -y' in the datadlowload steps. Also include troubleshooting steps like below to make it easier for developers to identify the problems, and mention that it will take ~1 hr for the data to download and inflate even for the tiny-imagenet dataset. It will be better to switch to an example with smaller dataset.

kubectl get pods
NAME                    READY   STATUS             RESTARTS   AGE
download-dataset-task   0/1     CrashLoopBackOff   4          3m35s
etcd                    1/1     Running            0          2m44s
imagenet-worker-1       0/1     Pending            0          94s
imagenet-worker-2       0/1     Pending            0          93s

Details of the crashing pod:

kubectl describe pod download-dataset-task 
Name:         download-dataset-task
Namespace:    default
Priority:     0
Node:         ip-192-168-81-41.us-east-2.compute.internal/192.168.81.41
Start Time:   Sun, 29 Mar 2020 19:50:49 -0700
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"download-dataset-task","namespace":"default"},"spec":{"containers":[{...
              kubernetes.io/psp: eks.privileged
Status:       Running
IP:           192.168.91.84
Containers:
  app:
    Container ID:  docker://74e0187cdcf70a847aa776f3a495a6226d4e38cf1a032cee184e5dcbadaf12cc
    Image:         centos:latest
    Image ID:      docker-pullable://centos@sha256:fe8d824220415eed5477b63addf40fb06c3b049404242b31982106ac204f6700
    Port:          <none>
    Host Port:     <none>
    Command:
      /bin/sh
      -c
      /bin/bash <<'EOF'
      yum update
      yum install -y wget unzip
      wget http://cs231n.stanford.edu/tiny-imagenet-200.zip
      unzip tiny-imagenet-200.zip -d /data
      EOF
      
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      Sun, 29 Mar 2020 20:02:42 -0700
      Finished:     Sun, 29 Mar 2020 20:02:47 -0700
    Ready:          False
    Restart Count:  7
    Environment:    <none>
    Mounts:
      /data from persistent-storage (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-6s4bz (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  persistent-storage:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  efs-claim
    ReadOnly:   false
  default-token-6s4bz:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-6s4bz
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason     Age                  From                                                  Message
  ----     ------     ----                 ----                                                  -------
  Normal   Scheduled  12m                  default-scheduler                                     Successfully assigned default/download-dataset-task to ip-192-168-81-41.us-east-2.compute.internal
  Normal   Created    11m (x4 over 12m)    kubelet, ip-192-168-81-41.us-east-2.compute.internal  Created container app
  Normal   Started    11m (x4 over 12m)    kubelet, ip-192-168-81-41.us-east-2.compute.internal  Started container app
  Normal   Pulling    10m (x5 over 12m)    kubelet, ip-192-168-81-41.us-east-2.compute.internal  Pulling image "centos:latest"
  Normal   Pulled     10m (x5 over 12m)    kubelet, ip-192-168-81-41.us-east-2.compute.internal  Successfully pulled image "centos:latest"
  Warning  BackOff    2m6s (x44 over 11m)  kubelet, ip-192-168-81-41.us-east-2.compute.internal  Back-off restarting failed container
kubectl logs  download-dataset-task app
CentOS-8 - AppStream                            8.5 MB/s | 6.6 MB     00:00    
CentOS-8 - Base                                  11 MB/s | 5.0 MB     00:00    
CentOS-8 - Extras                                21 kB/s | 4.8 kB     00:00    
Dependencies resolved.
================================================================================
 Package                Arch   Version                          Repo       Size
================================================================================
Upgrading:
 audit-libs             x86_64 3.0-0.13.20190507gitf58ec40.el8  BaseOS    116 k
 binutils               x86_64 2.30-58.el8_1.1                  BaseOS    5.7 M
 centos-gpg-keys        noarch 8.1-1.1911.0.9.el8               BaseOS     12 k
 centos-release         x86_64 8.1-1.1911.0.9.el8               BaseOS     21 k
 centos-repos           x86_64 8.1-1.1911.0.9.el8               BaseOS     13 k
 glibc                  x86_64 2.28-72.el8_1.1                  BaseOS    3.7 M
 glibc-common           x86_64 2.28-72.el8_1.1                  BaseOS    836 k
 glibc-minimal-langpack x86_64 2.28-72.el8_1.1                  BaseOS     48 k
 libarchive             x86_64 3.3.2-8.el8_1                    BaseOS    359 k
 openldap               x86_64 2.4.46-11.el8_1                  BaseOS    352 k
 sqlite-libs            x86_64 3.26.0-4.el8_1                   BaseOS    579 k
 systemd                x86_64 239-18.el8_1.4                   BaseOS    3.5 M
 systemd-libs           x86_64 239-18.el8_1.4                   BaseOS    562 k
 systemd-pam            x86_64 239-18.el8_1.4                   BaseOS    232 k
 systemd-udev           x86_64 239-18.el8_1.4                   BaseOS    1.3 M
Installing dependencies:
 xkeyboard-config       noarch 2.24-3.el8                       AppStream 828 k
 kbd-legacy             noarch 2.0.4-8.el8                      BaseOS    481 k
 kbd-misc               noarch 2.0.4-8.el8                      BaseOS    1.4 M
Installing weak dependencies:
 libxkbcommon           x86_64 0.8.2-1.el8                      AppStream 116 k
 diffutils              x86_64 3.6-5.el8                        BaseOS    359 k
 glibc-langpack-en      x86_64 2.28-72.el8_1.1                  BaseOS    818 k
 kbd                    x86_64 2.0.4-8.el8                      BaseOS    392 k

Transaction Summary
================================================================================
Install   7 Packages
Upgrade  15 Packages

Total download size: 22 M
Operation aborted.
Is this ok [y/N]: Is this ok [y/N]: Is this ok [y/N]: Is this ok [y/N]:

from elastic.

chauhang avatar chauhang commented on August 18, 2024

@Jeffwan For Nvidia GPUs it will be better to switch to the new GPU operator to make it easier for developers. Can we update the example to use this mechanism instead:

https://devblogs.nvidia.com/nvidia-gpu-operator-simplifying-gpu-management-in-kubernetes/

from elastic.

Jeffwan avatar Jeffwan commented on August 18, 2024

@Jeffwan For Nvidia GPUs it will be better to switch to the new GPU operator to make it easier for developers. Can we update the example to use this mechanism instead:

https://devblogs.nvidia.com/nvidia-gpu-operator-simplifying-gpu-management-in-kubernetes/

I will update the pod to fetch the data. BTW, are you using device plugin? It shows insufficient GPUs and looks like your cluster doesn't have available GPUs.

Gpu operator is in early stage and we currently don't use it yet.

from elastic.

chauhang avatar chauhang commented on August 18, 2024

@Jeffwan Yes I used the Nvidia plugin as per the instructions and spun up the cluster with p3.2xlarge nodes. Is there anyway to troubleshoot the deamonset setup and check the actual GPU resources?

from elastic.

Jeffwan avatar Jeffwan commented on August 18, 2024

@Jeffwan Yes I used the Nvidia plugin as per the instructions and spun up the cluster with p3.2xlarge nodes. Is there anyway to troubleshoot the deamonset setup and check the actual GPU resources?

  1. Can you run kubectl get ds and check if daemonset are up?
  2. Check your node status. kubectl get nodes and then kubectl describe node xxx to check node allocatable resources

from elastic.

chauhang avatar chauhang commented on August 18, 2024

@Jeffwan The kubectl get ds returns No resources found. Nothings shows up in the EKS console in AWS or the logs that indicates it is running or not. Is there something missing in our config?

from elastic.

Jeffwan avatar Jeffwan commented on August 18, 2024

@chauhang

What about following command? I forget to ask you specify namespace

kubectl get ds -n kube-system

from elastic.

chauhang avatar chauhang commented on August 18, 2024

@Jeffwan That works! thanks

from elastic.

kiukchung avatar kiukchung commented on August 18, 2024

we have a new imagenet example in v0.2.0 feel free to open an issue if you see issues running it on TEK

from elastic.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.