Comments (12)
I think the problem is on the EFS side. Could you help verify EFS mount settings and security group setting? Make sure your use public subnet
and ClusterSharedNodeGroup
https://github.com/pytorch/elastic/blob/master/docs/source/_static/img/efs-setup.jpg
from elastic.
@Jeffwan Can you please take a look for the EKS setting needed to make the EFS mount points working inside EKS?
from elastic.
/assign @Jeffwan
from elastic.
@Jeffwan I tried out the steps for EFS mounting. By default the VPC security group gets selected, so changed it to the ClusterSahredNodeGroup.
The Imagenet sample is still not working and and the pod logs show failedscheduleing error for GPU nodes:
kubectl describe pod imagenet-worker-1
Name: imagenet-worker-1
Namespace: default
Priority: 0
Node: <none>
Labels: job-name=imagenet
worker=1
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"labels":{"job-name":"imagenet","worker":"1"},"name":"imagenet-worker-1","nam...
kubernetes.io/psp: eks.privileged
Status: Pending
IP:
Containers:
elastic-trainer-worker:
Image: torchelastic/examples:0.1.0rc1
Port: <none>
Host Port: <none>
Args:
s3://torchelastic-ubuntu-m6cdt-s3bucket-1tbgx2u77u2j4/petctl/ubuntu/ubuntu-job/main.py
--input_path
/data/tiny-imagenet-200/train
--epochs
10
Limits:
nvidia.com/gpu: 1
Requests:
nvidia.com/gpu: 1
Environment:
RDZV_ENDPOINT: etcd:2379
JOB_ID: imagenet
MIN_SIZE: 1
MAX_SIZE: 3
SIZE: 2
Mounts:
/data from persistent-storage (rw)
/dev/shm from dshm (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-6s4bz (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
persistent-storage:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: efs-claim
ReadOnly: false
dshm:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: <unset>
default-token-6s4bz:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-6s4bz
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 8s (x2 over 10s) default-scheduler 0/2 nodes are available: 2 Insufficient nvidia.com/gpu.
The data download step is crashing due to the centos 'yum update' stuck in waiting for Y/N. Please update 'yum update -y' in the datadlowload steps. Also include troubleshooting steps like below to make it easier for developers to identify the problems, and mention that it will take ~1 hr for the data to download and inflate even for the tiny-imagenet dataset. It will be better to switch to an example with smaller dataset.
kubectl get pods
NAME READY STATUS RESTARTS AGE
download-dataset-task 0/1 CrashLoopBackOff 4 3m35s
etcd 1/1 Running 0 2m44s
imagenet-worker-1 0/1 Pending 0 94s
imagenet-worker-2 0/1 Pending 0 93s
Details of the crashing pod:
kubectl describe pod download-dataset-task
Name: download-dataset-task
Namespace: default
Priority: 0
Node: ip-192-168-81-41.us-east-2.compute.internal/192.168.81.41
Start Time: Sun, 29 Mar 2020 19:50:49 -0700
Labels: <none>
Annotations: kubectl.kubernetes.io/last-applied-configuration:
{"apiVersion":"v1","kind":"Pod","metadata":{"annotations":{},"name":"download-dataset-task","namespace":"default"},"spec":{"containers":[{...
kubernetes.io/psp: eks.privileged
Status: Running
IP: 192.168.91.84
Containers:
app:
Container ID: docker://74e0187cdcf70a847aa776f3a495a6226d4e38cf1a032cee184e5dcbadaf12cc
Image: centos:latest
Image ID: docker-pullable://centos@sha256:fe8d824220415eed5477b63addf40fb06c3b049404242b31982106ac204f6700
Port: <none>
Host Port: <none>
Command:
/bin/sh
-c
/bin/bash <<'EOF'
yum update
yum install -y wget unzip
wget http://cs231n.stanford.edu/tiny-imagenet-200.zip
unzip tiny-imagenet-200.zip -d /data
EOF
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Sun, 29 Mar 2020 20:02:42 -0700
Finished: Sun, 29 Mar 2020 20:02:47 -0700
Ready: False
Restart Count: 7
Environment: <none>
Mounts:
/data from persistent-storage (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-6s4bz (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
persistent-storage:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: efs-claim
ReadOnly: false
default-token-6s4bz:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-6s4bz
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 12m default-scheduler Successfully assigned default/download-dataset-task to ip-192-168-81-41.us-east-2.compute.internal
Normal Created 11m (x4 over 12m) kubelet, ip-192-168-81-41.us-east-2.compute.internal Created container app
Normal Started 11m (x4 over 12m) kubelet, ip-192-168-81-41.us-east-2.compute.internal Started container app
Normal Pulling 10m (x5 over 12m) kubelet, ip-192-168-81-41.us-east-2.compute.internal Pulling image "centos:latest"
Normal Pulled 10m (x5 over 12m) kubelet, ip-192-168-81-41.us-east-2.compute.internal Successfully pulled image "centos:latest"
Warning BackOff 2m6s (x44 over 11m) kubelet, ip-192-168-81-41.us-east-2.compute.internal Back-off restarting failed container
kubectl logs download-dataset-task app
CentOS-8 - AppStream 8.5 MB/s | 6.6 MB 00:00
CentOS-8 - Base 11 MB/s | 5.0 MB 00:00
CentOS-8 - Extras 21 kB/s | 4.8 kB 00:00
Dependencies resolved.
================================================================================
Package Arch Version Repo Size
================================================================================
Upgrading:
audit-libs x86_64 3.0-0.13.20190507gitf58ec40.el8 BaseOS 116 k
binutils x86_64 2.30-58.el8_1.1 BaseOS 5.7 M
centos-gpg-keys noarch 8.1-1.1911.0.9.el8 BaseOS 12 k
centos-release x86_64 8.1-1.1911.0.9.el8 BaseOS 21 k
centos-repos x86_64 8.1-1.1911.0.9.el8 BaseOS 13 k
glibc x86_64 2.28-72.el8_1.1 BaseOS 3.7 M
glibc-common x86_64 2.28-72.el8_1.1 BaseOS 836 k
glibc-minimal-langpack x86_64 2.28-72.el8_1.1 BaseOS 48 k
libarchive x86_64 3.3.2-8.el8_1 BaseOS 359 k
openldap x86_64 2.4.46-11.el8_1 BaseOS 352 k
sqlite-libs x86_64 3.26.0-4.el8_1 BaseOS 579 k
systemd x86_64 239-18.el8_1.4 BaseOS 3.5 M
systemd-libs x86_64 239-18.el8_1.4 BaseOS 562 k
systemd-pam x86_64 239-18.el8_1.4 BaseOS 232 k
systemd-udev x86_64 239-18.el8_1.4 BaseOS 1.3 M
Installing dependencies:
xkeyboard-config noarch 2.24-3.el8 AppStream 828 k
kbd-legacy noarch 2.0.4-8.el8 BaseOS 481 k
kbd-misc noarch 2.0.4-8.el8 BaseOS 1.4 M
Installing weak dependencies:
libxkbcommon x86_64 0.8.2-1.el8 AppStream 116 k
diffutils x86_64 3.6-5.el8 BaseOS 359 k
glibc-langpack-en x86_64 2.28-72.el8_1.1 BaseOS 818 k
kbd x86_64 2.0.4-8.el8 BaseOS 392 k
Transaction Summary
================================================================================
Install 7 Packages
Upgrade 15 Packages
Total download size: 22 M
Operation aborted.
Is this ok [y/N]: Is this ok [y/N]: Is this ok [y/N]: Is this ok [y/N]:
from elastic.
@Jeffwan For Nvidia GPUs it will be better to switch to the new GPU operator to make it easier for developers. Can we update the example to use this mechanism instead:
https://devblogs.nvidia.com/nvidia-gpu-operator-simplifying-gpu-management-in-kubernetes/
from elastic.
@Jeffwan For Nvidia GPUs it will be better to switch to the new GPU operator to make it easier for developers. Can we update the example to use this mechanism instead:
https://devblogs.nvidia.com/nvidia-gpu-operator-simplifying-gpu-management-in-kubernetes/
I will update the pod to fetch the data. BTW, are you using device plugin? It shows insufficient GPUs and looks like your cluster doesn't have available GPUs.
Gpu operator is in early stage and we currently don't use it yet.
from elastic.
@Jeffwan Yes I used the Nvidia plugin as per the instructions and spun up the cluster with p3.2xlarge nodes. Is there anyway to troubleshoot the deamonset setup and check the actual GPU resources?
from elastic.
@Jeffwan Yes I used the Nvidia plugin as per the instructions and spun up the cluster with p3.2xlarge nodes. Is there anyway to troubleshoot the deamonset setup and check the actual GPU resources?
- Can you run
kubectl get ds
and check if daemonset are up? - Check your node status.
kubectl get nodes
and thenkubectl describe node xxx
to check node allocatable resources
from elastic.
@Jeffwan The kubectl get ds
returns No resources found.
Nothings shows up in the EKS console in AWS or the logs that indicates it is running or not. Is there something missing in our config?
from elastic.
What about following command? I forget to ask you specify namespace
kubectl get ds -n kube-system
from elastic.
@Jeffwan That works! thanks
from elastic.
we have a new imagenet example in v0.2.0 feel free to open an issue if you see issues running it on TEK
from elastic.
Related Issues (20)
- Elastic agent doesn't detect worker failures in NCCL HOT 4
- Pytorch Lightning with TorchElastic - One worker doesn't start HOT 3
- Enable NCCL_ASYNC_ERROR_HANDLING in Torchelastic HOT 1
- Torch Elastic - How to make sure all nodes are in the same AZ? HOT 2
- Support PyTorch 1.8, TorchVision 0.9.0 and TorchAduio 0.8.0 HOT 7
- ModuleNotFoundError: No module named 'torch.distributed.elastic' HOT 4
- Out of Data documentation HOT 4
- Imagenet example fails during accuracy calculation (v0.2.2 on 1.8.1) HOT 1
- Cannot reuse --rdzv_id between different elastic launch ?
- EtcdStore: AttributeError: can't set attribute HOT 1
- Kubernetes CustomResourceDefinition Moving out of Beta HOT 4
- submodule path docs/src/pytorch-sphinx-theme not in .gitmodules
- [feature request] petctl to support pulling script directory from github repo by commit or tag
- Is petctl also deprecated?
- [feature request] Add CPU example HOT 2
- Kubernetes: ttlSecondsAfterFinished not working in ElasticJob spec
- rendezvous: _matches_machine_hostname doesn't resolve hostnames fully HOT 2
- Please add more torch elastic training examples
- RuntimeError: Expected all tensors to be on the same device, but found at least two devices HOT 4
- [examples/imagenet/main.py] Why doesn't elastic code contain gpu sync to compute performance, e.g. all_reduce
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from elastic.