nvidia / cloud-native-stack Goto Github PK

Run cloud native workloads on NVIDIA GPUs

License: Apache License 2.0

Shell 0.48% Markdown 90.82% YAML 8.09% TOML 0.08% INI 0.01% JSON 0.43% Python 0.10%

cloud-native-stack's Introduction

NVIDIA Cloud Native Stack

NVIDIA Cloud Native Stack (formerly known as Cloud Native Core) is a collection of software to run cloud native workloads on NVIDIA GPUs. NVIDIA Cloud Native Stack is based on Ubuntu, Kubernetes, Helm and the NVIDIA GPU and Network Operator.

Interested in deploying NVIDIA Cloud Native Stack? This repository has install guides for manual installations and ansible playbooks for automated installations.

Interested in a pre-provisioned NVIDIA Cloud Native Stack environment? NVIDIA LaunchPad provides pre-provisioned environments so that you can quickly get started.

Getting Started

Prerequisites

Please make sure to meet the following prerequisites to Install the Cloud Native Stack

system has direct internet access
system should have an Operating system either Ubuntu 20.04 and above or RHEL 8.7
system has adequate internet bandWidth
DNS server is working fine on the System
system can access Google repo(for k8s installation)
system has only 1 network interface configured with internet access. The IP is static and doesn't change
UEFI secure boot is disabled
Root file system should has at least 40GB capacity
system has 2CPU and 4GB Memory
At least one NVIDIA GPU attached to the system

Installation

Run the below commands to clone the NVIDIA Cloud Native Stack.

git clone https://github.com/NVIDIA/cloud-native-stack.git
cd cloud-native-stack/playbooks

Update the hosts file in playbooks directory with master and worker nodes(if you have) IP's with username and password like below

nano hosts

[master]
<master-IP> ansible_ssh_user=nvidia ansible_ssh_pass=nvidipass ansible_sudo_pass=nvidiapass ansible_ssh_common_args='-o StrictHostKeyChecking=no'
[node]
<worker-IP> ansible_ssh_user=nvidia ansible_ssh_pass=nvidiapass ansible_sudo_pass=nvidiapass ansible_ssh_common_args='-o StrictHostKeyChecking=no'

Install the NVIDIA Cloud Native Stack stack by running the below command. "Skipping" in the ansible output refers to the Kubernetes cluster is up and running.

bash setup.sh install

For more Information about customize the values, please refer Installation

NVIDIA Cloud Native Stack Component Matrix

Branch/Release	Version	Initial Release Date	Platform	OS	Containerd	CRI-O	K8s	Helm	NVIDIA GPU Operator	NVIDIA Network Operator	NVIDIA Data Center Driver
24.5.0/master	13.0	14 May 2024	NVIDIA Certified Server (x86 & arm64)	Ubuntu 22.04 LTS	1.7.16	1.30.0	1.30.0	3.14.4	24.3.0	24.1.1(x86 only)	550.54.15
24.5.0/master	13.0	14 May 2024	NVIDIA Certified Server (x86 & arm64)	RHEL 8.9	1.7.16	1.30.0	1.30.0	3.14.4	24.3.0	N/A	550.54.15
24.5.0/master	13.0	14 May 2024	Jetson Devices(AGX, NX, Orin)	JetPack 5.1 and JetPack 5.0	1.7.16	1.30.0	1.30.0	3.14.4	N/A	N/A	N/A
24.5.0/master	13.0	14 May 2024	DGX Server	DGX OS 6.0(Ubuntu 22.04 LTS)	1.7.16	1.30.0	1.30.0	3.14.4	24.3.0	N/A	N/A

24.5.0/master	12.1	14 May 2024	NVIDIA Certified Server (x86 & arm64)	Ubuntu 22.04 LTS	1.7.16	1.29.4	1.29.4	3.14.4	24.3.0	24.1.1(x86 only)	550.54.15
24.5.0/master	12.1	14 May 2024	NVIDIA Certified Server (x86 & arm64)	RHEL 8.9	1.7.16	1.29.4	1.29.4	3.14.4	24.3.0	N/A	550.54.15
24.5.0/master	12.1	14 May 2024	Jetson Devices(AGX, NX, Orin)	JetPack 5.1 and JetPack 5.0	1.7.16	1.29.4	1.29.4	3.14.4	N/A	N/A	N/A
24.5.0/master	12.1	14 May 2024	DGX Server	DGX OS 6.0(Ubuntu 22.04 LTS)	1.7.16	1.29.4	1.29.4	3.14.4	24.3.0	N/A	N/A

24.5.0/masrer	11.2	14 May 2024	NVIDIA Certified Server (x86 & arm64)	Ubuntu 22.04 LTS	1.7.16	1.28.6	1.28.8	3.14.4	24.3.0	24.1.1(x86 only)	550.54.15
24.5.0/master	11.2	14 May 2024	NVIDIA Certified Server (x86 & arm64)	RHEL 8.9	1.7.16	1.28.6	1.28.8	3.14.4	24.3.0	N/A	550.54.15
24.5.0/master	11.2	14 May 2024	Jetson Devices(AGX, NX, Orin)	JetPack 5.1 and JetPack 5.0	1.7.16	1.28.6	1.28.8	3.14.4	N/A	N/A	N/A
24.5.0/master	11.2	14 May 2024	DGX Server	DGX OS 6.0(Ubuntu 22.04 LTS)	1.7.16	1.28.6	1.28.8	3.14.4	24.3.0	N/A	N/A

24.5.0/master	10.5	14 May 2024	NVIDIA Certified Server (x86 & arm64)	Ubuntu 22.04 LTS	1.7.16	1.27.6	1.27.12	3.14.4	24.3.0	24.1.1(x86 only)	550.54.15
24.5.0/master	10.5	14 May 2024	NVIDIA Certified Server (x86 & arm64)	RHEL 8.9	1.7.16	1.27.6	1.27.12	3.14.4	24.3.0	N/A	550.54.15
24.5.0/master	10.5	14 May 2024	Jetson Devices(AGX, NX, Orin)	JetPack 5.1 and JetPack 5.0	1.7.16	1.27.6	1.27.12	3.14.4	N/A	N/A	N/A
24.5.0/master	10.5	14 May 2024	DGX Server	DGX OS 6.0(Ubuntu 22.04 LTS)	1.7.16	1.27.6	1.27.12	3.14.4	24.3.0	N/A	N/A

To Find other CNS Release Information, please refer to Cloud Native Stack Component Matrix

NOTE: Above CNS versions are available on master branch as well but it's recommend to use specific branch with respective release

Cloud Native Stack Topologies

Cloud Native Stack allows to deploy:
- 1 node with both control plane and worker functionalities
- 1 control plane node and any number of worker nodes

NOTE: (Cloud Native Stack does not allow the deployment of several control plane nodes)

Cloud Native Stack Features

Kubernetes with GPU Operator, Network Operator
MicroK8s on CNS
Installation on CSP's
Storage on CNS
Monitoring on CNS
LoadBalancer on CNS
Kserve

Branch/Release	CNS Version	Release Date	Kserve	LoadBalancer	Storage	Monitoring
24.5.0/master	13.0 12.1 11.2 10.5	9 July 2024	0.13 Istio: 1.20.4 Knative: 1.13.1 CertManager: 1.9.0	MetalLB: 0.14.5	NFS: 4.0.18 Local Path: 0.0.26	Prometheus: 61.3.0 Elastic: 8.14.1

Getting help or Providing feedback

Please open an issue on the GitHub project for any questions. Your feedback is appreciated.

Useful Links

cloud-native-stack's People

Contributors

Stargazers

Watchers

cloud-native-stack's Issues

Failed to init kubeadm

I am following exactly the steps from the guide https://github.com/NVIDIA/egx-platform/blob/master/install-guides/Jetson_Xavier_NX_v2.0.md

I failed at this command:

sudo kubeadm init --pod-network-cidr=10.244.0.0/16

I did try with sudo kubeadm reset -f but the error is still same. Any clue that what is the problem?

output:
[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"
I0524 18:07:42.638020 13185 local.go:69] [etcd] wrote Static Pod manifest for a local etcd member to "/etc/kubernetes/manifests/etcd.yaml"
I0524 18:07:42.638133 13185 waitcontrolplane.go:80] [wait-control-plane] Waiting for the API server to be healthy
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.

Unfortunately, an error has occurred:
timed out waiting for the condition

This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
- 'systemctl status kubelet'
- 'journalctl -xeu kubelet'

Additionally, a control plane component may have crashed or exited when started by the container runtime.
To troubleshoot, list all containers using your preferred container runtimes CLI, e.g. docker.
Here is one example how you may list all Kubernetes containers running in docker:
- 'docker ps -a | grep kube | grep -v pause'
Once you have found the failing container, you can inspect its logs with:
- 'docker logs CONTAINERID'
couldn't initialize a Kubernetes cluster
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init.runWaitControlPlanePhase
/workspace/anago-v1.17.5-beta.0.65+a04e7d7c202142/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/init/waitcontrolplane.go:100
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
/workspace/anago-v1.17.5-beta.0.65+a04e7d7c202142/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:234
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
/workspace/anago-v1.17.5-beta.0.65+a04e7d7c202142/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:422
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
/workspace/anago-v1.17.5-beta.0.65+a04e7d7c202142/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:207
k8s.io/kubernetes/cmd/kubeadm/app/cmd.NewCmdInit.func1
/workspace/anago-v1.17.5-beta.0.65+a04e7d7c202142/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/init.go:147
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute
/workspace/anago-v1.17.5-beta.0.65+a04e7d7c202142/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:826
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC
/workspace/anago-v1.17.5-beta.0.65+a04e7d7c202142/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:914
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute
/workspace/anago-v1.17.5-beta.0.65+a04e7d7c202142/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:864
k8s.io/kubernetes/cmd/kubeadm/app.Run
/workspace/anago-v1.17.5-beta.0.65+a04e7d7c202142/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/kubeadm.go:50
main.main
_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/kubeadm.go:25
runtime.main
/usr/local/go/src/runtime/proc.go:203
runtime.goexit
/usr/local/go/src/runtime/asm_arm64.s:1128
error execution phase wait-control-plane
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run.func1
/workspace/anago-v1.17.5-beta.0.65+a04e7d7c202142/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:235
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).visitAll
/workspace/anago-v1.17.5-beta.0.65+a04e7d7c202142/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:422
k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow.(*Runner).Run
/workspace/anago-v1.17.5-beta.0.65+a04e7d7c202142/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/phases/workflow/runner.go:207
k8s.io/kubernetes/cmd/kubeadm/app/cmd.NewCmdInit.func1
/workspace/anago-v1.17.5-beta.0.65+a04e7d7c202142/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/cmd/init.go:147
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).execute
/workspace/anago-v1.17.5-beta.0.65+a04e7d7c202142/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:826
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).ExecuteC
/workspace/anago-v1.17.5-beta.0.65+a04e7d7c202142/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:914
k8s.io/kubernetes/vendor/github.com/spf13/cobra.(*Command).Execute
/workspace/anago-v1.17.5-beta.0.65+a04e7d7c202142/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/vendor/github.com/spf13/cobra/command.go:864
k8s.io/kubernetes/cmd/kubeadm/app.Run
/workspace/anago-v1.17.5-beta.0.65+a04e7d7c202142/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/app/kubeadm.go:50
main.main
_output/dockerized/go/src/k8s.io/kubernetes/cmd/kubeadm/kubeadm.go:25
runtime.main
/usr/local/go/src/runtime/proc.go:203
runtime.goexit
/usr/local/go/src/runtime/asm_arm64.s:1128

sudo systemctl restart docker failed

hi：
env:
xavier, JP4.4

error is :

when I run sudo systemctl restart docker , it failed, if I set "default-runtime" : "nvidia" at daemon.json
but if I remove the "default-runtime" : "nvidia" at daemon.json, no error occurs.
thanks for any help

playbook for ubuntu server 4.2 fails to setup kubernetes

https://github.com/NVIDIA/egx-platform/blob/master/playbooks/Ubuntu_Server_v4.2.md on a g4dn.2xlarge ec2 instance running Ubuntu 20.04.3 LTS fails to successfully install

$ cat hosts 
[master]
172.31.24.228 ansible_ssh_user=ubuntu ansible_ssh_common_args='-o StrictHostKeyChecking=no'
[nodes]
172.31.24.228 ansible_ssh_user=ubuntu ansible_ssh_common_args='-o StrictHostKeyChecking=no'


$ bash setup.sh install
Ansible Already Installed


EGX DIY Stack Version 4.2

Installing EGX Stack

PLAY [all] **********************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] **********************************************************************************************************************************************************************************************************************
ok: [172.31.24.228]

TASK [Checking Nouveau is disabled] *********************************************************************************************************************************************************************************************************
changed: [172.31.24.228]

TASK [unload nouveau] ***********************************************************************************************************************************************************************************************************************
ok: [172.31.24.228]

TASK [blacklist nouveau] ********************************************************************************************************************************************************************************************************************
ok: [172.31.24.228]

TASK [Add an Kubernetes apt signing key for Ubuntu] *****************************************************************************************************************************************************************************************
ok: [172.31.24.228]

TASK [Adding Kubernetes apt repository for Ubuntu] ******************************************************************************************************************************************************************************************
ok: [172.31.24.228]

TASK [Install kubernetes components for Ubuntu on EGX Stack 1.2 or 1.3] *********************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Install kubernetes components for Ubuntu on EGX Stack 2.0] ****************************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Install kubernetes components for Ubuntu on EGX Stack 3.1] ****************************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Install kubernetes components for Ubuntu on EGX Stack 4.x] ****************************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Install kubernetes components for Ubuntu on EGX Stack 4.2] ****************************************************************************************************************************************************************************
changed: [172.31.24.228]

TASK [Hold the installed Packages] **********************************************************************************************************************************************************************************************************
changed: [172.31.24.228] => (item=kubelet)
changed: [172.31.24.228] => (item=kubectl)
changed: [172.31.24.228] => (item=kubeadm)

TASK [Creating a Kubernetes repository file for RHEL/CentOS] ********************************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Adding repository details in Kubernetes repo file for RHEL/CentOS] ********************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Installing required packages for RHEL/CentOS] *****************************************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Validate whether Kubernetes cluster installed] ****************************************************************************************************************************************************************************************
changed: [172.31.24.228]

TASK [Add Docker GPG key for Ubuntu] ********************************************************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Add Docker APT repository for Ubuntu] *************************************************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Install Docker-CE Engine for Ubuntu 20.04 on EGX Stack 3.1 or 1.3] ********************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Install Docker-CE Engine for Ubuntu 18.04 on EGX Stack 3.1 or 1.3] ********************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Install Docker-CE Engine for Ubuntu on EGX Stack 2.0] *********************************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Install Docker-CE Engine for Ubuntu on EGX Stack 1.2] *********************************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Configuring Docker-CE repo for RHEL/CentOS] *******************************************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Install Docker-CE Engine on RHEL/CentOS] **********************************************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Adding Docker to Current User] ********************************************************************************************************************************************************************************************************
changed: [172.31.24.228]

TASK [SetEnforce for RHEL/CentOS] ***********************************************************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [SELinux for RHEL/CentOS] **************************************************************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Enable Firewall Service for RHEL/CentOS] **********************************************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Allow Network Ports in Firewalld for RHEL/CentOS] *************************************************************************************************************************************************************************************
skipping: [172.31.24.228] => (item=6443/tcp) 
skipping: [172.31.24.228] => (item=10250/tcp) 

TASK [Remove swapfile from /etc/fstab] ******************************************************************************************************************************************************************************************************
ok: [172.31.24.228] => (item=swap)
ok: [172.31.24.228] => (item=none)

TASK [Disable swap] *************************************************************************************************************************************************************************************************************************
changed: [172.31.24.228]

TASK [Create containerd.conf] ***************************************************************************************************************************************************************************************************************
changed: [172.31.24.228] => (item=overlay)
changed: [172.31.24.228] => (item=br_netfilter)

TASK [Modprobe for overlay and br_netfilter] ************************************************************************************************************************************************************************************************
ok: [172.31.24.228] => (item=overlay)
ok: [172.31.24.228] => (item=br_netfilter)

TASK [Add sysctl parameters to /etc/sysctl.conf] ********************************************************************************************************************************************************************************************
ok: [172.31.24.228] => (item={'name': 'net.bridge.bridge-nf-call-ip6tables', 'value': '1', 'reload': False})
ok: [172.31.24.228] => (item={'name': 'net.bridge.bridge-nf-call-iptables', 'value': '1', 'reload': False})
ok: [172.31.24.228] => (item={'name': 'net.ipv4.ip_forward', 'value': '1', 'reload': True})

TASK [Install libseccomp2] ******************************************************************************************************************************************************************************************************************
ok: [172.31.24.228]

TASK [Create /etc/containerd] ***************************************************************************************************************************************************************************************************************
changed: [172.31.24.228]

TASK [Create /etc/default/kubelet] **********************************************************************************************************************************************************************************************************
changed: [172.31.24.228]

TASK [Download cri-containerd-cni] **********************************************************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Untar cri-containerd-cni] *************************************************************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Get defaults from containerd] *********************************************************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Write defaults to config.toml] ********************************************************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [restart containerd] *******************************************************************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Download cri-containerd-cni] **********************************************************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Untar cri-containerd-cni] *************************************************************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Get defaults from containerd] *********************************************************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Write defaults to config.toml] ********************************************************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [restart containerd] *******************************************************************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Download cri-containerd-cni] **********************************************************************************************************************************************************************************************************
ok: [172.31.24.228]

TASK [Untar cri-containerd-cni] *************************************************************************************************************************************************************************************************************
changed: [172.31.24.228]

TASK [Get defaults from containerd] *********************************************************************************************************************************************************************************************************
ok: [172.31.24.228]

TASK [Write defaults to config.toml] ********************************************************************************************************************************************************************************************************
changed: [172.31.24.228]

TASK [restart containerd] *******************************************************************************************************************************************************************************************************************
changed: [172.31.24.228]

TASK [Starting and enabling the required services] ******************************************************************************************************************************************************************************************
ok: [172.31.24.228] => (item=docker)
ok: [172.31.24.228] => (item=kubelet)
ok: [172.31.24.228] => (item=containerd)

PLAY RECAP **********************************************************************************************************************************************************************************************************************************
172.31.24.228              : ok=24   changed=12   unreachable=0    failed=0    skipped=29   rescued=0    ignored=0   


PLAY [master] *******************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] **********************************************************************************************************************************************************************************************************************
ok: [172.31.24.228]

TASK [Validate whether Kubernetes cluster installed] ****************************************************************************************************************************************************************************************
changed: [172.31.24.228]

TASK [Reset Kubernetes component] ***********************************************************************************************************************************************************************************************************
changed: [172.31.24.228]

TASK [remove etcd directory] ****************************************************************************************************************************************************************************************************************
ok: [172.31.24.228]

TASK [Iniitialize the Kubernetes cluster using kubeadm and containerd] **********************************************************************************************************************************************************************
changed: [172.31.24.228]

TASK [Initialize the Kubernetes cluster using kubeadm] **************************************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Create kube directory] ****************************************************************************************************************************************************************************************************************
ok: [172.31.24.228]

TASK [admin permissions] ********************************************************************************************************************************************************************************************************************
changed: [172.31.24.228]

TASK [Copy kubeconfig to home] **************************************************************************************************************************************************************************************************************
changed: [172.31.24.228]

TASK [Install networking plugin to kubernetes cluster on EGX DIY Stack 3.1 or 4.0 or 4.1 or 4.2] ********************************************************************************************************************************************
changed: [172.31.24.228]

TASK [Install networking plugin to kubernetes cluster on EGX DIY Stack 2.0] *****************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Install networking plugin to kubernetes cluster on EGX DIY Stack 1.2 or 1.3] **********************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Update Networking Plugin for  on EGX DIY Stack 1.2 or 1.3] ****************************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Taint the Kubernetes Control Plane node] **********************************************************************************************************************************************************************************************
changed: [172.31.24.228]

TASK [Generate join token] ******************************************************************************************************************************************************************************************************************
changed: [172.31.24.228]

TASK [set_fact] *****************************************************************************************************************************************************************************************************************************
ok: [172.31.24.228]

TASK [Store join command] *******************************************************************************************************************************************************************************************************************
changed: [172.31.24.228]

PLAY [nodes] ********************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] **********************************************************************************************************************************************************************************************************************
ok: [172.31.24.228]

TASK [Reset Kubernetes component] ***********************************************************************************************************************************************************************************************************
changed: [172.31.24.228]

TASK [Create kube directory] ****************************************************************************************************************************************************************************************************************
ok: [172.31.24.228]

TASK [Copy kubeadm-join command to node] ****************************************************************************************************************************************************************************************************
ok: [172.31.24.228]

TASK [Get the Active Mellanox NIC on nodes] *************************************************************************************************************************************************************************************************
skipping: [172.31.24.228]

TASK [Copy Mellanox NIC Active File to master] **********************************************************************************************************************************************************************************************
skipping: [172.31.24.228]

PLAY [nodes] ********************************************************************************************************************************************************************************************************************************

TASK [Gathering Facts] **********************************************************************************************************************************************************************************************************************
ok: [172.31.24.228]

TASK [Run kubeadm join] *********************************************************************************************************************************************************************************************************************
fatal: [172.31.24.228]: FAILED! => {"changed": true, "cmd": "kubeadm join 172.31.24.228:6443 --token ddhvan.6rb88wwwx33gxre0 --discovery-token-ca-cert-hash sha256:d3a8c4144c5b08c1c34ef20696db23063682fc5b74dd4d42d184cd33fbba0d2f ", "delta": "0:05:00.161901", "end": "2022-02-19 19:40:31.153453", "msg": "non-zero return code", "rc": 1, "start": "2022-02-19 19:35:30.991552", "stderr": "error execution phase preflight: couldn't validate the identity of the API Server: Get \"https://172.31.24.228:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s\": dial tcp 172.31.24.228:6443: connect: connection refused\nTo see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["error execution phase preflight: couldn't validate the identity of the API Server: Get \"https://172.31.24.228:6443/api/v1/namespaces/kube-public/configmaps/cluster-info?timeout=10s\": dial tcp 172.31.24.228:6443: connect: connection refused", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[preflight] Running pre-flight checks", "stdout_lines": ["[preflight] Running pre-flight checks"]}

PLAY RECAP **********************************************************************************************************************************************************************************************************************************
172.31.24.228              : ok=18   changed=10   unreachable=0    failed=1    skipped=6    rescued=0    ignored=0

$ systemctl status containerd
● containerd.service - containerd container runtime
     Loaded: loaded (/etc/systemd/system/containerd.service; enabled; vendor preset: enabled)
     Active: active (running) since Sat 2022-02-19 19:34:38 UTC; 13min ago
       Docs: https://containerd.io
   Main PID: 37046 (containerd)
      Tasks: 24
     Memory: 923.5M
     CGroup: /system.slice/containerd.service
             └─37046 /usr/local/bin/containerd

Feb 19 19:35:29 ip-172-31-24-228 containerd[37046]: time="2022-02-19T19:35:29.140155035Z" level=info msg="cleaning up dead shim"
Feb 19 19:35:29 ip-172-31-24-228 containerd[37046]: time="2022-02-19T19:35:29.150767759Z" level=warning msg="cleanup warnings time=\"2022-02-19T19:35:29Z\" level=info msg=\"starting signal loop\" namespace=k8s.io pid=39253\n"
Feb 19 19:35:29 ip-172-31-24-228 containerd[37046]: time="2022-02-19T19:35:29.198470266Z" level=info msg="shim disconnected" id=6a5b60d0c24e1a1748c9dd7fc98b94f957967e6a00ea4679be269c67596d18d4
Feb 19 19:35:29 ip-172-31-24-228 containerd[37046]: time="2022-02-19T19:35:29.198527288Z" level=warning msg="cleaning up after shim disconnected" id=6a5b60d0c24e1a1748c9dd7fc98b94f957967e6a00ea4679be269c67596d18d4 namespace=k8s.io
Feb 19 19:35:29 ip-172-31-24-228 containerd[37046]: time="2022-02-19T19:35:29.198540466Z" level=info msg="cleaning up dead shim"
Feb 19 19:35:29 ip-172-31-24-228 containerd[37046]: time="2022-02-19T19:35:29.209116727Z" level=warning msg="cleanup warnings time=\"2022-02-19T19:35:29Z\" level=info msg=\"starting signal loop\" namespace=k8s.io pid=39292\n"
Feb 19 19:35:29 ip-172-31-24-228 containerd[37046]: time="2022-02-19T19:35:29.209466026Z" level=info msg="TearDown network for sandbox \"6a5b60d0c24e1a1748c9dd7fc98b94f957967e6a00ea4679be269c67596d18d4\" successfully"
Feb 19 19:35:29 ip-172-31-24-228 containerd[37046]: time="2022-02-19T19:35:29.209491081Z" level=info msg="StopPodSandbox for \"6a5b60d0c24e1a1748c9dd7fc98b94f957967e6a00ea4679be269c67596d18d4\" returns successfully"
Feb 19 19:35:29 ip-172-31-24-228 containerd[37046]: time="2022-02-19T19:35:29.223992525Z" level=info msg="RemovePodSandbox for \"6a5b60d0c24e1a1748c9dd7fc98b94f957967e6a00ea4679be269c67596d18d4\""
Feb 19 19:35:29 ip-172-31-24-228 containerd[37046]: time="2022-02-19T19:35:29.230819595Z" level=info msg="RemovePodSandbox \"6a5b60d0c24e1a1748c9dd7fc98b94f957967e6a00ea4679be269c67596d18d4\" returns successfully"

$ systemctl status kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
     Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/kubelet.service.d
             └─10-kubeadm.conf
     Active: inactive (dead) since Sat 2022-02-19 19:35:28 UTC; 12min ago
       Docs: https://kubernetes.io/docs/home/
    Process: 38103 ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS (code=exited, status=0/SUCCESS)
   Main PID: 38103 (code=exited, status=0/SUCCESS)

Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.340896   38103 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"etc-ca-certificates\" (UniqueName: \"kubernetes.io/host-path/8163a2>
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.340945   38103 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"etc-ca-certificates\" (UniqueName: \"kubernetes.io/host-path/930396>
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.340977   38103 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"k8s-certs\" (UniqueName: \"kubernetes.io/host-path/930396a63ed739fc>
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.341011   38103 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"usr-share-ca-certificates\" (UniqueName: \"kubernetes.io/host-path/>
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.341041   38103 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kubeconfig\" (UniqueName: \"kubernetes.io/host-path/1ed545894a1d8d6>
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.923817   38103 apiserver.go:52] "Watching apiserver"
Feb 19 19:35:28 ip-172-31-24-228 kubelet[38103]: I0219 19:35:28.147621   38103 reconciler.go:157] "Reconciler: start to sync state"
Feb 19 19:35:28 ip-172-31-24-228 systemd[1]: Stopping kubelet: The Kubernetes Node Agent...
Feb 19 19:35:28 ip-172-31-24-228 systemd[1]: kubelet.service: Succeeded.
Feb 19 19:35:28 ip-172-31-24-228 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.

$ journalctl -n100 -u kubelet
-- Logs begin at Sat 2022-02-19 11:33:31 UTC, end at Sat 2022-02-19 19:47:02 UTC. --
Feb 19 19:35:18 ip-172-31-24-228 kubelet[37621]: E0219 19:35:18.304412   37621 kubelet.go:2291] "Error getting node" err="node \"ip-172-31-24-228\" not found"
Feb 19 19:35:18 ip-172-31-24-228 kubelet[37621]: E0219 19:35:18.404542   37621 kubelet.go:2291] "Error getting node" err="node \"ip-172-31-24-228\" not found"
Feb 19 19:35:18 ip-172-31-24-228 kubelet[37621]: E0219 19:35:18.504989   37621 kubelet.go:2291] "Error getting node" err="node \"ip-172-31-24-228\" not found"
Feb 19 19:35:18 ip-172-31-24-228 kubelet[37621]: E0219 19:35:18.605633   37621 kubelet.go:2291] "Error getting node" err="node \"ip-172-31-24-228\" not found"
Feb 19 19:35:18 ip-172-31-24-228 kubelet[37621]: E0219 19:35:18.706066   37621 kubelet.go:2291] "Error getting node" err="node \"ip-172-31-24-228\" not found"
Feb 19 19:35:18 ip-172-31-24-228 kubelet[37621]: E0219 19:35:18.806337   37621 kubelet.go:2291] "Error getting node" err="node \"ip-172-31-24-228\" not found"
Feb 19 19:35:18 ip-172-31-24-228 kubelet[37621]: E0219 19:35:18.807446   37621 nodelease.go:49] "Failed to get node when trying to set owner ref to the node lease" err="nodes \"ip-172-31-24-228\" not found" node="ip-172-31-24-228"
Feb 19 19:35:18 ip-172-31-24-228 kubelet[37621]: E0219 19:35:18.906865   37621 kubelet.go:2291] "Error getting node" err="node \"ip-172-31-24-228\" not found"
Feb 19 19:35:18 ip-172-31-24-228 kubelet[37621]: I0219 19:35:18.911712   37621 kubelet_node_status.go:74] "Successfully registered node" node="ip-172-31-24-228"
Feb 19 19:35:19 ip-172-31-24-228 kubelet[37621]: E0219 19:35:19.007803   37621 kubelet.go:2291] "Error getting node" err="node \"ip-172-31-24-228\" not found"
Feb 19 19:35:19 ip-172-31-24-228 kubelet[37621]: E0219 19:35:19.108114   37621 kubelet.go:2291] "Error getting node" err="node \"ip-172-31-24-228\" not found"
Feb 19 19:35:19 ip-172-31-24-228 kubelet[37621]: E0219 19:35:19.208318   37621 kubelet.go:2291] "Error getting node" err="node \"ip-172-31-24-228\" not found"
Feb 19 19:35:19 ip-172-31-24-228 kubelet[37621]: E0219 19:35:19.308797   37621 kubelet.go:2291] "Error getting node" err="node \"ip-172-31-24-228\" not found"
Feb 19 19:35:20 ip-172-31-24-228 kubelet[37621]: I0219 19:35:20.342050   37621 apiserver.go:52] "Watching apiserver"
Feb 19 19:35:20 ip-172-31-24-228 kubelet[37621]: I0219 19:35:20.515471   37621 reconciler.go:157] "Reconciler: start to sync state"
Feb 19 19:35:21 ip-172-31-24-228 systemd[1]: Stopping kubelet: The Kubernetes Node Agent...
Feb 19 19:35:21 ip-172-31-24-228 systemd[1]: kubelet.service: Succeeded.
Feb 19 19:35:21 ip-172-31-24-228 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.
Feb 19 19:35:21 ip-172-31-24-228 systemd[1]: Started kubelet: The Kubernetes Node Agent.
Feb 19 19:35:21 ip-172-31-24-228 kubelet[38103]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluste>
Feb 19 19:35:21 ip-172-31-24-228 kubelet[38103]: I0219 19:35:21.882857   38103 server.go:197] "Warning: For remote container runtime, --pod-infra-container-image is ignored in kubelet, which should be set in that remote runtime instead"
Feb 19 19:35:21 ip-172-31-24-228 kubelet[38103]: Flag --cgroup-driver has been deprecated, This parameter should be set via the config file specified by the Kubelet's --config flag. See https://kubernetes.io/docs/tasks/administer-cluste>
Feb 19 19:35:21 ip-172-31-24-228 kubelet[38103]: I0219 19:35:21.891922   38103 server.go:440] "Kubelet version" kubeletVersion="v1.21.7"
Feb 19 19:35:21 ip-172-31-24-228 kubelet[38103]: I0219 19:35:21.892170   38103 server.go:851] "Client rotation is on, will bootstrap in background"
Feb 19 19:35:21 ip-172-31-24-228 kubelet[38103]: I0219 19:35:21.893803   38103 certificate_store.go:130] Loading cert/key pair from "/var/lib/kubelet/pki/kubelet-client-current.pem".
Feb 19 19:35:21 ip-172-31-24-228 kubelet[38103]: I0219 19:35:21.894711   38103 dynamic_cafile_content.go:167] Starting client-ca-bundle::/etc/kubernetes/pki/ca.crt
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.922132   38103 server.go:660] "--cgroups-per-qos enabled, but --cgroup-root was not specified.  defaulting to /"
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.922355   38103 container_manager_linux.go:278] "Container manager verified user specified cgroup-root exists" cgroupRoot=[]
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.922423   38103 container_manager_linux.go:283] "Creating Container Manager object based on Node Config" nodeConfig={RuntimeCgroupsName: SystemCgroupsName: KubeletCgroupsNam>
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.922439   38103 topology_manager.go:120] "Creating topology manager with policy per scope" topologyPolicyName="none" topologyScopeName="container"
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.922449   38103 container_manager_linux.go:314] "Initializing Topology Manager" policy="none" scope="container"
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.922456   38103 container_manager_linux.go:319] "Creating device plugin manager" devicePluginEnabled=true
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.922539   38103 remote_runtime.go:62] parsed scheme: ""
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.922547   38103 remote_runtime.go:62] scheme "" not registered, fallback to default scheme
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.922578   38103 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/run/containerd/containerd.sock  <nil> 0 <nil>}] <nil> <nil>}
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.922588   38103 clientconn.go:948] ClientConn switching balancer to "pick_first"
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.922644   38103 remote_image.go:50] parsed scheme: ""
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.922651   38103 remote_image.go:50] scheme "" not registered, fallback to default scheme
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.922659   38103 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{/run/containerd/containerd.sock  <nil> 0 <nil>}] <nil> <nil>}
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.922664   38103 clientconn.go:948] ClientConn switching balancer to "pick_first"
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.922736   38103 kubelet.go:404] "Attempting to sync node with API server"
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.922753   38103 kubelet.go:272] "Adding static pod path" path="/etc/kubernetes/manifests"
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.922812   38103 kubelet.go:283] "Adding apiserver pod source"
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.922833   38103 apiserver.go:42] "Waiting for node sync before watching apiserver pods"
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.933337   38103 kuberuntime_manager.go:222] "Container runtime initialized" containerRuntime="containerd" version="v1.5.8" apiVersion="v1alpha2"
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.936118   38103 server.go:1190] "Started kubelet"
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.937526   38103 server.go:149] "Starting to listen" address="0.0.0.0" port=10250
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.938794   38103 fs_resource_analyzer.go:67] "Starting FS ResourceAnalyzer"
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.938846   38103 server.go:409] "Adding debug handlers to kubelet server"
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.938881   38103 volume_manager.go:271] "Starting Kubelet Volume Manager"
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.938912   38103 desired_state_of_world_populator.go:141] "Desired state populator starts to run"
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: E0219 19:35:26.940346   38103 cri_stats_provider.go:369] "Failed to get the info of the filesystem with mountpoint" err="unable to find data in memory cache" mountpoint="/var/lib/containe>
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: E0219 19:35:26.940389   38103 kubelet.go:1306] "Image garbage collection failed once. Stats initialization may not have completed yet" err="invalid capacity 0 on image filesystem"
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.940448   38103 client.go:86] parsed scheme: "unix"
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.940800   38103 client.go:86] scheme "unix" not registered, fallback to default scheme
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.941095   38103 passthrough.go:48] ccResolverWrapper: sending update to cc: {[{unix:///run/containerd/containerd.sock  <nil> 0 <nil>}] <nil> <nil>}
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.941242   38103 clientconn.go:948] ClientConn switching balancer to "pick_first"
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.965658   38103 kubelet_network_linux.go:56] "Initialized protocol iptables rules." protocol=IPv4
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.981806   38103 kubelet_network_linux.go:56] "Initialized protocol iptables rules." protocol=IPv6
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.981830   38103 status_manager.go:157] "Starting to sync pod status with apiserver"
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: I0219 19:35:26.981853   38103 kubelet.go:1846] "Starting kubelet main sync loop"
Feb 19 19:35:26 ip-172-31-24-228 kubelet[38103]: E0219 19:35:26.981906   38103 kubelet.go:1870] "Skipping pod synchronization" err="[container runtime status check may not have completed yet, PLEG is not healthy: pleg has yet to be succ>
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.041741   38103 kubelet_node_status.go:71] "Attempting to register node" node="ip-172-31-24-228"
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.052702   38103 kubelet_node_status.go:109] "Node was previously registered" node="ip-172-31-24-228"
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.052798   38103 kubelet_node_status.go:74] "Successfully registered node" node="ip-172-31-24-228"
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: E0219 19:35:27.082391   38103 kubelet.go:1870] "Skipping pod synchronization" err="container runtime status check may not have completed yet"
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.086107   38103 cpu_manager.go:199] "Starting CPU manager" policy="none"
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.086135   38103 cpu_manager.go:200] "Reconciling" reconcilePeriod="10s"
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.086153   38103 state_mem.go:36] "Initialized new in-memory state store"
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.086326   38103 state_mem.go:88] "Updated default CPUSet" cpuSet=""
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.086343   38103 state_mem.go:96] "Updated CPUSet assignments" assignments=map[]
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.086352   38103 policy_none.go:44] "None policy: Start"
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.090812   38103 manager.go:600] "Failed to retrieve checkpoint" checkpoint="kubelet_internal_checkpoint" err="checkpoint is not found"
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.091118   38103 plugin_manager.go:114] "Starting Kubelet Plugin Manager"
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.283039   38103 topology_manager.go:187] "Topology Admit Handler"
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.283181   38103 topology_manager.go:187] "Topology Admit Handler"
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.283244   38103 topology_manager.go:187] "Topology Admit Handler"
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.283295   38103 topology_manager.go:187] "Topology Admit Handler"
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.340336   38103 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"ca-certs\" (UniqueName: \"kubernetes.io/host-path/8163a26df170b890f>
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.340394   38103 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"etc-pki\" (UniqueName: \"kubernetes.io/host-path/8163a26df170b890f5>
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.340424   38103 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"usr-local-share-ca-certificates\" (UniqueName: \"kubernetes.io/host>
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.340538   38103 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"usr-share-ca-certificates\" (UniqueName: \"kubernetes.io/host-path/>
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.340592   38103 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"etc-pki\" (UniqueName: \"kubernetes.io/host-path/930396a63ed739fcfd>
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.340622   38103 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"etcd-data\" (UniqueName: \"kubernetes.io/host-path/91c2306dae8c7092>
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.340650   38103 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"k8s-certs\" (UniqueName: \"kubernetes.io/host-path/8163a26df170b890>
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.340677   38103 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"ca-certs\" (UniqueName: \"kubernetes.io/host-path/930396a63ed739fcf>
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.340722   38103 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"flexvolume-dir\" (UniqueName: \"kubernetes.io/host-path/930396a63ed>
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.340753   38103 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kubeconfig\" (UniqueName: \"kubernetes.io/host-path/930396a63ed739f>
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.340799   38103 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"usr-local-share-ca-certificates\" (UniqueName: \"kubernetes.io/host>
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.340842   38103 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"etcd-certs\" (UniqueName: \"kubernetes.io/host-path/91c2306dae8c709>
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.340896   38103 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"etc-ca-certificates\" (UniqueName: \"kubernetes.io/host-path/8163a2>
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.340945   38103 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"etc-ca-certificates\" (UniqueName: \"kubernetes.io/host-path/930396>
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.340977   38103 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"k8s-certs\" (UniqueName: \"kubernetes.io/host-path/930396a63ed739fc>
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.341011   38103 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"usr-share-ca-certificates\" (UniqueName: \"kubernetes.io/host-path/>
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.341041   38103 reconciler.go:224] "operationExecutor.VerifyControllerAttachedVolume started for volume \"kubeconfig\" (UniqueName: \"kubernetes.io/host-path/1ed545894a1d8d6>
Feb 19 19:35:27 ip-172-31-24-228 kubelet[38103]: I0219 19:35:27.923817   38103 apiserver.go:52] "Watching apiserver"
Feb 19 19:35:28 ip-172-31-24-228 kubelet[38103]: I0219 19:35:28.147621   38103 reconciler.go:157] "Reconciler: start to sync state"
Feb 19 19:35:28 ip-172-31-24-228 systemd[1]: Stopping kubelet: The Kubernetes Node Agent...
Feb 19 19:35:28 ip-172-31-24-228 systemd[1]: kubelet.service: Succeeded.
Feb 19 19:35:28 ip-172-31-24-228 systemd[1]: Stopped kubelet: The Kubernetes Node Agent.

kubeadm join command

not sure why but the copy of the kubeadm join command doesn't seem to work, unless I'm running the install from one of the master nodes. If it's from a third party system, it doesn't copy the files to the worker nodes, and then fails at the joining of the cluster.

EGX Platform 4.2 on DGX2 can not install Kubernetes

Using the Ansible playbook scripts for EGX platform 4.2 on DGX2 (in colossus) and following the pre-requisites with the exception of the OS. DGX0S was preinstalled and can't be changed by user.

Installation fails at same step.
Background and Logs
cat /etc/os-release NAME="Ubuntu" VERSION="20.04.3 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.3 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal

Logs

See Attached Ansible logs

TASK [Iniitialize the Kubernetes cluster using kubeadm and containerd] *******************************************************************************************************************************
fatal: [10.34.3.223]: FAILED! => {"changed": true, "cmd": ["kubeadm", "init", "--pod-network-cidr=192.168.32.0/22", "--cri-socket=/run/containerd/containerd.sock"], "delta": "0:04:03.700622", "end": "2022-02-15 19:13:04.510635", "msg": "non-zero return code", "rc": 1, "start": "2022-02-15 19:09:00.810013", "stderr": "I0215 19:09:01.126459 562444 version.go:254] remote version is much newer: v1.23.3; falling back to: stable-1.21\nerror execution phase wait-control-plane: couldn't initialize a Kubernetes cluster\nTo see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["I0215 19:09:01.126459 562444 version.go:254] remote version is much newer: v1.23.3; falling back to: stable-1.21", "error execution phase wait-control-plane: couldn't initialize a Kubernetes cluster", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[init] Using Kubernetes version: v1.21.9\n[preflight] Running pre-flight checks\n[preflight] Pulling images required for setting up a Kubernetes cluster\n[preflight] This might take a minute or two, depending on the speed of your internet connection\n[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'\n[certs] Using certificateDir folder "/etc/kubernetes/pki"\n[certs] Generating "ca" certificate and key\n[certs] Generating "apiserver" certificate and key\n[certs] apiserver serving cert is signed for DNS names [demobox kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 10.34.3.223]\n[certs] Generating "apiserver-kubelet-client" certificate and key\n[certs] Generating "front-proxy-ca" certificate and key\n[certs] Generating "front-proxy-client" certificate and key\n[certs] Generating "etcd/ca" certificate and key\n[certs] Generating "etcd/server" certificate and key\n[certs] etcd/server serving cert is signed for DNS names [demobox localhost] and IPs [10.34.3.223 127.0.0.1 ::1]\n[certs] Generating "etcd/peer" certificate and key\n[certs] etcd/peer serving cert is signed for DNS names [demobox localhost] and IPs [10.34.3.223 127.0.0.1 ::1]\n[certs] Generating "etcd/healthcheck-client" certificate and key\n[certs] Generating "apiserver-etcd-client" certificate and key\n[certs] Generating "sa" key and public key\n[kubeconfig] Using kubeconfig folder "/etc/kubernetes"\n[kubeconfig] Writing "admin.conf" kubeconfig file\n[kubeconfig] Writing "kubelet.conf" kubeconfig file\n[kubeconfig] Writing "controller-manager.conf" kubeconfig file\n[kubeconfig] Writing "scheduler.conf" kubeconfig file\n[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"\n[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"\n[kubelet-start] Starting the kubelet\n[control-plane] Using manifest folder "/etc/kubernetes/manifests"\n[control-plane] Creating static Pod manifest for "kube-apiserver"\n[control-plane] Creating static Pod manifest for "kube-controller-manager"\n[control-plane] Creating static Pod manifest for "kube-scheduler"\n[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"\n[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s\n[kubelet-check] Initial timeout of 40s passed.\n\n\tUnfortunately, an error has occurred:\n\t\ttimed out waiting for the condition\n\n\tThis error is likely caused by:\n\t\t- The kubelet is not running\n\t\t- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)\n\n\tIf you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:\n\t\t- 'systemctl status kubelet'\n\t\t- 'journalctl -xeu kubelet'\n\n\tAdditionally, a control plane component may have crashed or exited when started by the container runtime.\n\tTo troubleshoot, list all containers using your preferred container runtimes CLI.\n\n\tHere is one example how you may list all Kubernetes containers running in cri-o/containerd using crictl:\n\t\t- 'crictl --runtime-endpoint /run/containerd/containerd.sock ps -a | grep kube | grep -v pause'\n\t\tOnce you have found the failing container, you can inspect its logs with:\n\t\t- 'crictl --runtime-endpoint /run/containerd/containerd.sock logs CONTAINERID'", "stdout_lines": ["[init] Using Kubernetes version: v1.21.9", "[preflight] Running pre-flight checks", "[preflight] Pulling images required for setting up a Kubernetes cluster", "[preflight] This might take a minute or two, depending on the speed of your internet connection", "[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'", "[certs] Using certificateDir folder "/etc/kubernetes/pki"", "[certs] Generating "ca" certificate and key", "[certs] Generating "apiserver" certificate and key", "[certs] apiserver serving cert is signed for DNS names [demobox kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.96.0.1 10.34.3.223]", "[certs] Generating "apiserver-kubelet-client" certificate and key", "[certs] Generating "front-proxy-ca" certificate and key", "[certs] Generating "front-proxy-client" certificate and key", "[certs] Generating "etcd/ca" certificate and key", "[certs] Generating "etcd/server" certificate and key", "[certs] etcd/server serving cert is signed for DNS names [demobox localhost] and IPs [10.34.3.223 127.0.0.1 ::1]", "[certs] Generating "etcd/peer" certificate and key", "[certs] etcd/peer serving cert is signed for DNS names [demobox localhost] and IPs [10.34.3.223 127.0.0.1 ::1]", "[certs] Generating "etcd/healthcheck-client" certificate and key", "[certs] Generating "apiserver-etcd-client" certificate and key", "[certs] Generating "sa" key and public key", "[kubeconfig] Using kubeconfig folder "/etc/kubernetes"", "[kubeconfig] Writing "admin.conf" kubeconfig file", "[kubeconfig] Writing "kubelet.conf" kubeconfig file", "[kubeconfig] Writing "controller-manager.conf" kubeconfig file", "[kubeconfig] Writing "scheduler.conf" kubeconfig file", "[kubelet-start] Writing kubelet environment file with flags to file "/var/lib/kubelet/kubeadm-flags.env"", "[kubelet-start] Writing kubelet configuration to file "/var/lib/kubelet/config.yaml"", "[kubelet-start] Starting the kubelet", "[control-plane] Using manifest folder "/etc/kubernetes/manifests"", "[control-plane] Creating static Pod manifest for "kube-apiserver"", "[control-plane] Creating static Pod manifest for "kube-controller-manager"", "[control-plane] Creating static Pod manifest for "kube-scheduler"", "[etcd] Creating static Pod manifest for local etcd in "/etc/kubernetes/manifests"", "[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s", "[kubelet-check] Initial timeout of 40s passed.", "", "\tUnfortunately, an error has occurred:", "\t\ttimed out waiting for the condition", "", "\tThis error is likely caused by:", "\t\t- The kubelet is not running", "\t\t- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)", "", "\tIf you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:", "\t\t- 'systemctl status kubelet'", "\t\t- 'journalctl -xeu kubelet'", "", "\tAdditionally, a control plane component may have crashed or exited when started by the container runtime.", "\tTo troubleshoot, list all containers using your preferred container runtimes CLI.", "", "\tHere is one example how you may list all Kubernetes containers running in cri-o/containerd using crictl:", "\t\t- 'crictl --runtime-endpoint /run/containerd/containerd.sock ps -a | grep kube | grep -v pause'", "\t\tOnce you have found the failing container, you can inspect its logs with:", "\t\t- 'crictl --runtime-endpoint /run/containerd/containerd.sock logs CONTAINERID'"]}

systemctl sta
dgx.log
tus kubelet
● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Tue 2022-02-15 19:09:04 PST; 4min 54s ago
Docs: https://kubernetes.io/docs/home/
Main PID: 562664 (kubelet)
Tasks: 60 (limit: 618884)
Memory: 81.8M
CGroup: /system.slice/kubelet.service
└─562664 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --cgroup-driver=cgroupfs --config=/var/lib/kubelet/con>

pod CrashLoopBackOff

cloud-native-stack/install-guides
/Jetson_Xavier_v11.1.md
I referred to this article, and after deploying k8s, the pod "iva-video-analytics-demo-l4t" went into "CrashLoopBackOff" state, and when I tested opening the web page, it showed "Stream will start playing automatically when it is live".But my pod will be running in a while, but still, there is no video when accessing the webpage, and VLC doesn't work either.
My deepstream was installed according to Nvidia's instructions, but when I tried to view the log files, the following error occurred:
root@master:/home/nvidia# kubectl logs iva-video-analytics-demo-l4t-84c9df6766-q4t55

WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

debconf: delaying package configuration, since apt-utils is not installed
cp: cannot stat 'deepstream_app_tao_configs/*': No such file or directory

No NGC Configuration Provided

sed: can't read /opt/nvidia/deepstream/deepstream-6.2/samples/configs/tao_pretrained_models//config_infer_primary_trafficcamnet.txt: No such file or directory
debconf: unable to initialize frontend: Dialog
debconf: (TERM is not set, so the dialog frontend is not usable.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin:
debconf: unable to initialize frontend: Dialog
debconf: (TERM is not set, so the dialog frontend is not usable.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin:
deepstream-app: error while loading shared libraries: libnvdla_compiler.so: cannot open shared object file: No such file or directory

Failed to setup NVIDIA Cloud Native Stack v11.0 for Developers

Hello,

I am trying to setup NVIDIA Cloud Native Stack v11.0 for Developers on AWS with Ubuntu v22.04. I followed the install guide but the GPU Operator doesn't start correctly.

Platform: AWS
Instance type: g4dn.4xlarge
OS: Ubuntu v22.04 (AMI: ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-20231207)

Once finished the Installing GPU Operator and start verifying the GPU Operator and I got the following error message:

ubuntu@ip-172-31-34-87:~$ helm -n nvidia-gpu-operator ls
NAME                    NAMESPACE               REVISION        UPDATED                                 STATUS          CHART                   APP VERSION
gpu-operator-1702604969 nvidia-gpu-operator     1               2023-12-15 01:49:30.08757447 +0000 UTC  deployed        gpu-operator-v23.9.1    v23.9.1

The gpu-operator is not functioning:

ubuntu@ip-172-31-34-87:~$ kubectl -n nvidia-gpu-operator get all
NAME                                                                  READY   STATUS                  RESTARTS        AGE
pod/gpu-feature-discovery-5hcg5                                       0/1     Init:0/1                0               16m
pod/gpu-operator-1702604969-node-feature-discovery-gc-f7bc979f2cs2l   1/1     Running                 0               16m
pod/gpu-operator-1702604969-node-feature-discovery-master-86c8f2rwx   1/1     Running                 0               16m
pod/gpu-operator-1702604969-node-feature-discovery-worker-pth4p       1/1     Running                 0               16m
pod/gpu-operator-75fb9db9fd-92ljm                                     1/1     Running                 0               16m
pod/nvidia-dcgm-exporter-vb9lm                                        0/1     Init:0/1                0               16m
pod/nvidia-device-plugin-daemonset-fmv95                              0/1     Init:0/1                0               16m
pod/nvidia-operator-validator-cst7k                                   0/1     Init:CrashLoopBackOff   7 (4m54s ago)   16m

I tried to launch another instance but it ends with same result. Then I trying to print the logs:

ubuntu@ip-172-31-47-239:~$ kubectl logs -n nvidia-gpu-operator pod/nvidia-container-toolkit-daemonset-wp6kj
Defaulted container "nvidia-container-toolkit-ctr" out of: nvidia-container-toolkit-ctr, driver-validation (init)
Error from server (BadRequest): container "nvidia-container-toolkit-ctr" in pod "nvidia-container-toolkit-daemonset-wp6kj" is terminated
ubuntu@ip-172-31-47-239:~$
ubuntu@ip-172-31-47-239:~$ kubectl logs -n nvidia-gpu-operator pod/nvidia-operator-validator-45xxh
Defaulted container "nvidia-operator-validator" out of: nvidia-operator-validator, driver-validation (init), toolkit-validation (init), cuda-validation (init), plugin-validation (init)
Error from server (BadRequest): container "nvidia-operator-validator" in pod "nvidia-operator-validator-45xxh" is terminated

I had followed another install guide for Ubuntu Server which does not install CUDA driver and it works. What had I missed on the guide? I tried to match the CUDA driver (the version that nvidia-smi print out) on running the helm install command for the GPU Operator but it doesn't help as well.

With egx-platform 4.1 on Ubuntu, Validating CUDA with GPU task hangs

Host OS: Ubuntu 20.04 LTS
EGX-Platform 4.1 (installed via playbook)

When running setup.sh validate, I'm seeing the Validating the CUDA with GPU task hanging.

TASK [Collecting Number of GPU's] **************************************************************************************
changed: [172.16.100.10]

TASK [Validating the nvidia-smi on Kubernetes] *************************************************************************
changed: [172.16.100.10]

TASK [Validating the CUDA with GPU] ************************************************************************************

Just hangs here. If I cancel an rerun, all the validation works, but this task fails saying it is already running.

Looking at the logs of the cuda-vector-add pod, things look good.

nvidia@kejones-egx-stack-01:~$ kubectl get pods
NAME                                                              READY   STATUS      RESTARTS   AGE
cuda-vector-add                                                   0/1     Completed   0          5m29s
gpu-operator-1634331603-node-feature-discovery-master-6cccwnnmg   1/1     Running     0          31m
gpu-operator-1634331603-node-feature-discovery-worker-nkkdg       1/1     Running     0          31m
gpu-operator-7d5bf78f5c-z9xfs                                     1/1     Running     0          31m
nvidia@kejones-egx-stack-01:~$ kubectl logs cuda-vector-add
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Not able to connect to Kubernetes Cluster on System Restart

Getting issue in connecting to the Kubernetes cluster on system restart.

So, on running the playbook again, I am getting the following issue.

TASK [Iniitialize the Kubernetes cluster using kubeadm and containerd for Cloud Native Core 6.2] *******************************
fatal: [localhost]: FAILED! => {"changed": true, "cmd": ["kubeadm", "init", "--pod-network-cidr=192.168.32.0/22", "--cri-socket=/run/containerd/containerd.sock", "--kubernetes-version=v1.23.7", "--image-repository=k8s.gcr.io"], "delta": "0:00:06.605533", "end": "2023-02-23 07:11:48.306510", "msg": "non-zero return code", "rc": 1, "start": "2023-02-23 07:11:41.700977", "stderr": "error execution phase preflight: [preflight] Some fatal errors occurred:\n\t[ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-apiserver:v1.23.7: output: E0223 07:11:43.840167   21550 remote_image.go:238] \"PullImage from image service failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-apiserver:v1.23.7\\\": blob sha256:a7206717f4a04b2dbd33505c0c4eb93907b892f5a610488ef7bfe94c6692be99 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/a7206717f4a04b2dbd33505c0c4eb93907b892f5a610488ef7bfe94c6692be99: blob not found: not found\" image=\"k8s.gcr.io/kube-apiserver:v1.23.7\"\ntime=\"2023-02-23T07:11:43Z\" level=fatal msg=\"pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-apiserver:v1.23.7\\\": blob sha256:a7206717f4a04b2dbd33505c0c4eb93907b892f5a610488ef7bfe94c6692be99 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/a7206717f4a04b2dbd33505c0c4eb93907b892f5a610488ef7bfe94c6692be99: blob not found: not found\"\n, error: exit status 1\n\t[ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-controller-manager:v1.23.7: output: E0223 07:11:44.604162   21601 remote_image.go:238] \"PullImage from image service failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-controller-manager:v1.23.7\\\": blob sha256:db4970df9c7a657d31299a6bc96b86d9d36c1d91e914b3428a43392fa4299068 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/db4970df9c7a657d31299a6bc96b86d9d36c1d91e914b3428a43392fa4299068: blob not found: not found\" image=\"k8s.gcr.io/kube-controller-manager:v1.23.7\"\ntime=\"2023-02-23T07:11:44Z\" level=fatal msg=\"pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-controller-manager:v1.23.7\\\": blob sha256:db4970df9c7a657d31299a6bc96b86d9d36c1d91e914b3428a43392fa4299068 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/db4970df9c7a657d31299a6bc96b86d9d36c1d91e914b3428a43392fa4299068: blob not found: not found\"\n, error: exit status 1\n\t[ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-scheduler:v1.23.7: output: E0223 07:11:45.351086   21666 remote_image.go:238] \"PullImage from image service failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-scheduler:v1.23.7\\\": blob sha256:5b0fc360fb9abd0a33e31c3ed37c713d40dc5d3089b38e4092495510d45dcafe expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/5b0fc360fb9abd0a33e31c3ed37c713d40dc5d3089b38e4092495510d45dcafe: blob not found: not found\" image=\"k8s.gcr.io/kube-scheduler:v1.23.7\"\ntime=\"2023-02-23T07:11:45Z\" level=fatal msg=\"pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-scheduler:v1.23.7\\\": blob sha256:5b0fc360fb9abd0a33e31c3ed37c713d40dc5d3089b38e4092495510d45dcafe expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/5b0fc360fb9abd0a33e31c3ed37c713d40dc5d3089b38e4092495510d45dcafe: blob not found: not found\"\n, error: exit status 1\n\t[ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-proxy:v1.23.7: output: E0223 07:11:46.097771   21716 remote_image.go:238] \"PullImage from image service failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-proxy:v1.23.7\\\": blob sha256:26d9bc653f99e54163685a93f440a774802c3ce00eea16d46147bb8dc4d88663 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/26d9bc653f99e54163685a93f440a774802c3ce00eea16d46147bb8dc4d88663: blob not found: not found\" image=\"k8s.gcr.io/kube-proxy:v1.23.7\"\ntime=\"2023-02-23T07:11:46Z\" level=fatal msg=\"pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-proxy:v1.23.7\\\": blob sha256:26d9bc653f99e54163685a93f440a774802c3ce00eea16d46147bb8dc4d88663 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/26d9bc653f99e54163685a93f440a774802c3ce00eea16d46147bb8dc4d88663: blob not found: not found\"\n, error: exit status 1\n\t[ERROR ImagePull]: failed to pull image k8s.gcr.io/pause:3.6: output: E0223 07:11:46.815906   21774 remote_image.go:238] \"PullImage from image service failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/pause:3.6\\\": blob sha256:3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db: blob not found: not found\" image=\"k8s.gcr.io/pause:3.6\"\ntime=\"2023-02-23T07:11:46Z\" level=fatal msg=\"pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/pause:3.6\\\": blob sha256:3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db: blob not found: not found\"\n, error: exit status 1\n\t[ERROR ImagePull]: failed to pull image k8s.gcr.io/etcd:3.5.1-0: output: E0223 07:11:47.565614   21831 remote_image.go:238] \"PullImage from image service failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/etcd:3.5.1-0\\\": blob sha256:64b9ea357325d5db9f8a723dcf503b5a449177b17ac87d69481e126bb724c263 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/64b9ea357325d5db9f8a723dcf503b5a449177b17ac87d69481e126bb724c263: blob not found: not found\" image=\"k8s.gcr.io/etcd:3.5.1-0\"\ntime=\"2023-02-23T07:11:47Z\" level=fatal msg=\"pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/etcd:3.5.1-0\\\": blob sha256:64b9ea357325d5db9f8a723dcf503b5a449177b17ac87d69481e126bb724c263 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/64b9ea357325d5db9f8a723dcf503b5a449177b17ac87d69481e126bb724c263: blob not found: not found\"\n, error: exit status 1\n\t[ERROR ImagePull]: failed to pull image k8s.gcr.io/coredns/coredns:v1.8.6: output: E0223 07:11:48.302979   21889 remote_image.go:238] \"PullImage from image service failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/coredns/coredns:v1.8.6\\\": blob sha256:5b6ec0d6de9baaf3e92d0f66cd96a25b9edbce8716f5f15dcd1a616b3abd590e expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/5b6ec0d6de9baaf3e92d0f66cd96a25b9edbce8716f5f15dcd1a616b3abd590e: blob not found: not found\" image=\"k8s.gcr.io/coredns/coredns:v1.8.6\"\ntime=\"2023-02-23T07:11:48Z\" level=fatal msg=\"pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/coredns/coredns:v1.8.6\\\": blob sha256:5b6ec0d6de9baaf3e92d0f66cd96a25b9edbce8716f5f15dcd1a616b3abd590e expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/5b6ec0d6de9baaf3e92d0f66cd96a25b9edbce8716f5f15dcd1a616b3abd590e: blob not found: not found\"\n, error: exit status 1\n[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`\nTo see the stack trace of this error execute with --v=5 or higher", "stderr_lines": ["error execution phase preflight: [preflight] Some fatal errors occurred:", "\t[ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-apiserver:v1.23.7: output: E0223 07:11:43.840167   21550 remote_image.go:238] \"PullImage from image service failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-apiserver:v1.23.7\\\": blob sha256:a7206717f4a04b2dbd33505c0c4eb93907b892f5a610488ef7bfe94c6692be99 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/a7206717f4a04b2dbd33505c0c4eb93907b892f5a610488ef7bfe94c6692be99: blob not found: not found\" image=\"k8s.gcr.io/kube-apiserver:v1.23.7\"", "time=\"2023-02-23T07:11:43Z\" level=fatal msg=\"pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-apiserver:v1.23.7\\\": blob sha256:a7206717f4a04b2dbd33505c0c4eb93907b892f5a610488ef7bfe94c6692be99 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/a7206717f4a04b2dbd33505c0c4eb93907b892f5a610488ef7bfe94c6692be99: blob not found: not found\"", ", error: exit status 1", "\t[ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-controller-manager:v1.23.7: output: E0223 07:11:44.604162   21601 remote_image.go:238] \"PullImage from image service failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-controller-manager:v1.23.7\\\": blob sha256:db4970df9c7a657d31299a6bc96b86d9d36c1d91e914b3428a43392fa4299068 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/db4970df9c7a657d31299a6bc96b86d9d36c1d91e914b3428a43392fa4299068: blob not found: not found\" image=\"k8s.gcr.io/kube-controller-manager:v1.23.7\"", "time=\"2023-02-23T07:11:44Z\" level=fatal msg=\"pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-controller-manager:v1.23.7\\\": blob sha256:db4970df9c7a657d31299a6bc96b86d9d36c1d91e914b3428a43392fa4299068 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/db4970df9c7a657d31299a6bc96b86d9d36c1d91e914b3428a43392fa4299068: blob not found: not found\"", ", error: exit status 1", "\t[ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-scheduler:v1.23.7: output: E0223 07:11:45.351086   21666 remote_image.go:238] \"PullImage from image service failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-scheduler:v1.23.7\\\": blob sha256:5b0fc360fb9abd0a33e31c3ed37c713d40dc5d3089b38e4092495510d45dcafe expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/5b0fc360fb9abd0a33e31c3ed37c713d40dc5d3089b38e4092495510d45dcafe: blob not found: not found\" image=\"k8s.gcr.io/kube-scheduler:v1.23.7\"", "time=\"2023-02-23T07:11:45Z\" level=fatal msg=\"pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-scheduler:v1.23.7\\\": blob sha256:5b0fc360fb9abd0a33e31c3ed37c713d40dc5d3089b38e4092495510d45dcafe expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/5b0fc360fb9abd0a33e31c3ed37c713d40dc5d3089b38e4092495510d45dcafe: blob not found: not found\"", ", error: exit status 1", "\t[ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-proxy:v1.23.7: output: E0223 07:11:46.097771   21716 remote_image.go:238] \"PullImage from image service failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-proxy:v1.23.7\\\": blob sha256:26d9bc653f99e54163685a93f440a774802c3ce00eea16d46147bb8dc4d88663 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/26d9bc653f99e54163685a93f440a774802c3ce00eea16d46147bb8dc4d88663: blob not found: not found\" image=\"k8s.gcr.io/kube-proxy:v1.23.7\"", "time=\"2023-02-23T07:11:46Z\" level=fatal msg=\"pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/kube-proxy:v1.23.7\\\": blob sha256:26d9bc653f99e54163685a93f440a774802c3ce00eea16d46147bb8dc4d88663 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/26d9bc653f99e54163685a93f440a774802c3ce00eea16d46147bb8dc4d88663: blob not found: not found\"", ", error: exit status 1", "\t[ERROR ImagePull]: failed to pull image k8s.gcr.io/pause:3.6: output: E0223 07:11:46.815906   21774 remote_image.go:238] \"PullImage from image service failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/pause:3.6\\\": blob sha256:3d380ca8864549e74af4b29c10f9cb0956236dfb01c40ca076fb6c37253234db expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/3d380ca8864549e7e failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/etcd:3.5.1-0\\\": blob sha256:64b9ea357325d5db9f8a723dcf503b5a449177b17ac87d69481e126bb724c263 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/64b9ea357325d5db9f8a723dcf503b5a449177b17ac87d69481e126bb724c263: blob not found: not found\" image=\"k8s.gcr.io/etcd:3.5.1-0\"", "time=\"2023-02-23T07:11:47Z\" level=fatal msg=\"pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/etcd:3.5.1-0\\\": blob sha256:64b9ea357325d5db9f8a723dcf503b5a449177b17ac87d69481e126bb724c263 expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/64b9ea357325d5db9f8a723dcf503b5a449177b17ac87d69481e126bb724c263: blob not found: not found\"", ", error: exit status 1", "\t[ERROR ImagePull]: failed to pull image k8s.gcr.io/coredns/coredns:v1.8.6: output: E0223 07:11:48.302979   21889 remote_image.go:238] \"PullImage from image service failed\" err=\"rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/coredns/coredns:v1.8.6\\\": blob sha256:5b6ec0d6de9baaf3e92d0f66cd96a25b9edbce8716f5f15dcd1a616b3abd590e expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/5b6ec0d6de9baaf3e92d0f66cd96a25b9edbce8716f5f15dcd1a616b3abd590e: blob not found: not found\" image=\"k8s.gcr.io/coredns/coredns:v1.8.6\"", "time=\"2023-02-23T07:11:48Z\" level=fatal msg=\"pulling image: rpc error: code = NotFound desc = failed to pull and unpack image \\\"k8s.gcr.io/coredns/coredns:v1.8.6\\\": blob sha256:5b6ec0d6de9baaf3e92d0f66cd96a25b9edbce8716f5f15dcd1a616b3abd590e expected at /var/lib/containerd/io.containerd.content.v1.content/blobs/sha256/5b6ec0d6de9baaf3e92d0f66cd96a25b9edbce8716f5f15dcd1a616b3abd590e: blob not found: not found\"", ", error: exit status 1", "[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`", "To see the stack trace of this error execute with --v=5 or higher"], "stdout": "[init] Using Kubernetes version: v1.23.7\n[preflight] Running pre-flight checks\n[preflight] Pulling images required for setting up a Kubernetes cluster\n[preflight] This might take a minute or two, depending on the speed of your internet connection\n[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'", "stdout_lines": ["[init] Using Kubernetes version: v1.23.7", "[preflight] Running pre-flight checks", "[preflight] Pulling images required for setting up a Kubernetes cluster", "[preflight] This might take a minute or two, depending on the speed of your internet connection", "[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'"]}

Main concern is to know the reason behind not able to connect to the Cluster and the resolution to that.

CNC deployment fails with error message: Unsupported parameters for (apt) module: allow_change_held_packages

TASK [Install kubernetes components for Ubuntu on NVIDIA Cloud Native Core 7.1] ***********************************************************************
fatal: [localhost]: FAILED! => {"changed": false, "msg": "Unsupported parameters for (apt) module: allow_change_held_packages. Supported parameters include: dpkg_options, state, install_recommends (install-recommends), update_cache (update-cache), autoclean, lock_timeout, force_apt_get, deb, allow_unauthenticated (allow-unauthenticated), purge, update_cache_retry_max_delay, package (name, pkg), autoremove, only_upgrade, force, update_cache_retries, allow_downgrade (allow-downgrade, allow-downgrades, allow_downgrades), cache_valid_time, fail_on_autoremove, policy_rc_d, upgrade, default_release (default-release)."}

cnc_values.yaml references old version of CNS

In CNC_Values.yaml, it has cnc_version: 6.3 however there is another input with this value inside of cnc_version.yaml which sets cnc_version: 8.1.

This might be confusing to users.

Is is possible to just have one value setting cnc_version and have it default to the latest (eg 9.0)?

Ansible playbook should install latest CNC

Possible slight oversight.
Cloning the repo and running installed v6.3 of CNC, not latest 8.0.

Was this intentional?

Ansible 7.0.0 removes "warn" parameter for ansible.legacy.command

"setup.sh install" pip install the latest ansible package, which is currently 7.0.0 (recently released). Apparently support for the "warn" parameter to ansible.legacy.command has been removed.

Suggestion is to remove the "warn" parameters in all *.yaml files, and perhaps pin the ansible version to a specific version in setup.sh to avoid unexpected changes in the future.

As a quick fix for a local checkout, the current CNS will work if ansible 6.0.0 is installed, for example in setup.sh replace the ansible install with the following line:

pip3 install ansible==6.0.0 2>&1 >/dev/null

Video Analytics Demo and Nvidia-smi pod just "pending" after being deployed

I have followed the steps outlined in the Ubuntu Server V3.0 install guide, and the validation steps are not working.

First I tried the nvidia-smi example, but the nvidia-smi pod just showed as "Pending" when all other pods were shown as "Running"

Second, I tried the Video Analytics Demo. The install from the helm chart seemed to be successful, and helm reported that the pod was deployed. When I check the pod status, it shows as "pending" as well, while all other pods (other than nvidia-smi) show as "Running".

None of the install steps showed an error, but does this mean that my installation was not successful? How can i find some information on debugging a "Pending" pod.

Cannot validate install; GPU not available

Good day fellow NVIDIANs (jwyatt here)

I am following the instructions to the letter in EGX Stack v3.1 for AWS - Install Guide for Ubuntu Server x86-64 and am running into an issue when trying to validate the installation with nvidia-smi.

I'm running on a g4dn.2xlarge.

In summary, the nvidia-smi pod is hanging indefinitely. kubectl pod describe pod nvidia-smi reads:

Name:         nvidia-smi
Namespace:    default
Priority:     0
Node:         <none>
Labels:       run=nvidia-smi
Annotations:  <none>
Status:       Pending
IP:           
IPs:          <none>
Containers:
  nvidia-smi:
    Image:      nvidia/cuda:11.1.1-base
    Port:       <none>
    Host Port:  <none>
    Args:
      nvidia-smi
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-v5g2r (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  default-token-v5g2r:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-v5g2r
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  35s (x103 over 151m)  default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu.

I think the last line there is most relevant.

Describe Node

The output of kubectl describe node is:

Name:               ip-172-31-21-241
Roles:              master
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    feature.node.kubernetes.io/cpu-cpuid.ADX=true
                    feature.node.kubernetes.io/cpu-cpuid.AESNI=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX2=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512F=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true
                    feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true
                    feature.node.kubernetes.io/cpu-cpuid.FMA3=true
                    feature.node.kubernetes.io/cpu-cpuid.MPX=true
                    feature.node.kubernetes.io/cpu-hardware_multithreading=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ=true
                    feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true
                    feature.node.kubernetes.io/kernel-version.full=5.11.0-1020-aws
                    feature.node.kubernetes.io/kernel-version.major=5
                    feature.node.kubernetes.io/kernel-version.minor=11
                    feature.node.kubernetes.io/kernel-version.revision=0
                    feature.node.kubernetes.io/pci-10de.present=true
                    feature.node.kubernetes.io/pci-1d0f.present=true
                    feature.node.kubernetes.io/storage-nonrotationaldisk=true
                    feature.node.kubernetes.io/system-os_release.ID=ubuntu
                    feature.node.kubernetes.io/system-os_release.VERSION_ID=20.04
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.major=20
                    feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-172-31-21-241
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/master=
                    nvidia.com/gpu.present=true
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
                    nfd.node.kubernetes.io/extended-resources: 
                    nfd.node.kubernetes.io/feature-labels:
                      cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.AVX512BW,cpu-cpuid.AVX512CD,cpu-cpuid.AVX512DQ,cpu-cpuid.AVX512F,cpu-...
                    nfd.node.kubernetes.io/master.version: v0.6.0
                    nfd.node.kubernetes.io/worker.version: v0.6.0
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 172.31.21.241/20
                    projectcalico.org/IPv4IPIPTunnelAddr: 192.168.198.128
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Thu, 09 Dec 2021 19:00:28 +0000
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  ip-172-31-21-241
  AcquireTime:     <unset>
  RenewTime:       Thu, 09 Dec 2021 22:04:22 +0000
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Thu, 09 Dec 2021 19:05:06 +0000   Thu, 09 Dec 2021 19:05:06 +0000   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Thu, 09 Dec 2021 22:01:00 +0000   Thu, 09 Dec 2021 19:00:27 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Thu, 09 Dec 2021 22:01:00 +0000   Thu, 09 Dec 2021 19:00:27 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Thu, 09 Dec 2021 22:01:00 +0000   Thu, 09 Dec 2021 19:00:27 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Thu, 09 Dec 2021 22:01:00 +0000   Thu, 09 Dec 2021 19:05:00 +0000   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  172.31.21.241
  Hostname:    ip-172-31-21-241
Capacity:
  cpu:                8
  ephemeral-storage:  64989720Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             32407904Ki
  pods:               110
Allocatable:
  cpu:                8
  ephemeral-storage:  59894525853
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             32305504Ki
  pods:               110
System Info:
  Machine ID:                 ec218707683e25881184af68445ebd87
  System UUID:                ec218707-683e-2588-1184-af68445ebd87
  Boot ID:                    d59c4692-4971-43f2-a237-2d2fe49434cc
  Kernel Version:             5.11.0-1020-aws
  OS Image:                   Ubuntu 20.04.3 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  docker://19.3.13
  Kubelet Version:            v1.18.14
  Kube-Proxy Version:         v1.18.14
PodCIDR:                      192.168.0.0/24
PodCIDRs:                     192.168.0.0/24
Non-terminated Pods:          (14 in total)
  Namespace                   Name                                                               CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                   ----                                                               ------------  ----------  ---------------  -------------  ---
  default                     gpu-operator-1639077567-node-feature-discovery-master-84485st4x    0 (0%)        0 (0%)      0 (0%)           0 (0%)         164m
  default                     gpu-operator-1639077567-node-feature-discovery-worker-ffljl        0 (0%)        0 (0%)      0 (0%)           0 (0%)         164m
  default                     gpu-operator-76fb8d5c55-rq7jj                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         164m
  gpu-operator-resources      nvidia-container-toolkit-daemonset-g4rzv                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         164m
  gpu-operator-resources      nvidia-driver-daemonset-gsl5l                                      0 (0%)        0 (0%)      0 (0%)           0 (0%)         164m
  kube-system                 calico-kube-controllers-7f94cf5997-zr46g                           0 (0%)        0 (0%)      0 (0%)           0 (0%)         179m
  kube-system                 calico-node-8w9jt                                                  250m (3%)     0 (0%)      0 (0%)           0 (0%)         179m
  kube-system                 coredns-66bff467f8-hnbb6                                           100m (1%)     0 (0%)      70Mi (0%)        170Mi (0%)     3h3m
  kube-system                 coredns-66bff467f8-msqdc                                           100m (1%)     0 (0%)      70Mi (0%)        170Mi (0%)     3h3m
  kube-system                 etcd-ip-172-31-21-241                                              0 (0%)        0 (0%)      0 (0%)           0 (0%)         3h3m
  kube-system                 kube-apiserver-ip-172-31-21-241                                    250m (3%)     0 (0%)      0 (0%)           0 (0%)         3h3m
  kube-system                 kube-controller-manager-ip-172-31-21-241                           200m (2%)     0 (0%)      0 (0%)           0 (0%)         3h3m
  kube-system                 kube-proxy-mssxq                                                   0 (0%)        0 (0%)      0 (0%)           0 (0%)         3h3m
  kube-system                 kube-scheduler-ip-172-31-21-241                                    100m (1%)     0 (0%)      0 (0%)           0 (0%)         3h3m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests    Limits
  --------           --------    ------
  cpu                1 (12%)     0 (0%)
  memory             140Mi (0%)  340Mi (1%)
  ephemeral-storage  0 (0%)      0 (0%)
  hugepages-1Gi      0 (0%)      0 (0%)
  hugepages-2Mi      0 (0%)      0 (0%)
Events:

I notice there is no nvidia-gpu listed in Allocated resources.

Thanks for your support.

NOTICE: NVIDIA Cloud Native Core fails due to CUDA linux repository GPG key rotation

Issue

Pre-existing Cloud Native Core installation stopped working after a reboot.
Note: Installations deployed on 04/29 or later are not impacted.

What happened:

NVIDIA team rotated GPG keys for CUDA linux repositories on 4/28. More information on this can be found here. CUDA repository is included in the NVIDIA driver images deployed through GPU operator and causing failures during apt-get update on Ubuntu 20.04. This happens whenever current running driver containers are restarted or node reboots.

Following error message will be seen from driver Pod (nvidia-driver-ctr container):

Updating` the package cache...
W: GPG error: https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY A4B469963BF863CC
E: The repository 'https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64  InRelease' is no longer signed.

Fix:

Option 1 - Uninstall and re-install

bash setup.sh uninstall
bash setup.sh install

Option 2 - Repair

Repair existing Cloud Native Core installation by updating the clusterpolicy.
Go to the GPU Operator Issue #344 to see the exact steps.

GPU Operator Installation Failed

When try to follow step:
helm install --version 22.09 --create-namespace --namespace gpu-operator-resources --devel nvidia/gpu-operator --wait --generate-name

Got Error message:
Error: INSTALLATION FAILED: Get "https://10.158.210.53:6443/apis/apiextensions.k8s.io/v1/customresourcedefinitions/nodefeaturerules.nfd.k8s-sigs.io": dial tcp 10.158.210.53:6443: connect: connection refused - error from a previous attempt: http2: server sent GOAWAY and closed the connection; LastStreamID=661, ErrCode=NO_ERROR, debug=""

I thought it maybe due to a backend timeout. I tried again but got connection refused.
Error: INSTALLATION FAILED: Kubernetes cluster unreachable: Get "https://10.158.210.53:6443/version": dial tcp 10.158.210.53:6443: connect: connection refused

Please direct me how to proceed. Thanks!

Thanks,

Frank

Set enable_mig: no by default.

The NVIDIA launchpad uses cnc. The cnc install operation turns mig on during install by default. cnc_uninstall script doesn't turn mig off. This is causing dirty hardware reservations inside launchpad. So, we should make sure enable_mig is set to no by default.

MicroK8s alias wasn't working end to end

I ran through a 11.1 install with microk8s enabled.

Directly running the validate script after running the setup script failed.

Rather than doing an alias in the Ansible and putting this information into .bashrc it would be more consistent to add the ansible-equivalent of the below lines because the bashrc alias do not seem to get picked up by the Ansible user.

sudo snap alias microk8s.kubectl kubectl
sudo snap alias microk8s.helm helm

Usage of CRI endpoints without URL scheme is deprecated and can cause kubelet errors in the future.

Hi there,

I am new to AWS and trying to follow instruction to set up Kubenete.

After this step:
sudo kubeadm init --pod-network-cidr=192.168.0.0/16 --cri-socket=/run/containerd/containerd.sock --kubernetes-version="v1.24.1"

I got error and warning as below:
W1117 21:08:30.007456 35829 initconfiguration.go:119] Usage of CRI endpoints without URL scheme is deprecated and can cause kubelet errors in the future. Automatically prepending scheme "unix" to the "criSocket" with value "/run/containerd/containerd.sock". Please update your configuration!
[init] Using Kubernetes version: v1.24.1
[preflight] Running pre-flight checks
[WARNING SystemVerification]: missing optional cgroups: blkio
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR KubeletVersion]: the kubelet version is higher than the control plane version. This is not a supported version skew and may lead to a malfunctional cluster. Kubelet version: "1.25.2" Control plane version: "1.24.1"
[preflight] If you know what you are doing, you can make a check non-fatal with --ignore-preflight-errors=...
To see the stack trace of this error execute with --v=5 or higher

Attempt:
Changed kubernetes-version=v1.25.2

Outcome:
I was able to get through this step. But not sure whether I am interpreting the error message properly.

Please direct if I am doing it correct. Thanks!

Thanks,

Frank

Preferred single node setup

This may not actually be a bug as it is a question more for guidance.

We are running a single node EGX "cluster". It seems that by default, in this configuration, the single node will only have the "master" label and not the "worker" label.

When only running single node, is the suggested solution to simply add the worker label to the node manually?

GPU Driver Container Won't Start

Essentially I'm seeing what's in this ticket: NVIDIA/gpu-operator#564 (when I start up my machine running a cluster with a version of CNS installed, currently an old one, like 9.x)

...and because I'm using one of the playbooks from this repo, I'm not sure how to resolve this issue.

I'm also unsure as to why the issue is happening now...I've been running this on a machine since last fall, but the issue linked above pre-dates it.

Will updating to the latest CNS version solve this issue? Or will it still be a problem, given that it looks like the install.sh and Dockerfile(s) are pretty much the same. (I'll probably try doing this anyway on a test box but I wanted to ask here as well.)

Thank you.

how to compile devicequery?

guide followed

Now compile the cuda examples to validate from pod.

$ cd /usr/local/cuda/samples/1_Utilities/deviceQuery
$ sudo make
$ cd ~

where to compile it? kubectl just apply yml to create container from docker image.
compile from docker images?

untar of cri-containerd-cni fails

fatal: [10.225.0.212]: FAILED! => {"msg": "Unable to execute ssh command line on a controller due to: [Errno 13] Permission denied: b'sshpass'"}
fatal: [10.225.0.213]: FAILED! => {"msg": "Unable to execute ssh command line on a controller due to: [Errno 13] Permission denied: b'sshpass'"}
fatal: [10.225.0.211]: FAILED! => {"msg": "Unable to execute ssh command line on a controller due to: [Errno 13] Permission denied: b'sshpass'"}
changed: [10.225.0.210]

The issue is hard to diagnose, but it could be one of two problems:

the --no-overwrite-dir seems to cause an issue even when become is used in the Ansible playbook. Removing this extra_opts section results in a successful playbook run.
using the full Linux command with sudo works, but the Ansible playbook outputs an error as though sudo isn't being used despite become in the task.

fatal: [10.225.0.73]: FAILED! => {"changed": false, "dest": "/", "extract_results": {"cmd": ["/usr/bin/tar", "--extract", "-C", "/", "-z", "--show-transformed-names", "--no-overwrite-dir", "-f", "/tmp/cri-containerd-cni-1.7.3-linux-amd64.tar.gz"], "err": "/usr/bin/tar: cri-containerd.DEPRECATED.txt: Cannot open: File exists\n/usr/bin/tar: etc: Cannot change mode to rwxr-xr-x: Operation not permitted\n/usr/bin/tar: etc/cni: Cannot change mode to rwxr-xr-x: Operation not permitted\n/usr/bin/tar: etc/cni/net.d: Cannot change mode to rwxr-xr-x: Operation not permitted\n/usr/bin/tar: etc/cni/net.d/10-containerd-net.conflist: Cannot open: File exists\n/usr/bin/tar: etc/systemd: Cannot change mode to rwxr-xr-x: Operation not permitted\n/usr/bin/tar: etc/systemd/system: Cannot change mode to rwxr-xr-x: Operation not permitted\n/usr/bin/tar: etc/systemd/system/containerd.service: Cannot open: File exists\n/usr/bin/tar: etc/crictl.yaml: Cannot open: File exists\n/usr/bin/tar: usr: Cannot change mode to rwxr-xr-x: Operation not permitted\n/usr/bin/tar: usr/local: Cannot change mode to rwxr-xr-x: Operation not permitted\n/usr/bin/tar: usr/local/bin: Cannot change mode to rwxr-xr-x: Operation not permitted\n/usr/bin/tar: usr/local/bin/crictl: Cannot open: File exists\n/usr/bin/tar: usr/local/bin/containerd-shim-runc-v2: Cannot open: File exists\n/usr/bin/tar: usr/local/bin/containerd-stress: Cannot open: File exists\n/usr/bin/tar: usr/local/bin/containerd-shim-runc-v1: Cannot open: File exists\n/usr/bin/tar: usr/local/bin/containerd-shim: Cannot open: File exists\n/usr/bin/tar: usr/local/bin/critest: Cannot open: File exists\n/usr/bin/tar: usr/local/bin/containerd: Cannot open: File exists\n/usr/bin/tar: usr/local/bin/ctr: Cannot open: File exists\n/usr/bin/tar: usr/local/bin/ctd-decoder: Cannot open: File exists\n/usr/bin/tar: usr/local/sbin: Cannot change mode to rwxr-xr-x: Operation not permitted\n/usr/bin/tar: usr/local/sbin/runc: Cannot open: File exists\n/usr/bin/tar: opt: Cannot change mode to rwxr-xr-x: Operation not permitted\n/usr/bin/tar: opt/cni: Cannot change mode to rwxr-xr-x: Operation not permitted\n/usr/bin/tar: opt/cni/bin: Cannot change mode to rwxrwxr-x: Operation not permitted\n/usr/bin/tar: opt/cni/bin/dhcp: Cannot open: File exists\n/usr/bin/tar: opt/cni/bin/macvlan: Cannot open: File exists\n/usr/bin/tar: opt/cni/bin/host-device: Cannot open: File exists\n/usr/bin/tar: opt/cni/bin/dummy: Cannot open: File exists\n/usr/bin/tar: opt/cni/bin/loopback: Cannot open: File exists\n/usr/bin/tar: opt/cni/bin/ipvlan: Cannot open: File exists\n/usr/bin/tar: opt/cni/bin/host-local: Cannot open: File exists\n/usr/bin/tar: opt/cni/bin/vlan: Cannot open: File exists\n/usr/bin/tar: opt/cni/bin/firewall: Cannot open: File exists\n/usr/bin/tar: opt/cni/bin/ptp: Cannot open: File exists\n/usr/bin/tar: opt/cni/bin/portmap: Cannot open: File exists\n/usr/bin/tar: opt/cni/bin/vrf: Cannot open: File exists\n/usr/bin/tar: opt/cni/bin/bandwidth: Cannot open: File exists\n/usr/bin/tar: opt/cni/bin/static: Cannot open: File exists\n/usr/bin/tar: opt/cni/bin/tuning: Cannot open: File exists\n/usr/bin/tar: opt/cni/bin/bridge: Cannot open: File exists\n/usr/bin/tar: opt/cni/bin/sbr: Cannot open: File exists\n/usr/bin/tar: opt/containerd: Cannot change mode to rwxr-xr-x: Operation not permitted\n/usr/bin/tar: opt/containerd/cluster: Cannot change mode to rwxr-xr-x: Operation not permitted\n/usr/bin/tar: opt/containerd/cluster/gce: Cannot change mode to rwxr-xr-x: Operation not permitted\n/usr/bin/tar: opt/containerd/cluster/gce/env: Cannot open: File exists\n/usr/bin/tar: opt/containerd/cluster/gce/configure.sh: Cannot open: File exists\n/usr/bin/tar: opt/containerd/cluster/gce/cni.template: Cannot open: File exists\n/usr/bin/tar: opt/containerd/cluster/gce/cloud-init: Cannot change mode to rwxr-xr-x: Operation not permitted\n/usr/bin/tar: opt/containerd/cluster/gce/cloud-init/node.yaml: Cannot open: File exists\n/usr/bin/tar: opt/containerd/cluster/gce/cloud-init/master.yaml: Cannot open: File exists\n/usr/bin/tar: opt/containerd/cluster/version: Cannot open: File exists\n/usr/bin/tar: Exiting with failure status due to previous errors\n", "out": "", "rc": 2}, "gid": 0, "group": "root", "handler": "TgzArchive", "mode": "0755", "msg": "failed to unpack /tmp/cri-containerd-cni-1.7.3-linux-amd64.tar.gz to /", "owner": "root", "size": 4096, "src": "/tmp/cri-containerd-cni-1.7.3-linux-amd64.tar.gz", "state": "directory", "uid": 0}

Task in v3.0 Ubuntu 20.04 installation playbook has both shell and command module calls

name: Taint the Kubernetes Control Plane node
when: "'running' not in k8sup.stdout"
command: kubectl taint nodes --all node-role.kubernetes.io/master-
shell: command -v helm >/dev/null 2>&1
register: helm_exists
ignore_errors: yes

Results in following error:
Installing EGX Stack
[WARNING]: provided hosts list is empty, only localhost is available. Note that the implicit localhost does not match 'all'
ERROR! conflicting action statements: command, shell

The error appears to be in '/home/ubuntu/egx-platform/playbooks/egx-installation.yaml': line 252, column 6, but may
be elsewhere in the file depending on the exact syntax problem.

The offending line appears to be:

name: Taint the Kubernetes Control Plane node
^ here

Spelling mistake

cnc-installation.yaml has spelling mistake. Networking is spelling Netwroking. Tried to edit myself but dont have permissions to this repo.

Eric

Remote Ansible Installs Fail

The k8s-install.yaml file used in playbooks/ makes use of ansible_user_dir, but when doing a remote install, unless the username used for the remote is the same on the localhost (i.e. is already a user on the local system from which the playbook is kicked off), the playbook will fail with a permissions issue around line 276.

I verified this by creating a user on a local machine that kicks off the playbook (in fact, just a user dir in /home/ with the appropriate name and setting r/w permissions temporarily). Uninstalling and then re-running the playbook then works.

It looks like the change was done last year sometime: 55bdd53

Can it be installed in jetson nano or AGX xavier?

Can I install this full stack in jetson nano or xavier device?

Lost internet connection after run cnc-uninstall

Hi,

I was using the cnc-uninstall. But after iptables -F command, my machine lost internet connection. I need to restart the machine. The following installation cannot be done due to this step. Could someone explain what might cause this and how to resolve please?

Thanks,

calico.yaml url has expired

The URL in the following command
kubectl apply -f https://docs.projectcalico.org/v3.23/manifests/calico.yaml of document: https://github.com/NVIDIA/cloud-native-core/blob/master/install-guides/Ubuntu_Server_v7.0.md has expired.

Using the following URL resolved my issue:
https://raw.githubusercontent.com/projectcalico/calico/v3.24.0/manifests/calico.yaml

Kubernetes Repository Switch

I'm not 100% sure if this is an issue for the playbooks or not, but according to here:

https://kubernetes.io/blog/2023/08/31/legacy-package-repository-deprecation/

...it says the K8s package repositories have been moved and the legacy ones have now been removed.

The cns_values_*.yaml files should probably have the following references updated to default to the updated package repo:

k8s_apt_repository: " https://apt.kubernetes.io/ kubernetes-xenial main"

It might also be the case that the other k8s_ entries in the same YAML files may also need updating.

kubeadm init failed

using this command：

sudo kubeadm init --kubernetes-version=v1.17.5 --pod-network-cidr=10.244.0.0/16 --ignore-preflight-errors=Swap

log as：
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.

Unfortunately, an error has occurred:
timed out waiting for the condition

This error is likely caused by:
- The kubelet is not running
- The kubelet is unhealthy due to a misconfiguration of the node in some way (required cgroups disabled)

If you are on a systemd-powered system, you can try to troubleshoot the error with the following commands:
- 'systemctl status kubelet'
- 'journalctl -xeu kubelet'

helm install missing namespace

it seems that the helm install command doesn't specify the newly created gpu-operator-resources namespace in it, and ends up in default

nvidia / cloud-native-stack Goto Github PK

cloud-native-stack's Introduction

NVIDIA Cloud Native Stack

Getting Started

Prerequisites

Installation

NVIDIA Cloud Native Stack Component Matrix

Cloud Native Stack Topologies

Cloud Native Stack Features

Getting help or Providing feedback

Useful Links

cloud-native-stack's People

Contributors

Stargazers

Watchers

Forkers

cloud-native-stack's Issues

Logs

Describe Node

Option 1 - Uninstall and re-install

Option 2 - Repair

guide followed

$ cd /usr/local/cuda/samples/1_Utilities/deviceQuery $ sudo make $ cd ~

Recommend Projects

Recommend Topics

Recommend Org

$ cd /usr/local/cuda/samples/1_Utilities/deviceQuery
$ sudo make
$ cd ~