Giter Club home page Giter Club logo

jetlag's Introduction

jetlag

Tooling to install clusters for testing via an on-prem Assisted Installer in the Red Hat Scale/Alias Lab and bare metal servers in IBMcloud.

Three separate layouts of clusters can be deployed:

  • BM - Bare Metal - 3 control-plane nodes, X number of worker nodes
  • RWN - Remote Worker Node - 3 control-plane/worker nodes, X number of remote worker nodes
  • SNO - Single Node OpenShift - 1 OpenShift Master/Worker Node "cluster" per available hardware resource

Each cluster layout requires a bastion machine which is the first machine out of your lab "cloud" allocation. The bastion machine will host the assisted-installer service and serve as a router for clusters with a private machine network. BM and RWN layouts produce a single cluster consisting of 3 control-plane nodes and X number of worker or remote worker nodes. The worker node count can also be 0 such that your bare metal cluster is a compact 3 node cluster with schedulable control-plane nodes. SNO layout creates an SNO cluster per available machine after fulfilling the bastion machine requirement. Lastly, BM/RWN cluster types will allocate any unused machines under the hv ansible group which stands for hypervisor nodes. The hv nodes can host vms for additional clusters that can be deployed from the hub cluster. (For ACM/MCE testing)

Table of Contents

Tested Labs/Hardware

The listed hardware has been used for cluster deployments successfully. Potentially other hardware has been tested but not documented here.

Alias Lab

Hardware BM RWN SNO
740xd Yes No Yes
Dell r750 Yes No Yes

Scale Lab

Hardware BM RWN SNO
Dell r650 Yes No Yes
Dell r640 Yes Yes Yes
Dell fc640 Yes No Yes
Supermicro 1029p Yes Yes No
Supermicro 1029U Yes No Yes
Supermicro 5039ms Yes No Yes

IBMcloud

Hardware BM SNO
Supermicro E5-2620 Yes Yes
Lenovo ThinkSystem SR630 Yes Yes

For guidance on how to order hardware on IBMcloud, see order-hardware-ibmcloud.md in docs directory.

Prerequisites

Versions:

  • Ansible 4.10+ (core >= 2.11.12) (on machine running jetlag playbooks)
  • ibmcloud cli => 2.0.1 (IBMcloud environments)
  • ibmcloud plugin install sl (IBMcloud environments)
  • RHEL >= 8.6 (Bastion)
  • podman 3 / 4 (Bastion)

Update to RHEL 8.9

[root@xxx-xxx-xxx-r640 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.2 (Ootpa)

[root@xxx-xxx-xxx-r640 ~]# ./update-latest-rhel-release.sh 8.9
...
[root@xxx-xxx-xxx-r640 ~]# dnf update -y
...
[root@xxx-xxx-xxx-r640 ~]# reboot
...
[root@xxx-xxx-xxx-r640 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.9 (Ootpa)

Installing Ansible via bootstrap (requires python3-pip)

[root@xxx-xxx-xxx-r640 jetlag]# source bootstrap.sh
...
(.ansible) [root@xxx-xxx-xxx-r640 jetlag]#

Pre-reqs for Supermicro hardware:

  • SMCIPMITool downloaded to jetlag repo, renamed to smcipmitool.tar.gz, and placed under ansible/

Cluster Deployment Usage

There are three main files to configure. The inventory file is generated but might have to be edited for specific scenario/hardware usage:

  • ansible/vars/all.yml - An ansible vars file (Sample provided ansible/vars/all.sample.yml)
  • pull_secret.txt - Your OCP pull secret, download from console.redhat.com/openshift/downloads
  • ansible/inventory/$CLOUDNAME.local - The generated inventory file (Samples provided in ansible/inventory)

Start by editing the vars

(.ansible) [root@xxx-xxx-xxx-r640 jetlag]# cp ansible/vars/all.sample.yml ansible/vars/all.yml
(.ansible) [root@xxx-xxx-xxx-r640 jetlag]# vi ansible/vars/all.yml

Make sure to set/review the following vars:

  • lab - either alias or scalelab
  • lab_cloud - the cloud within the lab environment (Example: cloud42)
  • cluster_type - either bm, rwn, or sno for the respective cluster layout
  • worker_node_count - applies to bm and rwn cluster types for the desired worker count, ideal for leaving left over inventory hosts for other purposes
  • sno_node_count - applies to sno cluster type for the desired sno count, ideal for leaving left over inventory hosts for other purposes
  • bastion_lab_interface - set to the bastion machine's lab accessible interface
  • bastion_controlplane_interface - set to the interface in which the bastion will be networked to the deployed ocp cluster
  • controlplane_lab_interface - applies to bm and rwn cluster types and should map to the nodes interface in which the lab provides dhcp to and also required for public routable vlan based sno deployment(to disable this interface)
  • More customization such as cluster_network and service_network can be supported as extra vars, check each ansible roles default vars file for variable names and options

Set your pull-secret in pull_secret.txt in repo base directory. Example:

(.ansible) [root@xxx-xxx-xxx-r640 jetlag]# cat pull_secret.txt
{
  "auths": {
...

Run create-inventory playbook

(.ansible) [root@xxx-xxx-xxx-r640 jetlag]# ansible-playbook ansible/create-inventory.yml

Run setup-bastion playbook

(.ansible) [root@xxx-xxx-xxx-r640 jetlag]# ansible-playbook -i ansible/inventory/cloud99.local ansible/setup-bastion.yml

Run deploy for either bm/rwn/sno playbook with inventory created by create-inventory playbook

Bare Metal Cluster:

(.ansible) [root@xxx-xxx-xxx-r640 jetlag]# ansible-playbook -i ansible/inventory/cloud99.local ansible/bm-deploy.yml

See troubleshooting.md in docs directory for BM install related issues

Remote Worker Node Cluster:

(.ansible) [root@xxx-xxx-xxx-r640 jetlag]# ansible-playbook -i ansible/inventory/cloud99.local ansible/rwn-deploy.yml

Single Node OpenShift:

(.ansible) [root@xxx-xxx-xxx-r640 jetlag]# ansible-playbook -i ansible/inventory/cloud99.local ansible/sno-deploy.yml

Interact with your cluster from your bastion machine:

(.ansible) [root@xxx-xxx-xxx-r640 jetlag]# export KUBECONFIG=/root/bm/kubeconfig
(.ansible) [root@xxx-xxx-xxx-r640 jetlag]# oc get no
NAME               STATUS   ROLES                         AGE    VERSION
xxx-h02-000-r650   Ready    control-plane,master,worker   73m    v1.25.7+eab9cc9
xxx-h03-000-r650   Ready    control-plane,master,worker   103m   v1.25.7+eab9cc9
xxx-h05-000-r650   Ready    control-plane,master,worker   105m   v1.25.7+eab9cc9
(.ansible) [root@xxx-xxx-xxx-r640 jetlag]# cat /root/bm/kubeadmin-password
xxxxx-xxxxx-xxxxx-xxxxx

And for SNO

(.ansible) [root@xxx-xxx-xxx-r640 jetlag]# export KUBECONFIG=/root/sno/xxx-h02-000-r650/kubeconfig
(.ansible) [root@xxx-xxx-xxx-r640 jetlag]# oc get no
NAME      STATUS   ROLES                         AGE   VERSION
xxx-h02-000-r650   Ready    control-plane,master,worker   30h   v1.28.6+0fb4726
(.ansible) [root@xxx-xxx-xxx-r640 jetlag]# cat /root/sno/xxx-h02-000-r650/kubeadmin-password
xxxxx-xxxxx-xxxxx-xxxxx

Quickstart guides

Tips and Troubleshooting

See tips-and-vars.md in docs directory.

See troubleshooting.md in docs directory.

Disconnected API/Console Access

See disconnected-ipv6-cluster-access.md in docs directory.

Jetlag Hypervisors

See hypervisors.md in docs directory.

jetlag's People

Contributors

achuzhoy avatar akrzos avatar bbenshab avatar dbutenhof avatar ebattat avatar ekuric avatar hughnhan avatar iranzo avatar jaredoconnell avatar jtaleric avatar kambiz-aghaiepour avatar mrbojangles3 avatar mukrishn avatar noreen21 avatar radez avatar robertkrawitz avatar sadsfae avatar sarahbx avatar sferlin avatar shashank-boyapally avatar sjug avatar smalleni avatar tingxueca avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

jetlag's Issues

Document buggy r650 idrac version

Virtual media can not be mounted on:

iDRAC Firmware Version | 5.10.10.00

It works fine with this version:

iDRAC Firmware Version | 5.10.10.05

You can check idrac versions using racadm:

# sshpass -p "xxxxxx" ssh -o StrictHostKeyChecking=no [email protected] "racadm get iDRAC.Info"
[Key=iDRAC.Embedded.1#Info.1]
#Build=26
#CPLDVersion=1.0.7
#Description=This system component provides a complete set of remote management functions for Dell PowerEdge Servers
#HWRev=0.01
#IPMIVersion=2.0
#Name=iDRAC
#Product=Integrated Dell Remote Access Controller
#RollbackBuild=26
#RollbackVersion=5.10.10.00
#ServerGen=15G
#Type=15G Monolithic
#Version=5.10.10.00

Fix race condition on waiting for initial nodes to be discovered.

Currently there is a race condition when waiting for nodes to be discovered vs when assigning roles to each node. We currently only wait until cluster count of nodes is equal to the expected amount of nodes. Infact we should also check if all nodes we expect are pending-for-input thus they wil be ready for assignment. The fix should be implemented here.

Add tips readme document

Add a tips readme document to cover things such as:

  • Change version of OCP deployed by assisted-installer (minor version, major versions, nightly/ci versions)
  • Setting specific machines for ibmcloud (vars: bastion_hardware_id, controlplane_hardware_ids, worker_hardware_ids, and sno_hardware_ids)
  • Increasing max-pods during deployment timeframe (vars: kubelet_config, kubelet_config_max_pods, and cluster_network_host_prefix)
  • Deploying etcd on nvme storage
  • Using localstorage for PVs
  • Deploying a 2nd cluster using the same bastion maybe

Clean up how to get Ansible setup for Jetlag

More people are using Jetlag and a typical tripping point is on how to install Ansible and which version of Ansible to use with Jetlag. We need to develop a clear understanding of what version of Ansible and how to install it (either rpm or pip+virtualenv) to get Jetlag setup on the bastion.

Logic issue with boot_iso for dell nodes

@mkarg75 Found an issue in that the resolution of idrac versions is actually incorrect. See #13 to see the lines which are not resolving correctly. Although the when statements is logically incorrect, the r640 machines actually boot correctly. However fc640 machines do not seem to boot correctly. We need to make sure the when statements are corrected but also make sure the iso booting via redfish for fc640s and r640s is setup correctly.

podman version 3.4.2 may cause setup-bastion.yml playbook failure

Synopsis: Assist Install's 'service' container exits within seconds after started when using podman version 3.4.2

Environment:
i used Jetlag to set up my SNO on a private lab, and therefore I had to hand crafted the inventory file and run Jetlag bastion setup as below:
$ ansible-playbook -i my-inventory-sno ansible/setup-bastion.yml

It is unlikely that the issue is unique my env.
Condition:
After the playbook ran successfully, about ten seconds later, 'podman ps -a' show the "assisted-service"container exited.

CONTAINER ID  IMAGE                                                       COMMAND               CREATED        STATUS                    PORTS       NAMES
9e046fbee711  k8s.gcr.io/pause:3.5                                                              3 minutes ago  Up 3 minutes ago                      5326b23064a5-infra
a52c71c6bc97  registry.centos.org/centos/httpd-24-centos7:latest          /usr/bin/run-http...  3 minutes ago  Up 3 minutes ago                      httpd
bb02daef013c  k8s.gcr.io/pause:3.5                                                              2 minutes ago  Up 2 minutes ago                      9ca9acdfc341-infra
1d4c1355f3a4  quay.io/edge-infrastructure/postgresql-12-centos7:latest    run-postgresql        2 minutes ago  Up 2 minutes ago                      db
7c6a0bdf714c  quay.io/edge-infrastructure/assisted-installer-ui:v2.0.7    /deploy/start.sh      2 minutes ago  Up 2 minutes ago                      gui
b184dd40183f  quay.io/edge-infrastructure/assisted-service:v2.0.10        /assisted-service     2 minutes ago  Exited (1) 2 minutes ago              service
864db440ac4c  quay.io/edge-infrastructure/assisted-image-service:v2.0.10  /assisted-image-s...  2 minutes ago  Up 2 minutes ago                      image-service
[root@perf176d jetlag]# 

Debug:
Step 1: 'service' could not reach "db" container port 5432

[root@perf176d jetlag]# podman logs b184dd40183f
false
time="2022-02-21T19:00:17Z" level=info msg="Setting log format: text" func=main.InitLogs file="/go/src/github.com/openshift/origin/cmd/main.go:161"
time="2022-02-21T19:00:17Z" level=info msg="Setting Log Level: info" func=main.InitLogs file="/go/src/github.com/openshift/origin/cmd/main.go:167"
time="2022-02-21T19:00:17Z" level=info msg="Starting bm service" func=main.main file="/go/src/github.com/openshift/origin/cmd/main.go:205"
time="2022-02-21T19:00:17Z" level=info msg="Started service with OS images [{\"openshift_version\": \"4.8\", \"cpu_architecture\": \"x86_64\", \"url\": \"http://perf176d.perf.lab.eng.bos.redhat.com:8081/4.8.32/rhcos-48.84.202109241901-0-live.x86_64.iso\", \"rootfs_url\": \"http://perf176d.perf.lab.eng.bos.redhat.com:8081/4.8.32/rhcos-48.84.202109241901-0-live-rootfs.x86_64.img\", \"version\": \"48.84.202109241901-0\"}], Release images [{\"openshift_version\": \"4.8\", \"cpu_architecture\": \"x86_64\", \"url\": \"quay.io/openshift-release-dev/ocp-release:4.8.32-x86_64\", \"version\": \"4.8.32\"}]" func=main.main file="/go/src/github.com/openshift/origin/cmd/main.go:225"
time="2022-02-21T19:00:17Z" level=info msg="Connecting to DB" func=main.setupDB file="/go/src/github.com/openshift/origin/cmd/main.go:627"
time="2022-02-21T19:00:17Z" level=info msg="Failed to connect to DB, retrying" func=main.setupDB.func1 file="/go/src/github.com/openshift/origin/cmd/main.go:634" error="failed to connect to `host=127.0.0.1 user=admin database=installer`: dial error (dial tcp 127.0.0.1:5432: connect: connection refused)"

Step 2: DB logs shows that it is listening on port 5432

root@perf176d jetlag]# podman logs 1d4c1355f3a4
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....2022-02-21 19:00:14.841 UTC [25] LOG:  starting PostgreSQL 12.1 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2022-02-21 19:00:14.841 UTC [25] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-02-21 19:00:14.842 UTC [25] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2022-02-21 19:00:14.846 UTC [25] LOG:  redirecting log output to logging collector process
2022-02-21 19:00:14.846 UTC [25] HINT:  Future log output will appear in directory "log".
 done
server started
/var/run/postgresql:5432 - accepting connections
=> sourcing /usr/share/container-scripts/postgresql/start/set_passwords.sh ...
ALTER ROLE
waiting for server to shut down.... done
server stopped
Starting server...
2022-02-21 19:00:15.055 UTC [1] LOG:  starting PostgreSQL 12.1 on x86_64-redhat-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39), 64-bit
2022-02-21 19:00:15.056 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2022-02-21 19:00:15.056 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2022-02-21 19:00:15.056 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-02-21 19:00:15.056 UTC [1] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2022-02-21 19:00:15.060 UTC [1] LOG:  redirecting log output to logging collector process
2022-02-21 19:00:15.060 UTC [1] HINT:  Future log output will appear in directory "log".

step 3: Netstat on the host shows no one listening on 5432

[root@perf176d jetlag]# netstat -a | grep 5432
[root@perf176d jetlag]# 

Workaround:
Reinstall an older podman version such as podman-3.2.3 fix the problem. Everything works fine.

[RFE] Enable regular cluster deploys

Right now, jetlag is able to deploy SNO and remote worker node environments. It would be great to be able to drive regular cluster installs (not rwn) by getting rid of the remote worker node specific node stuff.

Determine what changes we need to support r750 in the Alias lab

It seems the r750 nodes in the Alias lab ended up installing a cluster however the playbook timed out on max recursion. We should root cause why those machines took so long to boot into their post discovery install. Perhaps the machines have a really long boot cycle until they boot the disk we installed on?

Jetlag doesn't handle a cloud being expanded well

We recently had a scalelab cloud expanded and the fact we always pick the first machine as the bastion and subsequent machines for control-plane and worker nodes means our inventory was essentially jumbled. We should think of other methods to picking machines for bastion/control-plane/worker that would handle that situation.

Need to fix post install tasks for provisioningOSDownloadURL for 4.10 clusters

It seems the 4.10 builds lack the proper url to perform these tasks

Error:

TASK [bm-post-cluster-install : Determine metal3 provisioning URL] *********************************************************************************************************************************
Tuesday 22 February 2022  17:06:02 +0000 (0:00:00.032)       1:01:40.518 ******
changed: [f04-h17-b07-5039ms.rdu2.scalelab.redhat.com]

TASK [bm-post-cluster-install : Set download url] **************************************************************************************************************************************************
Tuesday 22 February 2022  17:06:02 +0000 (0:00:00.349)       1:01:40.867 ******
fatal: [f04-h17-b07-5039ms.rdu2.scalelab.redhat.com]: FAILED! => {"msg": "The task includes an option with an undefined variable. The error was: 'dict object' has no attribute 'provisioningOSDownloadURL'\n\nThe error appears to be in '/root/acm-ztp/jetlag/ansible/roles/bm-post-cluster-install/tasks/main.yml': line 114, column 5, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n  - name: Set download url\n    ^ here\n"}

And the provisionings object:

# oc get provisionings provisioning-configuration -o yaml
apiVersion: metal3.io/v1alpha1
kind: Provisioning
metadata:
  creationTimestamp: "2022-02-22T16:19:28Z"
  finalizers:
  - provisioning.metal3.io
  generation: 2
  name: provisioning-configuration
  resourceVersion: "9759"
  uid: e5cb49a0-4514-4289-8653-ccc95225c001
spec:
  preProvisioningOSDownloadURLs: {}
  provisioningNetwork: Disabled
status:
  generations:
  - group: apps
    hash: ""
    lastGeneration: 1
    name: metal3
    namespace: openshift-machine-api
    resource: deployments
  - group: apps
    hash: ""
    lastGeneration: 1
    name: metal3-image-cache
    namespace: openshift-machine-api
    resource: daemonsets
  - group: admissionregistration.k8s.io
    hash: ""
    lastGeneration: 1
    name: baremetal-operator-validating-webhook-configuration
    namespace: ""
    resource: validatingwebhookconfigurations
  - group: apps
    hash: ""
    lastGeneration: 1
    name: metal3-image-customization
    namespace: openshift-machine-api
    resource: deployments
  observedGeneration: 2
  readyReplicas: 0

hv-install : Install python packages failed to install libvirt-python 6.0.0

Hi,

It is happened in ALIAS lab machines
e23-h19-740xd.alias.bos.scalelab.redhat.com
e23-h17-740xd.alias.bos.scalelab.redhat.com

TASK [hv-install : Install python packages] *********************************************************************************************************************************
Monday 06 February 2023 16:46:16 +0000 (0:01:01.714) 0:01:04.258 *******
failed: [e23-h19-740xd.alias.bos.scalelab.redhat.com] (item={'name': 'libvirt-python', 'version': '6.0.0'}) => {"ansible_loop_var": "item", "changed": false, "cmd": ["/usr/local/bin/pip3", "install", "libvirt-python==6.0.0"], "item": {"name": "libvirt-python", "version": "6.0.0"}, "msg": "stdout: Collecting libvirt-python==6.0.0\n Downloading libvirt-python-6.0.0.tar.gz (196 kB)\n Preparing metadata (setup.py): started\n Preparing metadata (setup.py): finished with status 'done'\nUsing legacy 'setup.py install' for libvirt-python, since package 'wheel' is not installed.\nInstalling collected packages: libvirt-python\n Attempting uninstall: libvirt-python\n Found existing installation: libvirt-python 8.0.0\n\n:stderr: ERROR: Cannot uninstall 'libvirt-python'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.\n"}
failed: [e23-h17-740xd.alias.bos.scalelab.redhat.com] (item={'name': 'libvirt-python', 'version': '6.0.0'}) => {"ansible_loop_var": "item", "changed": false, "cmd": ["/usr/local/bin/pip3", "install", "libvirt-python==6.0.0"], "item": {"name": "libvirt-python", "version": "6.0.0"}, "msg": "stdout: Collecting libvirt-python==6.0.0\n Downloading libvirt-python-6.0.0.tar.gz (196 kB)\n Preparing metadata (setup.py): started\n Preparing metadata (setup.py): finished with status 'done'\nUsing legacy 'setup.py install' for libvirt-python, since package 'wheel' is not installed.\nInstalling collected packages: libvirt-python\n Attempting uninstall: libvirt-python\n Found existing installation: libvirt-python 8.0.0\n\n:stderr: ERROR: Cannot uninstall 'libvirt-python'. It is a distutils installed project and thus we cannot accurately determine which files belong to it which would lead to only a partial uninstall.\n"}

PLAY RECAP ******************************************************************************************************************************************************************
e23-h17-740xd.alias.bos.scalelab.redhat.com : ok=2 changed=1 unreachable=0 failed=1 skipped=5 rescued=0 ignored=0
e23-h19-740xd.alias.bos.scalelab.redhat.com : ok=2 changed=1 unreachable=0 failed=1 skipped=5 rescued=0 ignored=0

Monday 06 February 2023 16:46:21 +0000 (0:00:04.878) 0:01:09.136 *******

hv-install : Install packages --------------------------------------------------------------------------------------------------------------------------------------- 61.71s
hv-install : Install python packages --------------------------------------------------------------------------------------------------------------------------------- 4.88s
Gathering Facts ------------------------------------------------------------------------------------------------------------------------------------------------------ 2.11s
install-tc : Wait for machine to be ready after reboot --------------------------------------------------------------------------------------------------------------- 0.11s
install-tc : Install packages ---------------------------------------------------------------------------------------------------------------------------------------- 0.09s
install-tc : Wait for machine rebooting ------------------------------------------------------------------------------------------------------------------------------ 0.07s
install-tc : Reboot hypervisor --------------------------------------------------------------------------------------------------------------------------------------- 0.07s
install-tc : Install kernel modules ---------------------------------------------------------------------------------------------------------------------------------- 0.06s

As workaround need replace in /root/jetlag/ansible/roles/hv-install/tasks/main.yml

  • name: Install python packages
    pip:
    name: pip
    extra_args: --upgrade
    loop:

  • name: Install python packages
    pip:
    name: "{{ item.name }}"
    loop:

    • name: libvirt-python
    • name: setuptools_rust
    • name: sushy-tools

Determine what issues jetlag might have creating large clusters

We attempted to deploy a large cluster today via Jetlag and ran into at least one issue in which the discovery image generation would take longer than 30s (The default timeout for the URI module in Ansible).

We should:

  • Document in the troubleshooting and/or tips document limits to larger clusters (Seems that larger than 25 nodes will result in this timeout)
  • Bump the timeout either by default to something much larger or bump only when above a specific threshold of nodes
  • See issue #141 and see if the image generation service fixes this problem

Jetlag should specifically detect the virtual media mounted in order to unmount cleanly

If a host such as a dell node has a virtualmedia resource left from a previous run or a different bastion, then the virtualmedia will fail to eject and won't be able to be replaced by a new discovery iso. We should see if we can detect the virtualmedia and remove the specific virtualmedia or remove all virtualmedia regardless of where it came from.

If there is old virtualmedia, you will get an error resembling:

fatal: [example.rdu2.scalelab.redhat.com]: FAILED! => {"changed": false, "msg": "Unable to find an available VirtualMedia resource supporting ['CD', 'DVD']"}

Check ibmcloud hardware if ipmi privilege level is administrator vs operator as validation

Goal would be to prevent attempting to boot an iso when you can not actually perform the unmount/mount commands to do so without the proper privileges.

[root@jetlag-bm0 ~]# SMCIPMITool xxxxxxxx root xxxxxxxxx user list
Maximum number of Users          : 10
Count of currently enabled Users : 8
 User ID | User Name       | Privilege Level    | Enable       
 ------- | -----------     | ---------------    | ------       
       3 | root            | Operator           | Yes          
[root@jetlag-bm0 ~]# SMCIPMITool xxxxxxxx root xxxxxxxxx user list
Maximum number of Users          : 10
Count of currently enabled Users : 8
 User ID | User Name       | Privilege Level    | Enable       
 ------- | -----------     | ---------------    | ------       
       2 | ADMIN           | Administrator      | Yes          
       3 | root            | Administrator      | Yes

Identify hardware vendor in create-inventory playbook (ibmcloud)

We can identify the hardware vendor by this hardwareChassis field for ibmcloud:

Example Supermicro:

$ ibmcloud sl hardware detail 960237 -p --output json | jq '.hardwareChassis'
{
  "id": 30,
  "name": "SMCH12"
}

Example Lenovo:

$ ibmcloud sl hardware detail 3145980 -p --output json | jq '.hardwareChassis'
{
  "id": 793,
  "name": "SR630"
}

Supermicro boot order issues

As tested several times, it seems the Supermicro tasks do not always work as intended. Although BootSourceOverrideEnabled is set to once, the machines do not instantly boot to a virtual iso and instead follow the typical pattern of running through the entire boot order. I am unsure if this is an issue with the tasks, the hardware, redfish or the firmware. The only known workaround at the moment is to ensure cdrom is earlier in boot order than the hard disk and observe the status of the install and manually eject the virtual cdrom once the install completes writing to the disk.

Separate out kube-burner testing

I'm thinking that we should at this point separate out the remote worker node/ node-density testing stuff from jetlag to keep it focused as a deployment tool. Thoughts?

Flexible deployment option

New Feature Request:

  1. Recognize hardware that fails or likely to fail during the install.
  2. Continue to deploy on the healthy set of nodes skipping the failed hardware.
  3. Report which nodes failed to deploy

Large scale cluster deployment get stuck due to some bad nodes (bad hardware or incorrect/unsupported configuration). The installation should be able to skip such nodes for installation.

Add support for non alias/scalelab systems?

Maybe this is not a common use case, but adding the request here. I'd like to do SNO installs on arbitrary hosts which are not in alias or scalelab, is there some way of supporting that? I could supply a file containing whatever info you normally automatically capture from alias/scalelab reservations.

[Bug] Dry run has a jsondecode error on post impairment pod eviction check

Issue is here.

Example of issue:

$ ./rwn-workload.py --dry-run -B 100 -P 10 -L 100
2021-06-02 20:41:05,107 : INFO : ###############################################################################
2021-06-02 20:41:05,108 : INFO : RWN Workload
2021-06-02 20:41:05,108 : INFO : ###############################################################################
2021-06-02 20:41:05,108 : INFO : Indexing server is unset, disabling all indexing
2021-06-02 20:41:05,108 : INFO : Scenario Phases:
2021-06-02 20:41:05,108 : INFO : * Workload Phase - No measurement indexing
2021-06-02 20:41:05,108 : INFO :   * 12 Namespaces/Deployments/Pods
2021-06-02 20:41:05,108 : INFO :   * Guaranteed QOS Pods w/ 29 cores and 120 memory
2021-06-02 20:41:05,108 : INFO :   * 100 Shared Node-Selectors
2021-06-02 20:41:05,108 : INFO :   * 100 Unique Node-Selectors
2021-06-02 20:41:05,108 : INFO :   * RWN tolerations
2021-06-02 20:41:05,108 : INFO : * Impairment Phase - 30s Duration
2021-06-02 20:41:05,108 : INFO :   * Link Latency: 100ms
2021-06-02 20:41:05,108 : INFO :   * Packet Loss: 10%
2021-06-02 20:41:05,108 : INFO :   * Bandwidth Limit: 100kbits
2021-06-02 20:41:05,108 : INFO : * Cleanup Phase - No measurement indexing
2021-06-02 20:41:05,108 : INFO : ###############################################################################
2021-06-02 20:41:05,108 : INFO : Workload phase starting
2021-06-02 20:41:05,108 : INFO : ###############################################################################
2021-06-02 20:41:05,111 : INFO : Created /tmp/tmpwizgakxg
2021-06-02 20:41:05,111 : INFO : Created /tmp/tmpwizgakxg/rwn-create.yml
2021-06-02 20:41:05,111 : INFO : Created /tmp/tmpwizgakxg/rwn-deployment-selector.yml
2021-06-02 20:41:05,111 : INFO : Command: echo kube-burner init -c rwn-create.yml --uuid 84181276-82e3-44ed-b529-e0ad21d001ab
2021-06-02 20:41:05,112 : INFO : Output : kube-burner init -c rwn-create.yml --uuid 84181276-82e3-44ed-b529-e0ad21d001ab
2021-06-02 20:41:05,112 : INFO : Workload phase complete
2021-06-02 20:41:05,112 : INFO : ###############################################################################
2021-06-02 20:41:05,112 : INFO : Impairment phase starting
2021-06-02 20:41:05,112 : INFO : ###############################################################################
2021-06-02 20:41:05,112 : INFO : Impairment start: 1622666465.1, end: 1622666495.1, duration: 30
2021-06-02 20:41:05,112 : INFO : Applying latency, packet loss, bandwidth limit impairments
2021-06-02 20:41:05,112 : INFO : Command: echo tc qdisc add dev ens1f1.100 root netem delay 100ms loss 10% rate 100kbit
2021-06-02 20:41:05,113 : INFO : Output : tc qdisc add dev ens1f1.100 root netem delay 100ms loss 10% rate 100kbit
2021-06-02 20:41:05,113 : INFO : Command: echo tc qdisc add dev ens1f1.101 root netem delay 100ms loss 10% rate 100kbit
2021-06-02 20:41:05,114 : INFO : Output : tc qdisc add dev ens1f1.101 root netem delay 100ms loss 10% rate 100kbit
2021-06-02 20:41:05,114 : INFO : Command: echo tc qdisc add dev ens1f1.102 root netem delay 100ms loss 10% rate 100kbit
2021-06-02 20:41:05,115 : INFO : Output : tc qdisc add dev ens1f1.102 root netem delay 100ms loss 10% rate 100kbit
2021-06-02 20:41:05,115 : INFO : Command: echo tc qdisc add dev ens1f1.103 root netem delay 100ms loss 10% rate 100kbit
2021-06-02 20:41:05,116 : INFO : Output : tc qdisc add dev ens1f1.103 root netem delay 100ms loss 10% rate 100kbit
2021-06-02 20:41:05,116 : INFO : Command: echo tc qdisc add dev ens1f1.104 root netem delay 100ms loss 10% rate 100kbit
2021-06-02 20:41:05,117 : INFO : Output : tc qdisc add dev ens1f1.104 root netem delay 100ms loss 10% rate 100kbit
2021-06-02 20:41:05,117 : INFO : Command: echo tc qdisc add dev ens1f1.105 root netem delay 100ms loss 10% rate 100kbit
2021-06-02 20:41:05,118 : INFO : Output : tc qdisc add dev ens1f1.105 root netem delay 100ms loss 10% rate 100kbit
2021-06-02 20:41:15,162 : INFO : Remaining impairment duration: 20.1
2021-06-02 20:41:25,200 : INFO : Remaining impairment duration: 10.0
2021-06-02 20:41:35,129 : INFO : Removing bandwidth, latency, and packet loss impairments
2021-06-02 20:41:35,129 : INFO : Command: echo tc qdisc del dev ens1f1.100 root netem
2021-06-02 20:41:35,134 : INFO : Output : tc qdisc del dev ens1f1.100 root netem
2021-06-02 20:41:35,135 : INFO : Command: echo tc qdisc del dev ens1f1.101 root netem
2021-06-02 20:41:35,139 : INFO : Output : tc qdisc del dev ens1f1.101 root netem
2021-06-02 20:41:35,140 : INFO : Command: echo tc qdisc del dev ens1f1.102 root netem
2021-06-02 20:41:35,145 : INFO : Output : tc qdisc del dev ens1f1.102 root netem
2021-06-02 20:41:35,145 : INFO : Command: echo tc qdisc del dev ens1f1.103 root netem
2021-06-02 20:41:35,150 : INFO : Output : tc qdisc del dev ens1f1.103 root netem
2021-06-02 20:41:35,151 : INFO : Command: echo tc qdisc del dev ens1f1.104 root netem
2021-06-02 20:41:35,155 : INFO : Output : tc qdisc del dev ens1f1.104 root netem
2021-06-02 20:41:35,156 : INFO : Command: echo tc qdisc del dev ens1f1.105 root netem
2021-06-02 20:41:35,161 : INFO : Output : tc qdisc del dev ens1f1.105 root netem
2021-06-02 20:41:35,162 : INFO : Impairment phase complete
2021-06-02 20:41:35,162 : INFO : ###############################################################################
2021-06-02 20:41:35,162 : INFO : Post impairment pod eviction check
2021-06-02 20:41:35,162 : INFO : ###############################################################################
2021-06-02 20:41:35,163 : INFO : Command: echo oc get ev -A --field-selector reason=TaintManagerEviction -o json
Traceback (most recent call last):
  File "/home/akrzos/akrh/code/akrzos/jetlag/workload/./rwn-workload.py", line 739, in <module>
    sys.exit(main())
  File "/home/akrzos/akrh/code/akrzos/jetlag/workload/./rwn-workload.py", line 621, in main
    json_data = json.loads(output)
  File "/usr/lib64/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python3.9/json/decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Setup haproxy for ibmcloud clusters

Similar to disconnected lab clusters, we should setup haproxy to enable communication to the clusters built on the private network in ibmcloud.

Check if vars are actually needed

It seems that this vars may actually not really need configuration since we have had more than one machine vendor and had not needed to customize them. We need to investigate if they are required or if we can remove them.

Enforced SELinux policy would not let containers to access the mounted host files

SNO deploy failed due to unavailable http sevice on bastion node and httpd container logs complained about file permission,

# podman logs 610a32ce203d
sed: couldn't open temporary file /etc/httpd/conf/sedCDaZFQ: Permission denied

Found to be an issue with Enforced SELinux setting on the bastion node.
Setting the policy to permissive helps to get rid of this permission issue.

# setenforce 0 
# getenforce
Permissive

Use `ansible.utils.nthhost` instead of `ansible.netcommon.nthhost`

Ansible says netcommon is being deprecated and I suspect largely one of the major tripping points over Ansible versions with jetlag, we should move to the latest and ensuring we install and use filters from a virtualenv.

TASK [create-inventory : Place inventory file named cloud29.local into inventory directory] **************************************************************************************************
Tuesday 21 June 2022  12:05:50 +0000 (0:00:00.031)       0:00:06.610 ********** 
[DEPRECATION WARNING]: Use 'ansible.utils.nthhost' module instead. This feature will be removed from ansible.netcommon in a release after 2024-01-01. Deprecation warnings can be disabled by
 setting deprecation_warnings=False in ansible.cfg.

Regular releases?

Can we have some releases tagged? I'm watching for new content via new releases, and this project does not have any.

Change `next_nth_usable` to `ansible.netcommon.nthhost`

For instance:

# pip3 freeze | grep ansible
ansible==4.10.0
ansible-core==2.11.9

Results:

# ansible -i 'localhost,' all -m shell -a "echo {{ 'fc00:1000::/64' | ansible.netcommon.nthhost(228) }}" 2>/dev/null
localhost | CHANGED | rc=0 >>
fc00:1000::e4
# ansible -i 'localhost,' all -m shell -a "echo {{ 'fc00:1000::/64' | next_nth_usable(228) }}" 2>/dev/null
localhost | FAILED | rc=-1 >>
template error while templating string: no filter named 'next_nth_usable'. String: echo {{ 'fc00:1000::/64' | next_nth_usable(228) }}

Vs

# pip3 freeze | grep ansible
ansible==4.10.0
ansible-core==2.11.7

Results in:

# ansible -i 'localhost,' all -m shell -a "echo {{ 'fc00:1000::/64' | ansible.netcommon.nthhost(228) }}" 2>/dev/null
localhost | CHANGED | rc=0 >>
fc00:1000::e4
# ansible -i 'localhost,' all -m shell -a "echo {{ 'fc00:1000::/64' | next_nth_usable(228) }}" 2>/dev/null
localhost | CHANGED | rc=0 >>
fc00:1000::e4

Looks like next_nth_usable is being deprecated.

UI pod and dnsmasq services fails on bastion node

Possible Production environment scenario

  • An unclean bastion node, can use previous running services on port 53. Maybe check for the port is already in use, if so delete/re-assign other port for dnsmasq on bastion node.

  • httpd services can be running, clean up the previously running services.

Research increasing size of neighbor table in discovery image

Research if we should bump discovery image neighbor arp table size from default:

[root@localhost ~]# sysctl net.ipv6.neigh.default.gc_thresh1
net.ipv6.neigh.default.gc_thresh1 = 128
[root@localhost ~]# sysctl net.ipv6.neigh.default.gc_thresh2
net.ipv6.neigh.default.gc_thresh2 = 512
[root@localhost ~]# sysctl net.ipv6.neigh.default.gc_thresh3
net.ipv6.neigh.default.gc_thresh3 = 1024
[root@localhost ~]# sysctl net.ipv4.neigh.default.gc_thresh1
net.ipv4.neigh.default.gc_thresh1 = 128
[root@localhost ~]# sysctl net.ipv4.neigh.default.gc_thresh2
net.ipv4.neigh.default.gc_thresh2 = 512
[root@localhost ~]# sysctl net.ipv4.neigh.default.gc_thresh3
net.ipv4.neigh.default.gc_thresh3 = 1024

To what OCP ships with via cluster-node-tuning-operator:

[root@e38-h06-000-r650 ~]# sysctl net.ipv6.neigh.default.gc_thresh1
net.ipv6.neigh.default.gc_thresh1 = 8192
[root@e38-h06-000-r650 ~]# sysctl net.ipv6.neigh.default.gc_thresh2
net.ipv6.neigh.default.gc_thresh2 = 32768
[root@e38-h06-000-r650 ~]# sysctl net.ipv6.neigh.default.gc_thresh3
net.ipv6.neigh.default.gc_thresh3 = 65536
[root@e38-h06-000-r650 ~]# sysctl net.ipv4.neigh.default.gc_thresh1
net.ipv4.neigh.default.gc_thresh1 = 8192
[root@e38-h06-000-r650 ~]# sysctl net.ipv4.neigh.default.gc_thresh2
net.ipv4.neigh.default.gc_thresh2 = 32768
[root@e38-h06-000-r650 ~]# sysctl net.ipv4.neigh.default.gc_thresh3
net.ipv4.neigh.default.gc_thresh3 = 65536

Set hypervisor network impairments to handle different hardware hypervisors

Currently the hypervisor impairments playbook only allows one "nic" (impaired_nic) to be set, this should be set per host (hostvars[inventory_hostname]['impaired_nic']) instead of a "global" var such that hypervisors will configure the correct nic to be impaired in a heterogeneous hypervisor environment.

Use SMCIPMITool for Supermicro machines to mount/unmount virtual media

The Supermicro provided tool called SMCIPMITool can mount/unmount virtual media and do many other things. We should automate using it with IBMCloud Supermicro hardware so we can have a more automated install process.

Example:

$ ./SMCIPMITool x.x.x.x root xxxxxxxxx wsiso umount
Done
$ ./SMCIPMITool x.x.x.x root xxxxxxxxx wsiso mount "http://x.x.x.x:8081" /iso/discovery.iso
mounting ISO file ...
Mount ISO successfully

It seems we also need to place the iso files into one level deep on the directory tree of the http server in order for it to behave as expected as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.