Giter Club home page Giter Club logo

cluster-nfd-operator's Introduction

Node Feature Discovery Operator

The Node Feature Discovery operator is a tool for Openshift administrators that makes it easy to detect and understand the hardware features and configurations of a cluster's nodes. With this operator, administrators can easily gather information about their nodes that can be used for scheduling, resource management, and more by controlling the life cycle of NFD.

Upstream Project

Node Feature Discovery – a Kubernetes add-on for detecting hardware features and system configuration.

The Node Feature Discovery and Node Feature Discovery operator are Upstream projects under the kubernetes-Sigs organization

Getting started with the Node Feature Discovery Operator

Prerequisite: a running OpenShift cluster 4.6+

Get the source code

git clone https://github.com/openshift/cluster-nfd-operator

Deploy the operator

IMAGE_REGISTRY=quay.io/<your-personal-registry>
make image push deploy

Create a NodeFeatureDiscovery instance

oc apply -f config/samples/nfd.openshift.io_v1_nodefeaturediscovery.yaml

Verify

The Operator will deploy NFD based on the information on the NodeFeatureDiscovery CR instance, after a moment you should be able to see

$ oc -n openshift-nfd get ds,deploy
NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/nfd-worker   3         3         3       3            3           <none>          5s
NAME                         READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/nfd-master   1/1     1            1           17s

Check that NFD feature labels have been created

$ oc get node -o json | jq .items[].metadata.labels
{
  "beta.kubernetes.io/arch": "amd64",
  "beta.kubernetes.io/os": "linux",
  "feature.node.kubernetes.io/cpu-cpuid.ADX": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AESNI": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX": "true",
...

Extending NFD with sidecar containers and hooks

First see upstream documentation of the hook feature and how to create a correct hook file: https://github.com/kubernetes-sigs/node-feature-discovery#local-user-specific-features.

The DaemonSet running on the workers will mount the hostPath: /etc/kubernetes/node-feature-discovery/source.d. Additional hooks can than be provided by a sidecar container that is as well running on the workers and mounting the same hostpath and writing the hook executable (shell-script, compiled code, ...) to this directory.

NFD will execute any file in this directory, if one needs any configuration for the hook, a separate configuration directory can be created under /etc/kubernetes/node-feature-discovery/source.d e.g. /etc/kubernetes/node-feature-discovery/source.d/own-hook-conf, NFD will not recurse deeper into the file hierarchy.

Building NFD operator for ARM locally

There 2 options:

  1. Using ARM server - process is the same, but you should use Dockerfile.arm instead of Dockerfile
  2. Using x86 server/laptop - process is the same but before running build command, Makefile.arm should be copied into Makefile

cluster-nfd-operator's People

Contributors

andymcc avatar arangogutierrez avatar chr15p avatar courtneypacheco avatar egallen avatar joepvd avatar jschintag avatar kpouget avatar linuxstudio avatar marquiz avatar mresvanis avatar niconosenzo avatar openshift-bot avatar openshift-ci-robot avatar openshift-ci[bot] avatar openshift-merge-bot[bot] avatar openshift-merge-robot avatar psap-ci-robot avatar sosiouxme avatar stevekuznetsov avatar thiagoalessio avatar thrasher-redhat avatar ybettan avatar yevgeny-shnaidman avatar yselkowitz avatar zwpaper avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cluster-nfd-operator's Issues

Make variables declaration consistent across Makefile

#58 (review)

We should make variables ${} vs. $() consistent across the Makefile. Traditionally, $() were mostly used to be used in GNU Makefiles and it still seems to be a preferred way to ${}. On the other hand BSD seem to prefer ${}. The choice is ours, but let's be consistent.

how enable NFD extended resources

Hi,
Not sure it is the right place to ask this question.
We are working on enabling Intel SGX feature on Openshift.
Intel SGX device plugin uses NFD extended resources to report the EPC size. To achieve it we need to set below parameters
--resource-labels=sgx.intel.com/epc
--extra-label-ns=sgx.intel.com
to start up nfd-master

see
https://github.com/intel/intel-device-plugins-for-kubernetes/blob/main/deployments/sgx_nfd/nfd-master.yaml

But I can’t find any clue how to enable the NFD extended resources features through Openshift NFD operator.
As I understand on Openshift the NFD should be managed and configured by the NFD Operator. Right?

Thanks in advance!

Node Feature Discovery Operator failing to properly default image in operator

Looks like this fork/mirror is updated from the upstream projects from time to time. The sync for 4.11 was done very poorly and then while finishing/fixing the sync merge problems that were committed, the PR [1] to sync with upstream for 0.5.0 removed the required patch [2][3] to allow things to work without overrides to NodeFeatureDiscovery objects.

[1] #266
[2] 055e871
[3] #187

OLM-installed nfd.v4.10.0's nfd-controller-manager fails to start

❯ oc logs -n openshift-nfd nfd-controller-manager-576765f974-krxgc -c manager
I0622 11:03:24.298094       1 main.go:61] Operator Version: a42b581a-dirty
2022-06-22T11:03:24.298Z        DPANIC  setup   non-string key argument passed to logging, ignoring all later arguments {"invalid key": "POD_NAMESPACE must be set"}
panic: non-string key argument passed to logging, ignoring all later arguments

goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00002e300, {0xc00036ee00, 0x1, 0x1})
        /go/src/github.com/openshift/cluster-nfd-operator/vendor/go.uber.org/zap/zapcore/entry.go:232 +0x446
go.uber.org/zap.(*Logger).DPanic(0x15de809, {0x163560e, 0x14170a0}, {0xc00036ee00, 0x1, 0x1})
        /go/src/github.com/openshift/cluster-nfd-operator/vendor/go.uber.org/zap/logger.go:220 +0x59
github.com/go-logr/zapr.handleFields(0x0, {0xc000255480, 0x2, 0x40cfe7}, {0x0, 0x0, 0xc000217b01})
        /go/src/github.com/openshift/cluster-nfd-operator/vendor/github.com/go-logr/zapr/zapr.go:110 +0x3f1
github.com/go-logr/zapr.(*zapLogger).WithValues(0xc0004cd5a0, {0xc000255480, 0x1, 0x15df7c6})
        /go/src/github.com/openshift/cluster-nfd-operator/vendor/github.com/go-logr/zapr/zapr.go:145 +0x31
main.main()
        /go/src/github.com/openshift/cluster-nfd-operator/main.go:130 +0x3e2

Setting POD_NAMESPACE via downwards API and it starts successfully.
I don't know why it was missing from my installation.

Broken on 3 node clusters

As of this commit, three node clusters with GPU acceleration seem to no longer be supported. In a three node cluster, all nodes will have the master and worker labels. Since they have the master label, the NFD workers will not be scheduled. The comment from the commit indicates that the intent was to allow NFD to run on servers that may be tagged with roles other than worker, but the effect of the change is greater than that.

Might I suggest running the NFD worker on the worker nodes by default and allowing the nodeselector to be set in the spec of the NodeFeatureDiscovery resource?

RH nfd-operator excessively logs at over 15,000 logged lines per minute

After installing the RH nfd-operator and gpu-operator in OCP 4.5 cluster, the nfd-operator logs at a rate of over 15,000 lines per minute.

$ oc get csv nfd.4.5.0-202011132127.p0
NAME                        DISPLAY                  VERSION   REPLACES   PHASE
nfd.4.5.0-202011132127.p0   Node Feature Discovery   4.5.0                Succeeded
$ date ; oc logs nfd-operator-7c69cf4f44-jkbwq | wc -l
Wed, Feb  3, 2021  2:44:30 PM
171957

$ date ; oc logs nfd-operator-7c69cf4f44-jkbwq | wc -l
Wed, Feb  3, 2021  2:47:01 PM
203070

2 minutes, 31 seconds since last count

171,975 - 203,070 = 31,095 messages

31095 logs / 2.5 minutes = 15547 logs

Typical contents of logs:

{"level":"info","ts":1612381861.9975307,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","Service":"nfd-master","Namespace":"openshift-operators"}
{"level":"info","ts":1612381862.0078075,"logger":"controller_nodefeaturediscovery","msg":"Found, skpping update","ServiceAccount":"nfd-worker","Namespace":"openshift-operators"}
{"level":"info","ts":1612381862.0079691,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","Role":"nfd-worker","Namespace":"openshift-operators"}
{"level":"info","ts":1612381862.01564,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","RoleBinding":"nfd-worker","Namespace":"openshift-operators"}
{"level":"info","ts":1612381862.0217285,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","ConfigMap":"nfd-worker","Namespace":"openshift-operators"}
{"level":"info","ts":1612381862.0276768,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","SecurityContextConstraints":"nfd-worker","Namespace":"default"}
{"level":"info","ts":1612381862.0387275,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","DaemonSet":"nfd-worker","Namespace":"openshift-operators"}
{"level":"info","ts":1612381862.0495317,"logger":"controller_nodefeaturediscovery","msg":"Looking for","ServiceAccount":"nfd-master","Namespace":"openshift-operators"}
{"level":"info","ts":1612381862.049615,"logger":"controller_nodefeaturediscovery","msg":"Found, skpping update","ServiceAccount":"nfd-master","Namespace":"openshift-operators"}
{"level":"info","ts":1612381862.04964,"logger":"controller_nodefeaturediscovery","msg":"Looking for","ClusterRole":"nfd-master","Namespace":""}
{"level":"info","ts":1612381862.04966,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","ClusterRole":"nfd-master","Namespace":""}
{"level":"info","ts":1612381862.0553772,"logger":"controller_nodefeaturediscovery","msg":"Looking for","ClusterRoleBinding":"nfd-master","Namespace":""}
{"level":"info","ts":1612381862.0554087,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","ClusterRoleBinding":"nfd-master","Namespace":""}
{"level":"info","ts":1612381862.0618849,"logger":"controller_nodefeaturediscovery","msg":"Looking for","DaemonSet":"nfd-master","Namespace":"openshift-operators"}
{"level":"info","ts":1612381862.0619338,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","DaemonSet":"nfd-master","Namespace":"openshift-operators"}
{"level":"info","ts":1612381862.1010854,"logger":"controller_nodefeaturediscovery","msg":"Looking for","Service":"nfd-master","Namespace":"openshift-operators"}
{"level":"info","ts":1612381862.1012244,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","Service":"nfd-master","Namespace":"openshift-operators"}

OCP version label missing after upgrade to 4.10.x (NFD version - v4.10.0-202204010807)

For product deployment we are using some nfd labels on node, one of that label is with key - "feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION"
After upgrading the OCP from v4.9.x to 4.10.6, above label was missing on worker nodes

# oc get no ocp-w-2.lab.ocp.lan -ojsonpath={.metadata.labels} | jq
{
  "IS-infoscalecluster-dev": "true",
  "IS-infoscalecluster-dev-33221": "true",
  "beta.kubernetes.io/arch": "amd64",
  "beta.kubernetes.io/os": "linux",
  "feature.node.kubernetes.io/cpu-cpuid.ADX": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AESNI": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX": "true",
  "feature.node.kubernetes.io/cpu-cpuid.AVX2": "true",
  "feature.node.kubernetes.io/cpu-cpuid.CLZERO": "true",
  "feature.node.kubernetes.io/cpu-cpuid.FMA3": "true",
  "feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR": "true",
  "feature.node.kubernetes.io/cpu-cpuid.MCAOVERFLOW": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SHA": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SSE4A": "true",
  "feature.node.kubernetes.io/cpu-cpuid.SUCCOR": "true",
  "feature.node.kubernetes.io/cpu-cpuid.WBNOINVD": "true",
  "feature.node.kubernetes.io/cpu-hardware_multithreading": "false",
  "feature.node.kubernetes.io/kernel-config.NO_HZ": "true",
  "feature.node.kubernetes.io/kernel-config.NO_HZ_FULL": "true",
  "feature.node.kubernetes.io/kernel-selinux.enabled": "true",
  "feature.node.kubernetes.io/kernel-version.full": "4.18.0-305.40.2.el8_4.x86_64",
  "feature.node.kubernetes.io/kernel-version.major": "4",
  "feature.node.kubernetes.io/kernel-version.minor": "18",
  "feature.node.kubernetes.io/kernel-version.revision": "0",
  "feature.node.kubernetes.io/pci-15ad.present": "true",
  "feature.node.kubernetes.io/system-os_release.ID": "rhcos",
  "feature.node.kubernetes.io/system-os_release.OSTREE_VERSION": "410.84.202203221702-0",
  "feature.node.kubernetes.io/system-os_release.RHEL_VERSION": "8.4",
  "feature.node.kubernetes.io/system-os_release.VERSION_ID": "4.10",
  "feature.node.kubernetes.io/system-os_release.VERSION_ID.major": "4",
  "feature.node.kubernetes.io/system-os_release.VERSION_ID.minor": "10",
  "kubernetes.io/arch": "amd64",
  "kubernetes.io/hostname": "ocp-w-2.lab.ocp.lan",
  "kubernetes.io/os": "linux",
  "node-role.kubernetes.io/worker": "",
  "node.openshift.io/os_id": "rhcos",
  "specialresource.openshift.io/state-infoscale-vtas-1000": "Ready"
}

# oc get clusterversions.config.openshift.io
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.6    True        False         10d     Cluster version is 4.10.6

# oc get csv
NAME                      DISPLAY                           VERSION               REPLACES                  PHASE
nfd.4.10.0-202204010807   Node Feature Discovery Operator   4.10.0-202204010807   nfd.4.10.0-202203271029   Succeeded

Assign a priority class to pods

Priority classes docs:
https://docs.openshift.com/container-platform/3.11/admin_guide/scheduling/priority_preemption.html#admin-guide-priority-preemption-priority-class

Example: https://github.com/openshift/cluster-monitoring-operator/search?q=priority&unscoped_q=priority

Notes: The pre-configured system priority classes (system-node-critical and system-cluster-critical) can only be assigned to pods in kube-system or openshift-* namespaces. Most likely, core operators and their pods should be assigned system-cluster-critical. Please do not assign system-node-critical (the highest priority) unless you are really sure about it.

NFD 4.9.0-202202211131 instance is going to degraded state

NodeFeatureDiscovery CR status is showing degraded as nfd-worker pods are not coming up.

Status:
Conditions:
Last Heartbeat Time: 2022-03-09T04:37:58Z
Last Transition Time: 2022-03-09T04:37:58Z
Status: False
Type: Available
Last Heartbeat Time: 2022-03-09T04:37:58Z
Last Transition Time: 2022-03-09T04:37:58Z
Status: False
Type: Upgradeable
Last Heartbeat Time: 2022-03-09T04:37:58Z
Last Transition Time: 2022-03-09T04:37:58Z
Status: False
Type: Progressing
Last Heartbeat Time: 2022-03-09T04:37:58Z
Last Transition Time: 2022-03-09T04:37:58Z
Message: NFDWorkerDaemonSetCorrupted
Reason: FailedGettingNFDWorkerDaemonSet
Status: True
Type: Degraded

Events:

Events:
Type Reason Age From Message


Warning FailedCreate 25s (x15 over 107s) daemonset-controller Error creating: pods "nfd-worker-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider "infoscale-fencing-scc": Forbidden: not usable by user or serviceaccount, provider restricted: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[4]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[5]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[6]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.containers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "infoscale-vtas-licensing-controller": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "infoscale-vtas-csi-driver-controller": Forbidden: not usable by user or serviceaccount, provider "infoscale-kubefencing-scc": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "nfd-worker": Forbidden: not usable by user or serviceaccount, provider "infoscale-vtas-csi-driver-node": Forbidden: not usable by user or serviceaccount, provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "infoscale-vtas-driver-container-rhel8": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount]

NFD Operator on OperatorHub not installing

Description of problem:

After installing NFD Operator, pods keep crashing.

{"level":"info","ts":1579886681.9943285,"logger":"cmd","msg":"Go Version: go1.12.12"} {"level":"info","ts":1579886681.9943712,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"} {"level":"info","ts":1579886681.9943788,"logger":"cmd","msg":"Version of operator-sdk: v0.4.0+git"} {"level":"info","ts":1579886681.9950516,"logger":"leader","msg":"Trying to become the leader."} {"level":"info","ts":1579886682.1206093,"logger":"leader","msg":"Found existing lock with my name. I was likely restarted."} {"level":"info","ts":1579886682.120643,"logger":"leader","msg":"Continuing as the leader."} {"level":"info","ts":1579886682.2016144,"logger":"cmd","msg":"Registering Components."} {"level":"info","ts":1579886682.2017787,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="} {"level":"info","ts":1579886682.2019343,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="} {"level":"info","ts":1579886682.202073,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="} {"level":"info","ts":1579886682.2021816,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="} {"level":"info","ts":1579886682.20229,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="} {"level":"error","ts":1579886682.202312,"logger":"cmd","msg":"","error":"no kind is registered for the type v1.SecurityContextConstraints in scheme \"k8s.io/client-go/kubernetes/scheme/register.go:60\"","stacktrace":"github.com/openshift/cluster-nfd-operator/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/openshift/cluster-nfd-operator/vendor/github.com/go-logr/zapr/zapr.go:128\nmain.main\n\t/go/src/github.com/openshift/cluster-nfd-operator/cmd/manager/main.go:92\nruntime.main\n\t/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/proc.go:200"}

Upon first glance, controller appears to be attempting to watch /api/v1/securitycontextconstraints which is not available in OpenShift 4.x.

The OperatorHub controller image should be updated to watch /apis/security.openshift.io/v1/securitycontextconstraints.

The command line option to disable leader election does not function as intended

In the main.go file in the main directory, a command line option exists for disabling leader election, and the default value indicates that leader election should be turned off. However, leader election is still enabled anyways:

# oc logs pod/nfd-controller-manager-7cd8d54686-nllqc manager
I0816 17:01:22.272019       1 main.go:55] Operator Version: 783214b0-dirty
I0816 17:01:23.326238       1 request.go:655] Throttling request took 1.021121017s, request: GET:https://172.30.0.1:443/apis/security.openshift.io/v1?timeout=32s
2021-08-16T17:01:25.093Z	INFO	controller-runtime.metrics	metrics server is starting to listen	{"addr": "127.0.0.1:8080"}
2021-08-16T17:01:25.094Z	INFO	setup	starting manager
I0816 17:01:25.094661       1 leaderelection.go:243] attempting to acquire leader lease openshift-nfd/39f5e5c3.nodefeaturediscoveries.nfd.kubernetes.io...
2021-08-16T17:01:25.095Z	INFO	controller-runtime.manager	starting metrics server	{"path": "/metrics"}
I0816 17:01:25.121966       1 leaderelection.go:253] successfully acquired lease openshift-nfd/39f5e5c3.nodefeaturediscoveries.nfd.kubernetes.io
2021-08-16T17:01:25.122Z	INFO	controller-runtime.manager.controller.nodefeaturediscovery	Starting EventSource	{"reconciler group": "nfd.openshift.io", "reconciler kind": "NodeFeatureDiscovery", "source": "kind source: /, Kind="}
2021-08-16T17:01:25.122Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"ConfigMap","namespace":"openshift-nfd","name":"39f5e5c3.nodefeaturediscoveries.nfd.kubernetes.io","uid":"e8ebfd1a-d7d7-44cc-9038-70bd5f1ae380","apiVersion":"v1","resourceVersion":"53680499"}, "reason": "LeaderElection", "message": "nfd-controller-manager-7cd8d54686-nllqc_c3d12674-228d-4747-9d07-f025b8aaf51e became leader"}
2021-08-16T17:01:25.122Z	DEBUG	controller-runtime.manager.events	Normal	{"object": {"kind":"Lease","namespace":"openshift-nfd","name":"39f5e5c3.nodefeaturediscoveries.nfd.kubernetes.io","uid":"10ebb14c-84f8-42be-ae4c-9eef1316e853","apiVersion":"coordination.k8s.io/v1","resourceVersion":"53680500"}, "reason": "LeaderElection", "message": "nfd-controller-manager-7cd8d54686-nllqc_c3d12674-228d-4747-9d07-f025b8aaf51e became leader"}
2021-08-16T17:01:25.223Z	INFO	controller-runtime.manager.controller.nodefeaturediscovery	Starting EventSource	{"reconciler group": "nfd.openshift.io", "reconciler kind": "NodeFeatureDiscovery", "source": "kind source: /, Kind="}

I think this problem occurs because LeaderElectionID is provided in the manager creation step:

        mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
                Scheme:                 scheme,
                MetricsBindAddress:     metricsAddr,
                Port:                   9443,
                LeaderElection:         enableLeaderElection,
                LeaderElectionID:       "39f5e5c3.nodefeaturediscoveries.nfd.kubernetes.io",
                HealthProbeBindAddress: probeAddr,
                Namespace:              watchNamespace, // namespaced-scope when the value is not an empty string
        })

If we remove LeaderElectionID and LeaderElection, then everything works as intended. If we remove just LeaderElectionID and leave LeaderElection there, an error occurs saying that LeaderElectionID must be defined, even though LeaderElection is set to false by default.

A better approach to this problem is to write an if-else conditional that says "If leader election is enabled, then [...]. Else, [...]", rather than providing a Leader election ID and relying on LeaderElection to disable Leader Election (since the true/false value of LeaderElection is ignored when LeaderElectionID is provided).

Will provide a PR fix for this soon.

Clean up devel branches

We should clean devel branches that have been merged or tested.

stick with the release-x branches for a clean repo maintenance

The proper way to deploy NFD on OCP through NFD-operator

Our operator running on OCP depends on NFD and needs to deploy NFD with some specific configurations. We'd like to know what is the proper way for us to deploy NFD through NFD-operator on OCP.

We plan to add NFD-operator as a dependency into our operator Bundle image metadata and let our operator create a CR including our configuration from NFD-operator CRD and deploy NFD.

But the potential problem is that if some other users on the same cluster also do the same thing and create their CRs. Can these CRs coexist?

If these CRs can't coexist, does that mean we need a cluster Administrator to merge the different requirements from users manually and create CR to deploy NFD?

Any suggestions and help are appreciated!

toleration list?

This pod must tolerate any taint. For example, the SDN pod has this list:

  tolerations:
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists
  - effect: NoSchedule
    key: node-role.kubernetes.io/master
    operator: Exists
  - effect: NoSchedule
    key: node.kubernetes.io/not-ready
    operator: Exists
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300

OpenShift NFD --enable-taints option not supported

Summary

OpenShift NFD operator version 4.14.0-202402081809 on OCP 4.14.11, does not support --enable-taints feature.

Details

We are currently creating accelerator profiles on OpenShift AI for multiple Intel GPU cards through taints- as mentioned in RH accelerator profile doc Current accelerator profile for one of the Intel GPU cards- link. Tolerations are already supported in accelerator profile.

For multiple GPU cards support, we would need to add taints on the nodes using NFD operator's NodeFeatureRule instance to integrate and use the accelerator profiles with Intel GPU device plugin's resource. We cannot directly taint the node as we do need matchFeatures with taints from NodeFeatureRule.
According to NFD upstream document tainting seems to be an experimental feature. We have tried adding --enable-taints in the nfd-worker pods/daemonset manually but this command is not supported.

Is it possible to enable this feature on OCP 4.14? If not, when would it be supported on OCP? Thanks!

nfd-workers crashes under an ipv6 environment

I applied this operator on an environment with ipv6. It deployed correctly on masters, but nfd-workers pods keep crashing.
When i look at logs inside the container, i can see:

2020/04/08 10:44:05 Sendng labeling request nfd-master
2020/04/08 10:44:05 failed to set node labels: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: address fd02::d19a:12000: too many colons in address"

This seems to be a problem with ipv6 address validation.

NFD logs crash approximately 7 hours after installation without touching the cluster

Approximately 7 hours after NFD has been installed, the following error shows up in the NFD controller manager's logs:

E0903 02:52:07.981627       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:225: Failed to watch *v1.NodeFeatureDiscovery: the server has received too many requests and has asked us to try again later (get nodefeaturediscoveries.nfd.openshift.io)
E0903 02:52:10.566841       1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:225: Failed to watch *v1.NodeFeatureDiscovery: the server has received too many requests and has asked us to try again later (get nodefeaturediscoveries.nfd.openshift.io)

This issue is easily repeatable by letting NFD run for several hours, such as overnight. I have reproduced it with and without using the cluster during that 7 hour period.

Seems like a controller manager issue?

Operator's controller logs are repeated when they shouldn't be. Possible duplicate function calls.

The operator's logs are repeated twice when launching NFD -- meaning, some commands are actually called twice -- but sometimes the repetitions occur even three, four, or more times. This issue could have existed before my PR #183, or it could have been introduced with my PR #183. Either way, the repetition problem does need to be fixed because it likely will have an impact on NFD functionality in the future.

Sample:

2021-08-05T15:05:48.251Z	INFO	controllers.NodeFeatureDiscovery	Fetch the NodeFeatureDiscovery instance
2021-08-05T15:05:48.251Z	INFO	controllers.NodeFeatureDiscovery	Ready to apply components
2021-08-05T15:05:48.251Z	INFO	controller_nodefeaturediscovery	Looking for	{"ServiceAccount": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.251Z	INFO	controller_nodefeaturediscovery	Found, skipping update	{"ServiceAccount": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.251Z	INFO	controller_nodefeaturediscovery	Looking for	{"ClusterRole": "nfd-master", "Namespace": ""}
2021-08-05T15:05:48.251Z	INFO	controller_nodefeaturediscovery	Found, updating	{"ClusterRole": "nfd-master", "Namespace": ""}
2021-08-05T15:05:48.343Z	INFO	controller_nodefeaturediscovery	Looking for	{"ClusterRoleBinding": "nfd-master", "Namespace": ""}
2021-08-05T15:05:48.343Z	INFO	controller_nodefeaturediscovery	Found, updating	{"ClusterRoleBinding": "nfd-master", "Namespace": ""}
2021-08-05T15:05:48.434Z	INFO	controller_nodefeaturediscovery	Looking for	{"DaemonSet": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.434Z	INFO	controller_nodefeaturediscovery	Found, updating	{"DaemonSet": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.458Z	INFO	controller_nodefeaturediscovery	Looking for	{"Service": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.458Z	INFO	controller_nodefeaturediscovery	Found, updating	{"Service": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.470Z	INFO	controller_nodefeaturediscovery	Looking for	{"ServiceAccount": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.470Z	INFO	controller_nodefeaturediscovery	Found, skipping update	{"ServiceAccount": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.470Z	INFO	controller_nodefeaturediscovery	Looking for	{"Role": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.470Z	INFO	controller_nodefeaturediscovery	Found, updating	{"Role": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.479Z	INFO	controller_nodefeaturediscovery	Looking for	{"RoleBinding": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.479Z	INFO	controller_nodefeaturediscovery	Found, updating	{"RoleBinding": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.486Z	INFO	controller_nodefeaturediscovery	Looking for	{"ConfigMap": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.486Z	INFO	controller_nodefeaturediscovery	Found, updating	{"ConfigMap": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.493Z	INFO	controller_nodefeaturediscovery	Looking for	{"DaemonSet": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.493Z	INFO	controller_nodefeaturediscovery	Found, updating	{"DaemonSet": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.506Z	INFO	controller_nodefeaturediscovery	Looking for	{"SecurityContextConstraints": "nfd-worker", "Namespace": "default"}
2021-08-05T15:05:48.506Z	INFO	controller_nodefeaturediscovery	Found, updating	{"SecurityContextConstraints": "nfd-worker", "Namespace": "default"}
2021-08-05T15:05:48.516Z	INFO	controllers.NodeFeatureDiscovery	Fetch the NodeFeatureDiscovery instance
2021-08-05T15:05:48.516Z	INFO	controllers.NodeFeatureDiscovery	Ready to apply components
2021-08-05T15:05:48.516Z	INFO	controller_nodefeaturediscovery	Looking for	{"ServiceAccount": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.516Z	INFO	controller_nodefeaturediscovery	Found, skipping update	{"ServiceAccount": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.516Z	INFO	controller_nodefeaturediscovery	Looking for	{"ClusterRole": "nfd-master", "Namespace": ""}
2021-08-05T15:05:48.516Z	INFO	controller_nodefeaturediscovery	Found, updating	{"ClusterRole": "nfd-master", "Namespace": ""}
2021-08-05T15:05:48.587Z	INFO	controller_nodefeaturediscovery	Looking for	{"ClusterRoleBinding": "nfd-master", "Namespace": ""}
2021-08-05T15:05:48.587Z	INFO	controller_nodefeaturediscovery	Found, updating	{"ClusterRoleBinding": "nfd-master", "Namespace": ""}
2021-08-05T15:05:48.659Z	INFO	controller_nodefeaturediscovery	Looking for	{"DaemonSet": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.659Z	INFO	controller_nodefeaturediscovery	Found, updating	{"DaemonSet": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.666Z	INFO	controller_nodefeaturediscovery	Looking for	{"Service": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.666Z	INFO	controller_nodefeaturediscovery	Found, updating	{"Service": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.679Z	INFO	controller_nodefeaturediscovery	Looking for	{"ServiceAccount": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.679Z	INFO	controller_nodefeaturediscovery	Found, skipping update	{"ServiceAccount": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.679Z	INFO	controller_nodefeaturediscovery	Looking for	{"Role": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.679Z	INFO	controller_nodefeaturediscovery	Found, updating	{"Role": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.687Z	INFO	controller_nodefeaturediscovery	Looking for	{"RoleBinding": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.687Z	INFO	controller_nodefeaturediscovery	Found, updating	{"RoleBinding": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.693Z	INFO	controller_nodefeaturediscovery	Looking for	{"ConfigMap": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.693Z	INFO	controller_nodefeaturediscovery	Found, updating	{"ConfigMap": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.700Z	INFO	controller_nodefeaturediscovery	Looking for	{"DaemonSet": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.700Z	INFO	controller_nodefeaturediscovery	Found, updating	{"DaemonSet": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.710Z	INFO	controller_nodefeaturediscovery	Looking for	{"SecurityContextConstraints": "nfd-worker", "Namespace": "default"}
2021-08-05T15:05:48.710Z	INFO	controller_nodefeaturediscovery	Found, updating	{"SecurityContextConstraints": "nfd-worker", "Namespace": "default"}

Duplicate key "version" in CRD

version: v1alpha1 is defined twice, in https://github.com/openshift/cluster-nfd-operator/blob/release-4.5/manifests/olm-catalog/4.5/nfd.crd.yaml#L13 and https://github.com/openshift/cluster-nfd-operator/blob/release-4.5/manifests/olm-catalog/4.5/nfd.crd.yaml#L38 which is failing validation in brew.

Image build failed. Error in plugin orchestrate_build: {"x86_64": {"pin_operator_digest": "while constructing a mapping\n  in \"/tmp/tmpXBzS7p/cluster-nfd-operator-bundle/manifests/nfd.crd.yaml\", line 6, column 3\nfound duplicate key \"version\" with value \"v1alpha1\" (original value: \"v1alpha1\")\n  in \"/tmp/tmpXBzS7p/cluster-nfd-operator-bundle/manifests/nfd.crd.yaml\", line 38, column 3\n\nTo suppress this check see:\n    http://yaml.readthedocs.io/en/latest/api.html#duplicate-keys\n\nDuplicate keys will become and error in future releases, and are errors\nby default when using the new API.\n"}}. OSBS build id: cluster-nfd-operator-bundle-rhaos-45-rhel-7-7ffff-4

Failed brew build: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=28602159

Please prepare bundle for OCP 4.12

we are going to release 4.12 RC and currently, this operator is disabled due to its bundle dir where its set is not ready, please update the manifest dir for the 4.12 branch if needed
you can ping us with @release-artist in the slack channel #aos-art

Operator deployment uses rolling but can never complete

The operator deployment is set to strategy: RollingUpdate which means that when an update is rolled out the pods in the new ReplicaSet have to go ready before the old ReplicaSet is scaled down. This creates a deadlock scenario:

  • deployment rolls out new version
  • a new RS is created which creates a new operator pod
  • new operator pod sees controller lock and waits for lease to expire
  • old operator pod is never scaled down and maintains lock

My fix is to just manually scale down the old RS which fixes everything but I think the best answer would be for the operator Deployment to be set to strategy: recreate

NFD Operator's controller manager crashes when an existing NFD instance is deleted

The NFD Operator's controller manager is unable to handle the case where the user creates an NFD instance and then deletes it.

This problem occurs because the current error checking library, criapierrors, does not check for errors related to missing resources. Missing resources will have their metav1 values/attributes set to empty or nil (depending on whether or not the attribute is a pointer or not). Thus, we want to check for metav1 problems, not GRPC problems, as GRPC problems are not relevant to checking for missing resoures.

Sample existing output:

2021-08-06T15:04:12.190Z	INFO	controller_nodefeaturediscovery	Looking for	{"ServiceAccount": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.190Z	INFO	controller_nodefeaturediscovery	Found, skipping update	{"ServiceAccount": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.190Z	INFO	controller_nodefeaturediscovery	Looking for	{"ClusterRole": "nfd-master", "Namespace": ""}
2021-08-06T15:04:12.190Z	INFO	controller_nodefeaturediscovery	Found, updating	{"ClusterRole": "nfd-master", "Namespace": ""}
2021-08-06T15:04:12.482Z	INFO	controller_nodefeaturediscovery	Looking for	{"ClusterRoleBinding": "nfd-master", "Namespace": ""}
2021-08-06T15:04:12.482Z	INFO	controller_nodefeaturediscovery	Found, updating	{"ClusterRoleBinding": "nfd-master", "Namespace": ""}
2021-08-06T15:04:12.626Z	INFO	controller_nodefeaturediscovery	Looking for	{"DaemonSet": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.626Z	INFO	controller_nodefeaturediscovery	Found, updating	{"DaemonSet": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.652Z	INFO	controller_nodefeaturediscovery	Looking for	{"Service": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.652Z	INFO	controller_nodefeaturediscovery	Found, updating	{"Service": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.666Z	INFO	controller_nodefeaturediscovery	Looking for	{"ServiceAccount": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.666Z	INFO	controller_nodefeaturediscovery	Found, skipping update	{"ServiceAccount": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.666Z	INFO	controller_nodefeaturediscovery	Looking for	{"Role": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.666Z	INFO	controller_nodefeaturediscovery	Found, updating	{"Role": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.687Z	INFO	controller_nodefeaturediscovery	Looking for	{"RoleBinding": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.687Z	INFO	controller_nodefeaturediscovery	Found, updating	{"RoleBinding": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.703Z	INFO	controller_nodefeaturediscovery	Looking for	{"ConfigMap": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.704Z	INFO	controller_nodefeaturediscovery	Found, updating	{"ConfigMap": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.720Z	INFO	controller_nodefeaturediscovery	Looking for	{"DaemonSet": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.720Z	INFO	controller_nodefeaturediscovery	Found, updating	{"DaemonSet": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.740Z	INFO	controller_nodefeaturediscovery	Looking for	{"SecurityContextConstraints": "nfd-worker", "Namespace": "default"}
2021-08-06T15:04:12.740Z	INFO	controller_nodefeaturediscovery	Found, updating	{"SecurityContextConstraints": "nfd-worker", "Namespace": "default"}
2021-08-06T17:11:55.170Z	INFO	controllers.NodeFeatureDiscovery	Fetch the NodeFeatureDiscovery instance
2021-08-06T17:11:55.170Z	ERROR	controllers.NodeFeatureDiscovery	requeueing event since there was an error reading object	{"error": "NodeFeatureDiscovery.nfd.openshift.io \"nfd-instance\" not found"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/github.com/go-logr/zapr/zapr.go:132
github.com/openshift/cluster-nfd-operator/controllers.(*NodeFeatureDiscoveryReconciler).Reconcile
	/go/src/github.com/openshift/cluster-nfd-operator/controllers/nodefeaturediscovery_controller.go:169
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.UntilWithContext
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99
2021-08-06T17:11:55.170Z	ERROR	controller-runtime.manager.controller.nodefeaturediscovery	Reconciler error	{"reconciler group": "nfd.openshift.io", "reconciler kind": "NodeFeatureDiscovery", "name": "nfd-instance", "namespace": "openshift-nfd", "error": "NodeFeatureDiscovery.nfd.openshift.io \"nfd-instance\" not found"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/github.com/go-logr/zapr/zapr.go:132
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:267
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.UntilWithContext
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99
2021-08-06T17:11:55.175Z	INFO	controllers.NodeFeatureDiscovery	Fetch the NodeFeatureDiscovery instance
2021-08-06T17:11:55.176Z	ERROR	controllers.NodeFeatureDiscovery	requeueing event since there was an error reading object	{"error": "NodeFeatureDiscovery.nfd.openshift.io \"nfd-instance\" not found"}

Customizing nfd-worker's config

Hi,

Is there's any way to customize the nfd-worker's config without building own images?
I wanted to change the labels from vendor to vendor+device id (and possible other), but it seems that after editing configmap -n openshift-nfd nfd-worker operator immediately overwrites with the definition embedded in its container image.

Thanks

Name on CSV doesn't match entry in art.yaml (4.8 only)

In all release-4.* branches, the CSV has name: nfd..., so the string replacement defined in art.yaml works fine, creating a unique name for every new release of the operator's bundle.

But in release-4.8, the name in the CSV is different:

    name: node-feature-discovery-operator.v4.8.0

https://github.com/openshift/cluster-nfd-operator/blob/release-4.8/manifests/4.8/manifests/nfd.v4.8.0.clusterserviceversion.yaml#L44

So art.yaml replacements don't happen, and pushing the bundles fail without a unique name.

Similar issue happened kubernetes-nmstate: openshift/kubernetes-nmstate#203

We can either change the entry in art.yaml only for release-4.8 branch, to something like this:

    update_list:
-    - search: "nfd.v{MAJOR}.{MINOR}.0"
-      replace: "nfd.{FULL_VER}"
+    - search: "node-feature-discovery-operator.v{MAJOR}.{MINOR}.0"
+      replace: "node-feature-discovery-operator.{FULL_VER}"

Or change the name to "nfd" in the CSV

    operatorframework.io/arch.ppc64le: supported
    operatorframework.io/arch.s390x: supported
-  name: node-feature-discovery-operator.v4.8.0
+  name: nfd.v4.8.0
  namespace: placeholder

A way to customize the deployment

We would like deploy the NFD service using the operator, but with some customization done (additional plugins and related pods). You can find the way we do that here https://github.com/ksimon1/cpu-model-nfd-plugin/blob/master/cpu-model-nfd-plugin.yaml.template

The issue is, there is no way to configure NFD when it is deployed and started using this operator.

The basic requirements are:

  • a empty{} volume for plugins mounted to /etc/kubernetes/node-feature-discovery/source.d/
  • list of extra containers to populate the plugin volume in the initContainers section
  • list of extra containers to run alongside NFD with that plugin volume mounted to a well known location

We can use our custom deployment and manage the lifecycle ourselves, but we would lose the changes provided by this operator and it would not be vanilla openshift anymore.

Documentation suggestion

Since this and https://github.com/openshift/node-feature-discovery are the two repositories linked to in the operator as deployed in OperatorHub, it would be useful to have more explicit instructions here as to what a user, having subscribed to the operator through the CLI as described in https://docs.openshift.com/container-platform/4.6/operators/admin/olm-adding-operators-to-cluster.html#olm-installing-operator-from-operatorhub-using-cli_olm-adding-operators-to-a-cluster, would do next to instantiate and configure the server & worker pods in a particular namespace.

For the record, all the user need do to instantiate the server and worker pods is oc apply -f 0700_cr.yaml with the namespace specified, and the configuration instructions are in https://kubernetes-sigs.github.io/node-feature-discovery/v0.7/get-started/deployment-and-usage.html#configuration

Creating, deleting, then recreating an NFD instance causes the controller manager to crash

The controller manager crashes when an NFD instance is created, then deleted, then recreated again. Relevant logs are below. Then below these logs is a description that explains how to replicate the bug.

2021-08-12T15:29:22.053Z	ERROR	controller-runtime.manager.controller.nodefeaturediscovery	Reconciler error	{"reconciler group": "nfd.openshift.io", "reconciler kind": "NodeFeatureDiscovery", "name": "nfd-instance", "namespace": "openshift-nfd", "error": "Operation cannot be fulfilled on nodefeaturediscoveries.nfd.openshift.io \"nfd-instance\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/github.com/go-logr/zapr/zapr.go:132
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:267
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.UntilWithContext
	/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99

To replicate the bug:

  1. Create the NFD instance:
oc create -f config/samples/nfd.openshift.io_v1_nodefeaturediscovery.yaml
  1. Delete the NFD instance:
oc delete -f config/samples/nfd.openshift.io_v1_nodefeaturediscovery.yaml
  1. Recreate the NFD instance:
oc create -f config/samples/nfd.openshift.io_v1_nodefeaturediscovery.yaml

Small mistake in the README

I just noticed a slight error in the README.md file. It says to create an NFD instance, you should do:

oc apply -f config/samples/nfd.kubernetes.io_v1_nodefeaturediscovery.yaml

However, this file (nfd.kubernetes.io_v1_nodefeaturediscovery.yaml) does not exist. The actual file is nfd.openshift.io_v1_nodefeaturediscovery.yaml. Thus, the command should be:

oc apply -f config/samples/nfd.openshift.io_v1_nodefeaturediscovery.yaml

OpenShift NFD extendedResources for SGX on 4.14

Summary

We use 4.14.0-202402081809 NFD operator version on OCP 4.14.11. This does not add the NFD extendedResources for SGX EPC with the NodeFeatureRule.

Details

  1. For 4.14.0-202402081809 NFD operator version on OCP 4.14.11, we use the NFR instance to add the SGX EPC sgx.intel.com/epc resource on the node (this was previously done with help of SGX device plugin's init container). The NFR does not create the SGX EPC resource on the node. Is the extendedResources option for EPC not supported for this operator version? Please confirm, thanks.

  2. For the 4.14.0-202402081809 NFD operator version, we use the SGX init container along with NFD instance and NFR instance - for the SGX EPC sgx.intel.com/epc resource on the node. This works as expected with all current files, but if quay image for NFD is updated to 4.14, the resource is not created. It only works for image tag 4.13 on OCP 4.14 as well.

The feature is supported on upstream NFD kubernetes-sigs/node-feature-discovery#1129, is it yet to be supported on OCP 4.14? Please confirm and suggest. Also when would this be supported on OCP?
Thanks!

Could not install NodeFeatureDiscovery on OKD4.5 and 4.6 Cluster by operatorHub.

On my OKD Cluster, I found the NFD operator on operatorHub, so I tried to install it from operatorHub.
But It was not installed.
(Before I found it on operatorHub, I used to install it manually, which I cloned from git and use make command to deploy.)

Installation proceeded in two ways ( Install on All Namespaces-default / specific Namespace ), but the following error occurred:
image

{"level":"info","ts":1616045944.9220765,"logger":"cmd","msg":"Go Version: go1.15.5"}
{"level":"info","ts":1616045944.9221418,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1616045944.9221659,"logger":"cmd","msg":"Version of operator-sdk: v0.4.0+git"}
{"level":"info","ts":1616045944.922823,"logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":1616045946.8118556,"logger":"leader","msg":"No pre-existing lock was found."}
{"level":"info","ts":1616045946.8415146,"logger":"leader","msg":"Became the leader."}
{"level":"info","ts":1616045950.8142676,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1616045950.8169394,"logger":"cmd","msg":"Registering Components."}
{"level":"info","ts":1616045950.817697,"logger":"cmd","msg":"Starting the Cmd."}
{"level":"info","ts":1616045950.818323,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8187845,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8190293,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8192632,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8194685,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8199208,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
{"level":"info","ts":1616045950.8208203,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8210897,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8213153,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.821549,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8217416,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8219752,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8221507,"logger":"controller-runtime.controller","msg":"Starting Controller","controller":"nodefeaturediscovery-controller"}
{"level":"info","ts":1616045951.0225158,"logger":"controller-runtime.controller","msg":"Starting workers","controller":"nodefeaturediscovery-controller","worker count":1}

How can I solve this? Do anyone have same error with?

NFD 4.10 / broken ServiceMonitor

Hi,

I'm now testing the NFD operator (current csv: nfd.4.10.0-202205120735, although I also see it replaces some nfd.4.10.0-202204251358. cluster was installed 9 days ago).

Looking forward to upgrading several OCP clusters from 4.8 to 4.10, I'm going through Prometheus alerts.
I see one regarding Prometheus being unable to scrape for metrics, out of some "nfd-controller-manager-metrics-service" Service.

prometheus-pod$ curl http://127.0.0.1:9090/api/v1/alerts
{"status":"success","data":{"alerts":[{"labels":{"alertname":"TargetDown","job":"nfd-controller-manager-metrics-service","namespace":"openshift-operators","service":"nfd-controller-manager-metrics-service","severity":"warning"},"annotations":{"description":"100% of the nfd-controller-manager-metrics-service/nfd-controller-manager-metrics-service targets in openshift-operators namespace have been unreachable for more than 15 minutes. This may be a symptom of network connectivity issues, down nodes, or failures within these components. Assess the health of the infrastructure and nodes running these targets and then contact support.","summary":"Some targets were not reachable from the monitoring server for an extended period of time."},"state":"firing","activeAt":"2022-05-17T14:33:11Z","value":"1e+02"}, ...

In the openshift-operators namespace, I found a "nfd-controller-manager-metrics-monitor" ServiceMonitor.
Defined with the following:

spec:
  endpoints:
  - bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    interval: 30s
    path: /metrics
    port: https
    scheme: https
    tlsConfig:
      caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt
      serverName: nfd-controller-manager-metrics-service.openshift-nfd.svc
  selector:
    matchLabels:
      control-plane: controller-manager

The thing is: my nfd operator is installed in the "openshift-operators" namespace.
The tlsConfig.serverName does not match (nfd-controller-manager-metrics-service.openshift-nfd.svc, should be nfd-controller-manager-metrics-service.openshift-operators.svc).
Shouldn't this configuration reflect whichever namespace operator is installed into?

Later on, as the alert wouldn't go away, I noticed there are two ServiceMonitors, with the exact same configuration.
Previous remark applies to the "controller-manager-metrics-monitor".
Although there's no reason to have two configurations. Doesn't sound related to CSV upgrade, creation timestamps almost match:

get servicemonitor -n openshift-operators -o yaml controller-manager-metrics-monitor nfd-controller-manager-metrics-monitor |grep creationTimest
    creationTimestamp: "2022-05-17T14:30:49Z"
    creationTimestamp: "2022-05-17T14:30:50Z"

Fixing the spec.tlsConfig.serverName on both ServiceMonitors, I can confirm Prometheus no longer complains about those metrics

prometheus-pod$ curl http://127.0.0.1:9090/api/v1/alerts
[nfd servicemonitor alert is gone]

prometheus-pod$ curl "http://127.0.0.1:9090/api/v1/query?query=nfd_degraded_info"
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"__name__":"nfd_degraded_info","container":"kube-rbac-proxy","endpoint":"https","instance":"10.94.13.23:8443","job":"nfd-controller-manager-metrics-service","namespace":"openshift-operators","pod":"nfd-controller-manager-747c569f45-gdvdn","service":"nfd-controller-manager-metrics-service"},"value":[1653645665.843,"0"]}]}}

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.