openshift / cluster-nfd-operator Goto Github PK
View Code? Open in Web Editor NEWThe Cluster Node Feature Discovery operator manages detection of hardware features and configuration in a Openshift cluster.
License: Apache License 2.0
The Cluster Node Feature Discovery operator manages detection of hardware features and configuration in a Openshift cluster.
License: Apache License 2.0
We would like deploy the NFD service using the operator, but with some customization done (additional plugins and related pods). You can find the way we do that here https://github.com/ksimon1/cpu-model-nfd-plugin/blob/master/cpu-model-nfd-plugin.yaml.template
The issue is, there is no way to configure NFD when it is deployed and started using this operator.
The basic requirements are:
We can use our custom deployment and manage the lifecycle ourselves, but we would lose the changes provided by this operator and it would not be vanilla openshift anymore.
We use 4.14.0-202402081809 NFD operator version on OCP 4.14.11. This does not add the NFD extendedResources
for SGX EPC with the NodeFeatureRule.
For 4.14.0-202402081809 NFD operator version on OCP 4.14.11, we use the NFR instance to add the SGX EPC sgx.intel.com/epc
resource on the node (this was previously done with help of SGX device plugin's init container). The NFR does not create the SGX EPC resource on the node. Is the extendedResources
option for EPC not supported for this operator version? Please confirm, thanks.
For the 4.14.0-202402081809 NFD operator version, we use the SGX init container along with NFD instance and NFR instance - for the SGX EPC sgx.intel.com/epc
resource on the node. This works as expected with all current files, but if quay image for NFD is updated to 4.14, the resource is not created. It only works for image tag 4.13 on OCP 4.14 as well.
The feature is supported on upstream NFD kubernetes-sigs/node-feature-discovery#1129, is it yet to be supported on OCP 4.14? Please confirm and suggest. Also when would this be supported on OCP?
Thanks!
Our operator running on OCP depends on NFD and needs to deploy NFD with some specific configurations. We'd like to know what is the proper way for us to deploy NFD through NFD-operator on OCP.
We plan to add NFD-operator as a dependency into our operator Bundle image metadata and let our operator create a CR including our configuration from NFD-operator CRD and deploy NFD.
But the potential problem is that if some other users on the same cluster also do the same thing and create their CRs. Can these CRs coexist?
If these CRs can't coexist, does that mean we need a cluster Administrator to merge the different requirements from users manually and create CR to deploy NFD?
Any suggestions and help are appreciated!
Description of problem:
After installing NFD Operator, pods keep crashing.
{"level":"info","ts":1579886681.9943285,"logger":"cmd","msg":"Go Version: go1.12.12"} {"level":"info","ts":1579886681.9943712,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"} {"level":"info","ts":1579886681.9943788,"logger":"cmd","msg":"Version of operator-sdk: v0.4.0+git"} {"level":"info","ts":1579886681.9950516,"logger":"leader","msg":"Trying to become the leader."} {"level":"info","ts":1579886682.1206093,"logger":"leader","msg":"Found existing lock with my name. I was likely restarted."} {"level":"info","ts":1579886682.120643,"logger":"leader","msg":"Continuing as the leader."} {"level":"info","ts":1579886682.2016144,"logger":"cmd","msg":"Registering Components."} {"level":"info","ts":1579886682.2017787,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="} {"level":"info","ts":1579886682.2019343,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="} {"level":"info","ts":1579886682.202073,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="} {"level":"info","ts":1579886682.2021816,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="} {"level":"info","ts":1579886682.20229,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="} {"level":"error","ts":1579886682.202312,"logger":"cmd","msg":"","error":"no kind is registered for the type v1.SecurityContextConstraints in scheme \"k8s.io/client-go/kubernetes/scheme/register.go:60\"","stacktrace":"github.com/openshift/cluster-nfd-operator/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/openshift/cluster-nfd-operator/vendor/github.com/go-logr/zapr/zapr.go:128\nmain.main\n\t/go/src/github.com/openshift/cluster-nfd-operator/cmd/manager/main.go:92\nruntime.main\n\t/opt/rh/go-toolset-1.12/root/usr/lib/go-toolset-1.12-golang/src/runtime/proc.go:200"}
Upon first glance, controller appears to be attempting to watch /api/v1/securitycontextconstraints
which is not available in OpenShift 4.x.
The OperatorHub controller image should be updated to watch /apis/security.openshift.io/v1/securitycontextconstraints
.
@zvonkok, Can we use this operator in the upstream K8s? we are working Linux foundation projects to leverage NFD operator in our project
NodeFeatureDiscovery CR status is showing degraded as nfd-worker pods are not coming up.
Status:
Conditions:
Last Heartbeat Time: 2022-03-09T04:37:58Z
Last Transition Time: 2022-03-09T04:37:58Z
Status: False
Type: Available
Last Heartbeat Time: 2022-03-09T04:37:58Z
Last Transition Time: 2022-03-09T04:37:58Z
Status: False
Type: Upgradeable
Last Heartbeat Time: 2022-03-09T04:37:58Z
Last Transition Time: 2022-03-09T04:37:58Z
Status: False
Type: Progressing
Last Heartbeat Time: 2022-03-09T04:37:58Z
Last Transition Time: 2022-03-09T04:37:58Z
Message: NFDWorkerDaemonSetCorrupted
Reason: FailedGettingNFDWorkerDaemonSet
Status: True
Type: Degraded
Events:
Events:
Type Reason Age From Message
Warning FailedCreate 25s (x15 over 107s) daemonset-controller Error creating: pods "nfd-worker-" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider "infoscale-fencing-scc": Forbidden: not usable by user or serviceaccount, provider restricted: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, spec.volumes[0]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[1]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[2]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[3]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[4]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[5]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.volumes[6]: Invalid value: "hostPath": hostPath volumes are not allowed to be used, spec.containers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "infoscale-vtas-licensing-controller": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "infoscale-vtas-csi-driver-controller": Forbidden: not usable by user or serviceaccount, provider "infoscale-kubefencing-scc": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "nfd-worker": Forbidden: not usable by user or serviceaccount, provider "infoscale-vtas-csi-driver-node": Forbidden: not usable by user or serviceaccount, provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "infoscale-vtas-driver-container-rhel8": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount]
We should clean devel branches that have been merged or tested.
stick with the release-x
branches for a clean repo maintenance
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/master
operator: DoesNotExist
- matchExpressions:
- key: node-role.kubernetes.io/worker
operator: Exists
I just noticed a slight error in the README.md
file. It says to create an NFD instance, you should do:
oc apply -f config/samples/nfd.kubernetes.io_v1_nodefeaturediscovery.yaml
However, this file (nfd.kubernetes.io_v1_nodefeaturediscovery.yaml
) does not exist. The actual file is nfd.openshift.io_v1_nodefeaturediscovery.yaml
. Thus, the command should be:
oc apply -f config/samples/nfd.openshift.io_v1_nodefeaturediscovery.yaml
Bundle build for cluster-nfd-operator is failing on OSBS. It seems to me that OSBS introduced a check and disallows null values in annotations.
""
, though I am not sure if the check will allow empty strings or not.Since this and https://github.com/openshift/node-feature-discovery are the two repositories linked to in the operator as deployed in OperatorHub, it would be useful to have more explicit instructions here as to what a user, having subscribed to the operator through the CLI as described in https://docs.openshift.com/container-platform/4.6/operators/admin/olm-adding-operators-to-cluster.html#olm-installing-operator-from-operatorhub-using-cli_olm-adding-operators-to-a-cluster, would do next to instantiate and configure the server & worker pods in a particular namespace.
For the record, all the user need do to instantiate the server and worker pods is oc apply -f 0700_cr.yaml
with the namespace specified, and the configuration instructions are in https://kubernetes-sigs.github.io/node-feature-discovery/v0.7/get-started/deployment-and-usage.html#configuration
The following branches are being fast-forwarded from the current development branch (master) as placeholders for future releases. No merging is allowed into these release branches until they are unfrozen for production release.
release-4.18
release-4.19
For more information, see the branching documentation.
As of this commit, three node clusters with GPU acceleration seem to no longer be supported. In a three node cluster, all nodes will have the master and worker labels. Since they have the master label, the NFD workers will not be scheduled. The comment from the commit indicates that the intent was to allow NFD to run on servers that may be tagged with roles other than worker, but the effect of the change is greater than that.
Might I suggest running the NFD worker on the worker nodes by default and allowing the nodeselector to be set in the spec of the NodeFeatureDiscovery resource?
Priority classes docs:
https://docs.openshift.com/container-platform/3.11/admin_guide/scheduling/priority_preemption.html#admin-guide-priority-preemption-priority-class
Example: https://github.com/openshift/cluster-monitoring-operator/search?q=priority&unscoped_q=priority
Notes: The pre-configured system priority classes (system-node-critical
and system-cluster-critical
) can only be assigned to pods in kube-system
or openshift-*
namespaces. Most likely, core operators and their pods should be assigned system-cluster-critical
. Please do not assign system-node-critical
(the highest priority) unless you are really sure about it.
The following branches are being fast-forwarded from the current development branch (master) as placeholders for future releases. No merging is allowed into these release branches until they are unfrozen for production release.
release-4.5
release-4.4
Contact the Test Platform or Automated Release teams for more information.
We should make variables
Looks like this fork/mirror is updated from the upstream projects from time to time. The sync for 4.11 was done very poorly and then while finishing/fixing the sync merge problems that were committed, the PR [1] to sync with upstream for 0.5.0 removed the required patch [2][3] to allow things to work without overrides to NodeFeatureDiscovery
objects.
version: v1alpha1
is defined twice, in https://github.com/openshift/cluster-nfd-operator/blob/release-4.5/manifests/olm-catalog/4.5/nfd.crd.yaml#L13 and https://github.com/openshift/cluster-nfd-operator/blob/release-4.5/manifests/olm-catalog/4.5/nfd.crd.yaml#L38 which is failing validation in brew.
Image build failed. Error in plugin orchestrate_build: {"x86_64": {"pin_operator_digest": "while constructing a mapping\n in \"/tmp/tmpXBzS7p/cluster-nfd-operator-bundle/manifests/nfd.crd.yaml\", line 6, column 3\nfound duplicate key \"version\" with value \"v1alpha1\" (original value: \"v1alpha1\")\n in \"/tmp/tmpXBzS7p/cluster-nfd-operator-bundle/manifests/nfd.crd.yaml\", line 38, column 3\n\nTo suppress this check see:\n http://yaml.readthedocs.io/en/latest/api.html#duplicate-keys\n\nDuplicate keys will become and error in future releases, and are errors\nby default when using the new API.\n"}}. OSBS build id: cluster-nfd-operator-bundle-rhaos-45-rhel-7-7ffff-4
Failed brew build: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=28602159
The patch to fix the issue to detect the Intel SGX feature has been merged into NFD upstream. see kubernetes-sigs/node-feature-discovery@8a39434
And we need this fixing to enable SGX feature on OCP. see intel/intel-device-plugins-for-kubernetes#780
Could you please let us know when and how this fixing can be merged into nfd-operator and OCP?
Thanks!
❯ oc logs -n openshift-nfd nfd-controller-manager-576765f974-krxgc -c manager
I0622 11:03:24.298094 1 main.go:61] Operator Version: a42b581a-dirty
2022-06-22T11:03:24.298Z DPANIC setup non-string key argument passed to logging, ignoring all later arguments {"invalid key": "POD_NAMESPACE must be set"}
panic: non-string key argument passed to logging, ignoring all later arguments
goroutine 1 [running]:
go.uber.org/zap/zapcore.(*CheckedEntry).Write(0xc00002e300, {0xc00036ee00, 0x1, 0x1})
/go/src/github.com/openshift/cluster-nfd-operator/vendor/go.uber.org/zap/zapcore/entry.go:232 +0x446
go.uber.org/zap.(*Logger).DPanic(0x15de809, {0x163560e, 0x14170a0}, {0xc00036ee00, 0x1, 0x1})
/go/src/github.com/openshift/cluster-nfd-operator/vendor/go.uber.org/zap/logger.go:220 +0x59
github.com/go-logr/zapr.handleFields(0x0, {0xc000255480, 0x2, 0x40cfe7}, {0x0, 0x0, 0xc000217b01})
/go/src/github.com/openshift/cluster-nfd-operator/vendor/github.com/go-logr/zapr/zapr.go:110 +0x3f1
github.com/go-logr/zapr.(*zapLogger).WithValues(0xc0004cd5a0, {0xc000255480, 0x1, 0x15df7c6})
/go/src/github.com/openshift/cluster-nfd-operator/vendor/github.com/go-logr/zapr/zapr.go:145 +0x31
main.main()
/go/src/github.com/openshift/cluster-nfd-operator/main.go:130 +0x3e2
Setting POD_NAMESPACE
via downwards API and it starts successfully.
I don't know why it was missing from my installation.
Now that we are fully branched for 4.4, please prepare your operator to supply a 4.4 bundle, so that 4.4 operator publishing works and doesn't overwrite 4.3 bundles. This means at least updating the package.yaml under https://github.com/openshift/cluster-nfd-operator/tree/master/manifests
@zvonkok
Hi,
Is there's any way to customize the nfd-worker's config without building own images?
I wanted to change the labels from vendor to vendor+device id (and possible other), but it seems that after editing configmap -n openshift-nfd nfd-worker
operator immediately overwrites with the definition embedded in its container image.
Thanks
Now that we are fully branched for 4.7, please prepare your operator to supply a 4.7 bundle, so that 4.7 operator publishing works and doesn't overwrite 4.6 bundles. This means at least updating the package.yaml under
https://github.com/openshift/cluster-nfd-operator/tree/master/manifests
Reference: openshift-eng/ocp-build-data#708
The operator deployment is set to strategy: RollingUpdate
which means that when an update is rolled out the pods in the new ReplicaSet have to go ready before the old ReplicaSet is scaled down. This creates a deadlock scenario:
My fix is to just manually scale down the old RS which fixes everything but I think the best answer would be for the operator Deployment to be set to strategy: recreate
OpenShift NFD operator version 4.14.0-202402081809 on OCP 4.14.11, does not support --enable-taints
feature.
We are currently creating accelerator profiles on OpenShift AI for multiple Intel GPU cards through taints- as mentioned in RH accelerator profile doc Current accelerator profile for one of the Intel GPU cards- link. Tolerations are already supported in accelerator profile.
For multiple GPU cards support, we would need to add taints on the nodes using NFD operator's NodeFeatureRule instance to integrate and use the accelerator profiles with Intel GPU device plugin's resource. We cannot directly taint the node as we do need matchFeatures
with taints from NodeFeatureRule.
According to NFD upstream document tainting seems to be an experimental feature. We have tried adding --enable-taints in the nfd-worker pods/daemonset manually but this command is not supported.
Is it possible to enable this feature on OCP 4.14? If not, when would it be supported on OCP? Thanks!
Hi,
I'm now testing the NFD operator (current csv: nfd.4.10.0-202205120735, although I also see it replaces some nfd.4.10.0-202204251358. cluster was installed 9 days ago).
Looking forward to upgrading several OCP clusters from 4.8 to 4.10, I'm going through Prometheus alerts.
I see one regarding Prometheus being unable to scrape for metrics, out of some "nfd-controller-manager-metrics-service" Service.
prometheus-pod$ curl http://127.0.0.1:9090/api/v1/alerts
{"status":"success","data":{"alerts":[{"labels":{"alertname":"TargetDown","job":"nfd-controller-manager-metrics-service","namespace":"openshift-operators","service":"nfd-controller-manager-metrics-service","severity":"warning"},"annotations":{"description":"100% of the nfd-controller-manager-metrics-service/nfd-controller-manager-metrics-service targets in openshift-operators namespace have been unreachable for more than 15 minutes. This may be a symptom of network connectivity issues, down nodes, or failures within these components. Assess the health of the infrastructure and nodes running these targets and then contact support.","summary":"Some targets were not reachable from the monitoring server for an extended period of time."},"state":"firing","activeAt":"2022-05-17T14:33:11Z","value":"1e+02"}, ...
In the openshift-operators namespace, I found a "nfd-controller-manager-metrics-monitor" ServiceMonitor.
Defined with the following:
spec:
endpoints:
- bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
interval: 30s
path: /metrics
port: https
scheme: https
tlsConfig:
caFile: /etc/prometheus/configmaps/serving-certs-ca-bundle/service-ca.crt
serverName: nfd-controller-manager-metrics-service.openshift-nfd.svc
selector:
matchLabels:
control-plane: controller-manager
The thing is: my nfd operator is installed in the "openshift-operators" namespace.
The tlsConfig.serverName does not match (nfd-controller-manager-metrics-service.openshift-nfd.svc, should be nfd-controller-manager-metrics-service.openshift-operators.svc).
Shouldn't this configuration reflect whichever namespace operator is installed into?
Later on, as the alert wouldn't go away, I noticed there are two ServiceMonitors, with the exact same configuration.
Previous remark applies to the "controller-manager-metrics-monitor".
Although there's no reason to have two configurations. Doesn't sound related to CSV upgrade, creation timestamps almost match:
get servicemonitor -n openshift-operators -o yaml controller-manager-metrics-monitor nfd-controller-manager-metrics-monitor |grep creationTimest
creationTimestamp: "2022-05-17T14:30:49Z"
creationTimestamp: "2022-05-17T14:30:50Z"
Fixing the spec.tlsConfig.serverName on both ServiceMonitors, I can confirm Prometheus no longer complains about those metrics
prometheus-pod$ curl http://127.0.0.1:9090/api/v1/alerts
[nfd servicemonitor alert is gone]
prometheus-pod$ curl "http://127.0.0.1:9090/api/v1/query?query=nfd_degraded_info"
{"status":"success","data":{"resultType":"vector","result":[{"metric":{"__name__":"nfd_degraded_info","container":"kube-rbac-proxy","endpoint":"https","instance":"10.94.13.23:8443","job":"nfd-controller-manager-metrics-service","namespace":"openshift-operators","pod":"nfd-controller-manager-747c569f45-gdvdn","service":"nfd-controller-manager-metrics-service"},"value":[1653645665.843,"0"]}]}}
Thanks.
https://github.com/openshift/cluster-nfd-operator/tree/release-4.3/manifests/olm-catalog
file does not exist: manifests/olm-catalog/4.3/image-references
This pod must tolerate any taint. For example, the SDN pod has this list:
tolerations:
- effect: NoSchedule
key: node.kubernetes.io/memory-pressure
operator: Exists
- effect: NoSchedule
key: node-role.kubernetes.io/master
operator: Exists
- effect: NoSchedule
key: node.kubernetes.io/not-ready
operator: Exists
- effect: NoExecute
key: node.kubernetes.io/not-ready
operator: Exists
tolerationSeconds: 300
- effect: NoExecute
key: node.kubernetes.io/unreachable
operator: Exists
tolerationSeconds: 300
The NFD Operator's controller manager is unable to handle the case where the user creates an NFD instance and then deletes it.
This problem occurs because the current error checking library, criapierrors
, does not check for errors related to missing resources. Missing resources will have their metav1
values/attributes set to empty or nil
(depending on whether or not the attribute is a pointer or not). Thus, we want to check for metav1
problems, not GRPC problems, as GRPC problems are not relevant to checking for missing resoures.
Sample existing output:
2021-08-06T15:04:12.190Z INFO controller_nodefeaturediscovery Looking for {"ServiceAccount": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.190Z INFO controller_nodefeaturediscovery Found, skipping update {"ServiceAccount": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.190Z INFO controller_nodefeaturediscovery Looking for {"ClusterRole": "nfd-master", "Namespace": ""}
2021-08-06T15:04:12.190Z INFO controller_nodefeaturediscovery Found, updating {"ClusterRole": "nfd-master", "Namespace": ""}
2021-08-06T15:04:12.482Z INFO controller_nodefeaturediscovery Looking for {"ClusterRoleBinding": "nfd-master", "Namespace": ""}
2021-08-06T15:04:12.482Z INFO controller_nodefeaturediscovery Found, updating {"ClusterRoleBinding": "nfd-master", "Namespace": ""}
2021-08-06T15:04:12.626Z INFO controller_nodefeaturediscovery Looking for {"DaemonSet": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.626Z INFO controller_nodefeaturediscovery Found, updating {"DaemonSet": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.652Z INFO controller_nodefeaturediscovery Looking for {"Service": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.652Z INFO controller_nodefeaturediscovery Found, updating {"Service": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.666Z INFO controller_nodefeaturediscovery Looking for {"ServiceAccount": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.666Z INFO controller_nodefeaturediscovery Found, skipping update {"ServiceAccount": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.666Z INFO controller_nodefeaturediscovery Looking for {"Role": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.666Z INFO controller_nodefeaturediscovery Found, updating {"Role": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.687Z INFO controller_nodefeaturediscovery Looking for {"RoleBinding": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.687Z INFO controller_nodefeaturediscovery Found, updating {"RoleBinding": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.703Z INFO controller_nodefeaturediscovery Looking for {"ConfigMap": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.704Z INFO controller_nodefeaturediscovery Found, updating {"ConfigMap": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.720Z INFO controller_nodefeaturediscovery Looking for {"DaemonSet": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.720Z INFO controller_nodefeaturediscovery Found, updating {"DaemonSet": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-06T15:04:12.740Z INFO controller_nodefeaturediscovery Looking for {"SecurityContextConstraints": "nfd-worker", "Namespace": "default"}
2021-08-06T15:04:12.740Z INFO controller_nodefeaturediscovery Found, updating {"SecurityContextConstraints": "nfd-worker", "Namespace": "default"}
2021-08-06T17:11:55.170Z INFO controllers.NodeFeatureDiscovery Fetch the NodeFeatureDiscovery instance
2021-08-06T17:11:55.170Z ERROR controllers.NodeFeatureDiscovery requeueing event since there was an error reading object {"error": "NodeFeatureDiscovery.nfd.openshift.io \"nfd-instance\" not found"}
github.com/go-logr/zapr.(*zapLogger).Error
/go/src/github.com/openshift/cluster-nfd-operator/vendor/github.com/go-logr/zapr/zapr.go:132
github.com/openshift/cluster-nfd-operator/controllers.(*NodeFeatureDiscoveryReconciler).Reconcile
/go/src/github.com/openshift/cluster-nfd-operator/controllers/nodefeaturediscovery_controller.go:169
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/src/github.com/openshift/cluster-nfd-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:263
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/src/github.com/openshift/cluster-nfd-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1
/go/src/github.com/openshift/cluster-nfd-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1
/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext
/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.UntilWithContext
/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99
2021-08-06T17:11:55.170Z ERROR controller-runtime.manager.controller.nodefeaturediscovery Reconciler error {"reconciler group": "nfd.openshift.io", "reconciler kind": "NodeFeatureDiscovery", "name": "nfd-instance", "namespace": "openshift-nfd", "error": "NodeFeatureDiscovery.nfd.openshift.io \"nfd-instance\" not found"}
github.com/go-logr/zapr.(*zapLogger).Error
/go/src/github.com/openshift/cluster-nfd-operator/vendor/github.com/go-logr/zapr/zapr.go:132
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/src/github.com/openshift/cluster-nfd-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:267
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/src/github.com/openshift/cluster-nfd-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1
/go/src/github.com/openshift/cluster-nfd-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1
/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext
/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.UntilWithContext
/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99
2021-08-06T17:11:55.175Z INFO controllers.NodeFeatureDiscovery Fetch the NodeFeatureDiscovery instance
2021-08-06T17:11:55.176Z ERROR controllers.NodeFeatureDiscovery requeueing event since there was an error reading object {"error": "NodeFeatureDiscovery.nfd.openshift.io \"nfd-instance\" not found"}
The following branches are being fast-forwarded from the current development branch (master) as placeholders for future releases. No merging is allowed into these release branches until they are unfrozen for production release.
release-4.16
release-4.17
For more information, see the branching documentation.
After installing the RH nfd-operator and gpu-operator in OCP 4.5 cluster, the nfd-operator logs at a rate of over 15,000 lines per minute.
$ oc get csv nfd.4.5.0-202011132127.p0
NAME DISPLAY VERSION REPLACES PHASE
nfd.4.5.0-202011132127.p0 Node Feature Discovery 4.5.0 Succeeded
$ date ; oc logs nfd-operator-7c69cf4f44-jkbwq | wc -l
Wed, Feb 3, 2021 2:44:30 PM
171957
$ date ; oc logs nfd-operator-7c69cf4f44-jkbwq | wc -l
Wed, Feb 3, 2021 2:47:01 PM
203070
2 minutes, 31 seconds since last count
171,975 - 203,070 = 31,095 messages
31095 logs / 2.5 minutes = 15547 logs
Typical contents of logs:
{"level":"info","ts":1612381861.9975307,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","Service":"nfd-master","Namespace":"openshift-operators"}
{"level":"info","ts":1612381862.0078075,"logger":"controller_nodefeaturediscovery","msg":"Found, skpping update","ServiceAccount":"nfd-worker","Namespace":"openshift-operators"}
{"level":"info","ts":1612381862.0079691,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","Role":"nfd-worker","Namespace":"openshift-operators"}
{"level":"info","ts":1612381862.01564,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","RoleBinding":"nfd-worker","Namespace":"openshift-operators"}
{"level":"info","ts":1612381862.0217285,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","ConfigMap":"nfd-worker","Namespace":"openshift-operators"}
{"level":"info","ts":1612381862.0276768,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","SecurityContextConstraints":"nfd-worker","Namespace":"default"}
{"level":"info","ts":1612381862.0387275,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","DaemonSet":"nfd-worker","Namespace":"openshift-operators"}
{"level":"info","ts":1612381862.0495317,"logger":"controller_nodefeaturediscovery","msg":"Looking for","ServiceAccount":"nfd-master","Namespace":"openshift-operators"}
{"level":"info","ts":1612381862.049615,"logger":"controller_nodefeaturediscovery","msg":"Found, skpping update","ServiceAccount":"nfd-master","Namespace":"openshift-operators"}
{"level":"info","ts":1612381862.04964,"logger":"controller_nodefeaturediscovery","msg":"Looking for","ClusterRole":"nfd-master","Namespace":""}
{"level":"info","ts":1612381862.04966,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","ClusterRole":"nfd-master","Namespace":""}
{"level":"info","ts":1612381862.0553772,"logger":"controller_nodefeaturediscovery","msg":"Looking for","ClusterRoleBinding":"nfd-master","Namespace":""}
{"level":"info","ts":1612381862.0554087,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","ClusterRoleBinding":"nfd-master","Namespace":""}
{"level":"info","ts":1612381862.0618849,"logger":"controller_nodefeaturediscovery","msg":"Looking for","DaemonSet":"nfd-master","Namespace":"openshift-operators"}
{"level":"info","ts":1612381862.0619338,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","DaemonSet":"nfd-master","Namespace":"openshift-operators"}
{"level":"info","ts":1612381862.1010854,"logger":"controller_nodefeaturediscovery","msg":"Looking for","Service":"nfd-master","Namespace":"openshift-operators"}
{"level":"info","ts":1612381862.1012244,"logger":"controller_nodefeaturediscovery","msg":"Found, updating","Service":"nfd-master","Namespace":"openshift-operators"}
Approximately 7 hours after NFD has been installed, the following error shows up in the NFD controller manager's logs:
E0903 02:52:07.981627 1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:225: Failed to watch *v1.NodeFeatureDiscovery: the server has received too many requests and has asked us to try again later (get nodefeaturediscoveries.nfd.openshift.io)
E0903 02:52:10.566841 1 reflector.go:138] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:225: Failed to watch *v1.NodeFeatureDiscovery: the server has received too many requests and has asked us to try again later (get nodefeaturediscoveries.nfd.openshift.io)
This issue is easily repeatable by letting NFD run for several hours, such as overnight. I have reproduced it with and without using the cluster during that 7 hour period.
Seems like a controller manager issue?
I applied this operator on an environment with ipv6. It deployed correctly on masters, but nfd-workers pods keep crashing.
When i look at logs inside the container, i can see:
2020/04/08 10:44:05 Sendng labeling request nfd-master
2020/04/08 10:44:05 failed to set node labels: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp: address fd02::d19a:12000: too many colons in address"
This seems to be a problem with ipv6 address validation.
The operator's logs are repeated twice when launching NFD -- meaning, some commands are actually called twice -- but sometimes the repetitions occur even three, four, or more times. This issue could have existed before my PR #183, or it could have been introduced with my PR #183. Either way, the repetition problem does need to be fixed because it likely will have an impact on NFD functionality in the future.
Sample:
2021-08-05T15:05:48.251Z INFO controllers.NodeFeatureDiscovery Fetch the NodeFeatureDiscovery instance
2021-08-05T15:05:48.251Z INFO controllers.NodeFeatureDiscovery Ready to apply components
2021-08-05T15:05:48.251Z INFO controller_nodefeaturediscovery Looking for {"ServiceAccount": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.251Z INFO controller_nodefeaturediscovery Found, skipping update {"ServiceAccount": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.251Z INFO controller_nodefeaturediscovery Looking for {"ClusterRole": "nfd-master", "Namespace": ""}
2021-08-05T15:05:48.251Z INFO controller_nodefeaturediscovery Found, updating {"ClusterRole": "nfd-master", "Namespace": ""}
2021-08-05T15:05:48.343Z INFO controller_nodefeaturediscovery Looking for {"ClusterRoleBinding": "nfd-master", "Namespace": ""}
2021-08-05T15:05:48.343Z INFO controller_nodefeaturediscovery Found, updating {"ClusterRoleBinding": "nfd-master", "Namespace": ""}
2021-08-05T15:05:48.434Z INFO controller_nodefeaturediscovery Looking for {"DaemonSet": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.434Z INFO controller_nodefeaturediscovery Found, updating {"DaemonSet": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.458Z INFO controller_nodefeaturediscovery Looking for {"Service": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.458Z INFO controller_nodefeaturediscovery Found, updating {"Service": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.470Z INFO controller_nodefeaturediscovery Looking for {"ServiceAccount": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.470Z INFO controller_nodefeaturediscovery Found, skipping update {"ServiceAccount": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.470Z INFO controller_nodefeaturediscovery Looking for {"Role": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.470Z INFO controller_nodefeaturediscovery Found, updating {"Role": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.479Z INFO controller_nodefeaturediscovery Looking for {"RoleBinding": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.479Z INFO controller_nodefeaturediscovery Found, updating {"RoleBinding": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.486Z INFO controller_nodefeaturediscovery Looking for {"ConfigMap": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.486Z INFO controller_nodefeaturediscovery Found, updating {"ConfigMap": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.493Z INFO controller_nodefeaturediscovery Looking for {"DaemonSet": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.493Z INFO controller_nodefeaturediscovery Found, updating {"DaemonSet": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.506Z INFO controller_nodefeaturediscovery Looking for {"SecurityContextConstraints": "nfd-worker", "Namespace": "default"}
2021-08-05T15:05:48.506Z INFO controller_nodefeaturediscovery Found, updating {"SecurityContextConstraints": "nfd-worker", "Namespace": "default"}
2021-08-05T15:05:48.516Z INFO controllers.NodeFeatureDiscovery Fetch the NodeFeatureDiscovery instance
2021-08-05T15:05:48.516Z INFO controllers.NodeFeatureDiscovery Ready to apply components
2021-08-05T15:05:48.516Z INFO controller_nodefeaturediscovery Looking for {"ServiceAccount": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.516Z INFO controller_nodefeaturediscovery Found, skipping update {"ServiceAccount": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.516Z INFO controller_nodefeaturediscovery Looking for {"ClusterRole": "nfd-master", "Namespace": ""}
2021-08-05T15:05:48.516Z INFO controller_nodefeaturediscovery Found, updating {"ClusterRole": "nfd-master", "Namespace": ""}
2021-08-05T15:05:48.587Z INFO controller_nodefeaturediscovery Looking for {"ClusterRoleBinding": "nfd-master", "Namespace": ""}
2021-08-05T15:05:48.587Z INFO controller_nodefeaturediscovery Found, updating {"ClusterRoleBinding": "nfd-master", "Namespace": ""}
2021-08-05T15:05:48.659Z INFO controller_nodefeaturediscovery Looking for {"DaemonSet": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.659Z INFO controller_nodefeaturediscovery Found, updating {"DaemonSet": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.666Z INFO controller_nodefeaturediscovery Looking for {"Service": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.666Z INFO controller_nodefeaturediscovery Found, updating {"Service": "nfd-master", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.679Z INFO controller_nodefeaturediscovery Looking for {"ServiceAccount": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.679Z INFO controller_nodefeaturediscovery Found, skipping update {"ServiceAccount": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.679Z INFO controller_nodefeaturediscovery Looking for {"Role": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.679Z INFO controller_nodefeaturediscovery Found, updating {"Role": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.687Z INFO controller_nodefeaturediscovery Looking for {"RoleBinding": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.687Z INFO controller_nodefeaturediscovery Found, updating {"RoleBinding": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.693Z INFO controller_nodefeaturediscovery Looking for {"ConfigMap": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.693Z INFO controller_nodefeaturediscovery Found, updating {"ConfigMap": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.700Z INFO controller_nodefeaturediscovery Looking for {"DaemonSet": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.700Z INFO controller_nodefeaturediscovery Found, updating {"DaemonSet": "nfd-worker", "Namespace": "openshift-nfd"}
2021-08-05T15:05:48.710Z INFO controller_nodefeaturediscovery Looking for {"SecurityContextConstraints": "nfd-worker", "Namespace": "default"}
2021-08-05T15:05:48.710Z INFO controller_nodefeaturediscovery Found, updating {"SecurityContextConstraints": "nfd-worker", "Namespace": "default"}
Hi cluster-nfd-operator team:
https://github.com/openshift/cluster-nfd-operator/blob/master/config/manager/manager.yaml
I tried to deploy cluster-nfd-operator in my OpenShift cluster. but the controller manager is running on worker node.
Then I found in the definition of nfd-controller-manager YAML file:
Is there any reason that the controller manager should be running on worker node ?
In all release-4.*
branches, the CSV has name: nfd...
, so the string replacement defined in art.yaml
works fine, creating a unique name for every new release of the operator's bundle.
But in release-4.8
, the name in the CSV is different:
name: node-feature-discovery-operator.v4.8.0
So art.yaml
replacements don't happen, and pushing the bundles fail without a unique name.
Similar issue happened kubernetes-nmstate: openshift/kubernetes-nmstate#203
We can either change the entry in art.yaml
only for release-4.8
branch, to something like this:
update_list:
- - search: "nfd.v{MAJOR}.{MINOR}.0"
- replace: "nfd.{FULL_VER}"
+ - search: "node-feature-discovery-operator.v{MAJOR}.{MINOR}.0"
+ replace: "node-feature-discovery-operator.{FULL_VER}"
Or change the name to "nfd" in the CSV
operatorframework.io/arch.ppc64le: supported
operatorframework.io/arch.s390x: supported
- name: node-feature-discovery-operator.v4.8.0
+ name: nfd.v4.8.0
namespace: placeholder
The nfd-controller-manager-... pod cannot run, because it can't pull registry.redhat.io/openshift4/ose-kube-rbac-proxy
which needs redhat registry credentials.
Now that we are fully branched for 4.5, please prepare your operator to supply a 4.5 bundle, so that 4.5 operator publishing works and doesn't overwrite 4.4 bundles. This means at least updating the package.yaml under https://github.com/openshift/cluster-nfd-operator/blob/master/manifests/olm-catalog/nfd.package.yaml
Reference: Get OLM operator owners to update their CSV channels
In the main.go
file in the main directory, a command line option exists for disabling leader election, and the default value indicates that leader election should be turned off. However, leader election is still enabled anyways:
# oc logs pod/nfd-controller-manager-7cd8d54686-nllqc manager
I0816 17:01:22.272019 1 main.go:55] Operator Version: 783214b0-dirty
I0816 17:01:23.326238 1 request.go:655] Throttling request took 1.021121017s, request: GET:https://172.30.0.1:443/apis/security.openshift.io/v1?timeout=32s
2021-08-16T17:01:25.093Z INFO controller-runtime.metrics metrics server is starting to listen {"addr": "127.0.0.1:8080"}
2021-08-16T17:01:25.094Z INFO setup starting manager
I0816 17:01:25.094661 1 leaderelection.go:243] attempting to acquire leader lease openshift-nfd/39f5e5c3.nodefeaturediscoveries.nfd.kubernetes.io...
2021-08-16T17:01:25.095Z INFO controller-runtime.manager starting metrics server {"path": "/metrics"}
I0816 17:01:25.121966 1 leaderelection.go:253] successfully acquired lease openshift-nfd/39f5e5c3.nodefeaturediscoveries.nfd.kubernetes.io
2021-08-16T17:01:25.122Z INFO controller-runtime.manager.controller.nodefeaturediscovery Starting EventSource {"reconciler group": "nfd.openshift.io", "reconciler kind": "NodeFeatureDiscovery", "source": "kind source: /, Kind="}
2021-08-16T17:01:25.122Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"ConfigMap","namespace":"openshift-nfd","name":"39f5e5c3.nodefeaturediscoveries.nfd.kubernetes.io","uid":"e8ebfd1a-d7d7-44cc-9038-70bd5f1ae380","apiVersion":"v1","resourceVersion":"53680499"}, "reason": "LeaderElection", "message": "nfd-controller-manager-7cd8d54686-nllqc_c3d12674-228d-4747-9d07-f025b8aaf51e became leader"}
2021-08-16T17:01:25.122Z DEBUG controller-runtime.manager.events Normal {"object": {"kind":"Lease","namespace":"openshift-nfd","name":"39f5e5c3.nodefeaturediscoveries.nfd.kubernetes.io","uid":"10ebb14c-84f8-42be-ae4c-9eef1316e853","apiVersion":"coordination.k8s.io/v1","resourceVersion":"53680500"}, "reason": "LeaderElection", "message": "nfd-controller-manager-7cd8d54686-nllqc_c3d12674-228d-4747-9d07-f025b8aaf51e became leader"}
2021-08-16T17:01:25.223Z INFO controller-runtime.manager.controller.nodefeaturediscovery Starting EventSource {"reconciler group": "nfd.openshift.io", "reconciler kind": "NodeFeatureDiscovery", "source": "kind source: /, Kind="}
I think this problem occurs because LeaderElectionID
is provided in the manager creation step:
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{
Scheme: scheme,
MetricsBindAddress: metricsAddr,
Port: 9443,
LeaderElection: enableLeaderElection,
LeaderElectionID: "39f5e5c3.nodefeaturediscoveries.nfd.kubernetes.io",
HealthProbeBindAddress: probeAddr,
Namespace: watchNamespace, // namespaced-scope when the value is not an empty string
})
If we remove LeaderElectionID
and LeaderElection
, then everything works as intended. If we remove just LeaderElectionID
and leave LeaderElection
there, an error occurs saying that LeaderElectionID
must be defined, even though LeaderElection
is set to false
by default.
A better approach to this problem is to write an if-else conditional that says "If leader election is enabled, then [...]. Else, [...]", rather than providing a Leader election ID and relying on LeaderElection
to disable Leader Election (since the true/false value of LeaderElection is ignored when LeaderElectionID
is provided).
Will provide a PR fix for this soon.
The controller manager crashes when an NFD instance is created, then deleted, then recreated again. Relevant logs are below. Then below these logs is a description that explains how to replicate the bug.
2021-08-12T15:29:22.053Z ERROR controller-runtime.manager.controller.nodefeaturediscovery Reconciler error {"reconciler group": "nfd.openshift.io", "reconciler kind": "NodeFeatureDiscovery", "name": "nfd-instance", "namespace": "openshift-nfd", "error": "Operation cannot be fulfilled on nodefeaturediscoveries.nfd.openshift.io \"nfd-instance\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/go-logr/zapr.(*zapLogger).Error
/go/src/github.com/openshift/cluster-nfd-operator/vendor/github.com/go-logr/zapr/zapr.go:132
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/go/src/github.com/openshift/cluster-nfd-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:267
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/go/src/github.com/openshift/cluster-nfd-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:235
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1
/go/src/github.com/openshift/cluster-nfd-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:198
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1
/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1
/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:155
k8s.io/apimachinery/pkg/util/wait.BackoffUntil
/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:156
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
k8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext
/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:185
k8s.io/apimachinery/pkg/util/wait.UntilWithContext
/go/src/github.com/openshift/cluster-nfd-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:99
To replicate the bug:
oc create -f config/samples/nfd.openshift.io_v1_nodefeaturediscovery.yaml
oc delete -f config/samples/nfd.openshift.io_v1_nodefeaturediscovery.yaml
oc create -f config/samples/nfd.openshift.io_v1_nodefeaturediscovery.yaml
On my OKD Cluster, I found the NFD operator on operatorHub, so I tried to install it from operatorHub.
But It was not installed.
(Before I found it on operatorHub, I used to install it manually, which I cloned from git and use make
command to deploy.)
Installation proceeded in two ways ( Install on All Namespaces-default / specific Namespace ), but the following error occurred:
{"level":"info","ts":1616045944.9220765,"logger":"cmd","msg":"Go Version: go1.15.5"}
{"level":"info","ts":1616045944.9221418,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1616045944.9221659,"logger":"cmd","msg":"Version of operator-sdk: v0.4.0+git"}
{"level":"info","ts":1616045944.922823,"logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":1616045946.8118556,"logger":"leader","msg":"No pre-existing lock was found."}
{"level":"info","ts":1616045946.8415146,"logger":"leader","msg":"Became the leader."}
{"level":"info","ts":1616045950.8142676,"logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"}
{"level":"info","ts":1616045950.8169394,"logger":"cmd","msg":"Registering Components."}
{"level":"info","ts":1616045950.817697,"logger":"cmd","msg":"Starting the Cmd."}
{"level":"info","ts":1616045950.818323,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8187845,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8190293,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8192632,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8194685,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8199208,"logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
{"level":"info","ts":1616045950.8208203,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8210897,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8213153,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.821549,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8217416,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8219752,"logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"nodefeaturediscovery-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1616045950.8221507,"logger":"controller-runtime.controller","msg":"Starting Controller","controller":"nodefeaturediscovery-controller"}
{"level":"info","ts":1616045951.0225158,"logger":"controller-runtime.controller","msg":"Starting workers","controller":"nodefeaturediscovery-controller","worker count":1}
How can I solve this? Do anyone have same error with?
Hi,
Not sure it is the right place to ask this question.
We are working on enabling Intel SGX feature on Openshift.
Intel SGX device plugin uses NFD extended resources to report the EPC size. To achieve it we need to set below parameters
--resource-labels=sgx.intel.com/epc
--extra-label-ns=sgx.intel.com
to start up nfd-master
But I can’t find any clue how to enable the NFD extended resources features through Openshift NFD operator.
As I understand on Openshift the NFD should be managed and configured by the NFD Operator. Right?
Thanks in advance!
For product deployment we are using some nfd labels on node, one of that label is with key - "feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION"
After upgrading the OCP from v4.9.x to 4.10.6, above label was missing on worker nodes
# oc get no ocp-w-2.lab.ocp.lan -ojsonpath={.metadata.labels} | jq
{
"IS-infoscalecluster-dev": "true",
"IS-infoscalecluster-dev-33221": "true",
"beta.kubernetes.io/arch": "amd64",
"beta.kubernetes.io/os": "linux",
"feature.node.kubernetes.io/cpu-cpuid.ADX": "true",
"feature.node.kubernetes.io/cpu-cpuid.AESNI": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX2": "true",
"feature.node.kubernetes.io/cpu-cpuid.CLZERO": "true",
"feature.node.kubernetes.io/cpu-cpuid.FMA3": "true",
"feature.node.kubernetes.io/cpu-cpuid.HYPERVISOR": "true",
"feature.node.kubernetes.io/cpu-cpuid.MCAOVERFLOW": "true",
"feature.node.kubernetes.io/cpu-cpuid.SHA": "true",
"feature.node.kubernetes.io/cpu-cpuid.SSE4A": "true",
"feature.node.kubernetes.io/cpu-cpuid.SUCCOR": "true",
"feature.node.kubernetes.io/cpu-cpuid.WBNOINVD": "true",
"feature.node.kubernetes.io/cpu-hardware_multithreading": "false",
"feature.node.kubernetes.io/kernel-config.NO_HZ": "true",
"feature.node.kubernetes.io/kernel-config.NO_HZ_FULL": "true",
"feature.node.kubernetes.io/kernel-selinux.enabled": "true",
"feature.node.kubernetes.io/kernel-version.full": "4.18.0-305.40.2.el8_4.x86_64",
"feature.node.kubernetes.io/kernel-version.major": "4",
"feature.node.kubernetes.io/kernel-version.minor": "18",
"feature.node.kubernetes.io/kernel-version.revision": "0",
"feature.node.kubernetes.io/pci-15ad.present": "true",
"feature.node.kubernetes.io/system-os_release.ID": "rhcos",
"feature.node.kubernetes.io/system-os_release.OSTREE_VERSION": "410.84.202203221702-0",
"feature.node.kubernetes.io/system-os_release.RHEL_VERSION": "8.4",
"feature.node.kubernetes.io/system-os_release.VERSION_ID": "4.10",
"feature.node.kubernetes.io/system-os_release.VERSION_ID.major": "4",
"feature.node.kubernetes.io/system-os_release.VERSION_ID.minor": "10",
"kubernetes.io/arch": "amd64",
"kubernetes.io/hostname": "ocp-w-2.lab.ocp.lan",
"kubernetes.io/os": "linux",
"node-role.kubernetes.io/worker": "",
"node.openshift.io/os_id": "rhcos",
"specialresource.openshift.io/state-infoscale-vtas-1000": "Ready"
}
# oc get clusterversions.config.openshift.io
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.10.6 True False 10d Cluster version is 4.10.6
# oc get csv
NAME DISPLAY VERSION REPLACES PHASE
nfd.4.10.0-202204010807 Node Feature Discovery Operator 4.10.0-202204010807 nfd.4.10.0-202203271029 Succeeded
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.