As of <a href="https://github.com/openshift/cluster-nfd-operator/commit/5b142e291b5475

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Broken on 3 node clusters about cluster-nfd-operator HOT 24 CLOSED

openshift commented on June 24, 2024

Broken on 3 node clusters

from cluster-nfd-operator.

Comments (24)

rmkraus commented on June 24, 2024 3

@ArangoGutierrez Yes, the latest version of the nfd operator works for me!

@deanpeterson For the sale of K8s debugging tricks: I just duplicated the DaemonSet and made my desired changes. That’s definitely not something you’d want to do on a production cluster, but it’s useful for quick debugging.

from cluster-nfd-operator.

rmkraus commented on June 24, 2024

I created my own nfd-worker ds and I was able to work around the issue. Obviously not a great permanent fix, though:

kind: DaemonSet
apiVersion: apps/v1
metadata:
  name: nfd-worker-new-hotness
  namespace: gpu-operator-resources
  labels:
    app: nfd-worker
spec:
  selector:
    matchLabels:
      app: nfd-worker
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: nfd-worker
    spec:
      restartPolicy: Always
      serviceAccountName: nfd-worker
      schedulerName: default-scheduler
      hostNetwork: true
      terminationGracePeriodSeconds: 30
      securityContext: {}
      containers:
        - resources: {}
          terminationMessagePath: /dev/termination-log
          name: nfd-worker
          command:
            - nfd-worker
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: spec.nodeName
          securityContext:
            capabilities:
              drop:
                - ALL
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: false
          imagePullPolicy: Always
          volumeMounts:
            - name: host-boot
              readOnly: true
              mountPath: /host-boot
            - name: host-os-release
              readOnly: true
              mountPath: /host-etc/os-release
            - name: host-sys
              mountPath: /host-sys
            - name: config
              mountPath: /etc/kubernetes/node-feature-discovery
            - name: nfd-hooks
              mountPath: /etc/kubernetes/node-feature-discovery/source.d
            - name: nfd-features
              mountPath: /etc/kubernetes/node-feature-discovery/features.d
          terminationMessagePolicy: File
          image: 'quay.io/openshift/origin-node-feature-discovery:latest'
          args:
            - '--sleep-interval=60s'
            - '--server=$(NFD_MASTER_SERVICE_HOST):$(NFD_MASTER_SERVICE_PORT)'
      serviceAccount: nfd-worker
      volumes:
        - name: host-boot
          hostPath:
            path: /boot
            type: ''
        - name: host-os-release
          hostPath:
            path: /etc/os-release
            type: ''
        - name: host-sys
          hostPath:
            path: /sys
            type: ''
        - name: nfd-hooks
          hostPath:
            path: /etc/kubernetes/node-feature-discovery/source.d
            type: ''
        - name: nfd-features
          hostPath:
            path: /etc/kubernetes/node-feature-discovery/features.d
            type: ''
        - name: config
          configMap:
            name: nfd-worker
            items:
              - key: nfd-worker-conf
                path: nfd-worker.conf
            defaultMode: 420
      dnsPolicy: ClusterFirst
      tolerations:
        - operator: Exists
          effect: NoSchedule
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
  revisionHistoryLimit: 10

from cluster-nfd-operator.

zvonkok commented on June 24, 2024

I think we can do the following to support both configurations.

multiple nodeSelectorTerms, the pod can be scheduled if one of the terms can be satisfied
multiple matchExpressions, the pod can be scheduled only if all expression are satisfied

Since we're tolerating all Taints we need to exclude masters but allow scheduling on a worker with a label or without a label.

      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
              - key:  node-role.kubernetes.io/master
                operator: DoesNotExist
          nodeSelectorTerms:
            - matchExpressions:
              - key:  node-role.kubernetes.io/worker
                operator: Exists

If we have a three-node cluster with nodes that have the worker and master label the first term would fail but the second would be valid -> pod scheduled on the node.

If we have three masters with label master the first term would repel them from the master(s) and nodes with worker label would get assigned because the second term is also valid.

If we have three masters with label master the first term would repel them from the master(s) which means the first term is valid and the second term will fail on workers that do not have a worker label. On nodes, without any label, the first term would be valid and hence the pod scheduled on the nodes.

I hope I do not have a mistake in thinking. @rmkraus @marquiz @ArangoGutierrez PTAL.

@marquiz We may need to update the upstream version as well.

from cluster-nfd-operator.

rmkraus commented on June 24, 2024

@zvonkok I have confirmed that the following affinity rules work as specified. These rules are exactly what you posted with the duplicate nodeSelectorTerms key removed.

      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: node-role.kubernetes.io/master
                    operator: DoesNotExist
              - matchExpressions:
                  - key: node-role.kubernetes.io/worker
                    operator: Exists

from cluster-nfd-operator.

zvonkok commented on June 24, 2024

@rmkraus Wait, this should not work since all matchExpressions need to be satisfied for a Pod to be scheduled.
Can you show me the labels/annotations of your nodes? Which in the end means you have a node with a worker and without a master label, which makes "no sense" if you have the three node cluster, where each node should have the master and worker label ....

from cluster-nfd-operator.

rmkraus commented on June 24, 2024

Negative, from K8s docs:

If you specify multiple nodeSelectorTerms associated with nodeAffinity types, then the pod can be scheduled onto a node if one of the nodeSelectorTerms can be satisfied.

If you specify multiple matchExpressions associated with nodeSelectorTerms, then the pod can be scheduled onto a node only if all matchExpressions is satisfied.

https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/

from cluster-nfd-operator.

rmkraus commented on June 24, 2024

So the conditions in the nodeSelectorTerms list are combined with OR. The conditions in the matchExpressions list are combined with AND.

from cluster-nfd-operator.

zvonkok commented on June 24, 2024

Negative, from K8s docs:

If you specify multiple nodeSelectorTerms associated with nodeAffinity types, then the pod can be scheduled onto a node if one of the nodeSelectorTerms can be satisfied.
If you specify multiple matchExpressions associated with nodeSelectorTerms, then the pod can be scheduled onto a node only if all matchExpressions is satisfied.

Here: ^^ only if all matchExpressions are satisfied ... Am I misunderstanding ? ... well english is not my first language so I may miss something :)

https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/

from cluster-nfd-operator.

zvonkok commented on June 24, 2024

Yes and the matchExpressions ANDed means label master does not exist and label worker does exist. That is why I am confused.

from cluster-nfd-operator.

rmkraus commented on June 24, 2024

English is my first language and it is still not super clear. ;-)

I understand that to mean:
When you define a matchExpressions clause, it contains a list. All of the items in the list must be true to get pod placement.
When you define a nodeSelectorTerms clause, it also contains a list. Only one of the items in that list needs to be true in order to get pod placement.

In the selector clause I posted, we have one nodeSelectorTerms clause that contains two distinct matchExpressions clauses. The matchExpresions clauses are evaluated separately and then the parent nodeSelectorTerms clause is evaluated.

from cluster-nfd-operator.

zvonkok commented on June 24, 2024

Alrighty, this makes sense, I was thinking on the wrong level of abstraction ... Make sense thanks for verifying this, we need to create a fix for 4.7 and backport it to 4.6.x z-stream.

from cluster-nfd-operator.

zvonkok commented on June 24, 2024

@ArangoGutierrez we need a fix ASAP for this, please also cherry-pick it for 4.6, thanks!

from cluster-nfd-operator.

rmkraus commented on June 24, 2024

👍 Thanks all!

from cluster-nfd-operator.

ArangoGutierrez commented on June 24, 2024

Ack

from cluster-nfd-operator.

ArangoGutierrez commented on June 24, 2024

We would need a fully reported Bugzilla to perform the backport
@zvonkok @rmkraus

from cluster-nfd-operator.

rmkraus commented on June 24, 2024

I unfortunately do not have BZ access. 😢

from cluster-nfd-operator.

ArangoGutierrez commented on June 24, 2024

No worries, @zvonkok I am testing the fix on the latest kube upstream. could you create the BZ and Assign to me

from cluster-nfd-operator.

deanpeterson commented on June 24, 2024

@rmkraus, when I try to modify the Daemonset directly in Openshift it will not let me as it is being managed by nfd-master-server. How were you able to apply your affinity overrides?

from cluster-nfd-operator.

ArangoGutierrez commented on June 24, 2024

Hi @deanpeterson the new version of the operator will come with full support for 3 node clusters
the patch was merged yesterday and will be part of the payload for OCP 4.7. in the mean wile, I am more than happy to assist you in your use case

from cluster-nfd-operator.

deanpeterson commented on June 24, 2024

Thanks @ArangoGutierrez! Is there an easy way to modify a 4.6 cluster to overwrite the affinity rules in the nfd-worker daemonset?

from cluster-nfd-operator.

ArangoGutierrez commented on June 24, 2024

current master branch on this repo is able to do so.
are you familiar on how to deploy the operator using the Makefile on this repo?

from cluster-nfd-operator.

ArangoGutierrez commented on June 24, 2024

@rmkraus does the state of current master branch satisfy your use case?
if so please feel free to close this issue
thanks for your interest in the project and helping us enhance it!

from cluster-nfd-operator.

deanpeterson commented on June 24, 2024

No I wasn't aware of the Makefile to replace the existing nfd operator

from cluster-nfd-operator.

ArangoGutierrez commented on June 24, 2024

Awesome !!
Thanks for your feedback, is always appreciated and needed to enhance thee operator

from cluster-nfd-operator.

Broken on 3 node clusters about cluster-nfd-operator HOT 24 CLOSED

Comments (24)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent