Giter Club home page Giter Club logo

Comments (22)

AxeZhan avatar AxeZhan commented on June 29, 2024 2

/sig scheduling

/assign

🤦. I'll came up with a fix.

from kubernetes.

alculquicondor avatar alculquicondor commented on June 29, 2024 1

For comparison, in the node affinity plugin, we only PreFilter nodes based on the metadata.name field selector:

if r.Key == metav1.ObjectNameField && r.Operator == v1.NodeSelectorOpIn {

Maybe we should revert #109877?

Or somehow limit it to field selectors, as opposed to label selectors.

from kubernetes.

liggitt avatar liggitt commented on June 29, 2024 1

Actually, should we support this?

Is this one of the validations we are lacking? cc @liggitt @thockin

I'm not sure what is being asked.

Various cloud providers have a variety of methods of naming nodes (documented in kubernetes/website#8873, though that file appears to be gone now... maybe with the out-of-tree cloud provider move)

Are you asking if the way the kubelet sets the hostname label value should change? That will almost certainly break some existing users of the label on clouds where the current label value != the node name.

Are you asking if the API server should disallow the label being different than the node name? That will break nodes in environments where those differ.

from kubernetes.

alculquicondor avatar alculquicondor commented on June 29, 2024 1

You also have to consider what's feasible to cherry-pick. But now we are late for this patch release.

from kubernetes.

AxeZhan avatar AxeZhan commented on June 29, 2024 1

I have created #125398 as the fix for the recent releases. Which reverted #109877 and added a unit test for this.

In the long run, I think we need to go on with following steps:

  1. Change the way we handle pv's nodeAffinity, take node's name and affinity's matchFields into consideration.
    func CheckNodeAffinity(pv *v1.PersistentVolume, nodeLabels map[string]string) error {
    if pv.Spec.NodeAffinity == nil {
    return nil
    }
    if pv.Spec.NodeAffinity.Required != nil {
    node := &v1.Node{ObjectMeta: metav1.ObjectMeta{Labels: nodeLabels}}
    terms := pv.Spec.NodeAffinity.Required
    if matches, err := corev1.MatchNodeSelectorTerms(node, terms); err != nil {
    return err
    } else if !matches {
    return fmt.Errorf("no matching NodeSelectorTerms")
    }
    }
    return nil
    }
	if pv.Spec.NodeAffinity.Required != nil {
		node := &v1.Node{ObjectMeta: metav1.ObjectMeta{
			Name:   nodeName,
			Labels: nodeLabels,
		}}
		terms := pv.Spec.NodeAffinity.Required
		matches, err := corev1.MatchNodeSelectorTerms(node, terms); 
	}
  1. Update the documentation to recommend users and provisioners to use matchFields with metadata.name to define affinity instead of using matchExpressions with hostname.
  2. Finally, bring back #109877, but extracting matchFields with metadata.name from pv's affinity instead of hostname.
    func GetLocalPersistentVolumeNodeNames(pv *v1.PersistentVolume) []string {
    if pv == nil || pv.Spec.NodeAffinity == nil || pv.Spec.NodeAffinity.Required == nil {
    return nil
    }
    var result sets.Set[string]
    for _, term := range pv.Spec.NodeAffinity.Required.NodeSelectorTerms {
    var nodes sets.Set[string]
    for _, matchExpr := range term.MatchExpressions {
    if matchExpr.Key == v1.LabelHostname && matchExpr.Operator == v1.NodeSelectorOpIn {
    if nodes == nil {
    nodes = sets.New(matchExpr.Values...)
    } else {
    nodes = nodes.Intersection(sets.New(matchExpr.Values...))
    }
    }
    }
    result = result.Union(nodes)
    }
    return sets.List(result)
    }
	for _, term := range pv.Spec.NodeAffinity.Required.NodeSelectorTerms {
		var nodes sets.Set[string]
		for _, matchExpr := range term.MatchFields {
			if matchExpr.Key == "metadata.name" && matchExpr.Operator == v1.NodeSelectorOpIn {
				if nodes == nil {
					nodes = sets.New(matchExpr.Values...)
				} else {
					nodes = nodes.Intersection(sets.New(matchExpr.Values...))
				}
			}
		}
		result = result.Union(nodes)
	}

Sounds reasonable?

from kubernetes.

k8s-ci-robot avatar k8s-ci-robot commented on June 29, 2024

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

from kubernetes.

jan-g avatar jan-g commented on June 29, 2024

Call path for prefilter:

volume-binding prefilter https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/volumebinding/volume_binding.go#L176
calls GetEligibleNodes here https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/volumebinding/volume_binding.go#L199
calls GetLocalPersistentVolumeNodeNames here https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/volumebinding/binder.go#L392
which does not do what it claims here https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/util/util.go#L573 - returns a set of hostname label contents, not node names.

Scheulder calls schedulePod here https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/schedule_one.go#L390
calls findNodesThatFitPod here https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/schedule_one.go#L442
which treats the PreFilterResult.NodeNames as node names (not labels) here: https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/schedule_one.go#L486-L495
nodes is empty beyond that point.

from kubernetes.

jan-g avatar jan-g commented on June 29, 2024

The behaviour for 1.27 is to bail out with a nodeinfo not found error, the result with the master branch looks less obvious.

from kubernetes.

alculquicondor avatar alculquicondor commented on June 29, 2024

cc @gabesaba

from kubernetes.

alculquicondor avatar alculquicondor commented on June 29, 2024

Actually, should we support this?

Is this one of the validations we are lacking? cc @liggitt @thockin

from kubernetes.

AxeZhan avatar AxeZhan commented on June 29, 2024

I found a related(maybe) issue from a long time ago: #61410 (comment)

That is not a valid assumption. Node name and hostname are distinct and are definitively known to be different in some clusters.

Seems we allow distinct nodeName and hostname before. Not sure what the situation is now.

from kubernetes.

jan-g avatar jan-g commented on June 29, 2024

The situation is this: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#built-in-node-labels

The value of these labels is cloud provider specific and is not guaranteed to be reliable. For example, the value of kubernetes.io/hostname may be the same as the node name in some environments and a different value in other environments.

If this is going to be changed I think it should be done via a proper deprecation cycle; there are clusters at the moment that are broken by the bug outlined in this issue.

(I understand the desire here - originally k8s had a design constraint that clusters would have a maximum size of about 10k entities; under those circumstances, simple algorithms - ie, for loops over everything - work well. Some of the commentary about this prefilter mentioned testing under simulation with 15k nodes, 60k pods. I think that was one of the drivers to include this shortcut. As I understand it there's still not really a first-class notion of attribute indexing in k8s' data model; if there's a real need to move k8s from being suitable as a substrate for appliances to a large-scale general-purpose compute platform, then efficiently suporting attribute-based queries seems to be pretty critical, but I think that is beyond the scope of this issue.)

from kubernetes.

alculquicondor avatar alculquicondor commented on June 29, 2024

Are you asking if the API server should disallow the label being different than the node name? That will break nodes in environments where those differ.

I was asking about this.

It sounds like we should be reverting #109877 then.

Note that the cherry-pick deadline is tomorrow. I think that will be the last 1.27 patch release.

from kubernetes.

alculquicondor avatar alculquicondor commented on June 29, 2024

FYI @msau42 and @Huang-Wei as reviewers of that PR.

from kubernetes.

AxeZhan avatar AxeZhan commented on June 29, 2024

It sounds like we should be reverting #109877 then.

So, we do not consider converting hostname to nodeName in the prefilter for performance reasons?

from kubernetes.

alculquicondor avatar alculquicondor commented on June 29, 2024

So, we do not consider converting hostname to nodeName in the prefilter for performance reasons?

How do you suggest we do it? If we are going to list all nodes to see which ones match, that might be similar to just letting Filter run.

from kubernetes.

AxeZhan avatar AxeZhan commented on June 29, 2024

If we are going to list all nodes to see which ones match, that might be similar to just letting Filter run.

I think this will be a bit faster than running the filter as this is a quick check. But yea, maybe it's not worth it. No strong opinion here.

from kubernetes.

liggitt avatar liggitt commented on June 29, 2024

If we are going to list all nodes to see which ones match, that might be similar to just letting Filter run.

I think this will be a bit faster than running the filter as this is a quick check. But yea, maybe it's not worth it. No strong opinion here.

but that requires the prefilter step to have access to all nodes, which it doesn't today... that seems like a big change

from kubernetes.

AxeZhan avatar AxeZhan commented on June 29, 2024

Yes, this doesn't seem to align with the meaning of 'prefilter'. 😂

It looks like our only option is to revert #109877.

from kubernetes.

msau42 avatar msau42 commented on June 29, 2024

Agree let's revert now and we can take more time to discuss and see if there's a better solution.

from kubernetes.

jan-g avatar jan-g commented on June 29, 2024

I don't want to get in the way of mitigating this in the short term. Having said that, better approaches might include fixing up the local-path provisioner for these cases to provide a direct constraint - eg, rancher/local-path-provisioner#414 and its attached PR.

from kubernetes.

AxeZhan avatar AxeZhan commented on June 29, 2024

/sig storage

from kubernetes.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.