Comments (22)
/sig scheduling
/assign
🤦. I'll came up with a fix.
from kubernetes.
For comparison, in the node affinity plugin, we only PreFilter nodes based on the metadata.name
field selector:
Maybe we should revert #109877?
Or somehow limit it to field selectors, as opposed to label selectors.
from kubernetes.
Actually, should we support this?
Is this one of the validations we are lacking? cc @liggitt @thockin
I'm not sure what is being asked.
Various cloud providers have a variety of methods of naming nodes (documented in kubernetes/website#8873, though that file appears to be gone now... maybe with the out-of-tree cloud provider move)
Are you asking if the way the kubelet sets the hostname label value should change? That will almost certainly break some existing users of the label on clouds where the current label value != the node name.
Are you asking if the API server should disallow the label being different than the node name? That will break nodes in environments where those differ.
from kubernetes.
You also have to consider what's feasible to cherry-pick. But now we are late for this patch release.
from kubernetes.
I have created #125398 as the fix for the recent releases. Which reverted #109877 and added a unit test for this.
In the long run, I think we need to go on with following steps:
- Change the way we handle pv's nodeAffinity, take node's name and affinity's matchFields into consideration.
if pv.Spec.NodeAffinity.Required != nil {
node := &v1.Node{ObjectMeta: metav1.ObjectMeta{
Name: nodeName,
Labels: nodeLabels,
}}
terms := pv.Spec.NodeAffinity.Required
matches, err := corev1.MatchNodeSelectorTerms(node, terms);
}
- Update the documentation to recommend users and provisioners to use
matchFields
withmetadata.name
to define affinity instead of usingmatchExpressions
withhostname
. - Finally, bring back #109877, but extracting
matchFields
withmetadata.name
from pv's affinity instead ofhostname
.
kubernetes/pkg/volume/util/util.go
Lines 573 to 594 in 6b73ccc
for _, term := range pv.Spec.NodeAffinity.Required.NodeSelectorTerms {
var nodes sets.Set[string]
for _, matchExpr := range term.MatchFields {
if matchExpr.Key == "metadata.name" && matchExpr.Operator == v1.NodeSelectorOpIn {
if nodes == nil {
nodes = sets.New(matchExpr.Values...)
} else {
nodes = nodes.Intersection(sets.New(matchExpr.Values...))
}
}
}
result = result.Union(nodes)
}
Sounds reasonable?
from kubernetes.
This issue is currently awaiting triage.
If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
from kubernetes.
Call path for prefilter:
volume-binding prefilter https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/volumebinding/volume_binding.go#L176
calls GetEligibleNodes here https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/volumebinding/volume_binding.go#L199
calls GetLocalPersistentVolumeNodeNames here https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/volumebinding/binder.go#L392
which does not do what it claims here https://github.com/kubernetes/kubernetes/blob/master/pkg/volume/util/util.go#L573 - returns a set of hostname label contents, not node names.
Scheulder calls schedulePod here https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/schedule_one.go#L390
calls findNodesThatFitPod here https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/schedule_one.go#L442
which treats the PreFilterResult.NodeNames as node names (not labels) here: https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/schedule_one.go#L486-L495
nodes
is empty beyond that point.
from kubernetes.
The behaviour for 1.27 is to bail out with a nodeinfo not found
error, the result with the master branch looks less obvious.
from kubernetes.
cc @gabesaba
from kubernetes.
Actually, should we support this?
Is this one of the validations we are lacking? cc @liggitt @thockin
from kubernetes.
I found a related(maybe) issue from a long time ago: #61410 (comment)
That is not a valid assumption. Node name and hostname are distinct and are definitively known to be different in some clusters.
Seems we allow distinct nodeName and hostname before. Not sure what the situation is now.
from kubernetes.
The situation is this: https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#built-in-node-labels
The value of these labels is cloud provider specific and is not guaranteed to be reliable. For example, the value of kubernetes.io/hostname may be the same as the node name in some environments and a different value in other environments.
If this is going to be changed I think it should be done via a proper deprecation cycle; there are clusters at the moment that are broken by the bug outlined in this issue.
(I understand the desire here - originally k8s had a design constraint that clusters would have a maximum size of about 10k entities; under those circumstances, simple algorithms - ie, for
loops over everything - work well. Some of the commentary about this prefilter mentioned testing under simulation with 15k nodes, 60k pods. I think that was one of the drivers to include this shortcut. As I understand it there's still not really a first-class notion of attribute indexing in k8s' data model; if there's a real need to move k8s from being suitable as a substrate for appliances to a large-scale general-purpose compute platform, then efficiently suporting attribute-based queries seems to be pretty critical, but I think that is beyond the scope of this issue.)
from kubernetes.
Are you asking if the API server should disallow the label being different than the node name? That will break nodes in environments where those differ.
I was asking about this.
It sounds like we should be reverting #109877 then.
Note that the cherry-pick deadline is tomorrow. I think that will be the last 1.27 patch release.
from kubernetes.
FYI @msau42 and @Huang-Wei as reviewers of that PR.
from kubernetes.
It sounds like we should be reverting #109877 then.
So, we do not consider converting hostname to nodeName in the prefilter for performance reasons?
from kubernetes.
So, we do not consider converting hostname to nodeName in the prefilter for performance reasons?
How do you suggest we do it? If we are going to list all nodes to see which ones match, that might be similar to just letting Filter run.
from kubernetes.
If we are going to list all nodes to see which ones match, that might be similar to just letting Filter run.
I think this will be a bit faster than running the filter as this is a quick check. But yea, maybe it's not worth it. No strong opinion here.
from kubernetes.
If we are going to list all nodes to see which ones match, that might be similar to just letting Filter run.
I think this will be a bit faster than running the filter as this is a quick check. But yea, maybe it's not worth it. No strong opinion here.
but that requires the prefilter step to have access to all nodes, which it doesn't today... that seems like a big change
from kubernetes.
Yes, this doesn't seem to align with the meaning of 'prefilter'. 😂
It looks like our only option is to revert #109877.
from kubernetes.
Agree let's revert now and we can take more time to discuss and see if there's a better solution.
from kubernetes.
I don't want to get in the way of mitigating this in the short term. Having said that, better approaches might include fixing up the local-path provisioner for these cases to provide a direct constraint - eg, rancher/local-path-provisioner#414 and its attached PR.
from kubernetes.
/sig storage
from kubernetes.
Related Issues (20)
- High kubepods cgroup cpu.weight/shares starves kernel threads on many core systems HOT 4
- Job may get stuck repeatedly failing with Duplicate value message for uncountedTerminatedPods.failed HOT 14
- kubectl port-forward failing for named ports in native sidecar HOT 6
- [Flaking Test] k8s.io/apiserver/pkg/registry/generic/registry.registry HOT 6
- ExtendedResourceToleration adds tolerations even when the quantity of requested resources is "0" HOT 3
- When a deployment selects a node with the kubelet service not running as the nodeName, the Pods will remain in the pending state, then move to Terminating, and new Pods will be continuously created in a loop, resulting in a large number of Terminating Pods that cannot be terminated. HOT 3
- [Flaking Test] gce-cos-master-serial (etcd failure should recover from sigkill) HOT 3
- Incorrect error reporting in case of missing cgroup controllers HOT 4
- kubelet unbalanced affinity pod in different numa node HOT 11
- KEP-4639 OCI VolumeSource PoC HOT 2
- [Flaking test] [sig-network] Services should release NodePorts on delete HOT 4
- Conntrack tables having stale entries for UDP connection HOT 12
- Optimize Pod informer memory efficiency used in admission plugins HOT 6
- Race between seeing a CRD added event and being able to select the kind HOT 4
- kubectl ApplySets require `contains-group-kinds` annotation but documentation references `contains-group-resources` HOT 2
- Can k8s restrict kubelet from using kmem through configuration HOT 9
- Missed k8s.io/kube-openapi/cmd/openapi-gen dependency on code-generator go.mod HOT 16
- Scheduler pre-binding can cause race conditions with automated empty node removal HOT 2
- Cronjob Should have limit on Active Jobs HOT 7
- Distinguish PDB error separately in eviction API HOT 9
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from kubernetes.