kubernetes / node-problem-detector Goto Github PK
View Code? Open in Web Editor NEWThis is a place for various problem detectors running on the Kubernetes nodes.
License: Apache License 2.0
This is a place for various problem detectors running on the Kubernetes nodes.
License: Apache License 2.0
Write Readme.md for KernelMonitor to demonstrate:
/cc @kubernetes/node-problem-detector-maintainers
I have a node running CoreOS 1221.0.0 with kernel version 4.8.6-coreos.
The node-problem-detector marked it with "KernelDeadlock True Sun, 04 Dec 2016 18:56:20 -0800 Wed, 16 Nov 2016 00:03:33 -0800 UnregisterNetDeviceIssue unregister_netdevice: waiting for lo to become free. Usage count = 1".
If I check my kernel log, I see the following:
$ dmesg -T | grep -i unregister_netdevice -C 3
[Wed Nov 16 08:02:19 2016] docker0: port 5(vethfd2807b) entered blocking state
[Wed Nov 16 08:02:19 2016] docker0: port 5(vethfd2807b) entered forwarding state
[Wed Nov 16 08:02:19 2016] IPv6: eth0: IPv6 duplicate address fe80::42:aff:fe02:1206 detected!
[Wed Nov 16 08:03:33 2016] unregister_netdevice: waiting for lo to become free. Usage count = 1
[Wed Nov 16 08:14:35 2016] vethafecb94: renamed from eth0
[Wed Nov 16 08:14:35 2016] docker0: port 2(veth807b9e2) entered disabled state
[Wed Nov 16 08:14:35 2016] docker0: port 2(veth807b9e2) entered disabled state
Clearly, the node managed to continue to perform operations after printing that message. In addition, pods continue to function just fine and there aren't any long-term issues for me on this node.
I know that the config of what counts as a deadlock is configurable, but perhaps the default configuration shouldn't include this, or the check should be more advanced for it, since as-is it could be quite confusing.
Ideally, we should not use journald config on non-systemd node. However, if we use, it is expected to error out.
However, now because of some unknown reason, NPD will create a /var/log/journal
directory and keep going. Related code: https://github.com/kubernetes/node-problem-detector/blob/master/pkg/systemlogmonitor/logwatchers/journald/log_watcher.go#L141
This should not matter much, because no one will actually write any log, so NPD will just hang and not consume more resource. But we should still fix this.
@kubernetes/node-problem-detector-maintainers
I cannot get the node problem detector to change a node status by injecting messages.
I am using kubernetes 1.5.2, Ubuntu 16.04, kernel 4.4.0-51-generic.
I run the npd as a daemonset. I have attempted to get this to work with the npd as version 0.3.0 and 0.4.0. I start the npd with the default command, using /config/kernel-monitor.json because my nodes use journald.
I have /dev/kmsg mounted into the pod, and I echo expressions matching the regexs in the kernel-monitor.json to /dev/kmsg on the node. I can view the fake logs I've echoed to /dev/kmsg in the pod.
Steps to reproduce:
# as root on the node where your pod is running
echo "task umount.aufs:123 blocked for more than 120 seconds." >> /dev/kmsg
# I have verified that these logs show up in journalctl -k
# this should match the following permanent condition in /config/kernel-monitor.json
# {
# "type": "permanent",
# "condition": "KernelDeadlock",
# "reason": "AUFSUmountHung",
# "pattern": "task umount\\.aufs:\\w+ blocked for more than \\w+ seconds\\."
# },
# check the node status of the node where you ran this on
kubectl get node <node>
# status will still be Ready
# for further detail examine the json
kubectl get node <node> -o json | jq .status.conditions
# you will see that the KernelDeadlock condition is still "False"
# I would expect the general status to change to "KernelDeadlock"
If I am not testing this properly, could you please give a detailed breakdown of how to test the node problem detector is working properly for kernel logs AND docker logs?
I have also reproduced this behavior using a custom docker_monitor.json and having the systemd docker service write to the journald docker logs. I have still been unsuccessful in getting the node status to change.
CurrentNodeName
to get the cloud provider agnostic hostname.
I'm disabling this test on AWS for now, but we're marching towards full support for AWS. Please treat this as if it had broken a build when it was checked in originally.
`Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
28m 28m 1 {kubelet 10.8.65.157} Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "kubernetes.io/host-path/5203c2d8-1b73-11e7-8754-46c88707ea0b-log" (spec.Name: "log") pod "5203c2d8-1b73-11e7-8754-46c88707ea0b" (UID: "5203c2d8-1b73-11e7-8754-46c88707ea0b").
28m 28m 1 {kubelet 10.8.65.157} Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "kubernetes.io/host-path/5203c2d8-1b73-11e7-8754-46c88707ea0b-localtime" (spec.Name: "localtime") pod "5203c2d8-1b73-11e7-8754-46c88707ea0b" (UID: "5203c2d8-1b73-11e7-8754-46c88707ea0b").
28m 28m 1 {kubelet 10.8.65.157} spec.containers{node-problem-detector} Normal Created Created container with docker id e65759efc680; Security:[seccomp=unconfined]
28m 28m 1 {kubelet 10.8.65.157} spec.containers{node-problem-detector} Normal Started Started container with docker id e65759efc680
28m 28m 1 {kubelet 10.8.65.157} spec.containers{node-problem-detector} Normal Created Created container with docker id a7e78dfbe63a; Security:[seccomp=unconfined]
28m 28m 1 {kubelet 10.8.65.157} spec.containers{node-problem-detector} Normal Started Started container with docker id a7e78dfbe63a
27m 27m 2 {kubelet 10.8.65.157} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "node-problem-detector" with CrashLoopBackOff: "Back-off 10s restarting failed container=node-problem-detector pod=node-problem-detector-gfx8q_default(5203c2d8-1b73-11e7-8754-46c88707ea0b)"
27m 27m 1 {kubelet 10.8.65.157} spec.containers{node-problem-detector} Normal Created Created container with docker id 6d3ac4da1b6e; Security:[seccomp=unconfined]
27m 27m 1 {kubelet 10.8.65.157} spec.containers{node-problem-detector} Normal Started Started container with docker id 6d3ac4da1b6e
27m 27m 2 {kubelet 10.8.65.157} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "node-problem-detector" with CrashLoopBackOff: "Back-off 20s restarting failed container=node-problem-detector pod=node-problem-detector-gfx8q_default(5203c2d8-1b73-11e7-8754-46c88707ea0b)"
`
What happens to the node wrong after creating daemonset? And set node to unscheduler?
Now i am running daemonset in all node, but how do i verify it is useful。
It is a bit difficult to simulate these problems。
Hardware issues: Bad cpu, memory or disk;
Kernel issues: Kernel deadlock, corrupted file system;
Container runtime issues: Unresponsive runtime daemon;
...
Broken out from here
Currently, the config has a default of "startPattern": "Initializing cgroup subsys cpuset",
This pattern is meant to detect a node's boot process. Prior to the 4.5 kernel, this message was typically printed during boot of a node. After 4.5 however, due to this change, it is quite unlikely for that message to appear.
Furthermore, there's rarely a reason to detect whether a message is for the current boot in such a fragile way.
With the kern.log
reader, every message is for the current boot because kern.log is usually handled where each kern.log
file corresponds to one boot (e.g. kern.log
is this boot, kern.log.1
is the boot before, kern.log.2.gz
the one before, etc). (EDIT: I'm wrong about this for gci at least)
With journald, the boot id is annotated in messages, and so it can accurately be correlated with the current boot id (see the "_BOOT_ID" record in journald messages).
With a kmsg reader, all messages will only be the current boot because kmsg is not persistent.
In none of those cases is startPattern
useful. Each kernel log parsing plugin should be responsible for doing the right thing itself I think.
Problem:
kubelet
, node-controller
, and now the node-problem-detector
all update the Node.Status
field.
Normally this is not a problem because if changes happen rapidly, resource version mismatch fails and everything is ok.
When a new field is added to Node Status, however:
kubelet
and node-controller
are in the same repository they are recompiled with the new field and Status updates continue to operate normally.node-problem-detector
is in a separate repository and has not been recompiled with the new version, its Update
calls end up squashing the new (unknown) field, resetting it to nil
.Suggested Solution:
node-problem-detector
should do a patch instead of an update to prevent wiping out fields it is not aware of. Incidentally, kubelet
and node-controller
should do this as well, but that is not as critical since they are in the same repository and new types are normally added their first.
CC @kubernetes/sig-node @kubernetes/sig-api-machinery @Random-Liu @bgrant0607
Hello,
I am testing out the node-problem-detector cluster addon and am seeing this error on only 1 node:
failed to update node conditions: Timeout: request did not complete within allowed duration
This cluster has had issues in the past due to some etcd3
nodes becoming unhealthy, and flanneld
being unable to connect to etcd3
at one point causing kube-apiserver
to fail to respond, and hence kubelet
, kube-controller-manager
, kube-proxy
, and kube-scheduler
to log various api timeout errors.
After restarting everything, the rest of the nodes are healthy & both etcd3
+ kube-apiserver
appear to be contactable by the kube-system
services. I then started npd-0.4.0
using the above manifest and am seeing only 1 node with this issue. The rest are sometimes logging what appears to be the race condition for updating node status exactly as in #108.
I've checked the 5 node etcd cluster health and only 1 of 5 is in unhealthy
state. Requests from kubectl
return most of the time, however I am seeing some timeouts from some kubectl describe node XXX
commands.
Relevant log sections below:
/var/log/node-problem-detector.log
:
[... BEGIN LOG ...]
I0617 05:08:14.386945 7 log_monitor.go:72] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:kernel] LogPath:/var/log/journal
Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:
kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, a
non-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:temporary Condition: Reason:UnregisterNe
tDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL po
inter dereference at .*} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Patt
ern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ second
s\.}]}
I0617 05:08:14.387062 7 log_watchers.go:40] Use log watcher of plugin "journald"
I0617 05:08:14.387253 7 log_monitor.go:72] Finish parsing log monitor config file: {WatcherConfig:{Plugin:filelog PluginConfig:map[timestamp:^.{15} message:kernel: \[.*\] (
.*) timestampFormat:Jan _2 15:04:05] LogPath:/var/log/kern.log Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:00
01-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d
+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more tha
n \w+ seconds\.} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {Type:temporary Condition:
Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .*} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP} {Typ
e:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:Dock
erHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.}]}
I0617 05:08:14.387286 7 log_watchers.go:40] Use log watcher of plugin "filelog"
I0617 05:08:14.387923 7 log_monitor.go:81] Start log monitor
E0617 05:08:14.387946 7 problem_detector.go:65] Failed to start log monitor "/config/kernel-monitor-filelog.json": failed to stat the file "/var/log/kern.log": stat /var/lo
g/kern.log: no such file or directory
I0617 05:08:14.387952 7 log_monitor.go:81] Start log monitor
I0617 05:08:14.389906 7 log_watcher.go:69] Start watching journald
I0617 05:08:14.389947 7 problem_detector.go:74] Problem detector started
I0617 05:08:14.390046 7 log_monitor.go:173] Initialize condition generated: [{Type:KernelDeadlock Status:false Transition:2017-06-17 05:08:14.390031295 +0000 UTC Reason:Ker
nelHasNoDeadlock Message:kernel has no deadlock}]
E0617 06:41:07.988201 7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful confli
ct (firstResourceVersion: "2163685", currentResourceVersion: "2163729"):
diff1={"metadata":{"resourceVersion":"2163729"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T06:41:01Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T06:41
:01Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T06:41:01Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T06:41:01Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T06:41:02Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoD
eadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 08:10:02.768550 7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful confli
ct (firstResourceVersion: "2181151", currentResourceVersion: "2181185"):
diff1={"metadata":{"resourceVersion":"2181185"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T08:10:02Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T08:10
:02Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T08:10:02Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T08:10:02Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T08:10:02Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoD
eadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 08:40:02.392305 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 08:40:32.394968 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 08:40:52.688694 7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 10:19:47.886026 7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful confli
ct (firstResourceVersion: "2206191", currentResourceVersion: "2206223"):
diff1={"metadata":{"resourceVersion":"2206223"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T10:19:44Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T10:19:44Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T10:19:44Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T10:19:44Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T10:19:47Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 10:20:03.868248 7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2206223", currentResourceVersion: "2206276"):
diff1={"metadata":{"resourceVersion":"2206276"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T10:19:57Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T10:19:57Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T10:19:57Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T10:19:57Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T10:19:58Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 10:27:14.278874 7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2207596", currentResourceVersion: "2207627"):
diff1={"metadata":{"resourceVersion":"2207627"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T10:27:09Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T10:27:09Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T10:27:09Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T10:27:09Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T10:27:11Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 10:49:49.999965 7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 11:04:58.654339 7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2214802", currentResourceVersion: "2214843"):
diff1={"metadata":{"resourceVersion":"2214843"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:04:56Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T11:04:56Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T11:04:56Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T11:04:56Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:04:58Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 11:13:25.349869 7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 11:13:27.448321 7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2216376", currentResourceVersion: "2216452"):
diff1={"metadata":{"resourceVersion":"2216452"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:13:26Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T11:13:26Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T11:13:26Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T11:13:26Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:13:25Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 11:27:54.707593 7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 11:28:03.676434 7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 11:28:11.687756 7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2219181", currentResourceVersion: "2219220"):
diff1={"metadata":{"resourceVersion":"2219220"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:27:56Z","type":"KernelDeadlock"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:28:07Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 11:28:11.687756 7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2219181", currentResourceVersion: "2219220"):
diff1={"metadata":{"resourceVersion":"2219220"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:27:56Z","type":"KernelDeadlock"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:28:07Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 11:35:22.216998 7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2220568", currentResourceVersion: "2220603"):
diff1={"metadata":{"resourceVersion":"2220603"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:35:12Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T11:35:12Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T11:35:12Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T11:35:12Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:35:21Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 11:35:39.515617 7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 11:36:44.370694 7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2220826", currentResourceVersion: "2220882"):
diff1={"metadata":{"resourceVersion":"2220882"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:36:40Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T11:36:40Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T11:36:40Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T11:36:40Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:36:43Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 11:43:04.549531 7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 11:52:29.011584 7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 11:55:55.741212 7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
[...SNIP...]
E0617 20:48:43.758632 7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:50:08.649485 7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:50:36.742814 7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:51:06.745291 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:51:18.310623 7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:51:25.507013 7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 20:51:35.682252 7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 20:51:40.731491 7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful confli
ct (firstResourceVersion: "2329653", currentResourceVersion: "2329819"):
diff1={"metadata":{"resourceVersion":"2329819"},"status":{"conditions":[{"lastTransitionTime":"2017-06-17T20:49:33Z","message":"Kubelet stopped posting node status.","reason":"N
odeStatusUnknown","status":"Unknown","type":"DiskPressure"},{"lastTransitionTime":"2017-06-17T20:49:33Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnkn
own","status":"Unknown","type":"MemoryPressure"},{"lastTransitionTime":"2017-06-17T20:49:33Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","statu
s":"Unknown","type":"OutOfDisk"},{"lastTransitionTime":"2017-06-17T20:49:33Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","ty
pe":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T20:51:39Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoD
eadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 20:52:05.903239 7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:52:35.941987 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:53:05.990698 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:53:35.993274 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:54:05.995654 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:54:35.998026 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:55:06.009227 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:55:36.011974 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:56:01.937423 7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:56:12.384973 7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 20:56:42.390952 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:57:12.393356 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:57:42.395943 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:58:12.405949 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:58:22.411682 7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:58:33.305720 7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:58:34.805954 7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful confli
ct (firstResourceVersion: "2330223", currentResourceVersion: "2330380"):
diff1={"metadata":{"resourceVersion":"2330380"},"status":{"conditions":[{"lastTransitionTime":"2017-06-17T20:52:28Z","message":"Kubelet stopped posting node status.","reason":"N
odeStatusUnknown","status":"Unknown","type":"DiskPressure"},{"lastTransitionTime":"2017-06-17T20:52:28Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnkn
own","status":"Unknown","type":"MemoryPressure"},{"lastTransitionTime":"2017-06-17T20:52:28Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"OutOfDisk"},{"lastTransitionTime":"2017-06-17T20:52:28Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T20:58:33Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 20:59:13.391379 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
[...SNIP...]
[... MESSAGE REPEATS APPROX EVERY ~1/2 SECOND ...]
[...SNIP...]
E0617 21:22:43.790298 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:23:13.792892 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:23:43.795469 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:24:13.798194 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:24:43.800739 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:25:13.803335 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:25:43.805904 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:26:13.808357 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:26:43.810985 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:27:13.813560 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:27:43.816150 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:28:13.886384 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:28:43.889035 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:29:13.891666 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:29:43.894478 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:30:13.897056 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:30:43.922019 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:31:13.947162 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:31:43.949919 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:32:13.952527 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:32:43.955028 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:33:13.957574 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:33:43.960186 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:34:13.962807 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:34:43.965335 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:35:13.967982 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:35:43.971220 7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
[...SNIP...]
[... MESSAGE REPEATS APPROX EVERY ~1/2 SECOND ...]
Write Readme.md for NodeProblemDetector to demonstrate:
/cc @kubernetes/node-problem-detector-maintainers
The default path to the kernel monitor config in node-problem-detector.go is '/config/kernel_monitor.json'. However, the file in the config/ directory of the repo is named 'kernel-monitor.json' (hyphen at one place and underscore at the other). They need to be named the same.
cc @Random-Liu
We ran 4k-node scalability test and observed that fluentd gets oom-killed frequently.
However, npd seems to send too many patch node-status requests (~3k qps out of ~11k total qps) due to it.
Ref: kubernetes/kubernetes#47344 (comment) kubernetes/kubernetes#47865 (comment)
Why is it reporting oom events when kubelet seems to do that?
Can we make npd report ooms for just system processes and let kubelet alone take care of k8s containers (by modifying npd's 'OOMKilling' rule)?
cc @Random-Liu @gmarek @kubernetes/sig-node-bugs
Panic crash:
kubectl get pods -o wide --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE NODE
default node-problem-detector-0kgkw 0/1 CrashLoopBackOff 3 1m 192.168.78.15
default node-problem-detector-ar3tk 0/1 CrashLoopBackOff 3 1m 192.168.78.16
kubectl logs node-problem-detector-0kgkw
I0623 01:02:18.560287 1 kernel_monitor.go:86] Finish parsing log file: {WatcherConfig:{KernelLogPath:/log/kern.log} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:UnregisterNetDeviceIssue Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+}]}
I0623 01:02:18.560413 1 kernel_monitor.go:93] Got system boot time: 2016-06-17 17:51:02.560408109 +0000 UTC
panic: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
goroutine 1 [running]:
panic(0x15fc280, 0xc8204fc000)
/usr/local/go/src/runtime/panic.go:464 +0x3e6
k8s.io/node-problem-detector/pkg/problemclient.NewClientOrDie(0x0, 0x0)
/usr/local/google/home/lantaol/workspace/src/k8s.io/node-problem-detector/pkg/problemclient/problem_client.go:56 +0x132
k8s.io/node-problem-detector/pkg/problemdetector.NewProblemDetector(0x7faa4f155140, 0xc8202a6900, 0x0, 0x0)
/usr/local/google/home/lantaol/workspace/src/k8s.io/node-problem-detector/pkg/problemdetector/problem_detector.go:45 +0x36
main.main()
/usr/local/google/home/lantaol/workspace/src/k8s.io/node-problem-detector/node_problem_detector.go:33 +0x56
We already have a NPD e2e test in kubernetes repo: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/node_problem_detector.go.
However, the e2e test only doesn't actually test the NPD deployed in the cluster. It:
Although this test is still necessary, we should add a new test to really inject log into kernel log or even trigger kernel problem and test the real NPD deployed in the cluster.
The test is disruptive, we may want to add [Feature]
tag and create a dedicated jenkins job for it.
/cc @dchen1107
Discussed with @apatil.
Kernel monitor was initially introduced to monitor kernel log and detect kernel issues.
However, in fact it could be extended to monitor other logs such as docker log, systemd log etc. by adding new translator
. Currently it is already doable, but not very intuitive because:
kernel xxx
.We should refactor the code to make it easier and more intuitive to extend kernel monitor:
kernelmonitor
to logmonitor
. We'll only use log monitor to monitor kernel log for K8s, but it should be easy for other users to reconfigure and extend it to monitor other logs./cc @kubernetes/sig-node
Failure happens.
Node Problem Detector should not just panic when fails to connect apiserver.
After adding the standalone mode in #49 , I created and pushed a new docker image gcr.io/google_containers/node-problem-detector:v0.3 which includes these changes for use by kubemark. However this version bump has to be reflected in:
@Random-Liu Hope this push doesn't disturb any plans you might be having for npd release schedule. If yes, it's not difficult to revert this change, as I just did it for use in kubemark.
Hi,
great initiative. Here is my experience on RHEL/CentOS 7
Apparently node-problem-detector is not able to work with the /run/log/journal/ (unfortunately) - correct? If so, it seems like a big limitation, no?
The issue is that even if cannot do that, it also fails to start, as it depends on the existence of "/var/log/journal". Knowing that it's not able to use /run/log/journal, is it reasonable to expect that it will create "/var/log/journal" on each node?
"/var/log/kern.log" is missing on my OS. I guess it's not in use any longer or it should be configured?
Here are the logs:
I0621 14:32:46.173199 7 log_watchers.go:40] Use log watcher of plugin "journald"
I0621 14:32:46.186164 7 log_monitor.go:72] Finish parsing log monitor config file: {WatcherConfig:{Plugin:filelog PluginConfig:map[timestamp:^.{15} message:kernel: \[.*\] (.*) timestampFormat:Jan _2 15:04:05] LogPath:/var/log/kern.log Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .*} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.}]}
I0621 14:32:46.186204 7 log_watchers.go:40] Use log watcher of plugin "filelog"
I0621 14:32:46.186926 7 log_monitor.go:81] Start log monitor
E0621 14:32:46.186945 7 problem_detector.go:65] Failed to start log monitor "/config/kernel-monitor.json": failed to stat the log path "/var/log/journal": stat /var/log/journal: no such file or directory
I0621 14:32:46.186951 7 log_monitor.go:81] Start log monitor
E0621 14:32:46.186958 7 problem_detector.go:65] Failed to start log monitor "/config/kernel-monitor-filelog.json": failed to stat the file "/var/log/kern.log": stat /var/log/kern.log: no such file or directory
F0621 14:32:46.186964 7 node_problem_detector.go:88] Problem detector failed with error: no log montior is successfully setup
It would be good to put some documentation as to how to write a new detector. Maybe give an example of a trivial detector in the code base that contributors can take and extend.
Forked from #63 (comment).
We need a release instruction to standardize the release process. The release instruction should cover:
This has relatively lower priority than the features, will file the doc before cutting release.
Hi, I deployed npd yesterday in my k8s 1.6, Now I find it can't update node condition.Below is its log
E0418 09:48:32.826374 7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "cs55": there is a meaningful conflict (firstResourceVersion: "881081", currentResourceVersion: "881106"):
diff1={"metadata":{"resourceVersion":"881106"},"status":{"conditions":[{"lastHeartbeatTime":"2017-04-18T01:48:32Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-04-18T01:47:32Z","type":"KernelDeadlock"},{"lastHeartbeatTime":"2017-04-18T01:48:32Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-04-18T01:48:32Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-04-18T01:48:32Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-04-18T01:48:32Z","lastTransitionTime":"2017-04-17T09:10:43Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
I wonder why this happen? @Random-Liu thx
Currently NPD gets hostname override from NODE_NAME
env. https://github.com/kubernetes/node-problem-detector/blob/master/pkg/problemclient/problem_client.go#L73
This is not friendly for standalone mode.
We should add a --hostname-override
flag, and set the flag with NODE_NAME
from downward api in the yaml file. https://github.com/kubernetes/node-problem-detector/blob/master/node-problem-detector.yaml#L21
As part of the effort to make testing using kubemark mimic real clusters as closely as possible, we are planning to add container for a "hollow-node-problem-detector" inside hollow-node alongside the existing containers "hollow-kubelet" and "hollow-kubeproxy". For this, it is required to have a node-problem-detector image which essentially has conditions and rules inside the kernel monitoring config set to an empty list. Also, this image should eventually (once it is tested to work fine in Kubemark) be pushed to gcr.io/google-containers.
@kubernetes/sig-scalability @wojtek-t @gmarek @Random-Liu
For move fast as soon as possible, we used https://github.com/hpcloud/tail for the first version.
However, in fact it has been proved to have some bugs, such as hpcloud/tail#21.
We should reconsider whether we should try to fix the bugs or implement a simple tail lib ourselves based on google/cadvisor#1264.
We should add a changelog to make the release more clear and official.
The master depends on itself having a similar view of time as the Nodes, and the controller manager will not properly roll out deployments if the node and master clocks are skewed. We should figure out a way of detecting this condition.
We should add presubmit/postsubmit e2e test for NPD.
For now, we can just download newest kubernetes version and run the node problem detector e2e test.
After introducing a new pattern, we usually need to verify it.
However, it's hard to trigger a real problem for a given service, so we usually have to inject log to verify it.
It's hard to inject log into a given systemd service. We should have a test systemd service which does nothing but generating specified log. We could let NPD parse the log of the test service to verify the newly introduced pattern.
make failed with output:
CGO_ENABLED=0 GOOS=linux godep go build -a -installsuffix cgo -ldflags '-w' -o node-problem-detector
vendor/github.com/coreos/go-systemd/sdjournal/functions.go:19:2: no buildable Go source files in /Users/sandflee/goproject/src/k8s.io/node-problem-detector/vendor/github.com/coreos/pkg/dlopen
godep: go exit status 1
With Linux 4.9, and perhaps earlier versions, there's a trailing
, shmem-rss:\\d+kB
Should it be optional in the built-in regex, are admins expect to edit their configuration, or something else?
I'm an ABRT developer and I would love to create a problem daemon reporting problems detected by ABRT to node-problem-detector.
ABRT's architecture is similar to node-problem-detector's - there are agents reporting detected problems to abrtd. An ABRT agent is either a tiny daemon watching logs (or systemd-journal) or a language error handler (Python sys.excepthook, Ruby at_exit callback, /proc/sys/kernel/core_pattern, Node.js uncaughtException event handler, Java JNI agent).
I've created a docker image that is capable to detect Kernel oopses, vmcores and core files on a host:
https://github.com/jfilak/docker-abrt/tree/atomic_minimal
(It should be possible to detect uncaught [Python, Ruby, Java] exceptions in the future)
ABRT provides several ways of reporting the detected problems to users - e-mail, FTP|SCP upload, D-Bus signal, Bugzilla bug, micro-Report, systemd-journal catalog message - and it is trivial to add another report destination.
The Design Doc defines "Problem Report Interface" but I've failed to find out how to register a new problem deamon to node-problem-detector or how to use the "Problem Report Interface" from a third party daemon.
So I create npd containers inside hollow-nodes (just for clarity: these are actually pods) of kubemark by passing the address of the kubemark master and the kubeconfig file through the override flag as follows:
"name": "hollow-node-problem-detector",
"image": "gcr.io/google_containers/node-problem-detector:v0.3",
"env": [
{
"name": "NODE_NAME",
"valueFrom": {
"fieldRef": {
"fieldPath": "metadata.name"
}
}
}
],
"command": [
"/node-problem-detector",
"--kernel-monitor=/config/kernel-monitor.json",
"--apiserver-override=https://104.198.41.48:443?inClusterConfig=false&auth=/kubeconfig/npd_kubeconfig",
"--alsologtostderr",
"1>>/var/logs/npd_$(NODE_NAME).log 2>&1"
],
"volumeMounts": ....
I even checked inside the container using kubectl exec
that the right kubeconfig file is present on the container and NODE_NAME env var is defined. Here's the kubeconfig:
apiVersion: v1
kind: Config
users:
- name: node-problem-detector
user:
token: 8qt8RLdwIKeZQ0QeVrSWg4BrqK3Cs8H8
clusters:
- name: kubemark
cluster:
insecure-skip-tls-verify: true
server: https://104.198.41.48
contexts:
- context:
cluster: kubemark
user: node-problem-detector
name: kubemark-npd-context
current-context: kubemark-npd-context
Despite this, the npd seems to communicate with the wrong master (the master of the real cluster underneath the kubemark cluster). It might probably be still using the inClusterConfig from the real node underneath the hollow-node, as that seems to be the only way by which it can get hold of the wrong master's IP address.
Currently, NPD sent a PATCH every 30 seconds, which could be a significant overhead for apiserver in large cluster.
Even in periodical resync, we should GET
first and only send PATCH
when the conditions are actually different.
Further more, we could use the same trick with kubelet to leverage the apiserver cache to reduce the load even more. https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_node_status.go#L313-L322
NPD (node problem detector) is introduced in Kubernetes 1.3 as a default add-on in GCE cluster.
At that time, it is mainly targeted on default GCE Kubernetes setup. However, as time goes by, some limitations were found such as Journald support, Authentication Issue, Scalability Issue which affected the adoption of NPD in many other environment.
In Kubernetes 1.6, we plan to invest some time to improve NPD, make it production ready and rollout it in GKE.
Here are the working items and priorities:
InClusterConfig
to access apiserver. However, this does not work when service account is not available. (Issue: #27, #21). We should make the apiserver client option configurable, so that user can customize it based on their cluster setup. This is prerequisite of Standalone mode (PR: #49, @andyxning)live-restore
is still in validation). Because of this, NPD may not be able to detect problems when docker is unresponsive. (Issue: #76)Note that only P0s are release blocker.
@dchen1107 @fabioy @ajitak
/cc @kubernetes/sig-node-misc
We've got some data from #2.
However, after that we've made a lot of changes. We should do benchmark to measure the cpu and memory usage. And optimize the log parsing code of kernel monitor if the resource usage is too high.
An accurate and acceptable number is important for NPD rollout.
/cc @dchen1107
Our k8s API server is running with a self-signed certificate to enable SSL. I am wondering what's the best way to specify a root CA when running node-problem-detector.
We want NPD to support journald log even when it's running as daemonset inside a container.
However, it's hard to consume host journal log inside container. Previously, we use fedora
base image, and rely on the journald library inside the base image to understand and read the host journald log. However, there are several limitations:
A more generic solution may be mount the host /
inside the container, and chroot
. In this way, we could use the journald on the host directly, which eliminates the problems above.
$ docker run -it --privileged -v /:/rootfs --entrypoint=/bin/bash gcr.io/google_containers/node-problem-detector:v0.4.0
[root@07fb0f31a048 /]# chroot /rootfs
sh-4.3# journalctl -k
-- Logs begin at Sun 2017-06-11 12:12:02 UTC, end at Mon 2017-06-12 18:59:06 UTC. --
Jun 12 18:19:54 e2e-test-lantaol-master kernel: device veth43d8449 entered promiscuous mode
Jun 12 18:19:54 e2e-test-lantaol-master kernel: IPv6: ADDRCONF(NETDEV_UP): veth43d8449: link is not ready
Jun 12 18:19:54 e2e-test-lantaol-master kernel: eth0: renamed from veth75b3ac0
...
I haven't thought through the potential security problem yet, but since NPD is a privileged daemonset, it's reasonable to grant host fs access to it.
This would allow it to be used with clusters that don't have service accounts enabled (they are disabled in our cluster because they do not play well with namespaced accounts yet)
To roll out the NodeProblemDetector and run it as an addon pods of Kubernetes, we need to understand its overhead, at least including cpu and memory overhead.
/cc @kubernetes/sig-node
Default Kubernetes setup removed ABAC support, and is only using RBAC now.
kubernetes/kubernetes#39092
This is fine for the NPD in kube-system, which has enough permission. However, the NPD setup in the example and Readme.md may not be enough.
We should add documentation and example to demonstrate how to deploy NPD in a RBAC K8s cluster.
I was reading through http://blog.scoutapp.com/articles/2013/07/25/understanding-cpu-steal-time-when-should-you-be-worried and I'm curious if the node-problem-detector should be watching for CPU steal?
For better reliability, in Kubernetes 1.6, we plan to add standalone NPD support to run NPD as a system daemon on each node.
To achieve this, we need to:
apiserver-override
option to make it possible to use customized client config instead of InClusterConfig
. #49NodeAllocatable
according to the resource usage accordingly.~~~system:kube-proxy
.If the kubelet is installed and then for some reason a security policy is changed later on a critical kubernetes component, it might be wise to watch out for that.
I am running k8s 1.3.0 and with docker image node-problem-detector:v0.1
I changed kubelet --hostname-override with node ip on ehch minion:
root@SZX1000116607:~# cat /etc/default/kubelet
KUBELET_OPTS='--hostname-override=10.22.109.119 --api-servers=http://10.22.109.119:8080,http://10.22.69.237:8080,http://10.22.117.82:8080 --pod-infra-container-image=xxxxx/kubernetes/pause:latest --cluster-dns=192.168.1.1 --cluster-domain=test1 --low-diskspace-threshold-mb=2048 --cert-dir=/var/run/kubelet --allow-privileged=true'
when I start node-problem-detector as a daemon set, I got a lot of error messages:
2016-07-11T07:49:17.310422203Z I0711 07:49:17.309761 1 kernel_monitor.go:86] Finish parsing log file: {WatcherConfig:{KernelLogPath:/log/kern.log} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:UnregisterNetDeviceIssue Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+}]}
2016-07-11T07:49:17.310510490Z I0711 07:49:17.310026 1 kernel_monitor.go:93] Got system boot time: 2016-07-08 01:40:54.310019188 +0000 UTC
2016-07-11T07:49:17.311878174Z I0711 07:49:17.311436 1 kernel_monitor.go:102] Start kernel monitor
2016-07-11T07:49:17.311907663Z I0711 07:49:17.311654 1 kernel_log_watcher.go:173] unable to parse line: "", can't find timestamp prefix "kernel: [" in line ""
2016-07-11T07:49:17.311926515Z I0711 07:49:17.311696 1 kernel_log_watcher.go:110] Start watching kernel log
2016-07-11T07:49:17.311942118Z I0711 07:49:17.311720 1 problem_detector.go:60] Problem detector started
2016-07-11T07:49:17.314020808Z 2016/07/11 07:49:17 Seeked /log/kern.log - &{Offset:0 Whence:0}
2016-07-11T07:49:18.355974460Z E0711 07:49:18.355712 1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:19.314395255Z E0711 07:49:19.314110 1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:20.314302138Z E0711 07:49:20.313982 1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:21.315151897Z E0711 07:49:21.314849 1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:22.313977681Z E0711 07:49:22.313834 1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:23.313906286Z E0711 07:49:23.313619 1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:24.314428608Z E0711 07:49:24.314141 1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:25.316626549Z E0711 07:49:25.316326 1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:26.314346471Z E0711 07:49:26.314142 1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:27.313901894Z E0711 07:49:27.313759 1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:28.314356686Z E0711 07:49:28.314198 1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:29.314747334Z E0711 07:49:29.314450 1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:30.314562756Z E0711 07:49:30.314235 1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
It seems node-problem-detector still use the original os hostname to access node instead of overrided hostname that is registered in etcd.
/wangyumi
Now to get the node name, we have to let node problem detector run in host network.
After kubernetes/kubernetes#27880 landing, we can change the node problem detector to get node name from the downward api. And that will also fix #23.
I'm seeing logs like these:
E0110 22:51:21.706489 1 manager.go:130] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-0-12-121.ec2.internal": <nil>
Add journald support.
Consider reusing the code in cadvisor, and maybe replace hpcloud/tail with the simpler implementation in cadvisor #3
I cloned the repo and ran:
$ make node-problem-detector
CGO_ENABLED=0 GOOS=linux godep go build -a -installsuffix cgo -ldflags '-w' -o node-problem-detector
vendor/gopkg.in/fsnotify.v1/inotify.go:19:2: cannot find package "golang.org/x/sys/unix" in any of:
/home/decarr/go/src/k8s.io/node-problem-detector/vendor/golang.org/x/sys/unix (vendor tree)
/usr/local/go/src/golang.org/x/sys/unix (from $GOROOT)
/home/decarr/go/src/k8s.io/node-problem-detector/Godeps/_workspace/src/golang.org/x/sys/unix (from $GOPATH)
/home/decarr/go/src/golang.org/x/sys/unix
godep: go exit status 1
Makefile:9: recipe for target 'node-problem-detector' failed
make: *** [node-problem-detector] Error 1
Looks like this is missing a godep in master:
https://github.com/kubernetes/node-problem-detector/tree/master/vendor
We should create a latest NPD image when each release.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.