kubernetes / node-problem-detector Goto Github PK

View Code? Open in Web Editor NEW

2.8K 54.0 609.0 25.34 MB

This is a place for various problem detectors running on the Kubernetes nodes.

License: Apache License 2.0

Makefile 2.27% Shell 3.43% Go 93.43% Dockerfile 0.76% Batchfile 0.06% PowerShell 0.05%

node-problem-detector's People

Contributors

Stargazers

Watchers

Forkers

random-liu derekwaynecarr freeformz girishkalele ronnielai e-llp ijumps linearregression cloudxtreme zanhsieh johscheuer apatil euank andyxning shyamjvs ashmere alejandroesc tuananh kewubenduben apsops lucasmundim panlei361 juliusmilan zhangxiaoyu-zidif wizard-cxy ajitak aaronbbrown joelsmith asifdxtreme fisherxu ravisantoshgudimetla mbssaiakhil crimsonfaith91 origin-ports cstuntz caoshufeng openshift alok87 mkumatag johnwalker247 y-taka-23 lphiri ngocngv bonowang starefossen rootfs rramkumar1 kmgreen2 spiffxp brionstone thockin-tmp r4j4h cimomo davidchinacloud rcluoyi jesseshieh gkganesh126 pzghost mzahlada slououou jsenon stewart-yu ravilr sagarkrsd weinliu erstaples pugna0 yevhenmaslov rlguarino mirake n1koo dixudx willseeyou box-kubernetes dashpole hingstarne kiranmudemela adamdang gitlp chadswen poothia wangzhen127 parinck alibasim86 sohel2020 hobbytp negz cynepco3hahue lzang nokia hercynium serg1i connextrum boazjohn jiayingz cryptactif duffqiu sevifives hchenxa whypro

node-problem-detector's Issues

Write Readme.md for KernelMonitor

Write Readme.md for KernelMonitor to demonstrate:

Motivation of KernelMonitor
Usage of current KernelMonitor
Future plan of KernelMonitor.

/cc @kubernetes/node-problem-detector-maintainers

"unregister_netdevice" isn't necessarily a KernelDeadlock

I have a node running CoreOS 1221.0.0 with kernel version 4.8.6-coreos.

The node-problem-detector marked it with "KernelDeadlock True Sun, 04 Dec 2016 18:56:20 -0800 Wed, 16 Nov 2016 00:03:33 -0800 UnregisterNetDeviceIssue unregister_netdevice: waiting for lo to become free. Usage count = 1".

If I check my kernel log, I see the following:

$ dmesg -T | grep -i unregister_netdevice -C 3
[Wed Nov 16 08:02:19 2016] docker0: port 5(vethfd2807b) entered blocking state
[Wed Nov 16 08:02:19 2016] docker0: port 5(vethfd2807b) entered forwarding state
[Wed Nov 16 08:02:19 2016] IPv6: eth0: IPv6 duplicate address fe80::42:aff:fe02:1206 detected!
[Wed Nov 16 08:03:33 2016] unregister_netdevice: waiting for lo to become free. Usage count = 1
[Wed Nov 16 08:14:35 2016] vethafecb94: renamed from eth0
[Wed Nov 16 08:14:35 2016] docker0: port 2(veth807b9e2) entered disabled state
[Wed Nov 16 08:14:35 2016] docker0: port 2(veth807b9e2) entered disabled state

Clearly, the node managed to continue to perform operations after printing that message. In addition, pods continue to function just fine and there aren't any long-term issues for me on this node.

I know that the config of what counts as a deadlock is configurable, but perhaps the default configuration shouldn't include this, or the check should be more advanced for it, since as-is it could be quite confusing.

DaemonSet NPD creates /var/log/journal on non-journald node.

Ideally, we should not use journald config on non-systemd node. However, if we use, it is expected to error out.

However, now because of some unknown reason, NPD will create a /var/log/journal directory and keep going. Related code: https://github.com/kubernetes/node-problem-detector/blob/master/pkg/systemlogmonitor/logwatchers/journald/log_watcher.go#L141

This should not matter much, because no one will actually write any log, so NPD will just hang and not consume more resource. But we should still fix this.

@kubernetes/node-problem-detector-maintainers

status doesn't change when injecting log messages

I cannot get the node problem detector to change a node status by injecting messages.

I am using kubernetes 1.5.2, Ubuntu 16.04, kernel 4.4.0-51-generic.

I run the npd as a daemonset. I have attempted to get this to work with the npd as version 0.3.0 and 0.4.0. I start the npd with the default command, using /config/kernel-monitor.json because my nodes use journald.

I have /dev/kmsg mounted into the pod, and I echo expressions matching the regexs in the kernel-monitor.json to /dev/kmsg on the node. I can view the fake logs I've echoed to /dev/kmsg in the pod.

Steps to reproduce:

# as root on the node where your pod is running
echo "task umount.aufs:123 blocked for more than 120 seconds." >> /dev/kmsg
# I have verified that these logs show up in journalctl -k

# this should match the following permanent condition in /config/kernel-monitor.json
#	{
#		"type": "permanent",
#		"condition": "KernelDeadlock",
#		"reason": "AUFSUmountHung",
#		"pattern": "task umount\\.aufs:\\w+ blocked for more than \\w+ seconds\\."
#	},

# check the node status of the node where you ran this on
kubectl get node <node>
# status will still be Ready

# for further detail examine the json
kubectl get node <node> -o json | jq .status.conditions
# you will see that the KernelDeadlock condition is still "False"

# I would expect the general status to change to "KernelDeadlock"

If I am not testing this properly, could you please give a detailed breakdown of how to test the node problem detector is working properly for kernel logs AND docker logs?

I have also reproduced this behavior using a custom docker_monitor.json and having the systemd docker service write to the journald docker logs. I have still been unsuccessful in getting the node status to change.

NodeProblemDetector tests failing on AWS

node-problem-detector/pkg/problemclient/problem_client.go

Line 61 in 9ab546a

c.nodeName, err = os.Hostname()

assumes that the hostname is the same name as the cloud provider instance name. This is false on AWS. NPD needs to be going through CurrentNodeName to get the cloud provider agnostic hostname.

I'm disabling this test on AWS for now, but we're marching towards full support for AWS. Please treat this as if it had broken a build when it was checked in originally.

now,i canot run it in kub1.5.3

`Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message

28m 28m 1 {kubelet 10.8.65.157} Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "kubernetes.io/host-path/5203c2d8-1b73-11e7-8754-46c88707ea0b-log" (spec.Name: "log") pod "5203c2d8-1b73-11e7-8754-46c88707ea0b" (UID: "5203c2d8-1b73-11e7-8754-46c88707ea0b").
28m 28m 1 {kubelet 10.8.65.157} Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "kubernetes.io/host-path/5203c2d8-1b73-11e7-8754-46c88707ea0b-localtime" (spec.Name: "localtime") pod "5203c2d8-1b73-11e7-8754-46c88707ea0b" (UID: "5203c2d8-1b73-11e7-8754-46c88707ea0b").
28m 28m 1 {kubelet 10.8.65.157} spec.containers{node-problem-detector} Normal Created Created container with docker id e65759efc680; Security:[seccomp=unconfined]
28m 28m 1 {kubelet 10.8.65.157} spec.containers{node-problem-detector} Normal Started Started container with docker id e65759efc680
28m 28m 1 {kubelet 10.8.65.157} spec.containers{node-problem-detector} Normal Created Created container with docker id a7e78dfbe63a; Security:[seccomp=unconfined]
28m 28m 1 {kubelet 10.8.65.157} spec.containers{node-problem-detector} Normal Started Started container with docker id a7e78dfbe63a
27m 27m 2 {kubelet 10.8.65.157} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "node-problem-detector" with CrashLoopBackOff: "Back-off 10s restarting failed container=node-problem-detector pod=node-problem-detector-gfx8q_default(5203c2d8-1b73-11e7-8754-46c88707ea0b)"

27m 27m 1 {kubelet 10.8.65.157} spec.containers{node-problem-detector} Normal Created Created container with docker id 6d3ac4da1b6e; Security:[seccomp=unconfined]
27m 27m 1 {kubelet 10.8.65.157} spec.containers{node-problem-detector} Normal Started Started container with docker id 6d3ac4da1b6e
27m 27m 2 {kubelet 10.8.65.157} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "node-problem-detector" with CrashLoopBackOff: "Back-off 20s restarting failed container=node-problem-detector pod=node-problem-detector-gfx8q_default(5203c2d8-1b73-11e7-8754-46c88707ea0b)"

Now i am running daemonset in all node, but how do i verify it is useful？

What happens to the node wrong after creating daemonset? And set node to unscheduler？

Now i am running daemonset in all node, but how do i verify it is useful。

It is a bit difficult to simulate these problems。

Hardware issues: Bad cpu, memory or disk;
Kernel issues: Kernel deadlock, corrupted file system;
Container runtime issues: Unresponsive runtime daemon;
...

"startPattern" is fragile and wrong on newer kernels

Broken out from here

Currently, the config has a default of "startPattern": "Initializing cgroup subsys cpuset",

This pattern is meant to detect a node's boot process. Prior to the 4.5 kernel, this message was typically printed during boot of a node. After 4.5 however, due to this change, it is quite unlikely for that message to appear.

Furthermore, there's rarely a reason to detect whether a message is for the current boot in such a fragile way.

With the kern.log reader, every message is for the current boot because kern.log is usually handled where each kern.log file corresponds to one boot (e.g. kern.log is this boot, kern.log.1 is the boot before, kern.log.2.gz the one before, etc). (EDIT: I'm wrong about this for gci at least)

With journald, the boot id is annotated in messages, and so it can accurately be correlated with the current boot id (see the "_BOOT_ID" record in journald messages).

With a kmsg reader, all messages will only be the current boot because kmsg is not persistent.

In none of those cases is startPattern useful. Each kernel log parsing plugin should be responsible for doing the right thing itself I think.

Node-Problem-Detector should Patch NodeStatus not Update

Problem:
kubelet, node-controller, and now the node-problem-detector all update the Node.Status field.

Normally this is not a problem because if changes happen rapidly, resource version mismatch fails and everything is ok.

When a new field is added to Node Status, however:

Since kubelet and node-controller are in the same repository they are recompiled with the new field and Status updates continue to operate normally.
However, since node-problem-detector is in a separate repository and has not been recompiled with the new version, its Update calls end up squashing the new (unknown) field, resetting it to nil.

Suggested Solution:
node-problem-detector should do a patch instead of an update to prevent wiping out fields it is not aware of. Incidentally, kubelet and node-controller should do this as well, but that is not as critical since they are in the same repository and new types are normally added their first.

CC @kubernetes/sig-node @kubernetes/sig-api-machinery @Random-Liu @bgrant0607

node-problem-detector:v0.4.0 - failed to update node conditions: Timeout

Hello,

I am testing out the node-problem-detector cluster addon and am seeing this error on only 1 node:

failed to update node conditions: Timeout: request did not complete within allowed duration

This cluster has had issues in the past due to some etcd3 nodes becoming unhealthy, and flanneld being unable to connect to etcd3 at one point causing kube-apiserver to fail to respond, and hence kubelet, kube-controller-manager, kube-proxy, and kube-scheduler to log various api timeout errors.

After restarting everything, the rest of the nodes are healthy & both etcd3 + kube-apiserver appear to be contactable by the kube-system services. I then started npd-0.4.0 using the above manifest and am seeing only 1 node with this issue. The rest are sometimes logging what appears to be the race condition for updating node status exactly as in #108.

I've checked the 5 node etcd cluster health and only 1 of 5 is in unhealthy state. Requests from kubectl return most of the time, however I am seeing some timeouts from some kubectl describe node XXX commands.

Relevant log sections below:

/var/log/node-problem-detector.log:

[... BEGIN LOG ...]

I0617 05:08:14.386945       7 log_monitor.go:72] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:kernel] LogPath:/var/log/journal
Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:
kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, a
non-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:temporary Condition: Reason:UnregisterNe
tDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL po
inter dereference at .*} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Patt
ern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ second
s\.}]}
I0617 05:08:14.387062       7 log_watchers.go:40] Use log watcher of plugin "journald"
I0617 05:08:14.387253       7 log_monitor.go:72] Finish parsing log monitor config file: {WatcherConfig:{Plugin:filelog PluginConfig:map[timestamp:^.{15} message:kernel: \[.*\] (
.*) timestampFormat:Jan _2 15:04:05] LogPath:/var/log/kern.log Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:00
01-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d
+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more tha
n \w+ seconds\.} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {Type:temporary Condition:
 Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .*} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP} {Typ
e:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:Dock
erHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.}]}
I0617 05:08:14.387286       7 log_watchers.go:40] Use log watcher of plugin "filelog"
I0617 05:08:14.387923       7 log_monitor.go:81] Start log monitor
E0617 05:08:14.387946       7 problem_detector.go:65] Failed to start log monitor "/config/kernel-monitor-filelog.json": failed to stat the file "/var/log/kern.log": stat /var/lo
g/kern.log: no such file or directory
I0617 05:08:14.387952       7 log_monitor.go:81] Start log monitor
I0617 05:08:14.389906       7 log_watcher.go:69] Start watching journald
I0617 05:08:14.389947       7 problem_detector.go:74] Problem detector started
I0617 05:08:14.390046       7 log_monitor.go:173] Initialize condition generated: [{Type:KernelDeadlock Status:false Transition:2017-06-17 05:08:14.390031295 +0000 UTC Reason:Ker
nelHasNoDeadlock Message:kernel has no deadlock}]
E0617 06:41:07.988201       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful confli
ct (firstResourceVersion: "2163685", currentResourceVersion: "2163729"):
 diff1={"metadata":{"resourceVersion":"2163729"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T06:41:01Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T06:41
:01Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T06:41:01Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T06:41:01Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T06:41:02Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoD
eadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 08:10:02.768550       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful confli
ct (firstResourceVersion: "2181151", currentResourceVersion: "2181185"):
 diff1={"metadata":{"resourceVersion":"2181185"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T08:10:02Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T08:10
:02Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T08:10:02Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T08:10:02Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T08:10:02Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoD
eadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 08:40:02.392305       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 08:40:32.394968       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 08:40:52.688694       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 10:19:47.886026       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful confli
ct (firstResourceVersion: "2206191", currentResourceVersion: "2206223"):
 diff1={"metadata":{"resourceVersion":"2206223"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T10:19:44Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T10:19:44Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T10:19:44Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T10:19:44Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T10:19:47Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 10:20:03.868248       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2206223", currentResourceVersion: "2206276"):
 diff1={"metadata":{"resourceVersion":"2206276"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T10:19:57Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T10:19:57Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T10:19:57Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T10:19:57Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T10:19:58Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 10:27:14.278874       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2207596", currentResourceVersion: "2207627"):
 diff1={"metadata":{"resourceVersion":"2207627"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T10:27:09Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T10:27:09Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T10:27:09Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T10:27:09Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T10:27:11Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 10:49:49.999965       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 11:04:58.654339       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2214802", currentResourceVersion: "2214843"):
 diff1={"metadata":{"resourceVersion":"2214843"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:04:56Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T11:04:56Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T11:04:56Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T11:04:56Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:04:58Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 11:13:25.349869       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 11:13:27.448321       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2216376", currentResourceVersion: "2216452"):
 diff1={"metadata":{"resourceVersion":"2216452"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:13:26Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T11:13:26Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T11:13:26Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T11:13:26Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:13:25Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 11:27:54.707593       7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 11:28:03.676434       7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 11:28:11.687756       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2219181", currentResourceVersion: "2219220"):
 diff1={"metadata":{"resourceVersion":"2219220"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:27:56Z","type":"KernelDeadlock"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:28:07Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 11:28:11.687756       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2219181", currentResourceVersion: "2219220"):
 diff1={"metadata":{"resourceVersion":"2219220"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:27:56Z","type":"KernelDeadlock"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:28:07Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 11:35:22.216998       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2220568", currentResourceVersion: "2220603"):
 diff1={"metadata":{"resourceVersion":"2220603"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:35:12Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T11:35:12Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T11:35:12Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T11:35:12Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:35:21Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 11:35:39.515617       7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 11:36:44.370694       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2220826", currentResourceVersion: "2220882"):
 diff1={"metadata":{"resourceVersion":"2220882"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:36:40Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T11:36:40Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T11:36:40Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T11:36:40Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:36:43Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 11:43:04.549531       7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 11:52:29.011584       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 11:55:55.741212       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost

[...SNIP...]

E0617 20:48:43.758632       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:50:08.649485       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:50:36.742814       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:51:06.745291       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:51:18.310623       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:51:25.507013       7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 20:51:35.682252       7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 20:51:40.731491       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful confli
ct (firstResourceVersion: "2329653", currentResourceVersion: "2329819"):
 diff1={"metadata":{"resourceVersion":"2329819"},"status":{"conditions":[{"lastTransitionTime":"2017-06-17T20:49:33Z","message":"Kubelet stopped posting node status.","reason":"N
odeStatusUnknown","status":"Unknown","type":"DiskPressure"},{"lastTransitionTime":"2017-06-17T20:49:33Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnkn
own","status":"Unknown","type":"MemoryPressure"},{"lastTransitionTime":"2017-06-17T20:49:33Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","statu
s":"Unknown","type":"OutOfDisk"},{"lastTransitionTime":"2017-06-17T20:49:33Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","ty
pe":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T20:51:39Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoD
eadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 20:52:05.903239       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:52:35.941987       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:53:05.990698       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:53:35.993274       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:54:05.995654       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:54:35.998026       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:55:06.009227       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:55:36.011974       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:56:01.937423       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:56:12.384973       7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 20:56:42.390952       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:57:12.393356       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:57:42.395943       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:58:12.405949       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:58:22.411682       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:58:33.305720       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:58:34.805954       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful confli
ct (firstResourceVersion: "2330223", currentResourceVersion: "2330380"):
 diff1={"metadata":{"resourceVersion":"2330380"},"status":{"conditions":[{"lastTransitionTime":"2017-06-17T20:52:28Z","message":"Kubelet stopped posting node status.","reason":"N
odeStatusUnknown","status":"Unknown","type":"DiskPressure"},{"lastTransitionTime":"2017-06-17T20:52:28Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnkn
own","status":"Unknown","type":"MemoryPressure"},{"lastTransitionTime":"2017-06-17T20:52:28Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"OutOfDisk"},{"lastTransitionTime":"2017-06-17T20:52:28Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T20:58:33Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 20:59:13.391379       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration

[...SNIP...]
[... MESSAGE REPEATS APPROX EVERY ~1/2 SECOND ...]
[...SNIP...]

E0617 21:22:43.790298       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:23:13.792892       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:23:43.795469       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:24:13.798194       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:24:43.800739       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:25:13.803335       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:25:43.805904       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:26:13.808357       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:26:43.810985       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:27:13.813560       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:27:43.816150       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:28:13.886384       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:28:43.889035       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:29:13.891666       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:29:43.894478       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:30:13.897056       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:30:43.922019       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:31:13.947162       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:31:43.949919       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:32:13.952527       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:32:43.955028       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:33:13.957574       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:33:43.960186       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:34:13.962807       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:34:43.965335       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:35:13.967982       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:35:43.971220       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration

[...SNIP...]
[... MESSAGE REPEATS APPROX EVERY ~1/2 SECOND ...]

Write Readme.md for NodeProblemDetector

Write Readme.md for NodeProblemDetector to demonstrate:

Motivation and scope of NodeProblemDetector
Usage of current NodeProblemDetector
Future plan of NodeProblemDetector.

/cc @kubernetes/node-problem-detector-maintainers

Kernel monitor file is wrongly named in node-problem-detector.go

The default path to the kernel monitor config in node-problem-detector.go is '/config/kernel_monitor.json'. However, the file in the config/ directory of the repo is named 'kernel-monitor.json' (hyphen at one place and underscore at the other). They need to be named the same.

cc @Random-Liu

NPD sending too many node-status updates in scale tests

We ran 4k-node scalability test and observed that fluentd gets oom-killed frequently.
However, npd seems to send too many patch node-status requests (~3k qps out of ~11k total qps) due to it.
Ref: kubernetes/kubernetes#47344 (comment) kubernetes/kubernetes#47865 (comment)

Why is it reporting oom events when kubelet seems to do that?
Can we make npd report ooms for just system processes and let kubelet alone take care of k8s containers (by modifying npd's 'OOMKilling' rule)?

cc @Random-Liu @gmarek @kubernetes/sig-node-bugs

does this tool work outside Google?

Panic crash:

kubectl get pods -o wide --all-namespaces
NAMESPACE     NAME                                READY     STATUS             RESTARTS   AGE       NODE
default       node-problem-detector-0kgkw         0/1       CrashLoopBackOff   3          1m        192.168.78.15
default       node-problem-detector-ar3tk         0/1       CrashLoopBackOff   3          1m        192.168.78.16

kubectl logs node-problem-detector-0kgkw
I0623 01:02:18.560287       1 kernel_monitor.go:86] Finish parsing log file: {WatcherConfig:{KernelLogPath:/log/kern.log} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:UnregisterNetDeviceIssue Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+}]}
I0623 01:02:18.560413       1 kernel_monitor.go:93] Got system boot time: 2016-06-17 17:51:02.560408109 +0000 UTC
panic: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory

goroutine 1 [running]:
panic(0x15fc280, 0xc8204fc000)
    /usr/local/go/src/runtime/panic.go:464 +0x3e6
k8s.io/node-problem-detector/pkg/problemclient.NewClientOrDie(0x0, 0x0)
    /usr/local/google/home/lantaol/workspace/src/k8s.io/node-problem-detector/pkg/problemclient/problem_client.go:56 +0x132
k8s.io/node-problem-detector/pkg/problemdetector.NewProblemDetector(0x7faa4f155140, 0xc8202a6900, 0x0, 0x0)
    /usr/local/google/home/lantaol/workspace/src/k8s.io/node-problem-detector/pkg/problemdetector/problem_detector.go:45 +0x36
main.main()
    /usr/local/google/home/lantaol/workspace/src/k8s.io/node-problem-detector/node_problem_detector.go:33 +0x56

We should add real case e2e test.

We already have a NPD e2e test in kubernetes repo: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/node_problem_detector.go.

However, the e2e test only doesn't actually test the NPD deployed in the cluster. It:

Deploy a NPD pod with test config.
Generate test log in the test log file.
Check whether test node condition is set.

Although this test is still necessary, we should add a new test to really inject log into kernel log or even trigger kernel problem and test the real NPD deployed in the cluster.

The test is disruptive, we may want to add [Feature] tag and create a dedicated jenkins job for it.

/cc @dchen1107

Generalize kernel monitor.

Discussed with @apatil.

Kernel monitor was initially introduced to monitor kernel log and detect kernel issues.

However, in fact it could be extended to monitor other logs such as docker log, systemd log etc. by adding new translator. Currently it is already doable, but not very intuitive because:

All files, types and functions are named as kernel xxx.
Translator is not configurable.

We should refactor the code to make it easier and more intuitive to extend kernel monitor:

Change kernelmonitor to logmonitor. We'll only use log monitor to monitor kernel log for K8s, but it should be easy for other users to reconfigure and extend it to monitor other logs.
Extend the configuration to make translator and log source configurable after #41 landed, including:
- Make the journald log filter configurable.
- Make the translate function configurable.

/cc @kubernetes/sig-node

Node Problem Detector should retry after failing to connect apiserver.

Failure happens.

Node Problem Detector should not just panic when fails to connect apiserver.

Bump the version to v0.3 in makefile, manifest and readme

After adding the standalone mode in #49 , I created and pushed a new docker image gcr.io/google_containers/node-problem-detector:v0.3 which includes these changes for use by kubemark. However this version bump has to be reflected in:

version tag inside the Makefile
image inside the node-problem-detector.yaml
readme, to indicate that v0.3 is the latest version and/or suggested for use?

@Random-Liu Hope this push doesn't disturb any plans you might be having for npd release schedule. If yes, it's not difficult to revert this change, as I just did it for use in kubemark.

does not create automatically /var/log/journal

Hi,

great initiative. Here is my experience on RHEL/CentOS 7
Apparently node-problem-detector is not able to work with the /run/log/journal/ (unfortunately) - correct? If so, it seems like a big limitation, no?
The issue is that even if cannot do that, it also fails to start, as it depends on the existence of "/var/log/journal". Knowing that it's not able to use /run/log/journal, is it reasonable to expect that it will create "/var/log/journal" on each node?
"/var/log/kern.log" is missing on my OS. I guess it's not in use any longer or it should be configured?

Here are the logs:

I0621 14:32:46.173199       7 log_watchers.go:40] Use log watcher of plugin "journald"
I0621 14:32:46.186164       7 log_monitor.go:72] Finish parsing log monitor config file: {WatcherConfig:{Plugin:filelog PluginConfig:map[timestamp:^.{15} message:kernel: \[.*\] (.*) timestampFormat:Jan _2 15:04:05] LogPath:/var/log/kern.log Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .*} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.}]}
I0621 14:32:46.186204       7 log_watchers.go:40] Use log watcher of plugin "filelog"
I0621 14:32:46.186926       7 log_monitor.go:81] Start log monitor
E0621 14:32:46.186945       7 problem_detector.go:65] Failed to start log monitor "/config/kernel-monitor.json": failed to stat the log path "/var/log/journal": stat /var/log/journal: no such file or directory
I0621 14:32:46.186951       7 log_monitor.go:81] Start log monitor
E0621 14:32:46.186958       7 problem_detector.go:65] Failed to start log monitor "/config/kernel-monitor-filelog.json": failed to stat the file "/var/log/kern.log": stat /var/log/kern.log: no such file or directory
F0621 14:32:46.186964       7 node_problem_detector.go:88] Problem detector failed with error: no log montior is successfully setup

Documentation: example of how to add a new detector

It would be good to put some documentation as to how to write a new detector. Maybe give an example of a trivial detector in the code base that contributors can take and extend.

Release instruction is needed.

Forked from #63 (comment).
We need a release instruction to standardize the release process. The release instruction should cover:

How to build release packages
- Build docker image.
- Build tar ball for standalone.
- ...
How to cut release
- Version a release
- Update CHANGELOG.md
- Create branch/tag
- Create github release.
- ...
How to update kubernetes
- Update e2e test
- Update addon pods in cluster
- ...

This has relatively lower priority than the features, will file the doc before cutting release.

can't update node condition

Hi, I deployed npd yesterday in my k8s 1.6, Now I find it can't update node condition.Below is its log

E0418 09:48:32.826374       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "cs55": there is a meaningful conflict (firstResourceVersion: "881081", currentResourceVersion: "881106"):
 diff1={"metadata":{"resourceVersion":"881106"},"status":{"conditions":[{"lastHeartbeatTime":"2017-04-18T01:48:32Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-04-18T01:47:32Z","type":"KernelDeadlock"},{"lastHeartbeatTime":"2017-04-18T01:48:32Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-04-18T01:48:32Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-04-18T01:48:32Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-04-18T01:48:32Z","lastTransitionTime":"2017-04-17T09:10:43Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}

I wonder why this happen? @Random-Liu thx

Add --hostname-override flag.

Currently NPD gets hostname override from NODE_NAME env. https://github.com/kubernetes/node-problem-detector/blob/master/pkg/problemclient/problem_client.go#L73

This is not friendly for standalone mode.

We should add a --hostname-override flag, and set the flag with NODE_NAME from downward api in the yaml file. https://github.com/kubernetes/node-problem-detector/blob/master/node-problem-detector.yaml#L21

Feature request for a "hollow"-node-problem-detector having an empty list of conditions and rules inside kernel monitor config

As part of the effort to make testing using kubemark mimic real clusters as closely as possible, we are planning to add container for a "hollow-node-problem-detector" inside hollow-node alongside the existing containers "hollow-kubelet" and "hollow-kubeproxy". For this, it is required to have a node-problem-detector image which essentially has conditions and rules inside the kernel monitoring config set to an empty list. Also, this image should eventually (once it is tested to work fine in Kubemark) be pushed to gcr.io/google-containers.

@kubernetes/sig-scalability @wojtek-t @gmarek @Random-Liu

KernelMonitor: Reconsider whether to use hpcloud/tail

For move fast as soon as possible, we used https://github.com/hpcloud/tail for the first version.
However, in fact it has been proved to have some bugs, such as hpcloud/tail#21.

We should reconsider whether we should try to fix the bugs or implement a simple tail lib ourselves based on google/cadvisor#1264.

We need a ChangeLog for NPD.

We should add a changelog to make the release more clear and official.

Detect clock skew vs the master or some other central source

The master depends on itself having a similar view of time as the Nodes, and the controller manager will not properly roll out deployments if the node and master clocks are skewed. We should figure out a way of detecting this condition.

Add CI e2e test for NPD.

We should add presubmit/postsubmit e2e test for NPD.

For now, we can just download newest kubernetes version and run the node problem detector e2e test.

A systemd service for test.

After introducing a new pattern, we usually need to verify it.

However, it's hard to trigger a real problem for a given service, so we usually have to inject log to verify it.

It's hard to inject log into a given systemd service. We should have a test systemd service which does nothing but generating specified log. We could let NPD parse the log of the test service to verify the newly introduced pattern.

failed to build node-problem-detector

go version go1.7 darwin/amd64
godep v74 (darwin/amd64/go1.7)

make failed with output:

CGO_ENABLED=0 GOOS=linux godep go build -a -installsuffix cgo -ldflags '-w' -o node-problem-detector
vendor/github.com/coreos/go-systemd/sdjournal/functions.go:19:2: no buildable Go source files in /Users/sandflee/goproject/src/k8s.io/node-problem-detector/vendor/github.com/coreos/pkg/dlopen
godep: go exit status 1

OOMKilling not triggering with recent kernels

With Linux 4.9, and perhaps earlier versions, there's a trailing

, shmem-rss:\\d+kB

Should it be optional in the built-in regex, are admins expect to edit their configuration, or something else?

How to hook up third-party daemons?

I'm an ABRT developer and I would love to create a problem daemon reporting problems detected by ABRT to node-problem-detector.

ABRT's architecture is similar to node-problem-detector's - there are agents reporting detected problems to abrtd. An ABRT agent is either a tiny daemon watching logs (or systemd-journal) or a language error handler (Python sys.excepthook, Ruby at_exit callback, /proc/sys/kernel/core_pattern, Node.js uncaughtException event handler, Java JNI agent).

I've created a docker image that is capable to detect Kernel oopses, vmcores and core files on a host:
https://github.com/jfilak/docker-abrt/tree/atomic_minimal

(It should be possible to detect uncaught [Python, Ruby, Java] exceptions in the future)

ABRT provides several ways of reporting the detected problems to users - e-mail, FTP|SCP upload, D-Bus signal, Bugzilla bug, micro-Report, systemd-journal catalog message - and it is trivial to add another report destination.

The Design Doc defines "Problem Report Interface" but I've failed to find out how to register a new problem deamon to node-problem-detector or how to use the "Problem Report Interface" from a third party daemon.

apiserver-override flag in standalone mode not working in kubemark

So I create npd containers inside hollow-nodes (just for clarity: these are actually pods) of kubemark by passing the address of the kubemark master and the kubeconfig file through the override flag as follows:

"name": "hollow-node-problem-detector",
"image": "gcr.io/google_containers/node-problem-detector:v0.3",
"env": [
						{
							"name": "NODE_NAME",
							"valueFrom": {
								"fieldRef": {
									"fieldPath": "metadata.name"
								}
							}
						}
],
"command": [
						"/node-problem-detector",
						"--kernel-monitor=/config/kernel-monitor.json",
						"--apiserver-override=https://104.198.41.48:443?inClusterConfig=false&auth=/kubeconfig/npd_kubeconfig",
						"--alsologtostderr",
						"1>>/var/logs/npd_$(NODE_NAME).log 2>&1"
],
"volumeMounts": ....

I even checked inside the container using kubectl exec that the right kubeconfig file is present on the container and NODE_NAME env var is defined. Here's the kubeconfig:

apiVersion: v1
kind: Config
users:
- name: node-problem-detector
  user:
    token: 8qt8RLdwIKeZQ0QeVrSWg4BrqK3Cs8H8
clusters:
- name: kubemark
  cluster:
    insecure-skip-tls-verify: true
    server: https://104.198.41.48
contexts:
- context:
    cluster: kubemark
    user: node-problem-detector
  name: kubemark-npd-context
current-context: kubemark-npd-context

Despite this, the npd seems to communicate with the wrong master (the master of the real cluster underneath the kubemark cluster). It might probably be still using the inClusterConfig from the real node underneath the hollow-node, as that seems to be the only way by which it can get hold of the wrong master's IP address.

cc @Random-Liu @andyxning

Node problem detector should use apiserver cache.

Currently, NPD sent a PATCH every 30 seconds, which could be a significant overhead for apiserver in large cluster.

Even in periodical resync, we should GET first and only send PATCH when the conditions are actually different.

Further more, we could use the same trick with kubelet to leverage the apiserver cache to reduce the load even more. https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/kubelet_node_status.go#L313-L322

NPD Kubernetes 1.6 Planning

NPD (node problem detector) is introduced in Kubernetes 1.3 as a default add-on in GCE cluster.

At that time, it is mainly targeted on default GCE Kubernetes setup. However, as time goes by, some limitations were found such as Journald support, Authentication Issue, Scalability Issue which affected the adoption of NPD in many other environment.

In Kubernetes 1.6, we plan to invest some time to improve NPD, make it production ready and rollout it in GKE.

Here are the working items and priorities:

Note that only P0s are release blocker.

@dchen1107 @fabioy @ajitak
/cc @kubernetes/sig-node-misc

Performance benchmark and optimization

We've got some data from #2.

However, after that we've made a lot of changes. We should do benchmark to measure the cpu and memory usage. And optimize the log parsing code of kernel monitor if the resource usage is too high.

An accurate and acceptable number is important for NPD rollout.

/cc @dchen1107

Support SSL certificates

Our k8s API server is running with a self-signed certificate to enable SSL. I am wondering what's the best way to specify a root CA when running node-problem-detector.

Use chroot instead of relying on a systemd base image.

We want NPD to support journald log even when it's running as daemonset inside a container.

However, it's hard to consume host journal log inside container. Previously, we use fedora base image, and rely on the journald library inside the base image to understand and read the host journald log. However, there are several limitations:

The fedora base image is pretty big, and there are a lot of things we don't actually need.
The journald version inside the container may be mismatch with the host, and sometimes causes problems. E.g. the npd image doesn't work with new GCI image now, and also the user bug report #114.

A more generic solution may be mount the host / inside the container, and chroot. In this way, we could use the journald on the host directly, which eliminates the problems above.

$ docker run -it --privileged -v /:/rootfs --entrypoint=/bin/bash gcr.io/google_containers/node-problem-detector:v0.4.0
[root@07fb0f31a048 /]# chroot /rootfs
sh-4.3# journalctl -k
-- Logs begin at Sun 2017-06-11 12:12:02 UTC, end at Mon 2017-06-12 18:59:06 UTC. --
Jun 12 18:19:54 e2e-test-lantaol-master kernel: device veth43d8449 entered promiscuous mode
Jun 12 18:19:54 e2e-test-lantaol-master kernel: IPv6: ADDRCONF(NETDEV_UP): veth43d8449: link is not ready
Jun 12 18:19:54 e2e-test-lantaol-master kernel: eth0: renamed from veth75b3ac0
...

I haven't thought through the potential security problem yet, but since NPD is a privileged daemonset, it's reasonable to grant host fs access to it.

Allow reading a kubeconfig file for master / auth info

This would allow it to be used with clusters that don't have service accounts enabled (they are disabled in our cluster because they do not play well with namespaced accounts yet)

Measure the resource overhead of NodeProblemDetector

To roll out the NodeProblemDetector and run it as an addon pods of Kubernetes, we need to understand its overhead, at least including cpu and memory overhead.

/cc @kubernetes/sig-node

Add RBAC in the example yaml file and document

Default Kubernetes setup removed ABAC support, and is only using RBAC now.
kubernetes/kubernetes#39092

This is fine for the NPD in kube-system, which has enough permission. However, the NPD setup in the example and Readme.md may not be enough.

We should add documentation and example to demonstrate how to deploy NPD in a RBAC K8s cluster.

[Question] Should this tool be monitoring CPU steal?

I was reading through http://blog.scoutapp.com/articles/2013/07/25/understanding-cpu-steal-time-when-should-you-be-worried and I'm curious if the node-problem-detector should be watching for CPU steal?

Standalone NPD Support

For better reliability, in Kubernetes 1.6, we plan to add standalone NPD support to run NPD as a system daemon on each node.

To achieve this, we need to:

[Feature] Check for SELinux / AppArmor Denies

If the kubelet is installed and then for some reason a security policy is changed later on a critical kubernetes component, it might be wise to watch out for that.

node-problem-detector does not work if I change kubelet hostname to node ip.

I am running k8s 1.3.0 and with docker image node-problem-detector:v0.1

I changed kubelet --hostname-override with node ip on ehch minion:

root@SZX1000116607:~# cat /etc/default/kubelet
KUBELET_OPTS='--hostname-override=10.22.109.119 --api-servers=http://10.22.109.119:8080,http://10.22.69.237:8080,http://10.22.117.82:8080 --pod-infra-container-image=xxxxx/kubernetes/pause:latest --cluster-dns=192.168.1.1  --cluster-domain=test1 --low-diskspace-threshold-mb=2048 --cert-dir=/var/run/kubelet --allow-privileged=true'

when I start node-problem-detector as a daemon set, I got a lot of error messages:

2016-07-11T07:49:17.310422203Z I0711 07:49:17.309761       1 kernel_monitor.go:86] Finish parsing log file: {WatcherConfig:{KernelLogPath:/log/kern.log} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:UnregisterNetDeviceIssue Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+}]}
2016-07-11T07:49:17.310510490Z I0711 07:49:17.310026       1 kernel_monitor.go:93] Got system boot time: 2016-07-08 01:40:54.310019188 +0000 UTC
2016-07-11T07:49:17.311878174Z I0711 07:49:17.311436       1 kernel_monitor.go:102] Start kernel monitor
2016-07-11T07:49:17.311907663Z I0711 07:49:17.311654       1 kernel_log_watcher.go:173] unable to parse line: "", can't find timestamp prefix "kernel: [" in line ""
2016-07-11T07:49:17.311926515Z I0711 07:49:17.311696       1 kernel_log_watcher.go:110] Start watching kernel log
2016-07-11T07:49:17.311942118Z I0711 07:49:17.311720       1 problem_detector.go:60] Problem detector started
2016-07-11T07:49:17.314020808Z 2016/07/11 07:49:17 Seeked /log/kern.log - &{Offset:0 Whence:0}
2016-07-11T07:49:18.355974460Z E0711 07:49:18.355712       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:19.314395255Z E0711 07:49:19.314110       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:20.314302138Z E0711 07:49:20.313982       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:21.315151897Z E0711 07:49:21.314849       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:22.313977681Z E0711 07:49:22.313834       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:23.313906286Z E0711 07:49:23.313619       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:24.314428608Z E0711 07:49:24.314141       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:25.316626549Z E0711 07:49:25.316326       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:26.314346471Z E0711 07:49:26.314142       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:27.313901894Z E0711 07:49:27.313759       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:28.314356686Z E0711 07:49:28.314198       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:29.314747334Z E0711 07:49:29.314450       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:30.314562756Z E0711 07:49:30.314235       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found

It seems node-problem-detector still use the original os hostname to access node instead of overrided hostname that is registered in etcd.

/wangyumi

Get node name from downward api

Now to get the node name, we have to let node problem detector run in host network.

After kubernetes/kubernetes#27880 landing, we can change the node problem detector to get node name from the downward api. And that will also fix #23.

Failed to update node conditions, but error <nil>

I'm seeing logs like these:

E0110 22:51:21.706489 1 manager.go:130] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-0-12-121.ec2.internal": <nil>

KernelMonitor: Add journald support

Add journald support.

Consider reusing the code in cadvisor, and maybe replace hpcloud/tail with the simpler implementation in cadvisor #3

Unable to build code, missing godep

I cloned the repo and ran:

$ make node-problem-detector
CGO_ENABLED=0 GOOS=linux godep go build -a -installsuffix cgo -ldflags '-w' -o node-problem-detector
vendor/gopkg.in/fsnotify.v1/inotify.go:19:2: cannot find package "golang.org/x/sys/unix" in any of:
    /home/decarr/go/src/k8s.io/node-problem-detector/vendor/golang.org/x/sys/unix (vendor tree)
    /usr/local/go/src/golang.org/x/sys/unix (from $GOROOT)
    /home/decarr/go/src/k8s.io/node-problem-detector/Godeps/_workspace/src/golang.org/x/sys/unix (from $GOPATH)
    /home/decarr/go/src/golang.org/x/sys/unix
godep: go exit status 1
Makefile:9: recipe for target 'node-problem-detector' failed
make: *** [node-problem-detector] Error 1

Looks like this is missing a godep in master:
https://github.com/kubernetes/node-problem-detector/tree/master/vendor

Create "latest" NPD image.

We should create a latest NPD image when each release.