Giter Club home page Giter Club logo

node-problem-detector's People

Contributors

abansal4032 avatar actions-user avatar acumino avatar adohe avatar andyxning avatar dchen1107 avatar euank avatar gkganesh126 avatar grosser avatar hakman avatar hercynium avatar jeremyje avatar jsenon avatar k8s-ci-robot avatar karan avatar linxiulei avatar martinforreal avatar max-rocket-internet avatar mcshooter avatar mmiranda96 avatar mx-psi avatar raghu-nandan-bs avatar random-liu avatar rramkumar1 avatar stmcginnis avatar testwill avatar thockin avatar vteratipally avatar wangzhen127 avatar zyecho avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

node-problem-detector's Issues

Write Readme.md for KernelMonitor

Write Readme.md for KernelMonitor to demonstrate:

  • Motivation of KernelMonitor
  • Usage of current KernelMonitor
  • Future plan of KernelMonitor.

/cc @kubernetes/node-problem-detector-maintainers

"unregister_netdevice" isn't necessarily a KernelDeadlock

I have a node running CoreOS 1221.0.0 with kernel version 4.8.6-coreos.

The node-problem-detector marked it with "KernelDeadlock True Sun, 04 Dec 2016 18:56:20 -0800 Wed, 16 Nov 2016 00:03:33 -0800 UnregisterNetDeviceIssue unregister_netdevice: waiting for lo to become free. Usage count = 1".

If I check my kernel log, I see the following:

$ dmesg -T | grep -i unregister_netdevice -C 3
[Wed Nov 16 08:02:19 2016] docker0: port 5(vethfd2807b) entered blocking state
[Wed Nov 16 08:02:19 2016] docker0: port 5(vethfd2807b) entered forwarding state
[Wed Nov 16 08:02:19 2016] IPv6: eth0: IPv6 duplicate address fe80::42:aff:fe02:1206 detected!
[Wed Nov 16 08:03:33 2016] unregister_netdevice: waiting for lo to become free. Usage count = 1
[Wed Nov 16 08:14:35 2016] vethafecb94: renamed from eth0
[Wed Nov 16 08:14:35 2016] docker0: port 2(veth807b9e2) entered disabled state
[Wed Nov 16 08:14:35 2016] docker0: port 2(veth807b9e2) entered disabled state

Clearly, the node managed to continue to perform operations after printing that message. In addition, pods continue to function just fine and there aren't any long-term issues for me on this node.

I know that the config of what counts as a deadlock is configurable, but perhaps the default configuration shouldn't include this, or the check should be more advanced for it, since as-is it could be quite confusing.

DaemonSet NPD creates /var/log/journal on non-journald node.

Ideally, we should not use journald config on non-systemd node. However, if we use, it is expected to error out.

However, now because of some unknown reason, NPD will create a /var/log/journal directory and keep going. Related code: https://github.com/kubernetes/node-problem-detector/blob/master/pkg/systemlogmonitor/logwatchers/journald/log_watcher.go#L141

This should not matter much, because no one will actually write any log, so NPD will just hang and not consume more resource. But we should still fix this.

@kubernetes/node-problem-detector-maintainers

status doesn't change when injecting log messages

I cannot get the node problem detector to change a node status by injecting messages.

I am using kubernetes 1.5.2, Ubuntu 16.04, kernel 4.4.0-51-generic.

I run the npd as a daemonset. I have attempted to get this to work with the npd as version 0.3.0 and 0.4.0. I start the npd with the default command, using /config/kernel-monitor.json because my nodes use journald.

I have /dev/kmsg mounted into the pod, and I echo expressions matching the regexs in the kernel-monitor.json to /dev/kmsg on the node. I can view the fake logs I've echoed to /dev/kmsg in the pod.

Steps to reproduce:

# as root on the node where your pod is running
echo "task umount.aufs:123 blocked for more than 120 seconds." >> /dev/kmsg
# I have verified that these logs show up in journalctl -k

# this should match the following permanent condition in /config/kernel-monitor.json
#	{
#		"type": "permanent",
#		"condition": "KernelDeadlock",
#		"reason": "AUFSUmountHung",
#		"pattern": "task umount\\.aufs:\\w+ blocked for more than \\w+ seconds\\."
#	},

# check the node status of the node where you ran this on
kubectl get node <node>
# status will still be Ready

# for further detail examine the json
kubectl get node <node> -o json | jq .status.conditions
# you will see that the KernelDeadlock condition is still "False"

# I would expect the general status to change to "KernelDeadlock"

If I am not testing this properly, could you please give a detailed breakdown of how to test the node problem detector is working properly for kernel logs AND docker logs?

I have also reproduced this behavior using a custom docker_monitor.json and having the systemd docker service write to the journald docker logs. I have still been unsuccessful in getting the node status to change.

NodeProblemDetector tests failing on AWS

c.nodeName, err = os.Hostname()
assumes that the hostname is the same name as the cloud provider instance name. This is false on AWS. NPD needs to be going through CurrentNodeName to get the cloud provider agnostic hostname.

I'm disabling this test on AWS for now, but we're marching towards full support for AWS. Please treat this as if it had broken a build when it was checked in originally.

now,i canot run it in kub1.5.3

`Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message


28m 28m 1 {kubelet 10.8.65.157} Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "kubernetes.io/host-path/5203c2d8-1b73-11e7-8754-46c88707ea0b-log" (spec.Name: "log") pod "5203c2d8-1b73-11e7-8754-46c88707ea0b" (UID: "5203c2d8-1b73-11e7-8754-46c88707ea0b").
28m 28m 1 {kubelet 10.8.65.157} Normal SuccessfulMountVolume MountVolume.SetUp succeeded for volume "kubernetes.io/host-path/5203c2d8-1b73-11e7-8754-46c88707ea0b-localtime" (spec.Name: "localtime") pod "5203c2d8-1b73-11e7-8754-46c88707ea0b" (UID: "5203c2d8-1b73-11e7-8754-46c88707ea0b").
28m 28m 1 {kubelet 10.8.65.157} spec.containers{node-problem-detector} Normal Created Created container with docker id e65759efc680; Security:[seccomp=unconfined]
28m 28m 1 {kubelet 10.8.65.157} spec.containers{node-problem-detector} Normal Started Started container with docker id e65759efc680
28m 28m 1 {kubelet 10.8.65.157} spec.containers{node-problem-detector} Normal Created Created container with docker id a7e78dfbe63a; Security:[seccomp=unconfined]
28m 28m 1 {kubelet 10.8.65.157} spec.containers{node-problem-detector} Normal Started Started container with docker id a7e78dfbe63a
27m 27m 2 {kubelet 10.8.65.157} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "node-problem-detector" with CrashLoopBackOff: "Back-off 10s restarting failed container=node-problem-detector pod=node-problem-detector-gfx8q_default(5203c2d8-1b73-11e7-8754-46c88707ea0b)"

27m 27m 1 {kubelet 10.8.65.157} spec.containers{node-problem-detector} Normal Created Created container with docker id 6d3ac4da1b6e; Security:[seccomp=unconfined]
27m 27m 1 {kubelet 10.8.65.157} spec.containers{node-problem-detector} Normal Started Started container with docker id 6d3ac4da1b6e
27m 27m 2 {kubelet 10.8.65.157} Warning FailedSync Error syncing pod, skipping: failed to "StartContainer" for "node-problem-detector" with CrashLoopBackOff: "Back-off 20s restarting failed container=node-problem-detector pod=node-problem-detector-gfx8q_default(5203c2d8-1b73-11e7-8754-46c88707ea0b)"

`

Now i am running daemonset in all node, but how do i verify it is useful?

What happens to the node wrong after creating daemonset? And set node to unscheduler?

Now i am running daemonset in all node, but how do i verify it is useful。

It is a bit difficult to simulate these problems。

Hardware issues: Bad cpu, memory or disk;
Kernel issues: Kernel deadlock, corrupted file system;
Container runtime issues: Unresponsive runtime daemon;
...

"startPattern" is fragile and wrong on newer kernels

Broken out from here

Currently, the config has a default of "startPattern": "Initializing cgroup subsys cpuset",

This pattern is meant to detect a node's boot process. Prior to the 4.5 kernel, this message was typically printed during boot of a node. After 4.5 however, due to this change, it is quite unlikely for that message to appear.

Furthermore, there's rarely a reason to detect whether a message is for the current boot in such a fragile way.

With the kern.log reader, every message is for the current boot because kern.log is usually handled where each kern.log file corresponds to one boot (e.g. kern.log is this boot, kern.log.1 is the boot before, kern.log.2.gz the one before, etc). (EDIT: I'm wrong about this for gci at least)

With journald, the boot id is annotated in messages, and so it can accurately be correlated with the current boot id (see the "_BOOT_ID" record in journald messages).

With a kmsg reader, all messages will only be the current boot because kmsg is not persistent.

In none of those cases is startPattern useful. Each kernel log parsing plugin should be responsible for doing the right thing itself I think.

Node-Problem-Detector should Patch NodeStatus not Update

Problem:
kubelet, node-controller, and now the node-problem-detector all update the Node.Status field.

Normally this is not a problem because if changes happen rapidly, resource version mismatch fails and everything is ok.

When a new field is added to Node Status, however:

  • Since kubelet and node-controller are in the same repository they are recompiled with the new field and Status updates continue to operate normally.
  • However, since node-problem-detector is in a separate repository and has not been recompiled with the new version, its Update calls end up squashing the new (unknown) field, resetting it to nil.

Suggested Solution:
node-problem-detector should do a patch instead of an update to prevent wiping out fields it is not aware of. Incidentally, kubelet and node-controller should do this as well, but that is not as critical since they are in the same repository and new types are normally added their first.

CC @kubernetes/sig-node @kubernetes/sig-api-machinery @Random-Liu @bgrant0607

node-problem-detector:v0.4.0 - failed to update node conditions: Timeout

Hello,

I am testing out the node-problem-detector cluster addon and am seeing this error on only 1 node:

  • failed to update node conditions: Timeout: request did not complete within allowed duration

This cluster has had issues in the past due to some etcd3 nodes becoming unhealthy, and flanneld being unable to connect to etcd3 at one point causing kube-apiserver to fail to respond, and hence kubelet, kube-controller-manager, kube-proxy, and kube-scheduler to log various api timeout errors.

After restarting everything, the rest of the nodes are healthy & both etcd3 + kube-apiserver appear to be contactable by the kube-system services. I then started npd-0.4.0 using the above manifest and am seeing only 1 node with this issue. The rest are sometimes logging what appears to be the race condition for updating node status exactly as in #108.

I've checked the 5 node etcd cluster health and only 1 of 5 is in unhealthy state. Requests from kubectl return most of the time, however I am seeing some timeouts from some kubectl describe node XXX commands.

Relevant log sections below:

/var/log/node-problem-detector.log:

[... BEGIN LOG ...]

I0617 05:08:14.386945       7 log_monitor.go:72] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:kernel] LogPath:/var/log/journal
Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:
kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, a
non-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:temporary Condition: Reason:UnregisterNe
tDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL po
inter dereference at .*} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Patt
ern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ second
s\.}]}
I0617 05:08:14.387062       7 log_watchers.go:40] Use log watcher of plugin "journald"
I0617 05:08:14.387253       7 log_monitor.go:72] Finish parsing log monitor config file: {WatcherConfig:{Plugin:filelog PluginConfig:map[timestamp:^.{15} message:kernel: \[.*\] (
.*) timestampFormat:Jan _2 15:04:05] LogPath:/var/log/kern.log Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:00
01-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d
+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more tha
n \w+ seconds\.} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {Type:temporary Condition:
 Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .*} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP} {Typ
e:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:Dock
erHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.}]}
I0617 05:08:14.387286       7 log_watchers.go:40] Use log watcher of plugin "filelog"
I0617 05:08:14.387923       7 log_monitor.go:81] Start log monitor
E0617 05:08:14.387946       7 problem_detector.go:65] Failed to start log monitor "/config/kernel-monitor-filelog.json": failed to stat the file "/var/log/kern.log": stat /var/lo
g/kern.log: no such file or directory
I0617 05:08:14.387952       7 log_monitor.go:81] Start log monitor
I0617 05:08:14.389906       7 log_watcher.go:69] Start watching journald
I0617 05:08:14.389947       7 problem_detector.go:74] Problem detector started
I0617 05:08:14.390046       7 log_monitor.go:173] Initialize condition generated: [{Type:KernelDeadlock Status:false Transition:2017-06-17 05:08:14.390031295 +0000 UTC Reason:Ker
nelHasNoDeadlock Message:kernel has no deadlock}]
E0617 06:41:07.988201       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful confli
ct (firstResourceVersion: "2163685", currentResourceVersion: "2163729"):
 diff1={"metadata":{"resourceVersion":"2163729"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T06:41:01Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T06:41
:01Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T06:41:01Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T06:41:01Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T06:41:02Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoD
eadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 08:10:02.768550       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful confli
ct (firstResourceVersion: "2181151", currentResourceVersion: "2181185"):
 diff1={"metadata":{"resourceVersion":"2181185"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T08:10:02Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T08:10
:02Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T08:10:02Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T08:10:02Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T08:10:02Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoD
eadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 08:40:02.392305       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 08:40:32.394968       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 08:40:52.688694       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 10:19:47.886026       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful confli
ct (firstResourceVersion: "2206191", currentResourceVersion: "2206223"):
 diff1={"metadata":{"resourceVersion":"2206223"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T10:19:44Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T10:19:44Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T10:19:44Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T10:19:44Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T10:19:47Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 10:20:03.868248       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2206223", currentResourceVersion: "2206276"):
 diff1={"metadata":{"resourceVersion":"2206276"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T10:19:57Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T10:19:57Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T10:19:57Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T10:19:57Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T10:19:58Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 10:27:14.278874       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2207596", currentResourceVersion: "2207627"):
 diff1={"metadata":{"resourceVersion":"2207627"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T10:27:09Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T10:27:09Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T10:27:09Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T10:27:09Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T10:27:11Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 10:49:49.999965       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 11:04:58.654339       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2214802", currentResourceVersion: "2214843"):
 diff1={"metadata":{"resourceVersion":"2214843"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:04:56Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T11:04:56Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T11:04:56Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T11:04:56Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:04:58Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 11:13:25.349869       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 11:13:27.448321       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2216376", currentResourceVersion: "2216452"):
 diff1={"metadata":{"resourceVersion":"2216452"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:13:26Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T11:13:26Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T11:13:26Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T11:13:26Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:13:25Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 11:27:54.707593       7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 11:28:03.676434       7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 11:28:11.687756       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2219181", currentResourceVersion: "2219220"):
 diff1={"metadata":{"resourceVersion":"2219220"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:27:56Z","type":"KernelDeadlock"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:28:07Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 11:28:11.687756       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2219181", currentResourceVersion: "2219220"):
 diff1={"metadata":{"resourceVersion":"2219220"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:27:56Z","type":"KernelDeadlock"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:28:07Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 11:35:22.216998       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2220568", currentResourceVersion: "2220603"):
 diff1={"metadata":{"resourceVersion":"2220603"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:35:12Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T11:35:12Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T11:35:12Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T11:35:12Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:35:21Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 11:35:39.515617       7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 11:36:44.370694       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful conflict (firstResourceVersion: "2220826", currentResourceVersion: "2220882"):
 diff1={"metadata":{"resourceVersion":"2220882"},"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:36:40Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-06-17T11:36:40Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-06-17T11:36:40Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-06-17T11:36:40Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T11:36:43Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 11:43:04.549531       7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 11:52:29.011584       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 11:55:55.741212       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost

[...SNIP...]

E0617 20:48:43.758632       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:50:08.649485       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:50:36.742814       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:51:06.745291       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:51:18.310623       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:51:25.507013       7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 20:51:35.682252       7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 20:51:40.731491       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful confli
ct (firstResourceVersion: "2329653", currentResourceVersion: "2329819"):
 diff1={"metadata":{"resourceVersion":"2329819"},"status":{"conditions":[{"lastTransitionTime":"2017-06-17T20:49:33Z","message":"Kubelet stopped posting node status.","reason":"N
odeStatusUnknown","status":"Unknown","type":"DiskPressure"},{"lastTransitionTime":"2017-06-17T20:49:33Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnkn
own","status":"Unknown","type":"MemoryPressure"},{"lastTransitionTime":"2017-06-17T20:49:33Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","statu
s":"Unknown","type":"OutOfDisk"},{"lastTransitionTime":"2017-06-17T20:49:33Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","ty
pe":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T20:51:39Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoD
eadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 20:52:05.903239       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:52:35.941987       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:53:05.990698       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:53:35.993274       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:54:05.995654       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:54:35.998026       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:55:06.009227       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:55:36.011974       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:56:01.937423       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:56:12.384973       7 manager.go:160] failed to update node conditions: etcdserver: request timed out
E0617 20:56:42.390952       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:57:12.393356       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:57:42.395943       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:58:12.405949       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 20:58:22.411682       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:58:33.305720       7 manager.go:160] failed to update node conditions: etcdserver: request timed out, possibly due to connection lost
E0617 20:58:34.805954       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "ip-10-123-1-104.ec2.internal": there is a meaningful confli
ct (firstResourceVersion: "2330223", currentResourceVersion: "2330380"):
 diff1={"metadata":{"resourceVersion":"2330380"},"status":{"conditions":[{"lastTransitionTime":"2017-06-17T20:52:28Z","message":"Kubelet stopped posting node status.","reason":"N
odeStatusUnknown","status":"Unknown","type":"DiskPressure"},{"lastTransitionTime":"2017-06-17T20:52:28Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnkn
own","status":"Unknown","type":"MemoryPressure"},{"lastTransitionTime":"2017-06-17T20:52:28Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"OutOfDisk"},{"lastTransitionTime":"2017-06-17T20:52:28Z","message":"Kubelet stopped posting node status.","reason":"NodeStatusUnknown","status":"Unknown","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-06-17T20:58:33Z","lastTransitionTime":"2017-06-17T05:08:14Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}
E0617 20:59:13.391379       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration

[...SNIP...]
[... MESSAGE REPEATS APPROX EVERY ~1/2 SECOND ...]
[...SNIP...]

E0617 21:22:43.790298       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:23:13.792892       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:23:43.795469       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:24:13.798194       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:24:43.800739       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:25:13.803335       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:25:43.805904       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:26:13.808357       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:26:43.810985       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:27:13.813560       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:27:43.816150       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:28:13.886384       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:28:43.889035       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:29:13.891666       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:29:43.894478       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:30:13.897056       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:30:43.922019       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:31:13.947162       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:31:43.949919       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:32:13.952527       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:32:43.955028       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:33:13.957574       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:33:43.960186       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:34:13.962807       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:34:43.965335       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:35:13.967982       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration
E0617 21:35:43.971220       7 manager.go:160] failed to update node conditions: Timeout: request did not complete within allowed duration

[...SNIP...]
[... MESSAGE REPEATS APPROX EVERY ~1/2 SECOND ...]

Write Readme.md for NodeProblemDetector

Write Readme.md for NodeProblemDetector to demonstrate:

  • Motivation and scope of NodeProblemDetector
  • Usage of current NodeProblemDetector
  • Future plan of NodeProblemDetector.

/cc @kubernetes/node-problem-detector-maintainers

NPD sending too many node-status updates in scale tests

We ran 4k-node scalability test and observed that fluentd gets oom-killed frequently.
However, npd seems to send too many patch node-status requests (~3k qps out of ~11k total qps) due to it.
Ref: kubernetes/kubernetes#47344 (comment) kubernetes/kubernetes#47865 (comment)

Why is it reporting oom events when kubelet seems to do that?
Can we make npd report ooms for just system processes and let kubelet alone take care of k8s containers (by modifying npd's 'OOMKilling' rule)?

cc @Random-Liu @gmarek @kubernetes/sig-node-bugs

does this tool work outside Google?

Panic crash:

kubectl get pods -o wide --all-namespaces
NAMESPACE     NAME                                READY     STATUS             RESTARTS   AGE       NODE
default       node-problem-detector-0kgkw         0/1       CrashLoopBackOff   3          1m        192.168.78.15
default       node-problem-detector-ar3tk         0/1       CrashLoopBackOff   3          1m        192.168.78.16

kubectl logs node-problem-detector-0kgkw
I0623 01:02:18.560287       1 kernel_monitor.go:86] Finish parsing log file: {WatcherConfig:{KernelLogPath:/log/kern.log} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:UnregisterNetDeviceIssue Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+}]}
I0623 01:02:18.560413       1 kernel_monitor.go:93] Got system boot time: 2016-06-17 17:51:02.560408109 +0000 UTC
panic: open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory

goroutine 1 [running]:
panic(0x15fc280, 0xc8204fc000)
    /usr/local/go/src/runtime/panic.go:464 +0x3e6
k8s.io/node-problem-detector/pkg/problemclient.NewClientOrDie(0x0, 0x0)
    /usr/local/google/home/lantaol/workspace/src/k8s.io/node-problem-detector/pkg/problemclient/problem_client.go:56 +0x132
k8s.io/node-problem-detector/pkg/problemdetector.NewProblemDetector(0x7faa4f155140, 0xc8202a6900, 0x0, 0x0)
    /usr/local/google/home/lantaol/workspace/src/k8s.io/node-problem-detector/pkg/problemdetector/problem_detector.go:45 +0x36
main.main()
    /usr/local/google/home/lantaol/workspace/src/k8s.io/node-problem-detector/node_problem_detector.go:33 +0x56

We should add real case e2e test.

We already have a NPD e2e test in kubernetes repo: https://github.com/kubernetes/kubernetes/blob/master/test/e2e/node_problem_detector.go.

However, the e2e test only doesn't actually test the NPD deployed in the cluster. It:

  • Deploy a NPD pod with test config.
  • Generate test log in the test log file.
  • Check whether test node condition is set.

Although this test is still necessary, we should add a new test to really inject log into kernel log or even trigger kernel problem and test the real NPD deployed in the cluster.

The test is disruptive, we may want to add [Feature] tag and create a dedicated jenkins job for it.

/cc @dchen1107

Generalize kernel monitor.

Discussed with @apatil.

Kernel monitor was initially introduced to monitor kernel log and detect kernel issues.

However, in fact it could be extended to monitor other logs such as docker log, systemd log etc. by adding new translator. Currently it is already doable, but not very intuitive because:

  1. All files, types and functions are named as kernel xxx.
  2. Translator is not configurable.

We should refactor the code to make it easier and more intuitive to extend kernel monitor:

  • Change kernelmonitor to logmonitor. We'll only use log monitor to monitor kernel log for K8s, but it should be easy for other users to reconfigure and extend it to monitor other logs.
  • Extend the configuration to make translator and log source configurable after #41 landed, including:
    • Make the journald log filter configurable.
    • Make the translate function configurable.

/cc @kubernetes/sig-node

Bump the version to v0.3 in makefile, manifest and readme

After adding the standalone mode in #49 , I created and pushed a new docker image gcr.io/google_containers/node-problem-detector:v0.3 which includes these changes for use by kubemark. However this version bump has to be reflected in:

  1. version tag inside the Makefile
  2. image inside the node-problem-detector.yaml
  3. readme, to indicate that v0.3 is the latest version and/or suggested for use?

@Random-Liu Hope this push doesn't disturb any plans you might be having for npd release schedule. If yes, it's not difficult to revert this change, as I just did it for use in kubemark.

does not create automatically /var/log/journal

Hi,

  1. great initiative. Here is my experience on RHEL/CentOS 7

  2. Apparently node-problem-detector is not able to work with the /run/log/journal/ (unfortunately) - correct? If so, it seems like a big limitation, no?

  3. The issue is that even if cannot do that, it also fails to start, as it depends on the existence of "/var/log/journal". Knowing that it's not able to use /run/log/journal, is it reasonable to expect that it will create "/var/log/journal" on each node?

  4. "/var/log/kern.log" is missing on my OS. I guess it's not in use any longer or it should be configured?

Here are the logs:

I0621 14:32:46.173199       7 log_watchers.go:40] Use log watcher of plugin "journald"
I0621 14:32:46.186164       7 log_monitor.go:72] Finish parsing log monitor config file: {WatcherConfig:{Plugin:filelog PluginConfig:map[timestamp:^.{15} message:kernel: \[.*\] (.*) timestampFormat:Jan _2 15:04:05] LogPath:/var/log/kern.log Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .*} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.}]}
I0621 14:32:46.186204       7 log_watchers.go:40] Use log watcher of plugin "filelog"
I0621 14:32:46.186926       7 log_monitor.go:81] Start log monitor
E0621 14:32:46.186945       7 problem_detector.go:65] Failed to start log monitor "/config/kernel-monitor.json": failed to stat the log path "/var/log/journal": stat /var/log/journal: no such file or directory
I0621 14:32:46.186951       7 log_monitor.go:81] Start log monitor
E0621 14:32:46.186958       7 problem_detector.go:65] Failed to start log monitor "/config/kernel-monitor-filelog.json": failed to stat the file "/var/log/kern.log": stat /var/log/kern.log: no such file or directory
F0621 14:32:46.186964       7 node_problem_detector.go:88] Problem detector failed with error: no log montior is successfully setup

Release instruction is needed.

Forked from #63 (comment).
We need a release instruction to standardize the release process. The release instruction should cover:

  • How to build release packages
    • Build docker image.
    • Build tar ball for standalone.
    • ...
  • How to cut release
    • Version a release
    • Update CHANGELOG.md
    • Create branch/tag
    • Create github release.
    • ...
  • How to update kubernetes
    • Update e2e test
    • Update addon pods in cluster
    • ...

This has relatively lower priority than the features, will file the doc before cutting release.

can't update node condition

Hi, I deployed npd yesterday in my k8s 1.6, Now I find it can't update node condition.Below is its log

E0418 09:48:32.826374       7 manager.go:160] failed to update node conditions: Operation cannot be fulfilled on nodes "cs55": there is a meaningful conflict (firstResourceVersion: "881081", currentResourceVersion: "881106"):
 diff1={"metadata":{"resourceVersion":"881106"},"status":{"conditions":[{"lastHeartbeatTime":"2017-04-18T01:48:32Z","type":"DiskPressure"},{"lastHeartbeatTime":"2017-04-18T01:47:32Z","type":"KernelDeadlock"},{"lastHeartbeatTime":"2017-04-18T01:48:32Z","type":"MemoryPressure"},{"lastHeartbeatTime":"2017-04-18T01:48:32Z","type":"OutOfDisk"},{"lastHeartbeatTime":"2017-04-18T01:48:32Z","type":"Ready"}]}}
, diff2={"status":{"conditions":[{"lastHeartbeatTime":"2017-04-18T01:48:32Z","lastTransitionTime":"2017-04-17T09:10:43Z","message":"kernel has no deadlock","reason":"KernelHasNoDeadlock","status":"False","type":"KernelDeadlock"}]}}

I wonder why this happen? @Random-Liu thx

Feature request for a "hollow"-node-problem-detector having an empty list of conditions and rules inside kernel monitor config

As part of the effort to make testing using kubemark mimic real clusters as closely as possible, we are planning to add container for a "hollow-node-problem-detector" inside hollow-node alongside the existing containers "hollow-kubelet" and "hollow-kubeproxy". For this, it is required to have a node-problem-detector image which essentially has conditions and rules inside the kernel monitoring config set to an empty list. Also, this image should eventually (once it is tested to work fine in Kubemark) be pushed to gcr.io/google-containers.

@kubernetes/sig-scalability @wojtek-t @gmarek @Random-Liu

A systemd service for test.

After introducing a new pattern, we usually need to verify it.

However, it's hard to trigger a real problem for a given service, so we usually have to inject log to verify it.

It's hard to inject log into a given systemd service. We should have a test systemd service which does nothing but generating specified log. We could let NPD parse the log of the test service to verify the newly introduced pattern.

failed to build node-problem-detector

  • go version go1.7 darwin/amd64
  • godep v74 (darwin/amd64/go1.7)

make failed with output:

CGO_ENABLED=0 GOOS=linux godep go build -a -installsuffix cgo -ldflags '-w' -o node-problem-detector
vendor/github.com/coreos/go-systemd/sdjournal/functions.go:19:2: no buildable Go source files in /Users/sandflee/goproject/src/k8s.io/node-problem-detector/vendor/github.com/coreos/pkg/dlopen
godep: go exit status 1

OOMKilling not triggering with recent kernels

With Linux 4.9, and perhaps earlier versions, there's a trailing

, shmem-rss:\\d+kB

Should it be optional in the built-in regex, are admins expect to edit their configuration, or something else?

How to hook up third-party daemons?

I'm an ABRT developer and I would love to create a problem daemon reporting problems detected by ABRT to node-problem-detector.

ABRT's architecture is similar to node-problem-detector's - there are agents reporting detected problems to abrtd. An ABRT agent is either a tiny daemon watching logs (or systemd-journal) or a language error handler (Python sys.excepthook, Ruby at_exit callback, /proc/sys/kernel/core_pattern, Node.js uncaughtException event handler, Java JNI agent).

I've created a docker image that is capable to detect Kernel oopses, vmcores and core files on a host:
https://github.com/jfilak/docker-abrt/tree/atomic_minimal

(It should be possible to detect uncaught [Python, Ruby, Java] exceptions in the future)

ABRT provides several ways of reporting the detected problems to users - e-mail, FTP|SCP upload, D-Bus signal, Bugzilla bug, micro-Report, systemd-journal catalog message - and it is trivial to add another report destination.

The Design Doc defines "Problem Report Interface" but I've failed to find out how to register a new problem deamon to node-problem-detector or how to use the "Problem Report Interface" from a third party daemon.

apiserver-override flag in standalone mode not working in kubemark

So I create npd containers inside hollow-nodes (just for clarity: these are actually pods) of kubemark by passing the address of the kubemark master and the kubeconfig file through the override flag as follows:

"name": "hollow-node-problem-detector",
"image": "gcr.io/google_containers/node-problem-detector:v0.3",
"env": [
						{
							"name": "NODE_NAME",
							"valueFrom": {
								"fieldRef": {
									"fieldPath": "metadata.name"
								}
							}
						}
],
"command": [
						"/node-problem-detector",
						"--kernel-monitor=/config/kernel-monitor.json",
						"--apiserver-override=https://104.198.41.48:443?inClusterConfig=false&auth=/kubeconfig/npd_kubeconfig",
						"--alsologtostderr",
						"1>>/var/logs/npd_$(NODE_NAME).log 2>&1"
],
"volumeMounts": ....

I even checked inside the container using kubectl exec that the right kubeconfig file is present on the container and NODE_NAME env var is defined. Here's the kubeconfig:

apiVersion: v1
kind: Config
users:
- name: node-problem-detector
  user:
    token: 8qt8RLdwIKeZQ0QeVrSWg4BrqK3Cs8H8
clusters:
- name: kubemark
  cluster:
    insecure-skip-tls-verify: true
    server: https://104.198.41.48
contexts:
- context:
    cluster: kubemark
    user: node-problem-detector
  name: kubemark-npd-context
current-context: kubemark-npd-context

Despite this, the npd seems to communicate with the wrong master (the master of the real cluster underneath the kubemark cluster). It might probably be still using the inClusterConfig from the real node underneath the hollow-node, as that seems to be the only way by which it can get hold of the wrong master's IP address.

cc @Random-Liu @andyxning

NPD Kubernetes 1.6 Planning

NPD (node problem detector) is introduced in Kubernetes 1.3 as a default add-on in GCE cluster.

At that time, it is mainly targeted on default GCE Kubernetes setup. However, as time goes by, some limitations were found such as Journald support, Authentication Issue, Scalability Issue which affected the adoption of NPD in many other environment.

In Kubernetes 1.6, we plan to invest some time to improve NPD, make it production ready and rollout it in GKE.

Here are the working items and priorities:

  • [P0] Journald support. Many important OS distros are using systemd now, such as GCI, CoreOS, CentOS etc. This is essential for NPD adoption. (Issue: #14, PR: #39, #33, @adohe)
  • [P0] Apiserver client option override. By default, NPD is running as DaemonSet and use InClusterConfig to access apiserver. However, this does not work when service account is not available. (Issue: #27, #21). We should make the apiserver client option configurable, so that user can customize it based on their cluster setup. This is prerequisite of Standalone mode (PR: #49, @andyxning)
  • [P0] Standalone mode. Make it possible to run NPD standalone, possibly as a systemd service. DaemonSet is easy to deploy and manage. However, docker still stops all containers when it's dead (live-restore is still in validation). Because of this, NPD may not be able to detect problems when docker is unresponsive. (Issue: #76)
  • [P1] Integrate NPD with K8s e2e framework. NPD is already running in e2e cluster, but the information it collects is not well-surfaced from the test framework. We should make it visible by failing the test or collecting via a dashboard (Issue: kubernetes/kubernetes#30811).
  • [P1] Scalability and performance. #85
    • Some known performance issue needs to be fixed in NPD, such as reduce apiserver access (#37), and improve log parsing efficiency. (#79, #84)
    • More benchmark to verify the performance of NPD. Both benchmark for NPD resource usage and apiserver load introduced by NPD (#50, @shyamjvs, #85).
  • [P2] Formalize the Project. Formalize the process of the project, including:
    • #66 Add change log. (#45) [P2]
    • Define release process. (#67) [P2]
    • Add pre/post submit e2e test. (#43) [P3]
  • [P2] Docker problem detection. Although kernel monitor could be extended to monitor other logs, it still needs some code change to achieve that. We should cleanup the code to make it easier to monitor other logs and add clear documentation for it. (#44) (#88) (PR: #88, #92, #94)
  • [P3] 3rd party problem daemon integration. Kernel monitor is designed to detect known kernel problems with minimum overhead, it is not expected to be a comprehensive solution. NPD should be extensible to integrate with more small problem daemons or more mature solution. (#35)

Note that only P0s are release blocker.

@dchen1107 @fabioy @ajitak
/cc @kubernetes/sig-node-misc

Performance benchmark and optimization

We've got some data from #2.

However, after that we've made a lot of changes. We should do benchmark to measure the cpu and memory usage. And optimize the log parsing code of kernel monitor if the resource usage is too high.

An accurate and acceptable number is important for NPD rollout.

/cc @dchen1107

Support SSL certificates

Our k8s API server is running with a self-signed certificate to enable SSL. I am wondering what's the best way to specify a root CA when running node-problem-detector.

Use chroot instead of relying on a systemd base image.

We want NPD to support journald log even when it's running as daemonset inside a container.

However, it's hard to consume host journal log inside container. Previously, we use fedora base image, and rely on the journald library inside the base image to understand and read the host journald log. However, there are several limitations:

  1. The fedora base image is pretty big, and there are a lot of things we don't actually need.
  2. The journald version inside the container may be mismatch with the host, and sometimes causes problems. E.g. the npd image doesn't work with new GCI image now, and also the user bug report #114.

A more generic solution may be mount the host / inside the container, and chroot. In this way, we could use the journald on the host directly, which eliminates the problems above.

$ docker run -it --privileged -v /:/rootfs --entrypoint=/bin/bash gcr.io/google_containers/node-problem-detector:v0.4.0
[root@07fb0f31a048 /]# chroot /rootfs
sh-4.3# journalctl -k
-- Logs begin at Sun 2017-06-11 12:12:02 UTC, end at Mon 2017-06-12 18:59:06 UTC. --
Jun 12 18:19:54 e2e-test-lantaol-master kernel: device veth43d8449 entered promiscuous mode
Jun 12 18:19:54 e2e-test-lantaol-master kernel: IPv6: ADDRCONF(NETDEV_UP): veth43d8449: link is not ready
Jun 12 18:19:54 e2e-test-lantaol-master kernel: eth0: renamed from veth75b3ac0
...

I haven't thought through the potential security problem yet, but since NPD is a privileged daemonset, it's reasonable to grant host fs access to it.

Add RBAC in the example yaml file and document

Default Kubernetes setup removed ABAC support, and is only using RBAC now.
kubernetes/kubernetes#39092

This is fine for the NPD in kube-system, which has enough permission. However, the NPD setup in the example and Readme.md may not be enough.

We should add documentation and example to demonstrate how to deploy NPD in a RBAC K8s cluster.

Standalone NPD Support

For better reliability, in Kubernetes 1.6, we plan to add standalone NPD support to run NPD as a system daemon on each node.

To achieve this, we need to:

  • Add apiserver-override option to make it possible to use customized client config instead of InClusterConfig. #49
  • Make NPD wait for: #79
    • Apiserver to come up. #19
    • Kubelet to register node.
      This is necessary, because in standalone mode, NPD may be deployed before apiserver and kubelet are functioning. It should wait for them to come up.
  • NPD should host a simple http server for readiness check and health monitor. #83
    • Requred: Add /health for health monitoring.
    • Optional: Add /pprof for performance debugging.
    • Optional: Add debug endpoint to list internal state.
    • Optional: Add grpc problem report endpoint.
  • Resource Limit
    • Get benchmark data about the NPD resource usage #85
    • ~~~Adjust NodeAllocatable according to the resource usage accordingly.~~~
  • Add standalone NPD into K8s GCI cloud init script. kubernetes/kubernetes#40206
    • Package NPD binary as tar ball and upload it to google storage. #71, #74
    • Download and setup NPD binary in K8s GCI cloud init script.
  • NPD should have a specific user account which has necessary permission in K8s cluster. Similar with system:kube-proxy.

node-problem-detector does not work if I change kubelet hostname to node ip.

I am running k8s 1.3.0 and with docker image node-problem-detector:v0.1

I changed kubelet --hostname-override with node ip on ehch minion:

root@SZX1000116607:~# cat /etc/default/kubelet
KUBELET_OPTS='--hostname-override=10.22.109.119 --api-servers=http://10.22.109.119:8080,http://10.22.69.237:8080,http://10.22.117.82:8080 --pod-infra-container-image=xxxxx/kubernetes/pause:latest --cluster-dns=192.168.1.1  --cluster-domain=test1 --low-diskspace-threshold-mb=2048 --cert-dir=/var/run/kubelet --allow-privileged=true'

when I start node-problem-detector as a daemon set, I got a lot of error messages:

2016-07-11T07:49:17.310422203Z I0711 07:49:17.309761       1 kernel_monitor.go:86] Finish parsing log file: {WatcherConfig:{KernelLogPath:/log/kern.log} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:UnregisterNetDeviceIssue Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+}]}
2016-07-11T07:49:17.310510490Z I0711 07:49:17.310026       1 kernel_monitor.go:93] Got system boot time: 2016-07-08 01:40:54.310019188 +0000 UTC
2016-07-11T07:49:17.311878174Z I0711 07:49:17.311436       1 kernel_monitor.go:102] Start kernel monitor
2016-07-11T07:49:17.311907663Z I0711 07:49:17.311654       1 kernel_log_watcher.go:173] unable to parse line: "", can't find timestamp prefix "kernel: [" in line ""
2016-07-11T07:49:17.311926515Z I0711 07:49:17.311696       1 kernel_log_watcher.go:110] Start watching kernel log
2016-07-11T07:49:17.311942118Z I0711 07:49:17.311720       1 problem_detector.go:60] Problem detector started
2016-07-11T07:49:17.314020808Z 2016/07/11 07:49:17 Seeked /log/kern.log - &{Offset:0 Whence:0}
2016-07-11T07:49:18.355974460Z E0711 07:49:18.355712       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:19.314395255Z E0711 07:49:19.314110       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:20.314302138Z E0711 07:49:20.313982       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:21.315151897Z E0711 07:49:21.314849       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:22.313977681Z E0711 07:49:22.313834       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:23.313906286Z E0711 07:49:23.313619       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:24.314428608Z E0711 07:49:24.314141       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:25.316626549Z E0711 07:49:25.316326       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:26.314346471Z E0711 07:49:26.314142       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:27.313901894Z E0711 07:49:27.313759       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:28.314356686Z E0711 07:49:28.314198       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:29.314747334Z E0711 07:49:29.314450       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found
2016-07-11T07:49:30.314562756Z E0711 07:49:30.314235       1 manager.go:130] failed to update node conditions: nodes "SZX1000116591" not found

It seems node-problem-detector still use the original os hostname to access node instead of overrided hostname that is registered in etcd.

/wangyumi

Unable to build code, missing godep

I cloned the repo and ran:

$ make node-problem-detector
CGO_ENABLED=0 GOOS=linux godep go build -a -installsuffix cgo -ldflags '-w' -o node-problem-detector
vendor/gopkg.in/fsnotify.v1/inotify.go:19:2: cannot find package "golang.org/x/sys/unix" in any of:
    /home/decarr/go/src/k8s.io/node-problem-detector/vendor/golang.org/x/sys/unix (vendor tree)
    /usr/local/go/src/golang.org/x/sys/unix (from $GOROOT)
    /home/decarr/go/src/k8s.io/node-problem-detector/Godeps/_workspace/src/golang.org/x/sys/unix (from $GOPATH)
    /home/decarr/go/src/golang.org/x/sys/unix
godep: go exit status 1
Makefile:9: recipe for target 'node-problem-detector' failed
make: *** [node-problem-detector] Error 1

Looks like this is missing a godep in master:
https://github.com/kubernetes/node-problem-detector/tree/master/vendor

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.