Comments (10)
Or you can also make use of the script https://github.com/kubernetes/node-problem-detector/blob/master/test/kernel_log_generator/generator.sh ,
either via docker or setting up PROBLEM Env manually in script and running on hostmachine !!
For verification , it would be also be useful to put a watch on the events , and look for the ones coming from Node.
kubectl get events | grep Node
from node-problem-detector.
/reopen
Hi @Random-Liu , I've got the same question with @xiuli2 , how can I verify the NPD is working?
I got docker-monitor.json and kernel-monitor enabled: --system-log-monitors=/etc/npd/kernel-monitor.json,/etc/npd/docker-monitor.json
Then I generated a fake journald log generated:
# echo "BUG: unable to handle kernel NULL pointer dereference at 12345" |systemd-cat
# echo "Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/aufs/layerdb/tmp/layer-632022140 /var/lib/docker/image/aufs/layerdb/sha256/011b303988d241a4ae28a6b82b0d8262751ef02910f0ae2265cb637504b72e36: directory not empty" | systemd-cat
But I can not see any output for NPD from kubectl get events | grep Node
or describe node NodeName
So is kubectl logs
# kubectl logs node-problem-detector-72c4w
I0409 01:24:20.074715 1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:kernel] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .*} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.}]}
I0409 01:24:20.074881 1 log_watchers.go:40] Use log watcher of plugin "journald"
I0409 01:24:20.074999 1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:docker] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:docker-monitor DefaultConditions:[] Rules:[{Type:temporary Condition: Reason:CorruptDockerImage Pattern:Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*}]}
I0409 01:24:20.075013 1 log_watchers.go:40] Use log watcher of plugin "journald"
I0409 01:24:20.075872 1 log_monitor.go:72] Start log monitor
I0409 01:24:20.097735 1 log_watcher.go:69] Start watching journald
I0409 01:24:20.097789 1 log_monitor.go:72] Start log monitor
I0409 01:24:20.099870 1 log_monitor.go:163] Initialize condition generated: [{Type:KernelDeadlock Status:false Transition:2018-04-09 01:24:20.099830399 -0400 EDT m=+0.043253068 Reason:KernelHasNoDeadlock Message:kernel has no deadlock}]
I0409 01:24:20.114934 1 log_watcher.go:69] Start watching journald
I0409 01:24:20.115008 1 problem_detector.go:73] Problem detector started
I0409 01:24:20.117235 1 log_monitor.go:163] Initialize condition generated: []
Thanks a lot.
from node-problem-detector.
First thing you may want to check is kubectl describe nodes
to see whether KernelDeadlock
node condition shows up.
Currently, the kernel problem detection is purely based on the kernel log, you could inject kernel log to see whether NPD takes action correspondingly, e.g. problems in https://github.com/kubernetes/node-problem-detector/tree/master/test/kernel_log_generator/problems.
from node-problem-detector.
I used kubernetes 1.6 testing and inject kernel log to /var/log/dmesg,but nothing happened。Do i have right?
from node-problem-detector.
@strugglingyouth Did you change the configuration correspondingly? https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor-filelog.json
You may want to inject log into /var/log/kern.log. Or you may want to change the configuration to point to /var/log/dmesg.
from node-problem-detector.
@Random-Liu I simulate the problems like the following in /var/log/kern.log
kernel:[534024.040037] unregister_netdevice: waiting for lo to become free. Usage count = 2
kernel: Memory cgroup out of memory: Kill process 1012 (heapster) score 1035 or sacrifice child
kernel: Killed process 1012 (heapster) total-vm:327128kB, anon-rss:306328kB, file-rss:11132kB
I am not sure whether the log is OKto validate the NPD? the format of the log is right? any specific style? need some time stamp?
After manually modify the kern.log,
A: I tried to run the command "kubectl get events | grep Node", but no the expected OOM events.
B: I ran the command "oc describe node NodeName" to view the Conditions and Events sections, and check whether OOM and unregisterdevice error got caught, but no related Conditions and Events. are the steps correct to check NPD work?
Could you please show the detailed steps to verify it is useful? thanks in advance!
from node-problem-detector.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
Prevent issues from auto-closing with an /lifecycle frozen
comment.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or @fejta
.
/lifecycle stale
from node-problem-detector.
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale
from node-problem-detector.
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close
from node-problem-detector.
@weinliu: you can't re-open an issue/PR unless you authored it or you are assigned to it.
In response to this:
/reopen
Hi @Random-Liu , I've got the same question with @xiuli2 , how can I verify the NPD is working?
I got docker-monitor.json and kernel-monitor enabled:--system-log-monitors=/etc/npd/kernel-monitor.json,/etc/npd/docker-monitor.json
Then I generated a fake journald log generated:
# echo "BUG: unable to handle kernel NULL pointer dereference at 12345" |systemd-cat # echo "Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/aufs/layerdb/tmp/layer-632022140 /var/lib/docker/image/aufs/layerdb/sha256/011b303988d241a4ae28a6b82b0d8262751ef02910f0ae2265cb637504b72e36: directory not empty" | systemd-cat
But I can not see any output for NPD from
kubectl get events | grep Node
ordescribe node NodeName
So is kubectl logs
# kubectl logs node-problem-detector-72c4w I0409 01:24:20.074715 1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:kernel] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .*} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.}]} I0409 01:24:20.074881 1 log_watchers.go:40] Use log watcher of plugin "journald" I0409 01:24:20.074999 1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:docker] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:docker-monitor DefaultConditions:[] Rules:[{Type:temporary Condition: Reason:CorruptDockerImage Pattern:Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*}]} I0409 01:24:20.075013 1 log_watchers.go:40] Use log watcher of plugin "journald" I0409 01:24:20.075872 1 log_monitor.go:72] Start log monitor I0409 01:24:20.097735 1 log_watcher.go:69] Start watching journald I0409 01:24:20.097789 1 log_monitor.go:72] Start log monitor I0409 01:24:20.099870 1 log_monitor.go:163] Initialize condition generated: [{Type:KernelDeadlock Status:false Transition:2018-04-09 01:24:20.099830399 -0400 EDT m=+0.043253068 Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] I0409 01:24:20.114934 1 log_watcher.go:69] Start watching journald I0409 01:24:20.115008 1 problem_detector.go:73] Problem detector started I0409 01:24:20.117235 1 log_monitor.go:163] Initialize condition generated: []
Thanks a lot.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
from node-problem-detector.
Related Issues (20)
- Support revert-pattern in logcounter
- arm64 container contains amd64 binary HOT 1
- What is up with the v0.8.15 tag? HOT 10
- V0.8.15 image is missing log-counter binary HOT 2
- node-problem-detector not able to detect kernel log events for a Kind cluster HOT 1
- Health checker doesn't restart kubelet on EKS HOT 7
- go.mod vulnerabilities fixed in master; when 0.8.16 release HOT 4
- License Scan and Findings HOT 2
- after upgrading npd from v0.8.13 to v0.8.15, containers with unready status: [node-problem-detector] HOT 10
- v1.8.13 restart count is 1 in gce-master-scale-performance HOT 8
- New CVEs in 0.8.16 HOT 5
- Include automatic CVE scanning in build process HOT 3
- Improve release process HOT 6
- Release v0.8.17 images not done? HOT 3
- Missing test coverage for standalone mode configuration HOT 5
- High CVEs in v0.8.17 HOT 8
- Load rule from configmap HOT 3
- CVE-2022-37434: zlib1g version 1:1.2.11.dfsg-1 HOT 1
- High CVEs in v0.8.18 HOT 8
- draino
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from node-problem-detector.