Giter Club home page Giter Club logo

Comments (10)

hardikdr avatar hardikdr commented on June 27, 2024 1

Or you can also make use of the script https://github.com/kubernetes/node-problem-detector/blob/master/test/kernel_log_generator/generator.sh ,
either via docker or setting up PROBLEM Env manually in script and running on hostmachine !!
For verification , it would be also be useful to put a watch on the events , and look for the ones coming from Node.
kubectl get events | grep Node

from node-problem-detector.

weinliu avatar weinliu commented on June 27, 2024 1

/reopen
Hi @Random-Liu , I've got the same question with @xiuli2 , how can I verify the NPD is working?
I got docker-monitor.json and kernel-monitor enabled: --system-log-monitors=/etc/npd/kernel-monitor.json,/etc/npd/docker-monitor.json

Then I generated a fake journald log generated:

# echo "BUG: unable to handle kernel NULL pointer dereference at 12345" |systemd-cat
# echo "Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/aufs/layerdb/tmp/layer-632022140 /var/lib/docker/image/aufs/layerdb/sha256/011b303988d241a4ae28a6b82b0d8262751ef02910f0ae2265cb637504b72e36: directory not empty" | systemd-cat

But I can not see any output for NPD from kubectl get events | grep Node or describe node NodeName

So is kubectl logs

# kubectl logs node-problem-detector-72c4w
I0409 01:24:20.074715       1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:kernel] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .*} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.}]}
I0409 01:24:20.074881       1 log_watchers.go:40] Use log watcher of plugin "journald"
I0409 01:24:20.074999       1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:docker] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:docker-monitor DefaultConditions:[] Rules:[{Type:temporary Condition: Reason:CorruptDockerImage Pattern:Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*}]}
I0409 01:24:20.075013       1 log_watchers.go:40] Use log watcher of plugin "journald"
I0409 01:24:20.075872       1 log_monitor.go:72] Start log monitor
I0409 01:24:20.097735       1 log_watcher.go:69] Start watching journald
I0409 01:24:20.097789       1 log_monitor.go:72] Start log monitor
I0409 01:24:20.099870       1 log_monitor.go:163] Initialize condition generated: [{Type:KernelDeadlock Status:false Transition:2018-04-09 01:24:20.099830399 -0400 EDT m=+0.043253068 Reason:KernelHasNoDeadlock Message:kernel has no deadlock}]
I0409 01:24:20.114934       1 log_watcher.go:69] Start watching journald
I0409 01:24:20.115008       1 problem_detector.go:73] Problem detector started
I0409 01:24:20.117235       1 log_monitor.go:163] Initialize condition generated: []

Thanks a lot.

from node-problem-detector.

Random-Liu avatar Random-Liu commented on June 27, 2024

First thing you may want to check is kubectl describe nodes to see whether KernelDeadlock node condition shows up.

Currently, the kernel problem detection is purely based on the kernel log, you could inject kernel log to see whether NPD takes action correspondingly, e.g. problems in https://github.com/kubernetes/node-problem-detector/tree/master/test/kernel_log_generator/problems.

from node-problem-detector.

strugglingyouth avatar strugglingyouth commented on June 27, 2024

I used kubernetes 1.6 testing and inject kernel log to /var/log/dmesg,but nothing happened。Do i have right?

from node-problem-detector.

Random-Liu avatar Random-Liu commented on June 27, 2024

@strugglingyouth Did you change the configuration correspondingly? https://github.com/kubernetes/node-problem-detector/blob/master/config/kernel-monitor-filelog.json

You may want to inject log into /var/log/kern.log. Or you may want to change the configuration to point to /var/log/dmesg.

from node-problem-detector.

xiuli2 avatar xiuli2 commented on June 27, 2024

@Random-Liu I simulate the problems like the following in /var/log/kern.log

kernel:[534024.040037] unregister_netdevice: waiting for lo to become free. Usage count = 2
kernel: Memory cgroup out of memory: Kill process 1012 (heapster) score 1035 or sacrifice child
kernel: Killed process 1012 (heapster) total-vm:327128kB, anon-rss:306328kB, file-rss:11132kB

I am not sure whether the log is OKto validate the NPD? the format of the log is right? any specific style? need some time stamp?

After manually modify the kern.log,

A: I tried to run the command "kubectl get events | grep Node", but no the expected OOM events.

B: I ran the command "oc describe node NodeName" to view the Conditions and Events sections, and check whether OOM and unregisterdevice error got caught, but no related Conditions and Events. are the steps correct to check NPD work?

Could you please show the detailed steps to verify it is useful? thanks in advance!

from node-problem-detector.

fejta-bot avatar fejta-bot commented on June 27, 2024

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

Prevent issues from auto-closing with an /lifecycle frozen comment.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or @fejta.
/lifecycle stale

from node-problem-detector.

fejta-bot avatar fejta-bot commented on June 27, 2024

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

from node-problem-detector.

fejta-bot avatar fejta-bot commented on June 27, 2024

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

from node-problem-detector.

k8s-ci-robot avatar k8s-ci-robot commented on June 27, 2024

@weinliu: you can't re-open an issue/PR unless you authored it or you are assigned to it.

In response to this:

/reopen
Hi @Random-Liu , I've got the same question with @xiuli2 , how can I verify the NPD is working?
I got docker-monitor.json and kernel-monitor enabled: --system-log-monitors=/etc/npd/kernel-monitor.json,/etc/npd/docker-monitor.json

Then I generated a fake journald log generated:

# echo "BUG: unable to handle kernel NULL pointer dereference at 12345" |systemd-cat
# echo "Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/aufs/layerdb/tmp/layer-632022140 /var/lib/docker/image/aufs/layerdb/sha256/011b303988d241a4ae28a6b82b0d8262751ef02910f0ae2265cb637504b72e36: directory not empty" | systemd-cat

But I can not see any output for NPD from kubectl get events | grep Node or describe node NodeName

So is kubectl logs

# kubectl logs node-problem-detector-72c4w
I0409 01:24:20.074715       1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:kernel] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}] Rules:[{Type:temporary Condition: Reason:OOMKilling Pattern:Kill process \d+ (.+) score \d+ or sacrifice child\nKilled process \d+ (.+) total-vm:\d+kB, anon-rss:\d+kB, file-rss:\d+kB} {Type:temporary Condition: Reason:TaskHung Pattern:task \S+:\w+ blocked for more than \w+ seconds\.} {Type:temporary Condition: Reason:UnregisterNetDevice Pattern:unregister_netdevice: waiting for \w+ to become free. Usage count = \d+} {Type:temporary Condition: Reason:KernelOops Pattern:BUG: unable to handle kernel NULL pointer dereference at .*} {Type:temporary Condition: Reason:KernelOops Pattern:divide error: 0000 \[#\d+\] SMP} {Type:permanent Condition:KernelDeadlock Reason:AUFSUmountHung Pattern:task umount\.aufs:\w+ blocked for more than \w+ seconds\.} {Type:permanent Condition:KernelDeadlock Reason:DockerHung Pattern:task docker:\w+ blocked for more than \w+ seconds\.}]}
I0409 01:24:20.074881       1 log_watchers.go:40] Use log watcher of plugin "journald"
I0409 01:24:20.074999       1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:docker] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:docker-monitor DefaultConditions:[] Rules:[{Type:temporary Condition: Reason:CorruptDockerImage Pattern:Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*}]}
I0409 01:24:20.075013       1 log_watchers.go:40] Use log watcher of plugin "journald"
I0409 01:24:20.075872       1 log_monitor.go:72] Start log monitor
I0409 01:24:20.097735       1 log_watcher.go:69] Start watching journald
I0409 01:24:20.097789       1 log_monitor.go:72] Start log monitor
I0409 01:24:20.099870       1 log_monitor.go:163] Initialize condition generated: [{Type:KernelDeadlock Status:false Transition:2018-04-09 01:24:20.099830399 -0400 EDT m=+0.043253068 Reason:KernelHasNoDeadlock Message:kernel has no deadlock}]
I0409 01:24:20.114934       1 log_watcher.go:69] Start watching journald
I0409 01:24:20.115008       1 problem_detector.go:73] Problem detector started
I0409 01:24:20.117235       1 log_monitor.go:163] Initialize condition generated: []

Thanks a lot.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

from node-problem-detector.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.