Host operating system: output of uname -a <p di

In order to narrow it down, I would look at your scrape durations. <code class="notran

node_exporter uses noticable CPU time on a large server about node_exporter HOT 10 CLOSED

siebenmann commented on May 24, 2024

node_exporter uses noticable CPU time on a large server

from node_exporter.

Comments (10)

SuperQ commented on May 24, 2024

In order to narrow it down, I would look at your scrape durations. scrape_duration_seconds, then look at node_scrape_collector_duration_seconds. That may help narrow down which collector is slow.

from node_exporter.

siebenmann commented on May 24, 2024

It looks like the collectors are pervasively slow when compared to other systems:

3.35820933985   systemd
3.10417672055   processes
2.35972708165   textfile
2.2794126131    cpu
2.2210433008000003      thermal_zone
2.1956482858    logind
2.1758462679000004      cpufreq
1.2243237645    ethtool
1.1644715419000002      schedstat
1.0535111773    filesystem
0.8092852216499999      stat
0.7347117214    hwmon
0.6140579104    tcpstat
0.57323082345   softnet
0.50068511215   udp_queues
0.40303370180000003     netstat
0.26249596775000006     rapl
[...]

This is from ``avg_over_time(node_scrape_collector_duration_seconds[5m]))'' for the server. This server isn't running anything different from our next biggest server (literally, they're both currently not active SLURM nodes) and that other server has the top collector (systemd) averaging 0.43 seconds as its duration.

from node_exporter.

SuperQ commented on May 24, 2024

Just to confirm, you haven't changed the GOMAXPROCS env var have you? The node_exporter defaults to GMAXPROCS=1 in order to avoid race conditions with reading kernel files. We've seen issues in the past where the kernel has a bad time with the amount of parallel IOs to procfs/sysfs.

This has a "kernel bug" smell to it. Have you tried using a HWE kernel version?

from node_exporter.

siebenmann commented on May 24, 2024

I haven't changed GOMAXPROCS (and until now I didn't know about node_exporter locking it to one). Unfortunately the HWE kernel isn't an option because we run our entire fleet with consistent kernel versions and this isn't an important enough issue to deviate from that. Would it be interesting to turn up the command line GOMAXPROCS here to 2 or 4 to see if this allows enough concurrent GC to cut the times down? Or raise Go GC thresholds? I took a look at a 30 second trace and it doesn't show any smoking guns (I can attach it here if it would be interesting).

/debug/pprof/profile says:

Duration: 30.01s, Total samples = 6.20s (20.66%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 3500ms, 56.45% of 6200ms total
Dropped 366 nodes (cum <= 31ms)
Showing top 10 nodes out of 252
      flat  flat%   sum%        cum   cum%
    2350ms 37.90% 37.90%     2350ms 37.90%  runtime/internal/syscall.Syscall6
     190ms  3.06% 40.97%      190ms  3.06%  runtime.nextFreeFast (inline)
     180ms  2.90% 43.87%      520ms  8.39%  runtime.scanobject
     170ms  2.74% 46.61%      260ms  4.19%  compress/flate.(*compressor).findMatch
     140ms  2.26% 48.87%      140ms  2.26%  runtime.memclrNoHeapPointers
     120ms  1.94% 50.81%      190ms  3.06%  runtime.findObject
     120ms  1.94% 52.74%      870ms 14.03%  runtime.mallocgc
      90ms  1.45% 54.19%       90ms  1.45%  compress/flate.matchLen (inline)
      70ms  1.13% 55.32%       90ms  1.45%  bytes.(*Buffer).ReadRune
      70ms  1.13% 56.45%      340ms  5.48%  compress/flate.(*compressor).deflate

from node_exporter.

SuperQ commented on May 24, 2024

Increasing GOMAXPROCS tends to be worse on high CPU count nodes. See #1880.

from node_exporter.

SuperQ commented on May 24, 2024

I would recommend trying to disable the optional (default disabled) collectors one at a time. For example, tcpstat or processes. Basically bisect the collector options to see if any of them are responsible for issues.

from node_exporter.

siebenmann commented on May 24, 2024

It turns out that even with all of the extra collectors not enabled, I still see significant collector times on this machine (and the overall scrape time is still much higher than any other machines, hovering around two seconds where even the most intensive other machines take under a second):

1.9443551424    textfile
1.91965697425   cpu
1.8569991597000004      cpufreq
1.8250815595500003      thermal_zone
1.42256327665   schedstat
1.18568954785   filesystem
1.106938035     hwmon
1.0897670761999998      softnet
0.8030116289999999      btrfs
0.7954851884    stat
0.6860131567    netstat
0.6763594620500001      netdev
0.64222134065   sockstat
[...]

The node_exporter process now uses less CPU time but still a visible amount of it. As far as the textfile collector goes, I've verified that this machine isn't collecting any more lines of textfiles than other machines are.

from node_exporter.

siebenmann commented on May 24, 2024

My apologies; it looks like this was a temporary system state issue, where for some reason the kernel did something peculiar for a while. Now that I look at a long-term graph of scrape times for node_exporter on this machine, it's clear that it used to be much faster and be less visible in system metrics/etc, and node_exporter on this system has now returned to that level even with full collectors. It is somewhat slower than our other systems, but then there's a lot of CPU information to scrape from /sys for eg the cpufreq collector.

from node_exporter.

SuperQ commented on May 24, 2024

Yes, cpufreq speed is a known issue. The kernel artificially slows down access for "security" reasons. Sadly, I don't think there's anything that can be done about that.

from node_exporter.

discordianfish commented on May 24, 2024

Closing then

from node_exporter.

node_exporter uses noticable CPU time on a large server about node_exporter HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent