Comments (10)
In order to narrow it down, I would look at your scrape durations. scrape_duration_seconds
, then look at node_scrape_collector_duration_seconds
. That may help narrow down which collector is slow.
from node_exporter.
It looks like the collectors are pervasively slow when compared to other systems:
3.35820933985 systemd
3.10417672055 processes
2.35972708165 textfile
2.2794126131 cpu
2.2210433008000003 thermal_zone
2.1956482858 logind
2.1758462679000004 cpufreq
1.2243237645 ethtool
1.1644715419000002 schedstat
1.0535111773 filesystem
0.8092852216499999 stat
0.7347117214 hwmon
0.6140579104 tcpstat
0.57323082345 softnet
0.50068511215 udp_queues
0.40303370180000003 netstat
0.26249596775000006 rapl
[...]
This is from ``avg_over_time(node_scrape_collector_duration_seconds[5m]))'' for the server. This server isn't running anything different from our next biggest server (literally, they're both currently not active SLURM nodes) and that other server has the top collector (systemd) averaging 0.43 seconds as its duration.
from node_exporter.
Just to confirm, you haven't changed the GOMAXPROCS
env var have you? The node_exporter defaults to GMAXPROCS=1
in order to avoid race conditions with reading kernel files. We've seen issues in the past where the kernel has a bad time with the amount of parallel IOs to procfs/sysfs.
This has a "kernel bug" smell to it. Have you tried using a HWE kernel version?
from node_exporter.
I haven't changed GOMAXPROCS
(and until now I didn't know about node_exporter locking it to one). Unfortunately the HWE kernel isn't an option because we run our entire fleet with consistent kernel versions and this isn't an important enough issue to deviate from that. Would it be interesting to turn up the command line GOMAXPROCS here to 2 or 4 to see if this allows enough concurrent GC to cut the times down? Or raise Go GC thresholds? I took a look at a 30 second trace and it doesn't show any smoking guns (I can attach it here if it would be interesting).
/debug/pprof/profile says:
Duration: 30.01s, Total samples = 6.20s (20.66%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 3500ms, 56.45% of 6200ms total
Dropped 366 nodes (cum <= 31ms)
Showing top 10 nodes out of 252
flat flat% sum% cum cum%
2350ms 37.90% 37.90% 2350ms 37.90% runtime/internal/syscall.Syscall6
190ms 3.06% 40.97% 190ms 3.06% runtime.nextFreeFast (inline)
180ms 2.90% 43.87% 520ms 8.39% runtime.scanobject
170ms 2.74% 46.61% 260ms 4.19% compress/flate.(*compressor).findMatch
140ms 2.26% 48.87% 140ms 2.26% runtime.memclrNoHeapPointers
120ms 1.94% 50.81% 190ms 3.06% runtime.findObject
120ms 1.94% 52.74% 870ms 14.03% runtime.mallocgc
90ms 1.45% 54.19% 90ms 1.45% compress/flate.matchLen (inline)
70ms 1.13% 55.32% 90ms 1.45% bytes.(*Buffer).ReadRune
70ms 1.13% 56.45% 340ms 5.48% compress/flate.(*compressor).deflate
from node_exporter.
Increasing GOMAXPROCS
tends to be worse on high CPU count nodes. See #1880.
from node_exporter.
I would recommend trying to disable the optional (default disabled) collectors one at a time. For example, tcpstat
or processes
. Basically bisect the collector options to see if any of them are responsible for issues.
from node_exporter.
It turns out that even with all of the extra collectors not enabled, I still see significant collector times on this machine (and the overall scrape time is still much higher than any other machines, hovering around two seconds where even the most intensive other machines take under a second):
1.9443551424 textfile
1.91965697425 cpu
1.8569991597000004 cpufreq
1.8250815595500003 thermal_zone
1.42256327665 schedstat
1.18568954785 filesystem
1.106938035 hwmon
1.0897670761999998 softnet
0.8030116289999999 btrfs
0.7954851884 stat
0.6860131567 netstat
0.6763594620500001 netdev
0.64222134065 sockstat
[...]
The node_exporter process now uses less CPU time but still a visible amount of it. As far as the textfile collector goes, I've verified that this machine isn't collecting any more lines of textfiles than other machines are.
from node_exporter.
My apologies; it looks like this was a temporary system state issue, where for some reason the kernel did something peculiar for a while. Now that I look at a long-term graph of scrape times for node_exporter on this machine, it's clear that it used to be much faster and be less visible in system metrics/etc, and node_exporter on this system has now returned to that level even with full collectors. It is somewhat slower than our other systems, but then there's a lot of CPU information to scrape from /sys for eg the cpufreq collector.
from node_exporter.
Yes, cpufreq speed is a known issue. The kernel artificially slows down access for "security" reasons. Sadly, I don't think there's anything that can be done about that.
from node_exporter.
Closing then
from node_exporter.
Related Issues (20)
- filetem ntrices prompts that the string description is inaccurate
- feat: `SUPPORT_END` from `/etc/os-release`
- --collector.disable-defaults doesn't disable cpu metric HOT 1
- Handle thermal_zone errors gracefully HOT 3
- `node_os_info` does not update `build_id` HOT 9
- When is v.1.8.0 coming? Wanted to pick up CVE fixes that are already in master HOT 1
- Unresolved vulnerability (CVE-2023-48795) in last release 1.7.0 HOT 2
- Node exporter not picking up sensors output and hence sensors metrics are not exported HOT 1
- docker image does not recognise timezone appropriately HOT 7
- kernel: node_exporter: page allocation failure. order:5, mode:0xd0 HOT 1
- CIFS unavailability causes abnormal "up" metric value HOT 1
- Feature request: Please sign your releases HOT 1
- Proposal for release a new minor/patch version HOT 1
- Disk and filesystem error metrics
- How does node-exporter collect information about whether the operating system is CentOS or Ubuntu? HOT 1
- [macOS 14.4.1] “node_exporter” is damaged and can’t be opened. You should move it to the Trash. HOT 8
- IP fragmentation metrics stats missing HOT 4
- node_exporter can't start HOT 2
- node-exporter can't start HOT 1
- NFSd Error HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from node_exporter.