Comments (15)
In the meantime I managed to reproduce, we are working on a fix. Please have some patience.
from pdns.
Thanks, first one seems incomplete, seconds one confirms what I'm seeing locally. I have deleted the comment to avoid exposing potential sensitive information.
from pdns.
all: thanks for you patience. Due to the nature of this bug, we could not tell everything we knew. But the fix is now public (see the link phoneph1 posted) and I'll close this issue.
from pdns.
Please try the following when this happens: look with perf top
to see where the recursor is spending time, and/or attach gdb:
gdb --pid=processIdOfRecursor
and then execute thread apply all bt
on thegdb
command line.
from pdns.
I tried to reproduce the issue on a CentOS7 machine, but failed so far. I am beating the recursor instance with requests for metrics and regular DNS queries. I also decreased the statistics-interval
to print every 5 seconds
An observation: the stopping of the metrics
collection log lines while DNS queries themselves are processed suggest the issue is with a particular thread internally called the handler
(I know that is not a very good name). The external name as seen in ps
and other tools to inspects processes or threads is rec/web+stat
.
from pdns.
we still have to wait for the problem to occur so I can send you the requested stuff. The first time it happened was about 2 days after the update.
looking at the metrics we have, we see about ~50req/s in average with some short peaks of up to ~120req/s on the recursors in sum. The "primary" recursor handles roughly 60% of the requests. The second recursor hit this strange problem about ~1.5 days later.
from pdns.
Wow, thank you very much.
In the meantime we had the problem again on Saturday at about 00:30. A colleague executed the suggested commands, I'll post the output asap - I only have to check something first, as something is not clear to me and prolly would create some confusion.
from pdns.
Hi, we're also affected quiet regularly. @omoerbeek, we're happy to help. So let me know if you need any infos or assistance. Is it possible to get the reproduction steps? (if there are any)
from pdns.
We are still working on this. Please post the version and other relevant info, so we have more data points. You also could try to figure out if reverting to a previous version makes a difference.
from pdns.
We had a similar issue. Below are some information.
Last stable version we have tried: 4.9.3
Current unstable version for us: 5.0.3
We are running pdns-recursor with 8 threads. At the moment 2 of these threads are at 100% CPU usage in the user space. It took more than 3 days for 1 one of the servers to reach this state. I tried performing the same query multiple times on that server and it times out in a few of them. I guess it times out when the kernel distribute these queries on the stuck threads. It seems that these 2 threads are stuck in an infinite loop. I can try to share the stacktrace if needed. Reverting back to version 4.9.3 brings us to a stable state again.
$ top -b -n 1 -H
32003 pdns-re+ 20 0 2006892 667084 19328 R 99.9 0.5 2985:51 rec/worker
31997 pdns-re+ 20 0 2006892 667084 19328 R 94.4 0.5 2879:25 rec/worker
31996 pdns-re+ 20 0 2006892 667084 19328 S 0.0 0.5 695:27.07 rec/worker
31998 pdns-re+ 20 0 2006892 667084 19328 S 0.0 0.5 698:59.04 rec/worker
31999 pdns-re+ 20 0 2006892 667084 19328 S 0.0 0.5 692:53.16 rec/worker
32000 pdns-re+ 20 0 2006892 667084 19328 S 0.0 0.5 696:57.50 rec/worker
32001 pdns-re+ 20 0 2006892 667084 19328 S 0.0 0.5 697:43.79 rec/worker
32002 pdns-re+ 20 0 2006892 667084 19328 S 0.0 0.5 699:00.64 rec/worker
$ pidstat -p 31997 -p 32003 1
04:55:52 PM UID PID %usr %system %guest %wait %CPU CPU Command
04:55:53 PM 844 31997 200.00 0.00 0.00 0.00 200.00 26 rec/worker
04:55:53 PM 844 32003 200.00 0.00 0.00 0.00 200.00 28 rec/worker
04:55:53 PM UID PID %usr %system %guest %wait %CPU CPU Command
04:55:54 PM 844 31997 200.00 0.00 0.00 0.00 200.00 26 rec/worker
04:55:54 PM 844 32003 200.00 0.00 0.00 0.00 200.00 28 rec/worker
04:55:54 PM UID PID %usr %system %guest %wait %CPU CPU Command
04:55:55 PM 844 31997 199.00 0.00 0.00 0.00 199.00 26 rec/worker
04:55:55 PM 844 32003 200.00 0.00 0.00 0.00 200.00 28 rec/worker
$ netstat -ulnp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
udp 2147485440 0 0.0.0.0:53 0.0.0.0:* 31993/pdns_recursor
udp 2147484672 0 0.0.0.0:53 0.0.0.0:* 31993/pdns_recursor
udp 0 0 0.0.0.0:53 0.0.0.0:* 31993/pdns_recursor
udp 0 0 0.0.0.0:53 0.0.0.0:* 31993/pdns_recursor
udp 0 0 0.0.0.0:53 0.0.0.0:* 31993/pdns_recursor
udp 0 0 0.0.0.0:53 0.0.0.0:* 31993/pdns_recursor
udp 0 0 0.0.0.0:53 0.0.0.0:* 31993/pdns_recursor
udp 0 0 0.0.0.0:53 0.0.0.0:* 31993/pdns_recursor
$ netstat --udp --statistics
282618010 packet receive errors
282618010 receive buffer errors
$ dig +tries=1 +timeout=1 @127.0.0.1 example.com
;; connection timed out; no servers could be reached
$ netstat --udp --statistics
282618011 packet receive errors
282618011 receive buffer errors
from pdns.
We have observed the same behavior (suddenly and unexpectedly 100% CPU utilization) also with version 4.8.7. The PowerDNS and hardware metrics do not show any abnormalities. perf top/gdb is still pending. After a reboot, everything is back to normal. Two independent DNS recursors are almost simultaneously crippled. If the PowerDNS is in this situation, it is more or less not able to handle DNS requests promptly anymore.
from pdns.
We have observed the same behavior (suddenly and unexpectedly 100% CPU utilization) also with version 4.8.7.
Can confirm this. We hit this bug on two instances (active / passive HA) on Debian with version 4.8.7.
This also affects the passive node which did not resolve anything in passive mode. Both are behind dnsdist.
from pdns.
any info yet on the patch? we just had a problem with both our recursors simultaneously which caused major panic in the company
from pdns.
We can confirm this on 5.0.3 behavior , we downgraded to 5.0.2 and the issues disappeared.
RHEL 8.9
from pdns.
If you didn't already see it... https://blog.powerdns.com/2024/04/24/powerdns-recursor-4-8-8-4-9-5-5-0-4-released
from pdns.
Related Issues (20)
- recursor docs: compiling does not mention rust
- pdns-recursor: not sending ECS option back to the client if scope netmask is non-zero HOT 14
- pdns-server: double free crash with --version HOT 5
- dnsdist: reloadAllCertificates() breaks non-TLS DoH listener HOT 1
- DNS over HTTP3 - no SNI for Apple devices HOT 1
- when dnssec is enabled, LUA does not work on the secondary server HOT 6
- changetype should not be a required field in the openapi specification HOT 1
- Hello I need your help about lua scripts HOT 1
- pdns_control's nonblocking connection to the socket is not correct
- systemd unit needs Wants and Before HOT 6
- dnsdist: add a Rule Selector to match DoT rules HOT 3
- dnsdist: Ponder QUIC improvements
- auth: lmdb-backend missing API endpoint search-data
- auth, rec: no ECDSA with OpenSSL 3.3 HOT 3
- `rec_control reload-acls` also should reload `proxy-protocol-from` HOT 1
- sdig: allow cleartext DoH to work when dnsdist is using `nghttp2` as lib HOT 1
- Recursor: Allow setting notify_allowed when creating zone with API
- [documentation] Missing colons in IPv6 forwarder examples
- rec: Duplicate NSEC and RRSIG in foo.cname-exists.phicoh.nl HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pdns.