Giter Club home page Giter Club logo

Comments (15)

omoerbeek avatar omoerbeek commented on June 7, 2024 4

In the meantime I managed to reproduce, we are working on a fix. Please have some patience.

from pdns.

omoerbeek avatar omoerbeek commented on June 7, 2024 3

Thanks, first one seems incomplete, seconds one confirms what I'm seeing locally. I have deleted the comment to avoid exposing potential sensitive information.

from pdns.

omoerbeek avatar omoerbeek commented on June 7, 2024 2

all: thanks for you patience. Due to the nature of this bug, we could not tell everything we knew. But the fix is now public (see the link phoneph1 posted) and I'll close this issue.

from pdns.

omoerbeek avatar omoerbeek commented on June 7, 2024 1

Please try the following when this happens: look with perf top to see where the recursor is spending time, and/or attach gdb:
gdb --pid=processIdOfRecursor and then execute thread apply all bt on thegdb command line.

from pdns.

omoerbeek avatar omoerbeek commented on June 7, 2024

I tried to reproduce the issue on a CentOS7 machine, but failed so far. I am beating the recursor instance with requests for metrics and regular DNS queries. I also decreased the statistics-interval to print every 5 seconds

An observation: the stopping of the metrics collection log lines while DNS queries themselves are processed suggest the issue is with a particular thread internally called the handler (I know that is not a very good name). The external name as seen in ps and other tools to inspects processes or threads is rec/web+stat.

from pdns.

BBQigniter avatar BBQigniter commented on June 7, 2024

we still have to wait for the problem to occur so I can send you the requested stuff. The first time it happened was about 2 days after the update.

looking at the metrics we have, we see about ~50req/s in average with some short peaks of up to ~120req/s on the recursors in sum. The "primary" recursor handles roughly 60% of the requests. The second recursor hit this strange problem about ~1.5 days later.

from pdns.

BBQigniter avatar BBQigniter commented on June 7, 2024

Wow, thank you very much.

In the meantime we had the problem again on Saturday at about 00:30. A colleague executed the suggested commands, I'll post the output asap - I only have to check something first, as something is not clear to me and prolly would create some confusion.

from pdns.

SlickSmith189 avatar SlickSmith189 commented on June 7, 2024

Hi, we're also affected quiet regularly. @omoerbeek, we're happy to help. So let me know if you need any infos or assistance. Is it possible to get the reproduction steps? (if there are any)

from pdns.

omoerbeek avatar omoerbeek commented on June 7, 2024

We are still working on this. Please post the version and other relevant info, so we have more data points. You also could try to figure out if reverting to a previous version makes a difference.

from pdns.

abdrabo avatar abdrabo commented on June 7, 2024

We had a similar issue. Below are some information.

Last stable version we have tried: 4.9.3
Current unstable version for us: 5.0.3

We are running pdns-recursor with 8 threads. At the moment 2 of these threads are at 100% CPU usage in the user space. It took more than 3 days for 1 one of the servers to reach this state. I tried performing the same query multiple times on that server and it times out in a few of them. I guess it times out when the kernel distribute these queries on the stuck threads. It seems that these 2 threads are stuck in an infinite loop. I can try to share the stacktrace if needed. Reverting back to version 4.9.3 brings us to a stable state again.

$ top -b -n 1 -H
32003 pdns-re+  20   0 2006892 667084  19328 R  99.9   0.5   2985:51 rec/worker
31997 pdns-re+  20   0 2006892 667084  19328 R  94.4   0.5   2879:25 rec/worker
31996 pdns-re+  20   0 2006892 667084  19328 S   0.0   0.5 695:27.07 rec/worker
31998 pdns-re+  20   0 2006892 667084  19328 S   0.0   0.5 698:59.04 rec/worker
31999 pdns-re+  20   0 2006892 667084  19328 S   0.0   0.5 692:53.16 rec/worker
32000 pdns-re+  20   0 2006892 667084  19328 S   0.0   0.5 696:57.50 rec/worker
32001 pdns-re+  20   0 2006892 667084  19328 S   0.0   0.5 697:43.79 rec/worker
32002 pdns-re+  20   0 2006892 667084  19328 S   0.0   0.5 699:00.64 rec/worker
$ pidstat -p 31997 -p 32003 1

04:55:52 PM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
04:55:53 PM   844     31997  200.00    0.00    0.00    0.00  200.00    26  rec/worker
04:55:53 PM   844     32003  200.00    0.00    0.00    0.00  200.00    28  rec/worker

04:55:53 PM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
04:55:54 PM   844     31997  200.00    0.00    0.00    0.00  200.00    26  rec/worker
04:55:54 PM   844     32003  200.00    0.00    0.00    0.00  200.00    28  rec/worker

04:55:54 PM   UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
04:55:55 PM   844     31997  199.00    0.00    0.00    0.00  199.00    26  rec/worker
04:55:55 PM   844     32003  200.00    0.00    0.00    0.00  200.00    28  rec/worker
$ netstat -ulnp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
udp   2147485440      0 0.0.0.0:53              0.0.0.0:*                           31993/pdns_recursor 
udp   2147484672      0 0.0.0.0:53              0.0.0.0:*                           31993/pdns_recursor 
udp        0      0 0.0.0.0:53              0.0.0.0:*                           31993/pdns_recursor 
udp        0      0 0.0.0.0:53              0.0.0.0:*                           31993/pdns_recursor 
udp        0      0 0.0.0.0:53              0.0.0.0:*                           31993/pdns_recursor 
udp        0      0 0.0.0.0:53              0.0.0.0:*                           31993/pdns_recursor 
udp        0      0 0.0.0.0:53              0.0.0.0:*                           31993/pdns_recursor 
udp        0      0 0.0.0.0:53              0.0.0.0:*                           31993/pdns_recursor 
$ netstat --udp --statistics
    282618010 packet receive errors
    282618010 receive buffer errors

$ dig +tries=1 +timeout=1 @127.0.0.1 example.com
;; connection timed out; no servers could be reached

$ netstat --udp --statistics
    282618011 packet receive errors
    282618011 receive buffer errors

from pdns.

danielchristianschroeter avatar danielchristianschroeter commented on June 7, 2024

We have observed the same behavior (suddenly and unexpectedly 100% CPU utilization) also with version 4.8.7. The PowerDNS and hardware metrics do not show any abnormalities. perf top/gdb is still pending. After a reboot, everything is back to normal. Two independent DNS recursors are almost simultaneously crippled. If the PowerDNS is in this situation, it is more or less not able to handle DNS requests promptly anymore.

from pdns.

mape2k avatar mape2k commented on June 7, 2024

We have observed the same behavior (suddenly and unexpectedly 100% CPU utilization) also with version 4.8.7.

Can confirm this. We hit this bug on two instances (active / passive HA) on Debian with version 4.8.7.
This also affects the passive node which did not resolve anything in passive mode. Both are behind dnsdist.

from pdns.

BBQigniter avatar BBQigniter commented on June 7, 2024

any info yet on the patch? we just had a problem with both our recursors simultaneously which caused major panic in the company

from pdns.

damsy avatar damsy commented on June 7, 2024

We can confirm this on 5.0.3 behavior , we downgraded to 5.0.2 and the issues disappeared.

RHEL 8.9

from pdns.

phonedph1 avatar phonedph1 commented on June 7, 2024

If you didn't already see it... https://blog.powerdns.com/2024/04/24/powerdns-recursor-4-8-8-4-9-5-5-0-4-released

from pdns.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.