Output of haproxy -vv and <code class="notranslate"

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Tim, I think you're right about the loop, <p dir="auto"

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

External check incurs high CPU usage (> 100%) about haproxy HOT 20 CLOSED

haproxy commented on September 7, 2024

External check incurs high CPU usage (> 100%)

from haproxy.

Comments (20)

wtarreau commented on September 7, 2024 2

@rdev5 I managed to implement a faster closefrom() call by abusing poll(). It skips over closed FDs and only closes the active ones. Poll() still takes a bit of time to return the list, but in my case I'm seeing that it roughly doubles the performance, scanning about 12 million FDs per second at only 5% CPU on my machine. I think it could reasonably easily be backported if you want to try it, you need these 3 patches from mainline :

2d7f81b
2555ccf
9188ac6
With this said you still need to be careful about your number of concurrent connections. Also it's worth keeping in mind that some external programs might as well close all FDs once started, and the large value will be inherited. This is something which could be improved here though.

from haproxy.

rdev5 commented on September 7, 2024

Don't suppose it might have anything to do with this while loop...?

https://github.com/haproxy/haproxy/blob/master/src/checks.c#L1973-L1974

from haproxy.

TimWolla commented on September 7, 2024

@rdev5 Can you check what type of load it is? Is HAProxy running at 100% or is the bulk of the load generated in the kernel?

If it's HAProxy: Try getting a flame graph (http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html). You might need to install debug symbols for libssl / other shared libraries to make sense of it.

If it's the Kernel: Maybe fork + exec is slow on that machine?

from haproxy.

wtarreau commented on September 7, 2024

Tim, I think you're right about the loop, given the high maxconn value in the config, it certainly takes time.
@rdev5 do you really need such a high value ? I doubt it given that you have a single process. 2 million concurrent connections with SSL would eat tens of gigs of RAM... Thus if possible, please try to set this to a more suitable value and see if it improves things.

There's unfortunately no clean solution to know the highest number of used FD, nor the list of used FDs. Some BSDs have a closefrom() syscall to replace this loop in a more efficient way, but that's all.

from haproxy.

TimWolla commented on September 7, 2024

Tim, I think you're right about the loop,

You want s/Tim/Matt/ here.

from haproxy.

wtarreau commented on September 7, 2024

Ah yes, I claim ENOCOFFEE :-) I've read it as part of your response.

from haproxy.

git001 commented on September 7, 2024

@rdev5 there is a blog post about a similar setup for ~2M ssl connections, maybe it helps.

https://medium.freecodecamp.org/how-we-fine-tuned-haproxy-to-achieve-2-000-000-concurrent-ssl-connections-d017e61a4d27

from haproxy.

wtarreau commented on September 7, 2024

I've changed the type to "help-wanted" and the status to "works as designed" because the huge number of FDs multiplied by the number of servers and the frequency of the forks provokes this. It doesn't mean we shouldn't try to further improve the situation, as indicated by the patch above trying to implement closefrom() on operating systems missing it, but the situation is indeed expected from the configuration.

from haproxy.

TimWolla commented on September 7, 2024

on operating systems missing it

Is that “missing it” correct? I believe you never actually use closefrom, even if available.

from haproxy.

wtarreau commented on September 7, 2024

I have not yet done the native closefrom() because it will require another build option and was not needed for Matt. I really wanted to provide him with a quick thing he could try. I'll check with Olivier to get a working native closefrom() implementation that works on *BSD and may try again on Solaris once I repair my old Sun.

But the purpose of the patch clearly was to have an equivalent to closefrom() which is faster than the close() loop for those who don't have closefrom().

from haproxy.

mattborja commented on September 7, 2024

@wtarreau Sorry for the delayed response. I'll won't be in the office until at least Monday to test (barring inclement weather), but will certainly give it a shot.

I also know that our maxconn is set a bit high, which incidentally was based off that article linked by @git001; our OS was tuned accordingly as this is sort of a "shared" load balancer that serves a variety of backends and I didn't want it becoming a bottleneck unnecessarily.

Will look through the patches and try to get to this Monday.

Thanks!

from haproxy.

wtarreau commented on September 7, 2024

OK, thanks.

Keep in mind that it won't solve everything though. The patches will just cut the close time in half which might still be too much. But whatever the solution, the system still has to iterate over all of the process' fd table before the exec. I really think that this is a good example of why external checks can be a really bad thing, they cannot mix well with high fd count. Using an agent would be much better. We could even imagine removing this feature and having a separate, dedicated process forked by the master process which would deal with agent requests and turn them them to fork+exec to emulate the current external checks. At least I think we need to add some doc to explain such side effects so that nobody gets trapped into this again.

from haproxy.

robertomczak commented on September 7, 2024

Hi,
I'm currently looking into long seamless reload time of HaProxy with 0 sessions active.
Our setup uses external-check as well.
During investigation I've observer lots of EBADF (Bad file descriptor) in strace of HaProxy. (Attachment). strace_17775.txt
For testing purposes our external-check command was replaced with /bin/true.

Looks that issue with long reloads time can be related to issue covered here.

Any idea when improved FDs handling might come to HaProxy ?

from haproxy.

wtarreau commented on September 7, 2024

It's not an issue, it's the only way to close all open FDs. The patchset above aims at improving the situation by detecting those which are closed and skipping them. But anyway they have to be scanned by the kernel and nothing is free. We've already merged the patches above to improve the situation, but really, the next step ought to be to remove external check support if that causes so many issues.

from haproxy.

rdev5 commented on September 7, 2024

@wtarreau Hi William, I'm looking to get back into this hopefully this morning sometime. In the mean time, as you consider potentially removing external check support, I wonder if perhaps it would be possible to support a form of remote-check; something akin to polling an external URL with the same set of environment variables (i.e. HAPROXY_SERVER_ADDR, HAPROXY_SERVER_PORT, etc.) as request headers, query string parameters, etc.

from haproxy.

rdev5 commented on September 7, 2024

I suppose we could make it work with agent-check but it does make it much less maintainable. We really don't want "too many cooks in the kitchen" writing socket programs and scheduling cron jobs on our production load balancers and would prefer to stick with editing a single haproxy.cfg file that we can keep under source control.

from haproxy.

rdev5 commented on September 7, 2024

@wtarreau Sorry for the delay getting back to you on this; had to rebuild a new openSUSE image to do this.

I built, configured, and ran haproxy, profiling server uptime both before and after the patch and there is a definitely a huge difference!

Before the patch (at f131481):

$ while true; do uptime; sleep 1; done
...
up 0:28, 1 user, load average: 3.21, 2.92, 2.39

After the patch (at 2292edf):

$ while true; do uptime; sleep 1; done
...
up 0:28, 1 user, load average: 0.07, 0.08, 0.03

Reverting the patch reproduces the issue. Also, I didn't test the individual patches, just the two snapshots mentioned above.

Thanks for looking into this for us and a quick turnaround.

How should we proceed from here?

from haproxy.

wtarreau commented on September 7, 2024

Hi,
thanks for the feedback. If you have that much a difference, it means you're very close to the limit above which your checks are frequent enough to constantly have some close() calls running. I've just pushed to the master branch a few other patches which restore the initial limits before execing checks. This will make sure your external scripts or equivalent are not in turn wasting their time closing all fds.

We might consider backporting the closefrom() calls to recent branches, but I'm not a big fan of backporting that to older, very stable, branches, because any change always presents a risk of regression and if other people don't have this problem they won't value this improvement. Would you have any objection to upgrading to latest 1.9 if we backport it there ?

Regarding the possible removal of the feature in the long term, it's not planed yet but it definitely deserves some thinking. One clean way I'm thinking about would be to rely on the master process in master-worker mode. The workers could simply communicate with the master and the master would fork-exec the commands itself. This would also solve the chroot and permission issue, as you could still keep your workers secure. That's just an idea of course.

from haproxy.

rdev5 commented on September 7, 2024

@wtarreau Alright, so after talking with our vendor, they're willing to move up to 1.9 on their new image slated for release at the end of this month. If it could be backported there by around that time, that would be ideal.

from haproxy.

wtarreau commented on September 7, 2024

Given this one was 7 months ago and didn't require any updates, I guess it's safe to close now.

from haproxy.

External check incurs high CPU usage (> 100%) about haproxy HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent