Comments (8)
OK, it's fairly easy to repro this. This uses recursion (to a max depth) because that causes the stack size to keep growing, which causes rb_profile_frames
and stackprof_record_sample_for_stack
to keep doing more work on each iteration.
def recurse(num = 1, depth = 1)
if depth == 10000
num
else
recurse(num * 2, depth + 1)
end
end
At a high/slow sampling rate, this is fine:
StackProf.run(mode: :wall, interval: 1000) {recurse}
and it prints out some results. (And I see the same with interval: 100
.)
But then down to 1 microsecond, this consistently hangs:
StackProf.run(mode: :wall, interval: 1) {recurse}
I assume the exact cutoff where this hangs will depend on context-specific factors like CPU. Using printf
to observe the runtime, the amount of time required for stackprof_record_sample_for_stack
is highly variable. But I think it only takes one, or a few, of the sample collections to be longer than interval
for the program to hang indefinitely.
from stackprof.
@benbuckman So looking at this per your request over in #122
So my initial thought after looking through your right up again and suggestions was "why don't we just use a lock so when multiple job handlers are being acted on at the same time, the handler just ejects early", but it turns out that was already implemented:
stackprof/ext/stackprof/stackprof.c
Lines 548 to 558 in b886c3c
... or at least, I think it is. Not sure if line 551 is causing it to basically reset the counter every time the handler method is fired, but that might just be my lack of experience with C talking here...
If that code is indeed not working as intended, do we know why? Is it because the code isn't working as I described, or is something else at play?
Would a more proper mutex help here?
I will also try reproducing using the method you described above.
Edit: Lack of C experience is showing here. The static
extends the lifetime of that variable to the end of the program, so in this case, in_signal_handler
should be acting like a simple mutex. That said, unsure if that is not enough when the interval gets so low.
Edit 2: After re-reading your "working theory", everything above is pointless in mentioning as the lock doesn't prevent work from being piled up right after the handler code and outside of the lock... as your had already described. That said, I think I am able to provide some evidence supporting your theory, so I will do so in my next comment.
from stackprof.
Alright, so I think I have some evidence to support @benbuckman 's claims that we are creating a situation where the "queue" is piling up when the interval
gets too small, and probably checks off your second box.
Most of the info for this is in a gist I created:
https://gist.github.com/NickLaMuro/3b54c57e1ffeb21316f38529fa93ea67
But effectively is a modified form of the reproducer that @benbuckman posted earlier. What I did was basically wrap his recurse
method an iterator, and tweak some of the values when running it. The iterator specifically was there just to make sure I was find some consistent results between runs that I could go off of, and also allowed me to kill off the "hanging" process at consistent intervals.
In addition, when the process was killed, I was also keeping track of how large the caller
got when it was killed, which can help determine how much was left to "work" on.
I was surprised at how quick I got consistent results, but once I started seeing somewhat consistent results when the process timed out, I used a general average to set the depth
to so that, ideally, some times it would fail and some times it would pass. The number I ended up on was 900, and that did exactly as I expected.
I think what this shows, indirectly, in respect to your second "TODO" item:
- Confirm that the bug is in fact caused by a pile-on of signal handlers / postponed jobs.
That there is a pileup, and that work is getting completed, though it is hinder specifically by the extra work that stackprof
ends up injecting as part of it's profiling. I didn't add timestamps to the log output statements (I should have, and probably will shortly), but basically "Test 1" happens instantly (as Ben probably observed as well), but when I dialed it in to a depth
of 900
, it would take nearly the entire 30 seconds to complete.
So after going through this exercise, I am curious if we should possibly go about this in a similar (though opposite) fashion as I mentioned in a comment Ben's previous PR:
So with that said, I wonder if it makes more sense to instead drop the multiplication code all together, and instead just error when the interval size is too large. Maybe at 50 times a second (so 20000) or lower? At that point, there is really no value in using a sampling profiler if your samples are that far apart.
Obviously, in this case, we are in the complete opposite direction, but at some point the interval
we are working with becomes so low that we are better off with an event-based profiler like ruby-prof
instead of this. So I guess I am curious if we should provide a warning (or even the "errror" suggestion you provided in the description), or even just some documentation around the limits of too small of an interval.
That said, you mentioned that you had this issue in production. Are you using a more reasonable interval
value in that case, or are you trying to push the interval
value to something really small? My question would then be is there something else at play here that is causing the ruby to basically not get any work done in between profiles? Or causing the stack to be incredibly large in your case?
from stackprof.
Thank you @NickLaMuro !
My C knowledge is very weak, but I was able to add a bunch of timestamp diffs to the code – specifically, in stackprof_record_sample
– and print them out. So, adding another angle –
- With
mode: :wall, interval: 200
(microseconds), the flow (in a test, profiling my earlierrecurse
code) looks like this:
sampling 311 usec since last, 311 since start, with interval 200
duration of stackprof_record_sample: 79 usec with interval 200
sampling 36 usec since last, 426 since start, with interval 200
duration of stackprof_record_sample: 67 usec with interval 200
sampling 136 usec since last, 629 since start, with interval 200
duration of stackprof_record_sample: 51 usec with interval 200
... [a few dozen more lines] ...
sampling 107 usec since last, 17221 since start, with interval 200
duration of stackprof_record_sample: 83 usec with interval 200
(and then it prints out results and exits nicely.)
- However, with
interval: 50
, it's like this:
sampling 117 usec since last, 117 since start, with interval 50
duration of stackprof_record_sample: 187 usec with interval 50
sampling 35 usec since last, 339 since start, with interval 50
duration of stackprof_record_sample: 32 usec with interval 50
sampling 6 usec since last, 377 since start, with interval 50
duration of stackprof_record_sample: 13 usec with interval 50
sampling 27 usec since last, 417 since start, with interval 50
duration of stackprof_record_sample: 19 usec with interval 50
sampling 26 usec since last, 462 since start, with interval 50
duration of stackprof_record_sample: 19 usec with interval 50
sampling 34 usec since last, 515 since start, with interval 50
duration of stackprof_record_sample: 20 usec with interval 50
sampling 48 usec since last, 583 since start, with interval 50
duration of stackprof_record_sample: 29 usec with interval 50
sampling 3 usec since last, 615 since start, with interval 50
duration of stackprof_record_sample: 16 usec with interval 50
... [50k more lines] ...
sampling 7 usec since last, 3657101 since start, with interval 50
duration of stackprof_record_sample: 77 usec with interval 50
sampling 9 usec since last, 3657187 since start, with interval 50
duration of stackprof_record_sample: 69 usec with interval 50
Interrupted. Exiting...
(It was going on forever, I had to sigterm the process.)
My interpretation:
- on this sample code, it takes ~10..187 microseconds to run
stackprof_record_sample
. - The
static int in_signal_handler
mutex is sort of working, insofar as it doesn't seem to be overlapping samples. - However, it seems to be queuing up so many samples, faster than they're being collected, that the sampler hogs the whole process ~forever.
from stackprof.
One possible solution which seems like it might work, is simply moving the in_signal_handler
lock from static methods in the job handlers (which are working on the already-enqueued jobs), to the _stackprof
global struct, checked in both the job handlers and the signal handler ... so it doesn't enqueue jobs if it's still working on one.
I'm not sure what this would mean for the final results, in terms of lost frames, I'll have to think about that tomorrow. I'll put up a POC-PR with what I've got so far, shortly.
Edit: WIP-PR is at #124
from stackprof.
One other note, re
So I guess I am curious if we should provide a warning (or even the "errror" suggestion you provided in the description), or even just some documentation around the limits of too small of an interval.
That said, you mentioned that you had this issue in production. Are you using a more reasonable interval value in that case, or are you trying to push the interval value to something really small?
One challenge here is that the safe interval is not constant. It depends on the size of the stack at any given moment, whether the same frame is repeating (because stackprof_record_sample
in raw
+aggregate
modes is doing a bunch of analysis/aggregation on the frames). So I don't think it's as simple as setting a minimum number that the code always rejects, it has to be context-aware (e.g. know how long each sample is actually taking).
from stackprof.
Hi, I'll take a look at this too. Of course any solution to this will result in dropped samples. Is there a reason you need to make the sampling frequency so high? Or is that just to demonstrate the bug? I'm curious because I've never run in to this, and I use this tool very frequently.
from stackprof.
Hey all, we're still running into this with Ruby 3 despite #150. In our case we're profiling in a separate thread using something like this, which never returns since the thread is never given the opportunity to wake up:
profile = StackProf.run(mode: :cpu, raw: true, interval: 1000) do
sleep(duration)
end
I'm not sure exactly why the sampling takes so long with our situation, but I've managed to repro this by forcing stackprof_record_sample()
to take longer than the sampling interval with some dummy work:
index 98d4857..5f0f462 100644
--- i/ext/stackprof/stackprof.c
+++ w/ext/stackprof/stackprof.c
@@ -578,6 +578,9 @@ stackprof_record_sample_for_stack(int num, int64_t timestamp_delta)
void
stackprof_record_sample()
{
+ volatile unsigned long long i;
+ for (i = 0; i < 1000000000ULL; ++i);
+
int64_t timestamp_delta = 0;
int num;
if (_stackprof.raw) {
My hypothesis is that when the handler takes longer than the SIGPROF timer, by the time a SIGPROF handler is done, there is already another pending SIGPROF that preempts everything else. It's really the kernel/signal system that's doing the "queuing", since signals that happen during handling don't interrupt immediately, but rather are blocked and delivered at the next opportunity.
Unfortunately when a process enters this state, it's impossible to recover, since the signal handling code always preempts other work (and seems to be holding the GVL).
from stackprof.
Related Issues (20)
- undefined symbol: pthread_atfork HOT 4
- Segmentation fault when stackprof is left running during VM shutdown
- Using wall-clock profiling on macOS makes Ruby thread scheduling starve background threads HOT 4
- TruffleRuby support HOT 4
- Stackprof raises NameError when wrapped in Rails environment conditional HOT 3
- `test_gc` fails on (32-bit, hardfloat, QEMU-emulated) arm HOT 1
- Incompatible with YJIT 3.2.0 HOT 1
- Problem with Ruby 2.7.6 and version 0.2.20 HOT 6
- Error installing with Ruby 3.1 HOT 3
- Segmentation fault on Ruby 2.7.6p219 HOT 4
- Possibility to filter certain files or classes?
- flaky test failure with ruby3.1
- Test failures on 32bit arch
- Flamegraph way too short in `wall` mode HOT 5
- Support Gecko format? HOT 2
- Weird timestamp deltas with puma HOT 2
- bug: "missed_samples" integer overflow possible HOT 1
- bug: possible state leak of `buffer_count` state between stackprof invocations HOT 2
- C99 error while compiling v0.2.25 HOT 1
- Add marker metadata feature HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from stackprof.