Giter Club home page Giter Club logo

Comments (8)

benbuckman avatar benbuckman commented on July 23, 2024

OK, it's fairly easy to repro this. This uses recursion (to a max depth) because that causes the stack size to keep growing, which causes rb_profile_frames and stackprof_record_sample_for_stack to keep doing more work on each iteration.

  def recurse(num = 1, depth = 1)
    if depth == 10000
      num
    else
      recurse(num * 2, depth + 1)
    end
  end

At a high/slow sampling rate, this is fine:

StackProf.run(mode: :wall, interval: 1000) {recurse}

and it prints out some results. (And I see the same with interval: 100.)

But then down to 1 microsecond, this consistently hangs:

StackProf.run(mode: :wall, interval: 1) {recurse}

I assume the exact cutoff where this hangs will depend on context-specific factors like CPU. Using printf to observe the runtime, the amount of time required for stackprof_record_sample_for_stack is highly variable. But I think it only takes one, or a few, of the sample collections to be longer than interval for the program to hang indefinitely.

from stackprof.

NickLaMuro avatar NickLaMuro commented on July 23, 2024

@benbuckman So looking at this per your request over in #122

So my initial thought after looking through your right up again and suggestions was "why don't we just use a lock so when multiple job handlers are being acted on at the same time, the handler just ejects early", but it turns out that was already implemented:

static void
stackprof_job_handler(void *data)
{
static int in_signal_handler = 0;
if (in_signal_handler) return;
if (!_stackprof.running) return;
in_signal_handler++;
stackprof_record_sample();
in_signal_handler--;
}

... or at least, I think it is. Not sure if line 551 is causing it to basically reset the counter every time the handler method is fired, but that might just be my lack of experience with C talking here...

If that code is indeed not working as intended, do we know why? Is it because the code isn't working as I described, or is something else at play?

Would a more proper mutex help here?

I will also try reproducing using the method you described above.

Edit: Lack of C experience is showing here. The static extends the lifetime of that variable to the end of the program, so in this case, in_signal_handler should be acting like a simple mutex. That said, unsure if that is not enough when the interval gets so low.

Edit 2: After re-reading your "working theory", everything above is pointless in mentioning as the lock doesn't prevent work from being piled up right after the handler code and outside of the lock... as your had already described. That said, I think I am able to provide some evidence supporting your theory, so I will do so in my next comment.

from stackprof.

NickLaMuro avatar NickLaMuro commented on July 23, 2024

Alright, so I think I have some evidence to support @benbuckman 's claims that we are creating a situation where the "queue" is piling up when the interval gets too small, and probably checks off your second box.

Most of the info for this is in a gist I created:

https://gist.github.com/NickLaMuro/3b54c57e1ffeb21316f38529fa93ea67

But effectively is a modified form of the reproducer that @benbuckman posted earlier. What I did was basically wrap his recurse method an iterator, and tweak some of the values when running it. The iterator specifically was there just to make sure I was find some consistent results between runs that I could go off of, and also allowed me to kill off the "hanging" process at consistent intervals.

In addition, when the process was killed, I was also keeping track of how large the caller got when it was killed, which can help determine how much was left to "work" on.

I was surprised at how quick I got consistent results, but once I started seeing somewhat consistent results when the process timed out, I used a general average to set the depth to so that, ideally, some times it would fail and some times it would pass. The number I ended up on was 900, and that did exactly as I expected.

I think what this shows, indirectly, in respect to your second "TODO" item:

  • Confirm that the bug is in fact caused by a pile-on of signal handlers / postponed jobs.

That there is a pileup, and that work is getting completed, though it is hinder specifically by the extra work that stackprof ends up injecting as part of it's profiling. I didn't add timestamps to the log output statements (I should have, and probably will shortly), but basically "Test 1" happens instantly (as Ben probably observed as well), but when I dialed it in to a depth of 900, it would take nearly the entire 30 seconds to complete.


So after going through this exercise, I am curious if we should possibly go about this in a similar (though opposite) fashion as I mentioned in a comment Ben's previous PR:

So with that said, I wonder if it makes more sense to instead drop the multiplication code all together, and instead just error when the interval size is too large. Maybe at 50 times a second (so 20000) or lower? At that point, there is really no value in using a sampling profiler if your samples are that far apart.

Obviously, in this case, we are in the complete opposite direction, but at some point the interval we are working with becomes so low that we are better off with an event-based profiler like ruby-prof instead of this. So I guess I am curious if we should provide a warning (or even the "errror" suggestion you provided in the description), or even just some documentation around the limits of too small of an interval.

That said, you mentioned that you had this issue in production. Are you using a more reasonable interval value in that case, or are you trying to push the interval value to something really small? My question would then be is there something else at play here that is causing the ruby to basically not get any work done in between profiles? Or causing the stack to be incredibly large in your case?

from stackprof.

benbuckman avatar benbuckman commented on July 23, 2024

Thank you @NickLaMuro !

My C knowledge is very weak, but I was able to add a bunch of timestamp diffs to the code – specifically, in stackprof_record_sample – and print them out. So, adding another angle –

  1. With mode: :wall, interval: 200 (microseconds), the flow (in a test, profiling my earlier recurse code) looks like this:
sampling 311 usec since last, 311 since start, with interval 200
duration of stackprof_record_sample: 79 usec with interval 200
sampling 36 usec since last, 426 since start, with interval 200
duration of stackprof_record_sample: 67 usec with interval 200
sampling 136 usec since last, 629 since start, with interval 200
duration of stackprof_record_sample: 51 usec with interval 200

... [a few dozen more lines] ...

sampling 107 usec since last, 17221 since start, with interval 200
duration of stackprof_record_sample: 83 usec with interval 200

(and then it prints out results and exits nicely.)

  1. However, with interval: 50, it's like this:
sampling 117 usec since last, 117 since start, with interval 50
duration of stackprof_record_sample: 187 usec with interval 50
sampling 35 usec since last, 339 since start, with interval 50
duration of stackprof_record_sample: 32 usec with interval 50
sampling 6 usec since last, 377 since start, with interval 50
duration of stackprof_record_sample: 13 usec with interval 50
sampling 27 usec since last, 417 since start, with interval 50
duration of stackprof_record_sample: 19 usec with interval 50
sampling 26 usec since last, 462 since start, with interval 50
duration of stackprof_record_sample: 19 usec with interval 50
sampling 34 usec since last, 515 since start, with interval 50
duration of stackprof_record_sample: 20 usec with interval 50
sampling 48 usec since last, 583 since start, with interval 50
duration of stackprof_record_sample: 29 usec with interval 50
sampling 3 usec since last, 615 since start, with interval 50
duration of stackprof_record_sample: 16 usec with interval 50

... [50k more lines] ...

sampling 7 usec since last, 3657101 since start, with interval 50
duration of stackprof_record_sample: 77 usec with interval 50
sampling 9 usec since last, 3657187 since start, with interval 50
duration of stackprof_record_sample: 69 usec with interval 50
Interrupted. Exiting...

(It was going on forever, I had to sigterm the process.)

My interpretation:

  1. on this sample code, it takes ~10..187 microseconds to run stackprof_record_sample.
  2. The static int in_signal_handler mutex is sort of working, insofar as it doesn't seem to be overlapping samples.
  3. However, it seems to be queuing up so many samples, faster than they're being collected, that the sampler hogs the whole process ~forever.

from stackprof.

benbuckman avatar benbuckman commented on July 23, 2024

One possible solution which seems like it might work, is simply moving the in_signal_handler lock from static methods in the job handlers (which are working on the already-enqueued jobs), to the _stackprof global struct, checked in both the job handlers and the signal handler ... so it doesn't enqueue jobs if it's still working on one.

I'm not sure what this would mean for the final results, in terms of lost frames, I'll have to think about that tomorrow. I'll put up a POC-PR with what I've got so far, shortly.

Edit: WIP-PR is at #124

from stackprof.

benbuckman avatar benbuckman commented on July 23, 2024

One other note, re

So I guess I am curious if we should provide a warning (or even the "errror" suggestion you provided in the description), or even just some documentation around the limits of too small of an interval.
That said, you mentioned that you had this issue in production. Are you using a more reasonable interval value in that case, or are you trying to push the interval value to something really small?

One challenge here is that the safe interval is not constant. It depends on the size of the stack at any given moment, whether the same frame is repeating (because stackprof_record_sample in raw+aggregate modes is doing a bunch of analysis/aggregation on the frames). So I don't think it's as simple as setting a minimum number that the code always rejects, it has to be context-aware (e.g. know how long each sample is actually taking).

from stackprof.

tenderlove avatar tenderlove commented on July 23, 2024

Hi, I'll take a look at this too. Of course any solution to this will result in dropped samples. Is there a reason you need to make the sampling frequency so high? Or is that just to demonstrate the bug? I'm curious because I've never run in to this, and I use this tool very frequently.

from stackprof.

isobit avatar isobit commented on July 23, 2024

Hey all, we're still running into this with Ruby 3 despite #150. In our case we're profiling in a separate thread using something like this, which never returns since the thread is never given the opportunity to wake up:

profile = StackProf.run(mode: :cpu, raw: true, interval: 1000) do
  sleep(duration)
end

I'm not sure exactly why the sampling takes so long with our situation, but I've managed to repro this by forcing stackprof_record_sample() to take longer than the sampling interval with some dummy work:

index 98d4857..5f0f462 100644
--- i/ext/stackprof/stackprof.c
+++ w/ext/stackprof/stackprof.c
@@ -578,6 +578,9 @@ stackprof_record_sample_for_stack(int num, int64_t timestamp_delta)
 void
 stackprof_record_sample()
 {
+    volatile unsigned long long i;
+    for (i = 0; i < 1000000000ULL; ++i);
+
     int64_t timestamp_delta = 0;
     int num;
     if (_stackprof.raw) {

My hypothesis is that when the handler takes longer than the SIGPROF timer, by the time a SIGPROF handler is done, there is already another pending SIGPROF that preempts everything else. It's really the kernel/signal system that's doing the "queuing", since signals that happen during handling don't interrupt immediately, but rather are blocked and delivered at the next opportunity.

Unfortunately when a process enters this state, it's impossible to recover, since the signal handling code always preempts other work (and seems to be holding the GVL).

from stackprof.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.