Describe the bug reproducer below After upgradin

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Just as an FYI, <a class="issue-link js-issue-link" data-error-text="Failed to load ti

3.11.0 causes carrier thread pinning regression (compared to 3.10.2) while loading classes,about quarkusio/quarkus

Comments (75)

Sanne commented on August 15, 2024 3

@geoand yes, I like your idea. Worth exploring for sure.

FWIW when I was an early JBoss user I always deployed my code like that, over rsync. Much faster than packing things in WARs & EARs and uploading them 😉
That was JBoss AS 4.2 in ~2007. It supported "exploded archives" already.

from quarkus.

dmlloyd commented on August 15, 2024 2

The pinned thread printer could easily be enhanced to also print the pinning reason as well. I wonder if it is worth opening a bug and a PR?

from quarkus.

imperatorx commented on August 15, 2024 2

When I run it with the java 23 loom early acces JDK, the JFR event has an additional message:

pinnedReason = "Native frame or <clinit> on stack"

from quarkus.

dmlloyd commented on August 15, 2024 2

The issue was always there, the commit just changes the timing of it for some reason.

Maybe, we made load jar file + defineClass not atomic anymore (as a single atomic op) because the changes in the cl doesn't make load class a "fully locked op" from my understanding

Yeah, this makes sense: if the lock granularity of class loading is exactly the same as the lock granularity of the shared JAR lock, then the shared JAR lock would only ever be taken when the class loading lock is also taken (during class loading anyway). In this case we never actually needed the shared lock, as we could have just used the class loading lock instead for an exactly identical result.

The bug to fix remains unchanged though: we should not have a lock inside the class loading code path. It should be atomic in some other (non-blocking) way.

from quarkus.

dmlloyd commented on August 15, 2024 1

It might be simpler to patch the JDK PinnedThreadPrinter with more stack walker options as a first try using --patch-module.

from quarkus.

imperatorx commented on August 15, 2024 1

Yes, it prints the trace. here is a complete reproducer project. I've reproduced it on a 16 core windows machine: https://github.com/imperatorx/quarkus-reproducer-pin/blob/master/src/main/java/org/acme/GreetingResource.java

I've also added JFR frame logs to it.

from quarkus.

franz1981 commented on August 15, 2024 1

From what I can see reading the existing implementation, defineClass (which is native) can subsequently trigger new class loads (making it effectively recursive) - and if the runner class loader perform any new attempt to compete (against other concurrent load attempts) against the rw lock on the Jar resource, it can make it to "wait" while having a native method on the stack...does it makes sense @dmlloyd @mariofusco @geoand ?

A similar thing can happen if within a <clinit> a new class load happen and it hit contention on any of the existing JarResource's lock(s).

from quarkus.

mariofusco commented on August 15, 2024 1

The bug to fix remains unchanged though: we should not have a lock inside the class loading code path. It should be atomic in some other (non-blocking) way.

I agree and tried to implement this same idea here. Please give it a look.

from quarkus.

geoand commented on August 15, 2024 1

Taking a step back for a bit, here is a basic example of the layout of the fast-jar (which uses RunnerClassLoader and JarResource):

The reason we have the locks on JarResource is to ensure that we don't keep too many jars open which increases memory footprint.

One thing I have thought of in the past is to get rid of the jars entirely, and just use directories which correspond to the jar and would be "exploded" when the jar is built. This intuitively seems to me to overcome a lot of the issues we are facing (including this one), however I would like to invike folks to challenge this idea for potential drawbacks.
Furthermore, even if we do decide it makes sense, it's not something we want to implement in a micro release as it requires various non trivial changes.

from quarkus.

geoand commented on August 15, 2024 1

Just as an FYI, #41023 reverts the change in 3.11

from quarkus.

Sanne commented on August 15, 2024 1

@Sanne might remember as he is the one that added the ability to close jars and keep a "live set" of them open

Essentially what I had noticed is that keeping a Jar open implies keeping a ZipStream open (obviously), and to allow random access within such stream, the whole header section of the zip stream is kept in memory. Add to this file handles.. for a system in which we can assume that after bootstrap+warmup the classloading events become rare sightings I'd favour releasing memory over classloading-latency.

I don't remember the exact data amounts - it certainly did scale with the number of jars. And even if I did remember, I wouldn't trust such details as they tend to change with JVM improvements. I'd wager it might be better nowadays, however the principle still applies IMO .. we don't need to keep them open.

I'm also not convinced that we need parallel classloading capabilities as these are "just" the application classes, and mostly loaded by a single bootstrap thread. Did that change actually manifest in tangible benefits?

from quarkus.

franz1981 commented on August 15, 2024 1

@Sanne

I'm also not convinced that we need parallel classloading capabilities as these are "just" the application classes, and mostly loaded by a single bootstrap thread

We don't need it, indeed: we had no reports that it was required: the only reason why has happened IIRC was to "fix" the issue mentioned at smallrye/smallrye-context-propagation#443 (comment) but already fixed by #39988

in short: not necessary in the current form.

from quarkus.

franz1981 commented on August 15, 2024 1

Yep, but that means too that using reentrant locks while on a native call (i.e. defineClass) "maybe" won't be a problem in subsequent releases - as per synchronized blocks.
That means that I won't be worried for the pinning event but only to fix the deadlocking issue.

from quarkus.

dmlloyd commented on August 15, 2024 1

Any late linkage will cause it, for example calling a method that returns a Foo will cause Foo to be loaded if it already isn't loaded, plus the corresponding hidden native frame.

from quarkus.

vsevel commented on August 15, 2024 1

can you please double check?

short answer (before we re-test) is: I am definitely sure. because they always stayed with the LTS, and the behavior they we were saying is the stack with the runner that we searched across the quarkus issues. and that is how we found this current issue.

from quarkus.

dmlloyd commented on August 15, 2024

I can only hypothesize, but my hypothesis is that perhaps the JVM has an internal lock (or more likely, an invisible native call frame) which is being acquired in between the org.acme.GreetingResource.lambda$init$0(GreetingResource.java:32) frame and the io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:71) frame.

Maybe @franz1981 has an idea?

from quarkus.

dmlloyd commented on August 15, 2024

I'll try and think of some way to prove this hypothesis...

from quarkus.

dmlloyd commented on August 15, 2024

Looking into the JDK source, it appears that the stack walker used to print the pinned stack trace does not include the SHOW_HIDDEN_FRAMES flag, which I think strengthens my hypothesis considerably.

from quarkus.

franz1981 commented on August 15, 2024

we do have a thread unsafe declared class loader so maybe is the synchronization performed by the JVM itself that is not well-behaving here...?

from quarkus.

quarkus-bot commented on August 15, 2024

/cc @cescoffier (virtual-threads), @ozangunalp (virtual-threads)

from quarkus.

franz1981 commented on August 15, 2024

@dmlloyd @Sanne @geoand these seems the only 2 commits which separate the 2 versions mentioned

I was wrong about being thread-unsafe because of registerAsParallelCapable(); added in such (with the relevant related changes to make it possible - which I have to learn/look at)

from quarkus.

dmlloyd commented on August 15, 2024

I don't think that does much more than register the classloader's class into a map, which in turn determines which object is being synchronized on (in this case, none).

I think that we may want to revisit the idea of blocking inside the class loader (in this case, we are using a shared lock on the resource map). I believe we should change our assumptions and assume that because class loading may happen in a virtual thread, we cannot block (wait) inside of a class loading operation. (Yes I realize that loading a resource may be a blocking operation, but it is generally not a waiting operation.)

from quarkus.

franz1981 commented on August 15, 2024

I don't think that does much more than register the classloader's class into a map, which in turn determines which object is being synchronized on (in this case, none).

yeah I meant that my previous statement about not being parallel capable was very wrong :P

from quarkus.

franz1981 commented on August 15, 2024

I think that we may want to revisit the idea of blocking inside the class loader

I need first to understand what means "pinning" here, looking at https://github.com/openjdk/jdk/blob/jdk-21%2B35/src/java.base/share/classes/jdk/internal/vm/Continuation.java#L385 to match what the user is doing.

from quarkus.

franz1981 commented on August 15, 2024

@dmlloyd I think that what the JDK does to understand if pinning is happening is to check on (a failed) preemption the context of the frame, see https://github.com/openjdk/jdk/blob/jdk-21%2B35/src/java.base/share/classes/jdk/internal/vm/Continuation.java#L69-L71

So would be "nice" to have in the output of the pinned reason, which detailed reason is...

from quarkus.

dmlloyd commented on August 15, 2024

Right, I can only infer the possibility of a hidden native frame based on what's happening at the time, but we don't really know without either the pinned reason or full stack trace.

from quarkus.

franz1981 commented on August 15, 2024

Exactly, and I've found this -> https://github.com/openjdk/jdk/blob/5abc02927b480a85fadecf8d03850604510276e4/src/hotspot/share/runtime/continuationFreezeThaw.cpp#L205-L207

I could debug it via gdb, in case (poor me .-. very rusty on this)

from quarkus.

geoand commented on August 15, 2024

Is there a JFR event that we could use to find the reason for the pinning?

from quarkus.

franz1981 commented on August 15, 2024

@geoand https://sap.github.io/SapMachine/jfrevents/21.html#virtualthreadpinned

The interesting bit is that you can configure threshold to check if is a temporary pinning or not

eg read/write reentrant locks in the past used to spin a bit, sometime, causing to actually "pin" the thread, IDK if this is still valid.

from quarkus.

franz1981 commented on August 15, 2024

@imperatorx what is printed with jdk.tracePinnedThreads=full?

from quarkus.

imperatorx commented on August 15, 2024

@imperatorx what is printed with jdk.tracePinnedThreads=full?

full pin trace:

Thread[#74,ForkJoinPool-1-worker-1,5,CarrierThreads]
    java.base/java.lang.VirtualThread$VThreadContinuation.onPinned(VirtualThread.java:185)
    java.base/jdk.internal.vm.Continuation.onPinned0(Continuation.java:393)
    java.base/java.lang.VirtualThread.park(VirtualThread.java:592)
    java.base/java.lang.System$2.parkVirtualThread(System.java:2639)
    java.base/jdk.internal.misc.VirtualThreads.park(VirtualThreads.java:54)
    java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:219)
    java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
    java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1079)
    java.base/java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:738)
    io.quarkus.bootstrap.runner.JarResource.readLockAcquireAndGetJarReference(JarResource.java:156)
    io.quarkus.bootstrap.runner.JarResource.getResourceData(JarResource.java:72)
    io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:105)
    io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:71)
    org.acme.GreetingResource.lambda$init$3(GreetingResource.java:50)
    java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:572)
    java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
    java.base/java.lang.VirtualThread.run(VirtualThread.java:311)

JFR stack frames printed:

{
  method = java.lang.VirtualThread.parkOnCarrierThread(boolean, long)
  lineNumber = 687
  bytecodeIndex = 116
  type = "Interpreted"
}

{
  method = java.lang.VirtualThread.park()
  lineNumber = 603
  bytecodeIndex = 200
  type = "Interpreted"
}

{
  method = java.lang.System$2.parkVirtualThread()
  lineNumber = 2639
  bytecodeIndex = 17
  type = "Interpreted"
}

{
  method = jdk.internal.misc.VirtualThreads.park()
  lineNumber = 54
  bytecodeIndex = 3
  type = "Interpreted"
}

{
  method = java.util.concurrent.locks.LockSupport.park(Object)
  lineNumber = 219
  bytecodeIndex = 16
  type = "Interpreted"
}

{
  method = java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer$Node, int, boolean, boolean, boolean, long)
  lineNumber = 754
  bytecodeIndex = 361
  type = "Interpreted"
}

{
  method = java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(int)
  lineNumber = 1079
  bytecodeIndex = 15
  type = "Interpreted"
}

{
  method = java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock()
  lineNumber = 738
  bytecodeIndex = 5
  type = "Interpreted"
}

{
  method = io.quarkus.bootstrap.runner.JarResource.readLockAcquireAndGetJarReference()
  lineNumber = 156
  bytecodeIndex = 4
  type = "Interpreted"
}

{
  method = io.quarkus.bootstrap.runner.JarResource.getResourceData(String)
  lineNumber = 72
  bytecodeIndex = 1
  type = "Interpreted"
}

{
  method = io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(String, boolean)
  lineNumber = 105
  bytecodeIndex = 159
  type = "Interpreted"
}

{
  method = io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(String)
  lineNumber = 71
  bytecodeIndex = 3
  type = "Interpreted"
}

{
  method = org.acme.GreetingResource.lambda$init$3(ClassLoader)
  lineNumber = 50
  bytecodeIndex = 1
  type = "Interpreted"
}

{
  method = org.acme.GreetingResource$$Lambda+0x0000024c211e1248.904209509.run()
  lineNumber = -1
  bytecodeIndex = 4
  type = "Interpreted"
}

{
  method = java.util.concurrent.Executors$RunnableAdapter.call()
  lineNumber = 572
  bytecodeIndex = 4
  type = "Interpreted"
}

{
  method = java.util.concurrent.FutureTask.run()
  lineNumber = 317
  bytecodeIndex = 39
  type = "Interpreted"
}

{
  method = java.lang.VirtualThread.runWith(Object, Runnable)
  lineNumber = 341
  bytecodeIndex = 5
  type = "Interpreted"
}

{
  method = java.lang.VirtualThread.run(Runnable)
  lineNumber = 311
  bytecodeIndex = 63
  type = "Interpreted"
}

{
  method = java.lang.VirtualThread$VThreadContinuation$1.run()
  lineNumber = 192
  bytecodeIndex = 8
  type = "Interpreted"
}

{
  method = jdk.internal.vm.Continuation.enter0()
  lineNumber = 320
  bytecodeIndex = 4
  type = "Interpreted"
}

{
  method = jdk.internal.vm.Continuation.enter(Continuation, boolean)
  lineNumber = 312
  bytecodeIndex = 1
  type = "Interpreted"
}

{
  method = jdk.internal.vm.Continuation.enterSpecial(Continuation, boolean, boolean)
  lineNumber = -1
  bytecodeIndex = 0
  type = "Native"
}

from quarkus.

franz1981 commented on August 15, 2024

It looks to me as a reentrancy case (because I expect no other threads are there holding the write lock) likely due to a write lock already held and a read lock unable to be acquired (hence leading to park).

from quarkus.

franz1981 commented on August 15, 2024

I've tried to reproduce with this on 3.11.0 on an hello world endpoint:

@Path("/hello")
public class GreetingResource {

    private static class Fun {

    }


    @GET
    @Produces(MediaType.TEXT_PLAIN)
    public String hello() {
        var cl = Thread.currentThread()
                .getContextClassLoader();
        System.out.println(cl.getClass());
        try(var exec = Executors.newVirtualThreadPerTaskExecutor()) {
            for (int i = 0; i < 2; i++) {
                exec.submit(() -> {
                    try {
                        cl.loadClass("profiling.workshop.greeting.GreetingResource$Fun");
                        System.out.println("Loaded");
                    } catch (Exception e) {
                        e.printStackTrace();
                    }
                });
            }
        }
        return "hello";
    }
}

Java(TM) SE Runtime Environment (build 21+35-LTS-2513)
Java HotSpot(TM) 64-Bit Server VM (build 21+35-LTS-2513, mixed mode, sharing)

without much luck.
I've used

curl http://localhost:8080/hello

few times, both running the quarkus application with -Djdk.tracePinnedThreads=short and without it, not getting the same outcome of this issue (deadlock or any log re pinned threads).

@imperatorx Can you share the full reproducer? To check if is any different from what I've done...?
Did you produce an high load of requests instead?

from quarkus.

franz1981 commented on August 15, 2024

aha, found why: the classloading with inner classes is not the same as any top level application classes .-. dumb me :D

Said that, by switching to

@Path("/hello")
public class GreetingResource {

    @GET
    @Produces(MediaType.TEXT_PLAIN)
    public String hello() {
        var cl = Thread.currentThread()
                .getContextClassLoader();
        System.out.println(cl.getClass());
        try(var exec = Executors.newVirtualThreadPerTaskExecutor()) {
            for (int i = 0; i < 2; i++) {
                exec.submit(() -> {
                    try {
                        cl.loadClass("profiling.workshop.greeting.Fun");
                    } catch (Exception e) {
                        e.printStackTrace();
                    }
                });
            }
        }
        return "hello";
    }

}

or increasing the iteration count to 10_000 still didn't worked to cause the deadlock.

from quarkus.

imperatorx commented on August 15, 2024

The deadlock happens in the actual application with more complex stuff going on, the reproducer is just to reproduce that the io.quarkus.bootstrap.runner.RunnerClassLoader pins the carrier thread in 3.11 where it did not in 3.10. The thread.park is not a problem on itself, but there is a native frame or a static class initializer on the stack somewhere (not visible in pin trace though) that is causing the pinning of the carrier thread.

from quarkus.

franz1981 commented on August 15, 2024

the reproducer is just to reproduce that the io.quarkus.bootstrap.runner.RunnerClassLoader pins the carrier thread

Got it: so @dmlloyd @Sanne @geoand from what I can see here:

if multiple (virtual threads) call RunnerClassLoader::loadClass and findLoadedClass return null, they all compete against the same JarResource::readLock (which internally can spin due to the contention)
Right after releasing the read lock at JarResource::readLock they all acquire the write lock at JarResource::ensureJarFileIsOpen - which again means parking any other threads AND making them to spin

I'm not quite sure about the deadlock, here, but that's why I would like a reproducer that at least make the deadlock to happen, because, under heavy contention lock/unlock can cause some form of monopolization, for sure, due to spinning.

@imperatorx The reproducer you placed there is able to produce the stack trace due to pinning you posted? Because that's what I cannot reproduce...

from quarkus.

mariofusco commented on August 15, 2024

@franz1981 @dmlloyd @geoand I'm available to further investigate this and I will start giving a look at it immediately. If I understand correctly the pinning is caused (or at least put in evidence) by something that changed on our classloader between version 3.10.2 and 3.11.0. Do you already have any idea of which could be the problematic commit or change?

from quarkus.

geoand commented on August 15, 2024

@franz1981 yeah that does sound reasonable to me.

@mariofusco cool, thanks! I am assuming that the commit that introduced the issue is 2710238

from quarkus.

dmlloyd commented on August 15, 2024

The issue was always there, the commit just changes the timing of it for some reason.

from quarkus.

geoand commented on August 15, 2024

Right, there should be nothing wrong with the ClassLoading algorithm itself

from quarkus.

franz1981 commented on August 15, 2024

The issue was always there, the commit just changes the timing of it for some reason.

Maybe, we made load jar file + defineClass not atomic anymore (as a single atomic op) because the changes in the cl doesn't make load class a "fully locked op" from my understanding

from quarkus.

mariofusco commented on August 15, 2024

Not sure if I'm looking at a red herring, but for now the only thing that I can confirm is that avoiding to configure the RunnerClassLoader with registerAsParallelCapable() prevents the pinning. Of course I'm still digging into this trying to completely understand the root cause of the problem, but wouldn't it be safer to send a PR that, at least temporarily, simply removes that statement?

from quarkus.

geoand commented on August 15, 2024

We can certainly have that as a backup option

from quarkus.

dmlloyd commented on August 15, 2024

How much memory footprint are we actually talking about? Does it matter in practice?

from quarkus.

geoand commented on August 15, 2024

@Sanne might remember as he is the one that added the ability to close jars and keep a "live set" of them open

from quarkus.

dmlloyd commented on August 15, 2024

I seem to recall that we did notice a performance difference. Also there's the ever-present risk of ABA deadlocks in the event that there is any kind of circularity happening when locking is in use.

I think that we could definitely look at a different release strategy (time based for example). I don't think the tradeoff is between latency and size though. If properly designed, most class loading operations should have low latency and high throughput, while still being able to release unused resources. But it does need to be designed in such a way that deadlocks and virtual thread stalls (like this one) are not possible.

from quarkus.

geoand commented on August 15, 2024

Coming back to what I said a few days ago, wouldn't using directories instead of jars overcome all these issues?

from quarkus.

mariofusco commented on August 15, 2024

I will try to recap the current situation:

At the moment there is no evidence that we could have any relevant advantage adding parallel capabilities to this classloader.
Even though we could probably make it works, the proposed lock-free implementation necessary for virtual threads compatibility, is quite hard to understand and validate ( ... and I'm the one who wrote it 😄 ) and could cause other maintainance problems in future.
We agree that there is a more straightforward solution consisting of getting rid of the jars and exploding them into directories.

If my premises are all correct what I suggest is:

Closing my pull request ... 3 wasted working days, but at least I learnt something and I hope that @Sanne will offer me a beer to recover from this when we will meet in London :)
Reverting also on main the commit that flagged the classloader as parallel capable like we already did on branch 3.11.
Closing this issue accordingly.
Open a feature request to implement the directories based solution suggested by @geoand (I'm available to work on it).

from quarkus.

geoand commented on August 15, 2024

We agree that there is a more straightforward solution consisting of getting rid of the jars and exploding them into directories.

I would still like to people to try and poke holes in this idea... I personally don't see why it would not work better, but others might :).

Also, this solution would definitely be more involved (in terms of how many things need to be changed) than fixes such as this one.

from quarkus.

mariofusco commented on August 15, 2024

We agree that there is a more straightforward solution consisting of getting rid of the jars and exploding them into directories.

I would still like to people to try and poke holes in this idea... I personally don't see why it would not work better, but others might :).

Also, this solution would definitely be more involved (in terms of how many things need to be changed) than fixes such as this one.

Yes, we need to further validate this idea and check the implications, but it is only the (optional) point 4. of the actions that I suggested and I think that 1. to 3. could be taken regardless of it.

I'm also not convinced that we need parallel classloading capabilities as these are "just" the application classes, and mostly loaded by a single bootstrap thread. Did that change actually manifest in tangible benefits?

@Sanne In reality this is for sure the main point of this whole discussion. The first thing that we need to address is figuring out if we need a parallel classloader or not. See premise 1. of my former comment, I had a quick chat with @franz1981 this morning about it and he thinks that, regardless the smallrye fix for which the parallel classloader has been introduced, there could be other relevant use cases where it could be necessary or at least a nice to have. Can you please help us to double check this? If @franz1981 is right and since @geoand's suggestion is a more long-term solution, the only alternative would be finalizing my pull request, making it as good as possible, and merging it.

from quarkus.

Sanne commented on August 15, 2024

Also there's the ever-present risk of ABA deadlocks in the event that there is any kind of circularity happening when locking is in use.

@dmlloyd this concerns me and I believe you mentioned this in the past as well - I'm not aware of us getting in potential deadlocks with the existing code but that's more likely due to not having understood the scenario you have in mind.

You think you could describe a potential scenario in such a way that someone here might try to create some tests for it?

FWIW when I last reviewed the design (and it's been a while - I have not been able to look carefully at the current state) we only had reentrant locks protecting resource state changes, locks would not be held for longer than such state transitions and (I believe) not held during operations that could trigger loading of other classes - at least not from a different thread, which would have been necessary to get in trouble with the reentrant locks.

from quarkus.

vsevel commented on August 15, 2024

the app gets blocked after we see the Thread[#66,ForkJoinPool-1-worker-2,5,CarrierThreads] java.base/java.lang.VirtualThread$VThreadContinuation.onPinned(VirtualThread.java:183)
we have 53 threads saying: waiting for quarkus-virtual-thread-17@16977 to release lock on <0x4265> (a io.quarkus.bootstrap.runner.RunnerClassLoader)
the liveness probe does not answer.
we see timeouts on kafka consumer poll.
some examples of thread being blocked:

"micrometer-kafka-metrics@16921" tid=0x64 nid=NA waiting for monitor entry
  java.lang.Thread.State: BLOCKED
              blocks micrometer-kafka-metrics@16901
              blocks micrometer-kafka-metrics@16912
              blocks micrometer-kafka-metrics@16918
              blocks micrometer-kafka-metrics@16924
              blocks micrometer-kafka-metrics@16927
              blocks micrometer-kafka-metrics@16930
              waiting for quarkus-virtual-thread-17@16977 to release lock on <0x4265> (a io.quarkus.bootstrap.runner.RunnerClassLoader)
                at io.micrometer.core.instrument.Tags$ArrayIterator.<init>(Tags.java:131)
                at io.micrometer.core.instrument.Tags.iterator(Tags.java:128)
                at io.micrometer.core.instrument.Tags.of(Tags.java:224)
...

"smallrye-kafka-consumer-thread-2@16897" tid=0x4b nid=NA waiting for monitor entry
  java.lang.Thread.State: BLOCKED
              waiting for quarkus-virtual-thread-17@16977 to release lock on <0x4265> (a io.quarkus.bootstrap.runner.RunnerClassLoader)
                at io.smallrye.mutiny.groups.UniOnItem.delayIt(UniOnItem.java:188)
                at io.smallrye.mutiny.helpers.ExponentialBackoff.lambda$randomExponentialBackoffFunction$0(ExponentialBackoff.java:48)
                at io.smallrye.mutiny.helpers.ExponentialBackoff$$Lambda/0x00007fd03090fa88.apply(Unknown Source:-1)
                at io.smallrye.context.impl.wrappers.SlowContextualFunction.apply(SlowContextualFunction.java:21)
                at io.smallrye.mutiny.groups.MultiOnItem.lambda$transformToUni$6(MultiOnItem.java:268)

"executor-thread-1@16867" tid=0x1f nid=NA waiting for monitor entry
  java.lang.Thread.State: BLOCKED
              waiting for quarkus-virtual-thread-17@16977 to release lock on <0x4265> (a io.quarkus.bootstrap.runner.RunnerClassLoader)
                at com.x.HealthCheckService_Bean.proxy(Unknown Source:-1)
                at com.x.HealthCheckService_Bean.get(Unknown Source:-1)
                at com.x.HealthCheckService_Bean.get(Unknown Source:-1)
                at io.quarkus.arc.impl.ArcContainerImpl.beanInstanceHandle(ArcContainerImpl.java:554)
                at io.quarkus.arc.impl.ArcContainerImpl.beanInstanceHandle(ArcContainerImpl.java:534)
                at io.quarkus.arc.impl.ArcContainerImpl.beanInstanceHandle(ArcContainerImpl.java:567)
                at io.quarkus.arc.impl.ArcContainerImpl.instance(ArcContainerImpl.java:339)
                at x.HealthCheckService_ScheduledInvoker_autoCheckAll_01fef9f6dc57a56c94b1934cb946de439403b427.invokeBean(Unknown Source:-1)

we have to wait for the liveness to kill the pod. and processing can restart. that is harsh.

from quarkus.

mariofusco commented on August 15, 2024

the app gets blocked after we see the Thread[#66,ForkJoinPool-1-worker-2,5,CarrierThreads] java.base/java.lang.VirtualThread$VThreadContinuation.onPinned(VirtualThread.java:183) we have 53 threads saying: waiting for quarkus-virtual-thread-17@16977 to release lock on <0x4265> (a io.quarkus.bootstrap.runner.RunnerClassLoader) the liveness probe does not answer. we see timeouts on kafka consumer poll.

This seems to be definitively a bug of the parallel RunnerClassLoader implementation and indeed we (temporaraly?) reverted it to the original sequential version. What at the moment isn't clear to me is why this is happening only with virtual threads, while if there is a potential deadlock condition this should probably happen regardless of the kind of threads. The only (admittedly partial and not strong enough) explanation that I can see at the moment is that with virtual threads we have a higher degree of parallelism thus making the problem more evident and easier to reproduce.

About the reproducer it would be great if you could provide one, or at least give us some more hints on how to create one on our own, for instance providing the full stack traces of the Quarkus application when the deadlock is detected. I will try to write a proper unit test for the parallel RunnerClassLoader anyway, otherwise any potential fix would be a blind one.

As a final note, these days I'm working on an alternative lock-free implementation of the parallel RunnerClassLoader. @vsevel it would be great if you could have a chance to try that pull request in your environment and tell us if this solves the problem, making the deadlocks to disappear or not.

from quarkus.

vsevel commented on August 15, 2024

it would be great if you could provide one

not likely. when it happens it always is close to startup (10 seconds after the app has started). the app consumes messages at a decent rate (25 to 50 messages per second) with virtual thread max-concurrency: 5.

full stack traces of the Quarkus application when the deadlock is detected.

a thread dump you mean?

it would be great if you could have a chance to try that pull request in your environment and tell us if this solves the problem, making the deadlocks to disappear or not.

sure. no problem.

from quarkus.

mariofusco commented on August 15, 2024

full stack traces of the Quarkus application when the deadlock is detected.

a thread dump you mean?

Yes sorry, I meant a complete thread dump of the application when the deadlock happens.

from quarkus.

vsevel commented on August 15, 2024

threads.txt

from quarkus.

mariofusco commented on August 15, 2024

Thanks a lot for that thread dump. I must admit that it looks quite weird though and I'm still trying to make sense of it. What I mean is that I see 53 threads saying

  java.lang.Thread.State: BLOCKED
              waiting for quarkus-virtual-thread-17@16977 to release lock on <0x4265> (a io.quarkus.bootstrap.runner.RunnerClassLoader)

but at the same time neither I see any track of that virtual thread quarkus-virtual-thread-17@16977 in the whole dump, nor there is any other thread, virtual or not, with a stack saying something like locked <0x4265>, so I still have no clue of what is holding that lock and why it is not releasing it.

@vsevel for now I will continue my investigations, mostly trying to reproduce the same problem with a more isolated test case. Please let me know when you'll have a chance to give a try to my pull request and how it works for you.

from quarkus.

vsevel commented on August 15, 2024

but at the same time neither I see any track of that virtual thread quarkus-virtual-thread-17@16977 in the whole dump, nor there is any other thread, virtual or not, with a stack saying something like locked <0x4265>, so I still have no clue of what is holding that lock and why it is not releasing it.

same for me. I could not find the VT holding the lock. we created the dump by connecting with the remote debugger from intellij into our ocp cluster using port forward, and executing Run > Debubgging Actions > Get Thread Dump. I do not know if this dump has limited visibility.

Please let me know when you'll have a chance to give a try to my pull request and how it works for you.

where is it?

FYI I also created #41313
this is not related to class loading, but I need help investigating this

from quarkus.

mariofusco commented on August 15, 2024

Please let me know when you'll have a chance to give a try to my pull request and how it works for you.

where is it?

#40942

FYI I also created #41313 this is not related to class loading, but I need help investigating this

I will give a look asap, but I would like to make sense of this issue first.

from quarkus.

franz1981 commented on August 15, 2024

Sadly the thread dump analysis could be unfruitful due to how RW locks work, see https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/java/util/concurrent/locks/ReentrantReadWriteLock.java#L336

It stores the information of first uncontended read locker Thread, but just for performance reasons, and there is no concept of "read ownership" because a read lock has no exclusive ownership, while the write one, yep.
This means that a thread dump cannot print read lock owners which prevent writers or others to be unblocked.
(Let's remember that no write lock can be acquired while a read one is already held).
I believe that's the deadlock happening here:

a read lock acquired
a reentrant class loading on recursive define class
a subsequent write lock, which cause a deadlock with itself, but invisible to thread dumps

from quarkus.

mariofusco commented on August 15, 2024

Please let me know when you'll have a chance to give a try to my pull request and how it works for you.

where is it?

#40942

@vsevel Did you have a chance to give a try to my pull request? I'm still trying to reproduce the problem in isolation, but still no luck so far :(

from quarkus.

vsevel commented on August 15, 2024

Did you have a chance to give a try to my pull request? I'm still trying to reproduce the problem in isolation, but still no luck so far :(

I will spend some time these next 2 days

from quarkus.

franz1981 commented on August 15, 2024

FYI https://mail.openjdk.org/pipermail/loom-dev/2024-June/006811.html

from quarkus.

geoand commented on August 15, 2024

Thanks for that link!

but class loading and initialization is something that needs to be re-examined next

I assume Alan means from the JDK folks?

from quarkus.

franz1981 commented on August 15, 2024

After a call with @mariofusco we found how to trigger it reliably: there's no need (i.e. it is not seen as pinning by the JVM) to call <ClassLoader>::loadClass but the real reason why pinning happen is if we have native stack frames in the carrier i.e. Class::forName or just NameOfTheClass.class.getName()

from quarkus.

geoand commented on August 15, 2024

That's a very interesting find...

from quarkus.

franz1981 commented on August 15, 2024

Now the funny part is the deadlock, but actually If waiting on a lock pinn the carrier - be just need to make some recursive class definition to enter in the locks in the right other and exhausting the number of carriers and it's done :P

from quarkus.

vsevel commented on August 15, 2024

we were able to spend some time on this issue.
we confirm @mariofusco that with your branch we have not been able to reproduce the deadlock across 8 runs. but we still see the pinning issue when going through RunnerClassLoader.
I understand it is more of an issue inside the open jdk. correct?
just to make sure, we reverted the image to the one built with 3.8, and on the first run we got some pinning + deadlock.

from quarkus.

mariofusco commented on August 15, 2024

we were able to spend some time on this issue. we confirm @mariofusco that with your branch we have not been able to reproduce the deadlock across 8 runs.

This is a great news. Thanks for giving it a try.

but we still see the pinning issue when going through RunnerClassLoader. I understand it is more of an issue inside the open jdk. correct?

Yes, this is not completely related even though still unexpected. Can you please paste the stack trace of the pinning using my classloader?

just to make sure, we reverted the image to the one built with 3.8, and on the first run we got some pinning + deadlock.

This is also unexpected to be honest, or probably I'm missing something. If you reverted to 3.8 it means that you're using the RunnerClassLoader when it still was sequential, or in other words far before it was registered as parallel capable in version 3.11.0. Do you mean that also the sequential version of the old classloader can suffer of deadlock? If so the revert that we did recently to that sequential version is practically useless, can you please double check?

from quarkus.

vsevel commented on August 15, 2024

Can you please paste the stack trace of the pinning using my classloader?

I believe this is this issue #41314

can you please double check?

let me check with the dev team.

from quarkus.

mariofusco commented on August 15, 2024

Can you please paste the stack trace of the pinning using my classloader?

I believe this is this issue #41314

Ok, this makes sense now: it is a totally different pinning problem and completely unrelated with the classloader.

from quarkus.

mariofusco commented on August 15, 2024

I finally recreated in isolation the virtual thread pinning condition originally reported here.

Given all that have been discussed so far and the feedback provided by @vsevel I don't see any reason to further hold this pull request removing the read/write lock from the RunnerClassLoader.

from quarkus.

imperatorx commented on August 15, 2024

If so the revert that we did recently to that sequential version is practically useless

I can confirm it is not useless, it fixed the original issue.

from quarkus.

vsevel commented on August 15, 2024

we tested it again and were able to reproduce it:

2024-06-26 09:28:18,231 INFO [io.quarkus] (main) {} x 1.1.6-SNAPSHOT on JVM (powered by Quarkus 3.8.4) started in 6.142s. Listening on: http://0.0.0.0:8080 ...
...
Thread[#80,ForkJoinPool-1-worker-2,5,CarrierThreads]
    java.base/java.lang.VirtualThread$VThreadContinuation.onPinned(VirtualThread.java:183)
    java.base/jdk.internal.vm.Continuation.onPinned0(Continuation.java:393)
    java.base/java.lang.VirtualThread.park(VirtualThread.java:582)
    java.base/java.lang.System$2.parkVirtualThread(System.java:2643)
    java.base/jdk.internal.misc.VirtualThreads.park(VirtualThreads.java:54)
    java.base/java.util.concurrent.locks.LockSupport.park(LockSupport.java:219)
    java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:754)
    java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1079)
    java.base/java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:738)
    io.quarkus.bootstrap.runner.JarResource.readLockAcquireAndGetJarReference(JarResource.java:143)
    io.quarkus.bootstrap.runner.JarResource.getResourceData(JarResource.java:72)
    io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:99)
    io.quarkus.bootstrap.runner.RunnerClassLoader.loadClass(RunnerClassLoader.java:65)
    io.quarkus.cache.runtime.CacheInterceptor.getArcCacheInterceptionContext(CacheInterceptor.java:82)
    io.quarkus.cache.runtime.CacheInterceptor.getInterceptionContext(CacheInterceptor.java:61)
    io.quarkus.cache.runtime.CacheResultInterceptor.intercept(CacheResultInterceptor.java:43)
    io.quarkus.cache.runtime.CacheResultInterceptor_Bean.intercept(Unknown Source)
    io.quarkus.arc.impl.InterceptorInvocation.invoke(InterceptorInvocation.java:42)
    io.quarkus.arc.impl.AroundInvokeInvocationContext.perform(AroundInvokeInvocationContext.java:30)
...

thread dump shows multiple stacks (22 threads blocked) such as:

"vert.x-eventloop-thread-0@16949" tid=0x1a nid=NA waiting for monitor entry
  java.lang.Thread.State: BLOCKED
	 waiting for quarkus-virtual-thread-16@17044 to release lock on <0x4293> (a io.quarkus.bootstrap.runner.RunnerClassLoader)
	  at io.vertx.core.net.impl.pool.SimpleConnectionPool$Evict.execute(SimpleConnectionPool.java:537)
	  at io.vertx.core.net.impl.pool.SimpleConnectionPool$Evict.execute(SimpleConnectionPool.java:500)
	  at io.vertx.core.net.impl.pool.CombinerExecutor.submit(CombinerExecutor.java:67)
	  at io.vertx.core.net.impl.pool.SimpleConnectionPool.execute(SimpleConnectionPool.java:244)
	  at io.vertx.core.net.impl.pool.SimpleConnectionPool.evict(SimpleConnectionPool.java:549)
	  at io.vertx.core.http.impl.SharedClientHttpStreamEndpoint.checkExpired(SharedClientHttpStreamEndpoint.java:115)
	  at io.vertx.core.http.impl.HttpClientImpl.lambda$static$1(HttpClientImpl.java:106)
	  at io.vertx.core.http.impl.HttpClientImpl$$Lambda/0x00007fd358434700.accept(Unknown Source:-1)
	  at java.util.concurrent.ConcurrentHashMap$ValuesView.forEach(ConcurrentHashMap.java:4783)
	  at io.vertx.core.net.impl.pool.ConnectionManager.forEach(ConnectionManager.java:33)
	  at io.vertx.core.http.impl.HttpClientImpl.checkExpired(HttpClientImpl.java:150)
	  at io.vertx.core.http.impl.HttpClientImpl.access$100(HttpClientImpl.java:48)
	  at io.vertx.core.http.impl.HttpClientImpl$PoolChecker.handle(HttpClientImpl.java:144)
	  at io.vertx.core.http.impl.HttpClientImpl$PoolChecker.handle(HttpClientImpl.java:132)
	  at io.vertx.core.impl.VertxImpl$InternalTimerHandler.handle(VertxImpl.java:1056)
	  at io.vertx.core.impl.VertxImpl$InternalTimerHandler.handle(VertxImpl.java:1027)
	  at io.vertx.core.impl.ContextImpl.emit(ContextImpl.java:328)
	  at io.vertx.core.impl.ContextImpl.emit(ContextImpl.java:321)
	  at io.vertx.core.impl.ContextInternal.emit(ContextInternal.java:209)
	  at io.vertx.core.impl.VertxImpl$InternalTimerHandler.run(VertxImpl.java:1045)
	  at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	  at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:153)
	  at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
	  at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
	  at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:470)
	  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:569)
	  at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
	  at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	  at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	  at java.lang.Thread.runWith(Thread.java:1596)
	  at java.lang.Thread.run(Thread.java:1583)

but I can't see quarkus-virtual-thread-16@17044 anywhere.

from quarkus.

vsevel commented on August 15, 2024

we did some additional tests with your branch, hoping to capture a full thread dump when detecting the pinning. unfortunately (for this test) we did not reproduce the pinning with your fix.

from quarkus.

3.11.0 causes carrier thread pinning regression (compared to 3.10.2) while loading classes about quarkus HOT 75 OPEN

Comments (75)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent