When using the CUDA backend, but not the interpreter, it is easy to get a "thread bloc

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thread blocked indefinitely about accelerate HOT 17 CLOSED

acceleratehs commented on May 28, 2024

Thread blocked indefinitely

from accelerate.

Comments (17)

rrnewton commented on May 28, 2024

Actually, I have a related question on this same code that I might as well ask here. The thread that does puts the MVar after GPU results are available is forked with forkIO here:

https://github.com/AccelerateHS/accelerate-cuda/blob/2320705ee4e1bda2565117810ef6b4e5e702ca48/Data/Array/Accelerate/CUDA.hs#L224

But doesn't that need to be a forkOS so that the GHC capability isn't stalled by the blocking CUDA call, interfering with other IO threads?

I think Accelerate is not passing cuCtxCreate any flags, right now, correct? Which means we would get the default blocking/spinning behavior CU_CTX_SCHED_AUTO:

http://developer.download.nvidia.com/compute/cuda/4_2/rel/toolkit/docs/online/group__CUDA__CTX_g65dc0012348bc84810e2103a40d8e2cf.html#g65dc0012348bc84810e2103a40d8e2cf

Which in practice means spinning in most cases. However, I think that spinning in a foreign function is just as bad as blocking wrt the GHC RTS, right?

Personally, I was hoping we could make CU_CTX_SCHED_BLOCKING_SYNC the Accelerate default so as to be gentle on CPU resources. It's a bit solipsistic for NVidia to make spinning the default -- waste power and screw up whatever the CPU is trying to do because, obviously, the GPU computation is the only thing you should care about ;-).

from accelerate.

simonmar commented on May 28, 2024

forkIO vs. forkOS makes no difference to blocking, it only affects which OS thread the foreign call is made in. As long as the foreign call is marked "safe", it won't block other Haskell threads (provided you use -threaded).

from accelerate.

rrnewton commented on May 28, 2024

Ok, just to make sure I'm following --

If we compile with --threaded, and run with +RTS -N1, and then forkIO ten threads, one of which does a blocking CUDA call which blocks the hosting pthread for, say, a week, the other nine IO threads will have a chance to run in the intervening week, right?
A helpful passage appears here:

"although during the course of running the program more OS threads might be 
 created in order to continue running Haskell code while foreign calls execute"

Ah, so extra OS threads are forked on demand! Very nice. But from that Wiki I don't yet understand when these are forked. It would seem every "safe" foreign call from an IO thread can result in an OS thread being created? For example if we forkIO 10K threads, and do 10K foreign calls, we can end up with 10K OS threads, irrespective of +RTS -N, right?

Is it fair to say that blocking foreign calls should be marked as "safe"? This wiki makes it sound as if safe/unsafe is just a question of whether the foreign function calls back into Haskell. But if, again, a foreign call blocks for a week on CUDA, even if it doesn't call back into Haskell, we want it on its own OS thread...

It looks like "waitForReturnCapability" is for returning safe FFI calls to get back in:
https://github.com/ghc/ghc/blob/1dbe6d59b621ab9bd836241d633b3a8d99812cb3/rts/Capability.c#L579
I didn't see where OS threads get forked.. is that in the compiler generated code sequence for a safe foreign call?

from accelerate.

rrnewton commented on May 28, 2024

P.S. It looks like 100% of the foreign decls in the cuda package are marked as "unsafe".

This would need to be changed in a couple places to get the behavior Simon describes, right?

from accelerate.

tmcdonell commented on May 28, 2024

@simonmar okay, thanks! I have added some notes to the documentation as to why this happens. 'seq' is definitely one way to go avoiding it.

from accelerate.

tmcdonell commented on May 28, 2024

@rrnewton the default context does not pass any context creation flags, so yes, it would just pick up the CU_CTX_SCHED_AUTO spinning behaviour. Changing to CU_CTX_SCHED_BLOCKING_SYNC would be good, although I think some explicit synchronisation points will need to be added in the execute phase --- I'm pretty sure I know how to do that now, however.

From memory, 100% of the foreign calls in the cuda package are marked as "unsafe". The CUDA documentation is a bit confusing; I may have been under the impression that functions such as launchKernel and (more obviously) memcpyAsync always return immediately, so then it wouldn't matter if unsafe foreign calls block. But maybe that is only valid for the runtime and not driver API, since we can give these different behaviour flags to context creation. I'm no longer sure.

I've no objection to changing the foreign calls to "safe".

from accelerate.

simonmar commented on May 28, 2024

On 16/05/2012 03:28, Ryan Newton wrote:

Ok, just to make sure I'm following --

If we compile with --threaded, and run with +RTS -N1, and then forkIO
ten threads, one of which does a blocking CUDA call which blocks the
hosting pthread for, say, a week, the other nine IO threads will
have a chance to run in the intervening week, right?

Correct. The best docs for this are in the GHC users guide:

http://www.haskell.org/ghc/docs/latest/html/users_guide/ffi-ghc.html#ffi-threads

Ah, so extra OS threads are forked on demand! Very nice. But from
that Wiki I don't yet understand when these are forked. It would
seem every "safe" foreign call from an IO thread can result in an OS
thread being created? For example if we forkIO 10K threads, and do
10K foreign calls, we can end up with 10K OS threads, irrespective of
+RTS -N, right?

Also correct. Every blocked FFI call needs a separate OS thread, and we
also need at least one OS thread per capability in the RTS.

More background here:
http://community.haskell.org/~simonmar/papers/conc-ffi.pdf

This is why we have the IO manager: it handles the majority of blocking
I/O with a single OS threaed.

Is it fair to say that blocking foreign calls should be marked as
"safe"?

Absolutely.

It looks like "waitForReturnCapability" is for returning safe FFI
calls to get back in:
https://github.com/ghc/ghc/blob/1dbe6d59b621ab9bd836241d633b3a8d99812cb3/rts/Capability.c#L579

I didn't see where OS threads get forked.. is that in the compiler
generated code sequence for a safe foreign call?

rts/Schedule.c:suspendThread() is called by the Haskell code right
before a safe foreign call, it calls releaseCapability_() which spawns a
new OS thread if necessary.

Cheers,
Simon

from accelerate.

simonmar commented on May 28, 2024

On 16/05/2012 04:23, Trevor L. McDonell wrote:

I've no objection to changing the foreign calls to "safe".

Best would be to just mark the long-running ones as safe, leave the
short-running ones as unsafe, since there is quite a high overhead for a
safe call.

Cheers,
Simon

from accelerate.

rrnewton commented on May 28, 2024

I'm finding it difficult to dig up information on exactly how the cuda driver interacts with the OS (e.g. How does it block/wake pthreads?).

http://forums.nvidia.com/index.php?showtopic=177417

from accelerate.

tmcdonell commented on May 28, 2024

@rrnewton notes:

For reference, here's a link to the new bit of doc:

https://github.com/AccelerateHS/accelerate-cuda/blob/4575f474fa85d15bf48a88a4abb59ec62992d1cf/Data/Array/Accelerate/CUDA.hs#L85

Would it be a solution to walk the AST (during or before convertAcc) and force all the input arrays, before doing anything with the context?

Wouldn't that be a built-in equivalent of the seq Simon uses above? The Accelerate computation is strict in all the arrays that gets used, so I don't think this approach would force anything that isn't due to be forced anyway.

Shall we keep this issue open until a fix for the underlying issue is found? As Simon pointed out, behavior should not differ from the interpreter.

If we make the right things strict in the library, the user shouldn't have to deal with adding seq themselves.

from accelerate.

rrnewton commented on May 28, 2024

This was an issue that cropped up in spite of single threaded use of the API.

What are the issues with multi-threaded use of the API? Won't two threads calling "run" both grab the default context (the same context), leading to the same issue? Or is there something I'm forgetting?

from accelerate.

tmcdonell commented on May 28, 2024

If two threads call run with independent computations, then yes, they will be serialised when they try to grab the default context. But that would have happened anyway because the GPU can only process work from a single thread at a time (although it looks like the GK110, released yesterday, lifts this restriction).

So if you have a multi-threaded application, you probably want to use the runIn form, and have each thread supply their own context that refers to a distinct device (possibly lurking bug: driver contexts hold thread-local state?)

The issue here was that the computations were dependent, and the second had already taken the context before realising the first was still to be evaluated --- deadlock.

It might be enough to make use in the front-end library strict in its argument. But then again I'm not certain the context needs an MVar around it anyway.

Aside: it always calls forkIO on the computation, otherwise I found the finaliser threads, which free device memory, didn't get a chance to run.

from accelerate.

mchakravarty commented on May 28, 2024

@tmcdonell Yes, why have an MVar around the context? WIthout the MVar, all this wouldn't be an issue, or would it?

from accelerate.

tmcdonell commented on May 28, 2024

I mangled the commit tag, but here it is: AccelerateHS/accelerate-cuda@9f019ce

Thinking about why I did this, it occurs to me that what I really wanted was exclusive access to the device to avoid context switching (I recall trying this once and it was much slower than executing the two programs sequentially), but actually using a single context from different points is fine.

from accelerate.

mchakravarty commented on May 28, 2024

Ok, great, but can't we close the issue in this case? Or is there an outstanding problem?

from accelerate.

tmcdonell commented on May 28, 2024

The 'thread blocked' issues is resolved. I've created a new issue w.r.t. what Ryan mentions above regarding context synchronisation behaviour.

from accelerate.

mchakravarty commented on May 28, 2024

Thanks.

from accelerate.

Thread blocked indefinitely about accelerate HOT 17 CLOSED

Comments (17)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent