Comments (17)
Actually, I have a related question on this same code that I might as well ask here. The thread that does puts the MVar after GPU results are available is forked with forkIO here:
But doesn't that need to be a forkOS so that the GHC capability isn't stalled by the blocking CUDA call, interfering with other IO threads?
I think Accelerate is not passing cuCtxCreate any flags, right now, correct? Which means we would get the default blocking/spinning behavior CU_CTX_SCHED_AUTO:
Which in practice means spinning in most cases. However, I think that spinning in a foreign function is just as bad as blocking wrt the GHC RTS, right?
Personally, I was hoping we could make CU_CTX_SCHED_BLOCKING_SYNC the Accelerate default so as to be gentle on CPU resources. It's a bit solipsistic for NVidia to make spinning the default -- waste power and screw up whatever the CPU is trying to do because, obviously, the GPU computation is the only thing you should care about ;-).
from accelerate.
forkIO vs. forkOS makes no difference to blocking, it only affects which OS thread the foreign call is made in. As long as the foreign call is marked "safe", it won't block other Haskell threads (provided you use -threaded).
from accelerate.
Ok, just to make sure I'm following --
If we compile with --threaded, and run with +RTS -N1, and then forkIO ten threads, one of which does a blocking CUDA call which blocks the hosting pthread for, say, a week, the other nine IO threads will have a chance to run in the intervening week, right?
A helpful passage appears here:
"although during the course of running the program more OS threads might be
created in order to continue running Haskell code while foreign calls execute"
Ah, so extra OS threads are forked on demand! Very nice. But from that Wiki I don't yet understand when these are forked. It would seem every "safe" foreign call from an IO thread can result in an OS thread being created? For example if we forkIO 10K threads, and do 10K foreign calls, we can end up with 10K OS threads, irrespective of +RTS -N, right?
Is it fair to say that blocking foreign calls should be marked as "safe"? This wiki makes it sound as if safe/unsafe is just a question of whether the foreign function calls back into Haskell. But if, again, a foreign call blocks for a week on CUDA, even if it doesn't call back into Haskell, we want it on its own OS thread...
It looks like "waitForReturnCapability" is for returning safe FFI calls to get back in:
https://github.com/ghc/ghc/blob/1dbe6d59b621ab9bd836241d633b3a8d99812cb3/rts/Capability.c#L579
I didn't see where OS threads get forked.. is that in the compiler generated code sequence for a safe foreign call?
from accelerate.
P.S. It looks like 100% of the foreign decls in the cuda
package are marked as "unsafe".
This would need to be changed in a couple places to get the behavior Simon describes, right?
from accelerate.
@simonmar okay, thanks! I have added some notes to the documentation as to why this happens. 'seq' is definitely one way to go avoiding it.
from accelerate.
@rrnewton the default context does not pass any context creation flags, so yes, it would just pick up the CU_CTX_SCHED_AUTO spinning behaviour. Changing to CU_CTX_SCHED_BLOCKING_SYNC would be good, although I think some explicit synchronisation points will need to be added in the execute phase --- I'm pretty sure I know how to do that now, however.
From memory, 100% of the foreign calls in the cuda
package are marked as "unsafe". The CUDA documentation is a bit confusing; I may have been under the impression that functions such as launchKernel
and (more obviously) memcpyAsync
always return immediately, so then it wouldn't matter if unsafe foreign calls block. But maybe that is only valid for the runtime and not driver API, since we can give these different behaviour flags to context creation. I'm no longer sure.
I've no objection to changing the foreign calls to "safe".
from accelerate.
On 16/05/2012 03:28, Ryan Newton wrote:
Ok, just to make sure I'm following --
If we compile with --threaded, and run with +RTS -N1, and then forkIO
ten threads, one of which does a blocking CUDA call which blocks the
hosting pthread for, say, a week, the other nine IO threads will
have a chance to run in the intervening week, right?
Correct. The best docs for this are in the GHC users guide:
http://www.haskell.org/ghc/docs/latest/html/users_guide/ffi-ghc.html#ffi-threads
Ah, so extra OS threads are forked on demand! Very nice. But from
that Wiki I don't yet understand when these are forked. It would
seem every "safe" foreign call from an IO thread can result in an OS
thread being created? For example if we forkIO 10K threads, and do
10K foreign calls, we can end up with 10K OS threads, irrespective of
+RTS -N, right?
Also correct. Every blocked FFI call needs a separate OS thread, and we
also need at least one OS thread per capability in the RTS.
More background here:
http://community.haskell.org/~simonmar/papers/conc-ffi.pdf
This is why we have the IO manager: it handles the majority of blocking
I/O with a single OS threaed.
Is it fair to say that blocking foreign calls should be marked as
"safe"?
Absolutely.
It looks like "waitForReturnCapability" is for returning safe FFI
calls to get back in:
https://github.com/ghc/ghc/blob/1dbe6d59b621ab9bd836241d633b3a8d99812cb3/rts/Capability.c#L579I didn't see where OS threads get forked.. is that in the compiler
generated code sequence for a safe foreign call?
rts/Schedule.c:suspendThread() is called by the Haskell code right
before a safe foreign call, it calls releaseCapability_() which spawns a
new OS thread if necessary.
Cheers,
Simon
from accelerate.
On 16/05/2012 04:23, Trevor L. McDonell wrote:
I've no objection to changing the foreign calls to "safe".
Best would be to just mark the long-running ones as safe, leave the
short-running ones as unsafe, since there is quite a high overhead for a
safe call.
Cheers,
Simon
from accelerate.
I'm finding it difficult to dig up information on exactly how the cuda driver interacts with the OS (e.g. How does it block/wake pthreads?).
http://forums.nvidia.com/index.php?showtopic=177417
from accelerate.
@rrnewton notes:
For reference, here's a link to the new bit of doc:
Would it be a solution to walk the AST (during or before convertAcc) and force all the input arrays, before doing anything with the context?
Wouldn't that be a built-in equivalent of the
seq
Simon uses above? The Accelerate computation is strict in all the arrays that gets used, so I don't think this approach would force anything that isn't due to be forced anyway.Shall we keep this issue open until a fix for the underlying issue is found? As Simon pointed out, behavior should not differ from the interpreter.
If we make the right things strict in the library, the user shouldn't have to deal with adding seq
themselves.
from accelerate.
This was an issue that cropped up in spite of single threaded use of the API.
What are the issues with multi-threaded use of the API? Won't two threads calling "run" both grab the default context (the same context), leading to the same issue? Or is there something I'm forgetting?
from accelerate.
If two threads call run
with independent computations, then yes, they will be serialised when they try to grab the default context. But that would have happened anyway because the GPU can only process work from a single thread at a time (although it looks like the GK110, released yesterday, lifts this restriction).
So if you have a multi-threaded application, you probably want to use the runIn
form, and have each thread supply their own context that refers to a distinct device (possibly lurking bug: driver contexts hold thread-local state?)
The issue here was that the computations were dependent, and the second had already taken the context before realising the first was still to be evaluated --- deadlock.
It might be enough to make use
in the front-end library strict in its argument. But then again I'm not certain the context needs an MVar around it anyway.
Aside: it always calls forkIO
on the computation, otherwise I found the finaliser threads, which free
device memory, didn't get a chance to run.
from accelerate.
@tmcdonell Yes, why have an MVar around the context? WIthout the MVar, all this wouldn't be an issue, or would it?
from accelerate.
I mangled the commit tag, but here it is: AccelerateHS/accelerate-cuda@9f019ce
Thinking about why I did this, it occurs to me that what I really wanted was exclusive access to the device to avoid context switching (I recall trying this once and it was much slower than executing the two programs sequentially), but actually using a single context from different points is fine.
from accelerate.
Ok, great, but can't we close the issue in this case? Or is there an outstanding problem?
from accelerate.
The 'thread blocked' issues is resolved. I've created a new issue w.r.t. what Ryan mentions above regarding context synchronisation behaviour.
from accelerate.
Thanks.
from accelerate.
Related Issues (20)
- [BUG] Unexpectedly long phases when training a neural network HOT 1
- Support CUDA 11 HOT 1
- [BUG] CUDA-10 library doesn't support the Turing-based RTX 2060? HOT 8
- `inconsistent valuation @ shared 'Acc'` when trying to lift non-`Acc` function to `Acc` HOT 6
- `Foreign` instance for reference interpreter
- Is there a way to force accelerate operations to be sequentially evaluated? HOT 10
- [BUG] doc bugs
- Could not enable debugging options HOT 5
- Support GHCJS compilation HOT 7
- [BUG] Function hashes have incorrect length causing internal errors HOT 2
- [BUG] undefined symbol: _ZTIN4llvm10CallbackVHE HOT 4
- [BUG] Value 'sm_30' is not defined for option 'gpu-name' HOT 4
- [BUG] typo in Semigroup instance of Exp (Maybe a) HOT 1
- How to realise convolution? HOT 13
- [Tracking Issue] Implementing (Segmented) Single-Pass Look-Back Scans
- [BUG] Internal error in package accelerate and LLVM.PTX backend: CUDA Exception - misaligned address HOT 1
- [BUG] Runtime error with llvm-ptx backend: double free or corruption (!prev)
- [BUG] Library won't compile with debug flag when referenced by another project's cabal.project file. HOT 9
- [BUG] ptxas fatal error, sm_89 not defined for gpu-name
- [BUG] Cabal.extra-source-files lists many non-existing cbits files HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from accelerate.