acceleratehs / accelerate Goto Github PK
View Code? Open in Web Editor NEWEmbedded language for high-performance array computations
Home Page: https://www.acceleratehs.org
License: Other
Embedded language for high-performance array computations
Home Page: https://www.acceleratehs.org
License: Other
The following program:
import Data.Array.Accelerate as A
import Data.Array.Accelerate.CUDA
main = print $ run $ A.fold A.max (constant 0) (use (fromList (Z:.10) [1..]) :: Acc (Vector Int))
produces a series of errors like this, on x86_64/Linux:
/tmp/accelerate-cuda-3450/dragon3450.cu(87): error: more than one instance of overloaded function "max" matches the argument list:
function "max(int, int)"
function "max(unsigned int, unsigned int)"
function "max(int, unsigned int)"
function "max(unsigned int, int)"
function "max(long long, long long)"
function "max(unsigned long long, unsigned long long)"
function "max(long long, unsigned long long)"
function "max(unsigned long long, long long)"
function "max(float, float)"
function "max(double, double)"
function "max(float, double)"
function "max(double, float)"
argument types are: (int64_t, int64_t)
12 errors detected in the compilation of "/tmp/tmpxft_00000d7c_00000000-4_dragon3450.cpp4.ii".
Keeping a list here of small things to fix in the docs:
When adding new Accelerate backends it would be nice to have the option of running standalone benchmarks that are completely separate from any Haskell process.
We can already do this with generated data (Generate). But it would be nice to also have another array-inlet, in addition to "Use" that reads an array from disk.
This might be related to #35 although it is different error message.
import Data.Array.Accelerate as Acc
import qualified Data.Array.Accelerate.Interpreter as I
nobug :: Exp Int
nobug = the $ unit 1
-------------------------
bug :: Acc (Vector Int)
bug = Acc.map f def
where
def = use $ fromList (Z :. 0) [] :: Acc (Vector Int)
f i = the $ unit i
Running the method 'bug' produces:
*Main> bug
let a0 =*** Exception:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/mchakravarty/accelerate/issues
./Data/Array/Accelerate/Smart.hs:121 (prjIdx): inconsistent valuation
I get a segfault when I use run1
instead of run
with the CUDA backend. Here's my program:
{-# LANGUAGE FlexibleContexts #-}
import Data.Array.Accelerate as A
--import Data.Array.Accelerate.Interpreter
import Data.Array.Accelerate.CUDA
import PortablePixmap
import Prelude as P
import Data.List (foldl')
import System.Environment
import System.IO
import Debug.Trace
main = do
[a,b,c,d,x,y,z] <- getArgs
hSetBinaryMode stdout True
print (mandelset (read a) (read b) (read c) (read d) (read x) (read y) (read z))
-- -----------------------------------------------------------------------------
type F = Double
type Complex = (F,F)
type ComplexPlane = Array DIM2 Complex
mandelset :: F -> F -> F -> F -> Int -> Int -> Int -> PixMap
mandelset x y x' y' screenX screenY lIMIT
= createPixmap (P.fromIntegral screenX) (P.fromIntegral screenY) lIMIT $
P.map prettyRGB result
where
result = [ x | (_,_,x) <- toList (mandel x y x' y' screenX screenY lIMIT) ]
prettyRGB::Int -> (Int,Int,Int)
prettyRGB s = (r, g, b)
where t = ((lIMIT - s) * 255) `quot` lIMIT
r = t `mod` 128 + 64
g = t * 2 `mod` 128 + 64
b = t * 3 `mod` 256
mandel :: F -> F -> F -> F -> Int -> Int -> Int -> Array DIM2 (F,F,Int)
mandel x y x' y' screenX screenY depth
= foldl (flip ($)) zs0 (P.take depth (repeat go))
where
(cs,zs0) = trace "genPlane" $ run $ lift (cs, zs0)
where cs = genPlane x y x' y' screenX screenY
zs0 = mkinit cs
go :: Array DIM2 (F,F,Int) -> Array DIM2 (F,F,Int)
go zs = zs `seq` (trace "go" $ run1 (A.zipWith iter (use cs)) zs)
genPlane :: F -> F
-> F -> F
-> Int
-> Int
-> Acc ComplexPlane
genPlane lowx lowy highx highy viewx viewy
= generate (constant (Z:.viewy:.viewx))
(\ix -> let pr = unindex2 ix
x = A.fromIntegral (A.fst pr)
y = A.fromIntegral (A.snd pr)
in
lift ( elowx + (x * exsize) / eviewx
, elowy + (y * eysize) / eviewy))
where
elowx, elowy, exsize, eysize, eviewx, eviewy :: Exp F
elowx = constant lowx
elowy = constant lowy
exsize = constant (highx - lowx)
eysize = constant (highy - lowy)
eviewx = constant (P.fromIntegral viewx)
eviewy = constant (P.fromIntegral viewy)
next :: Exp Complex -> Exp Complex -> Exp Complex
next c z = c `plus` (z `times` z)
plus :: Exp Complex -> Exp Complex -> Exp Complex
plus = lift2 f
where f :: (Exp F, Exp F) -> (Exp F, Exp F) -> (Exp F, Exp F)
f (x1,y1) (x2,y2) = (x1+x2,y1+y2)
times :: Exp Complex -> Exp Complex -> Exp Complex
times = lift2 f
where f :: (Exp F, Exp F) -> (Exp F, Exp F) -> (Exp F, Exp F)
f (x,y) (x',y') = (x*x'-y*y', x*y'+y*x')
dot :: Exp Complex -> Exp F
dot = lift1 f
where f :: (Exp F, Exp F) -> Exp F
f (x,y) = x*x + y*y
iter :: Exp Complex -> Exp (F,F,Int) -> Exp (F,F,Int)
iter c z = f (unlift z)
where
f :: (Exp F, Exp F, Exp Int) -> Exp (F,F,Int)
f (x,y,i) =
(dot z' >* 4.0) ? ( lift (x,y,i)
, lift (A.fst z', A.snd z', i+1) )
where z' = A.curry (next c) x y
mkinit :: Acc ComplexPlane -> Acc (Array DIM2 (F,F,Int))
mkinit cs = A.map (lift1 f) cs
where f :: (Exp F, Exp F) -> (Exp F, Exp F, Exp Int)
f (x,y) = (x,y,0)
Run it like this:
./mandel -0.25 -1.0 0.0 -0.75 512 512 255 >out
A gist of the code I'm running, and the result is here.
I can duplicate this with GHCs 7.2.1 and 7.4.1. My device info is as follows:
Prelude Foreign.CUDA.Driver> initialise []
Prelude Foreign.CUDA.Driver> props =<< device 0
DeviceProperties {deviceName = "Quadro 5000", computeCapability = 2.0, totalGlobalMem = 2683502592, totalConstMem = 65536, sharedMemPerBlock = 49152, regsPerBlock = 32768, warpSize = 32, maxThreadsPerBlock = 1024, maxThreadsPerMultiProcessor = 1536, maxBlockSize = (1024,1024,64), maxGridSize = (65535,65535,65535), maxTextureDim1D = 65536, maxTextureDim2D = (65536,65535), maxTextureDim3D = (2048,2048,2048), clockRate = 1026000, multiProcessorCount = 11, memPitch = 2147483647, memBusWidth = 320, memClockRate = 1500000, textureAlignment = 512, computeMode = Default, deviceOverlap = True, concurrentKernels = True, eccEnabled = False, asyncEngineCount = 2, cacheMemL2 = 655360, tccDriverEnabled = False, pciInfo = PCI {busID = 2, deviceID = 0, domainID = 0}, kernelExecTimeoutEnabled = False, integrated = False, canMapHostMemory = True, unifiedAddressing = True}
@rrnewton notes in #48 that the current (driver default) behaviour is to spin when waiting for GPU operations to complete, which is not friendly towards other Haskell threads that want to do useful work. We should change this to something that is gentler with CPU resources (CU_CTX_SCHED_BLOCKING_SYNC
).
Tangentially related to #13.
Originally suggested by @blever
The ability to combine multiple arrays into a single one can currently only be achieved using zip/zipWith. Of course, these arrays can only combine 2 arrays at a time, so in general it would be useful to have operators for combining more than 2 arrays in more elaborate, but structured, ways.
Some requirements could be:
In addition to fulfilling the need of a common pattern, combining operators would allow for further backend optimisations:
when I do cabal install accelerate-cuda-0.12.1.0
with ghc-7.4.2 (as it comes from ghc-7.4.2-x86_64-unknown-linux.tar.bz2)
I get a bunch of errors like this one:
[ 4 of 25] Compiling Data.Array.Accelerate.CUDA.CodeGen.Base ( Data/Array/Accelerate/CUDA/CodeGen/Base.hs, dist/build/Data/Array/Accelerate/CUDA/CodeGen/Base.o )
Data/Array/Accelerate/CUDA/CodeGen/Base.hs:74:55:
Not in scope: `noSrcLoc'
It seems to work with
cabal install accelerate-cuda --constraint='srcloc<0.2' --constraint='language-c-quote<0.4'
Feature request: fast, safe conversion to and from some instance of Data.Vector.Generic
I whipped up a prototype of conversions between Array DIM1 e
and the Storable
vector variant, as that exports a raw pointer interface.
I'm unsure about the safety of the "unsafe" versions, not being too familiar with the Accelerate array data model, and would like feedback on whether they're actually safe to use in pure code. However, both of the QC roundtrip properties Work On My Machineโข.
If the Accelerate.IO modules are not the preferred way to add such functionality, l'd be open to suggestions. Also, I imagine there is a way to do row-major conversions with arbitrary shapes other than DIM1
, but I figured I'd get the simple case working first.
Hello, with the current 0.9 code, whats the desired way to run the accelerate examples binary to do a test run?
because i seem to be getting stack over flow errors when I try to run it, and it seems to be needing command line args that aren't clear how to supply
thanks
-carter
As the test works fine on the CUDA backend, it is unlikely to be a frontend issue.
I have written some Accelerate programs in which I end up using multiple "run"s, often just for debugging or diagnostic purposes, for example, to print intermediate results in the computation.
The frustrating thing is that because it includes array level tuples and Let's, Accelerate compiles arbitrary graphs of array operations. So of course it's already possible to extract many arrays of varying types from a single Accelerate run.
However, putting all the results into a big tuple of arrays can be inconvenient, especially with all the lifting and unlifting business.
Can anyone think of a way to make it easer to construct entire Accelerate computation graphs with multiple outputs? Perhaps the analogy is to the State monad vs. ST. It would be nice to be able to virtually "run" accelerate multiple times within a monad, but really only call a single run.
@tmcdonell suggested that the frontend lifts array computations embedded in scalar expressions using let bindings to simplify backend code (especially wrt. to reference counting).
the full log:
$ /Users/<...>/Library/Haskell/ghc-7.4.1/lib/accelerate-examples-0.12.1.0/bin/accelerate-mandelbrot --cuda
accelerate-mandelbrot:
*** Internal error in package accelerate-cuda ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:139 (unhandled): CUDA Exception: driver not initialised
executing on MacOS X 10.6 and 10.8, 64bit
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Sat_Apr__7_14:56:41_PDT_2012
Cuda compilation tools, release 4.2, V0.2.1221
the only executable which is running is accelerate-examples,
failing ones: accelerate-crystal accelerate-fluid accelerate-mandelbrot accelerate-quickcheck.
The backend kit needs to follow 298a3f2.
Here's the error:
/tmp/accelerate-cuda-11123/dragon11123.cu(12): error: no instance of function template "mod" matches the argument list
argument types are: (const int, Int64)
1 error detected in the compilation of "/tmp/tmpxft_00002b75_00000000-4_dragon11123.cpp4.ii".
accelerate_anova1_v1.gpu.exe: nvcc terminated abnormally (2)
Here's a reproducer: https://gist.github.com/2949100
(Sorry I removed some cruft but it's far from minimal.)
It seems to assume that the shape is correct. While editing a test case I accidentally provided the wrong shape to "fromList":
xs = Sugar.fromList sh [(1,1),(2,2)]
sh = Z :. (5::Int)
I said it was length 5 when it was really only length 2. It would be great to get a more precise error message for this case.
When we introduce nested parallelism, we need to handle lambda-bound variables properly. What needs to be improved is the following:
Each use of an array from within a scalar expression will generate code to a new array variable (texture reference). If the same array is used multiple times, these references should be shared.
Is this a backend or frontend problem?
Originally suggested by @tmcdonell
I get the following error at "cabal configure":
checking size of Int... target (undefined::Int)' is not a module name or a source file checking size of Char... target
(undefined::Char)' is not a module name or a source file
I just tried the latest accelerate from github, and now I'm getting this error:
mandel: /tmp/accelerate-cuda-538/dragon538.cubin: rename: unsupported operation (Invalid cross-
device link)
This happens with all the examples I've tried, including the mandel example from earlier.
{-# LANGUAGE FlexibleContexts #-}
import Data.Array.Accelerate as A
--import Data.Array.Accelerate.Interpreter
import Data.Array.Accelerate.CUDA
import PortablePixmap
import Prelude as P
import Data.List (foldl')
import System.Environment
import System.IO
import Debug.Trace
main = do
(a:b:c:d:x:y:z:_) <- getArgs
hSetBinaryMode stdout True
print (mandelset (read a) (read b) (read c) (read d) (read x) (read y) (read z))
-- -----------------------------------------------------------------------------
type F = Double
type Complex = (F,F)
type ComplexPlane = Array DIM2 Complex
mandelset :: F -> F -> F -> F -> Int -> Int -> Int -> PixMap
mandelset x y x' y' screenX screenY lIMIT
= createPixmap (P.fromIntegral screenX) (P.fromIntegral screenY) lIMIT $
P.map prettyRGB result
where
result = [ x | (_,_,x) <- toList (mandel x y x' y' screenX screenY lIMIT) ]
prettyRGB::Int -> (Int,Int,Int)
prettyRGB s = (r, g, b)
where t = ((lIMIT - s) * 255) `quot` lIMIT
r = t `mod` 128 + 64
g = t * 2 `mod` 128 + 64
b = t * 3 `mod` 256
mandel :: F -> F -> F -> F -> Int -> Int -> Int -> Array DIM2 (F,F,Int)
mandel x y x' y' screenX screenY depth
= foldl (flip ($)) zs0 (P.take depth (repeat go))
where
(cs,zs0) = trace "genPlane" $ run $ lift (cs, zs0)
where cs = genPlane x y x' y' screenX screenY
zs0 = mkinit cs
go :: Array DIM2 (F,F,Int) -> Array DIM2 (F,F,Int)
go zs = zs `seq` (trace "go" $ run (A.zipWith iter (use cs) (use zs)))
genPlane :: F -> F
-> F -> F
-> Int
-> Int
-> Acc ComplexPlane
genPlane lowx lowy highx highy viewx viewy
= generate (constant (Z:.viewy:.viewx))
(\ix -> let pr = unindex2 ix
x = A.fromIntegral (A.fst pr)
y = A.fromIntegral (A.snd pr)
in
lift ( elowx + (x * exsize) / eviewx
, elowy + (y * eysize) / eviewy))
where
elowx, elowy, exsize, eysize, eviewx, eviewy :: Exp F
elowx = constant lowx
elowy = constant lowy
exsize = constant (highx - lowx)
eysize = constant (highy - lowy)
eviewx = constant (P.fromIntegral viewx)
eviewy = constant (P.fromIntegral viewy)
next :: Exp Complex -> Exp Complex -> Exp Complex
next c z = c `plus` (z `times` z)
plus :: Exp Complex -> Exp Complex -> Exp Complex
plus = lift2 f
where f :: (Exp F, Exp F) -> (Exp F, Exp F) -> (Exp F, Exp F)
f (x1,y1) (x2,y2) = (x1+x2,y1+y2)
times :: Exp Complex -> Exp Complex -> Exp Complex
times = lift2 f
where f :: (Exp F, Exp F) -> (Exp F, Exp F) -> (Exp F, Exp F)
f (x,y) (x',y') = (x*x'-y*y', x*y'+y*x')
dot :: Exp Complex -> Exp F
dot = lift1 f
where f :: (Exp F, Exp F) -> Exp F
f (x,y) = x*x + y*y
iter :: Exp Complex -> Exp (F,F,Int) -> Exp (F,F,Int)
iter c z = f (unlift z)
where
f :: (Exp F, Exp F, Exp Int) -> Exp (F,F,Int)
f (x,y,i) =
(dot z' >* 4.0) ? ( lift (x,y,i)
, lift (A.fst z', A.snd z', i+1) )
where z' = A.curry (next c) x y
mkinit :: Acc ComplexPlane -> Acc (Array DIM2 (F,F,Int))
mkinit cs = A.map (lift1 f) cs
where f :: (Exp F, Exp F) -> (Exp F, Exp F, Exp Int)
f (x,y) = (x,y,0)
When using the CUDA backend, but not the interpreter, it is easy to get a "thread blocked indefinitely on MVar" exception by having one GPU computation depend on another. I presume this is due to the use of withMVar
in run
, so I worked around it with some seqs. Is that right?
This seems like a bug, since if something works with the interpreter we would expect it to work with the CUDA backend. I can imagine it would be difficult to fix though. Could it be documented somewhere?
Indexing an array of 8- or 16-bit type within scalar code yields spurious results:
ghci> let xs = use $ fromList (Z:.10) [0..] :: Acc (Vector Int8)
ghci> CUDA.run $ A.map (\i -> xs ! index1 (A.fromIntegral i)) xs
Array (Z :. 10) [0,4,4,6,4,-123,6,-121,4,69]
D.A.A.Smart.convertAccFun1
will throw a type mismatch error in the following example:
import Data.Array.Accelerate as Acc
import Data.Array.Accelerate.AST ( Afun )
import Data.Array.Accelerate.Smart
z :: Acc (Scalar Int)
z = unit 0
xs :: Acc (Vector Int)
xs = use $ fromList (Z:.10) [0..]
acc1 :: Acc (Vector Int) -> Acc (Vector Int)
acc1 = Acc.map (\_ -> the z)
afun1 :: Afun (Vector Int -> Vector Int)
afun1 = convertAccFun1 acc1
and the session in ghci:
ghci> acc1 xs
let a0 = unit 0
in map
(\x0 -> a0!(index Z)) (use (Array Z :. 10 [0,1,2,3,4,5,6,7,8,9]))
ghci> :force afun1
*** Exception:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/mchakravarty/accelerate/issues
./Data/Array/Accelerate/Smart.hs:114 (prjIdx): type mismatch
couldn't match expected type `Array (:. Z Int) Int' with actual type `Array Z Int'
The CUDA backend can currently not handle arrays that contain elements of typeBool
or Char
. In D.A.A.Array.Data
, see the instance declarations for ArrayElem Bool
and ArrayElem Char
for details.
I just got a CUDA-capable macbook pro and am trying to run accelerate-cuda on my laptop for the first time. This is a fresh 10.8 machine with Apple's "command line development" utilities. I seem to get the same error (below) with either a cabal install of the stable 0.12 hackage version, or HEAD.
[~/accelerate/accelerate-cuda] (master)$ ./configure
checking for nvcc... nvcc
checking whether the C++ compiler works... yes
checking for C++ compiler default output file name... a.out
checking for suffix of executables...
checking whether we are cross compiling... configure: error: in `/Users/rrnewton/accelerate/accelerate-cuda':
configure: error: cannot run C++ compiled programs.
If you meant to cross compile, use `--host'.
See `config.log' for more details
[~/accelerate/accelerate-cuda] (master)$ gcc --version
i686-apple-darwin11-llvm-gcc-4.2 (GCC) 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)
Copyright (C) 2007 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
I can't see this ever being an issue other than during debugging, since one would just use stencil
, but anyway, something fishy is going on...
import Data.Array.Accelerate as A
is :: Array DIM2 Int
is = fromList (Z:.10:.10) [0..]
fs :: Array DIM2 Float
fs = fromList (Z:.10:.10) [0..]
-- Ignoring the first parameter is OK
--
ok1 = stencil2 centre Clamp (use fs) Clamp (use is)
where
centre :: Stencil3x3 Float -> Stencil3x3 Int -> Exp Int
centre _ (_,(_,y,_),_) = y
-- Using both is also OK
--
ok2 = stencil2 centre Clamp (use fs) Clamp (use is)
where
centre :: Stencil3x3 Float -> Stencil3x3 Int -> Exp Float
centre (_,(_,x,_),_) (_,(_,y,_),_) = x + A.fromIntegral y
-- Not using the second parameter to the stencil function ('y') somehow results
-- 'x' having the wrong indices, and environment projection fails:
--
-- stencil2*** Exception: Fatal error in Smart.prjIdx:
-- Couldn't match expected type `((Float,Float,Float),(Float,Float,Float),(Float,Float,Float))' with actual type `((Int,Int,Int),(Int,Int,Int),(Int,Int,Int))'
-- Type mismatch at shared 'Exp' tree with stable name 42; i = 0
-- Possible reason: nested data parallelism โ array computation that depends on a
-- scalar variable of type 'Exp a'
--
bad1 = stencil2 centre Clamp (use fs) Clamp (use is)
where
centre :: Stencil3x3 Float -> Stencil3x3 Int -> Exp Float
centre (_,(_,x,_),_) _ = x
Hi,
when running accelerate-examples on my machine, the first tests pass, but then I get the following error message:
scanseg-sum: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/mchakravarty/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:58 (unhandled): CUDA Exception: unspecified launch failure
All the later tests report fail with the same error message. The test "slices" gives, in addition, the error
./Data/Array/Accelerate/CUDA/Compile.hs:415 (prepareAcc): IndexAny: not implemented yet
If I run the program through cuda-memcheck, it reports
========= Out-of-range Shared or Local Address
========= at 0x000001a8 in inclusive_scan
========= by thread (544,0,0) in block (0,0,0)
My specifications are
DeviceProperties {deviceName = "GeForce GT 540M", computeCapability = 2.1, totalGlobalMem = 1073414144, totalConstMem = 65536, sharedMemPerBlock = 49152, regsPerBlock = 32768, warpSize = 32, maxThreadsPerBlock = 1024, maxThreadsPerMultiProcessor = 1536, maxBlockSize = (1024,1024,64), maxGridSize = (65535,65535,65535), maxTextureDim1D = 65536, maxTextureDim2D = (65536,65535), maxTextureDim3D = (2048,2048,2048), clockRate = 1344000, multiProcessorCount = 2, memPitch = 2147483647, memBusWidth = 128, memClockRate = 800000, textureAlignment = 512, computeMode = Default, deviceOverlap = True, concurrentKernels = True, eccEnabled = False, asyncEngineCount = 1, cacheMemL2 = 131072, tccDriverEnabled = False, pciInfo = PCI {busID = 1, deviceID = 0, domainID = 0}, kernelExecTimeoutEnabled = False, integrated = False, canMapHostMemory = True, unifiedAddressing = True}
I'm using ghc 7.4.1, cuda-0.4.1.0, and accelerate-0.9.0.1.
If I can provide any additional helpful information, please let me know.
Kind regards,
Philipp
I have accelerate-cuda installed globally (.cabal/config: user-install: False)
I cannot run any of the accelerate-examples. For instance,
waldmann@octopus:~$ accelerate-mandelbrot
accelerate-mandelbrot: /usr/local/share/accelerate-cuda-0.12.1.0/cache: createDirectory: permission denied (Permission denied)
Of course I can
sudo chmod go+w /usr/local/share/accelerate-cuda-0.12.1.0
but that does not seem right.
Originally reported by @tmcdonell
When one or more threads try to write to the same location, the hardware write-combining mechanism accepts one transaction and rejects all others. The permute
operation does not currently take this into account.
main :: IO ()
main = do
putStr ""Interpreter : "" ; print (Interp.run accumulate)
putStr ""CUDA : "" ; print =<< (CUDA.run accumulate)
accumulate :: Acc (Vector Int)
accumulate = Acc.permute (+) dst (idx Acc.!) src
where
src = Acc.use $ Acc.fromList 16 (repeat 1)
idx = Acc.use $ Acc.fromList 16 [0,0,3,2,1,1,2,1,3,3,1,0,0,2,1,1] :: Acc (Vector Int)
dst = Acc.use $ Acc.fromList 4 (repeat 0)
Which results in:
*Test> :main
Interpreter : Array 4 [4,6,3,3]
CUDA : Array 4 [1,1,1,1]
Compute 1.0 devices do not support any atomic primitives. At least for integral types, we can work around this by tagging each transaction with a thread ID (or similar). This requires many additional memory transactions and wastes the upper bits.
For devices of compute 1.1 and greater, we can use atomic compare-and-swap. This is limited to 32-bit and 64-bit [unsigned] integers, but doesn't require any additional transactions (assuming the internals are intelligent). To coerce
#define INT_AS_FLOAT(x) (*((float*)&(x)))
#define FLOAT_AS_INT(x) (*((int*)&(x)))
or
// Actually, pointer casting breaks strict aliasing rules, so a union would be better.
float __int_as_float(int a)
{
union {int a; float b;} u;
u.a = a;
return u.b;
}
@tmcdonell's code for the tagging approach is https://gist.github.com/924007 and for atomic compare-and-swap is https://gist.github.com/924009
If you are using (^)
on Exp
s you get the error:
**** Exception: Prelude.Ord.compare applied to EDSL types
See https://gist.github.com/2768391 for an example.
My guess is that you are inheriting the default implementation of (^)
that uses Ord
functions
Already there exists the "size" function:
size :: (Shape ix, Elt e) => Acc (Array ix e) -> Exp Int
But it would be good to also have a different version that operates directly on shapes, i.e.:
size' :: (Shape ix) => Exp ix -> Exp Int
Easy enough to implement if there are no objections.
What name would people propose it be called given that size
is already taken?
Currently we have 11 failures:
Find the cause for these failures and decide which one we want to fix for 0.12 and which ones we might delay until 0.12.1.
via @arj
import Data.Array.Accelerate as Acc
type AccVector a = Acc (Vector a)
type AccSparseVector a = ((AccVector Int), (AccVector a))
getEntry :: (Elt a) => Exp Int -> a -> AccSparseVector a -> Exp a
getEntry i d (idx,val) = Acc.snd $ the $ Acc.foldAll f def xs
where
xs = Acc.zip idx val
def = constant (0 :: Int,d)
f ack v = (i ==* Acc.fst v) ? (v, ack)
-- f ack v = (i ==* i) ? (v, ack) -- This fails!
-- f ack v = ack -- This works!
-- f ack v = v -- This works!
vectorFromSparseVector :: (Elt a) => AccSparseVector a -> Int -> a -> AccVector a
vectorFromSparseVector sv@(idx,val) size d = Acc.map m def
where
def = use $ fromList (Z :. size) $ take size $ ([1,2..] :: [Int]) :: AccVector Int
m i = getEntry i d sv
------------
nobug = getEntry 1 (0.0 :: Float) (idx, val)
where
idx = use $ fromList (Z :. (1 :: Int)) ([1] :: [Int])
val = use $ fromList (Z :. (1 :: Int)) ([1] :: [Float])
-------------------------
bug = vectorFromSparseVector (idx, val) 1 $ (0.0 :: Float)
where
idx = use $ fromList (Z :. (1 :: Int)) ([1] :: [Int])
val = use $ fromList (Z :. (1 :: Int)) ([1] :: [Float])
and the ghci session:
ghci> bug
let a0 =*** Exception:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/mchakravarty/accelerate/issues
./Data/Array/Accelerate/Smart.hs:114 (prjIdx): type mismatch
couldn't match expected type `((),Int)' with actual type `(((),Int),Float)'
I installed accelerate 0.9.0 from github, with nVidia's CUDA 3.2 and the hackage cuda-0.3.2 package. My platform is a 2009 iMac with NVIDIA GeForce GT 120 (256MB video RAM) running OS X 10.6.8. Installation of everything went smoothly, once I set my environment variables LDFLAGS, CPPFLAGS, LD_LIBRARY_PATH appropriately to point at the nvidia files in /usr/local/cuda/...
When I run accelerate-examples, it works (though with some stack overflows which are apparently expected).
However, when I run accelerate-examples --cuda, all tests fail with the same error:
map-abs: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/mchakravarty/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:58 (unhandled): CUDA Exception: invalid handle
So I am dutifully reporting the error. Any suggestions about how to gather more info or what I've done wrong?
Originally reported by @tmcdonell
CUDA devices do not coalesce memory transfers to global memory of 8- and 16-bit types. Without providing alternate skeletons that process multiple elements per thread (vec4 and vec2 types respectively), we may be able to promote these to 32-bit transactions, and mask off the irrelevant data. Similar issues exist for shared memory bank conflicts.
Hi,
I get the error in the title when I try to run the following small example (that is converting a matrix from sparse list representation to dense array representation):
import Data.Array.Accelerate as Acc
import Data.Array.Accelerate.CUDA
import Prelude hiding (replicate, zip, unzip, map, scanl, scanl1, scanr, scanr1, zipWith,
filter, max, min, not, fst, snd, curry, uncurry)
import qualified Prelude as P
type SRow = [(Int,Int)]
toDense :: Int -> [SRow] -> Acc (Array DIM2 Int)
toDense nCols rows =
rows' where
sh = (Z :. nCols :. (length rows) :: DIM2)
nElts = sum (P.map length rows)
emptyArr = use $ fromList sh (repeat 0)
arrElts = use $ fromList (Z:.nElts :: DIM1)
(concatMap (P.map P.snd) rows)
rows' = permute (+) emptyArr indMap arrElts
newInds = use $ fromList (Z:.nElts :: DIM1) $
P.concat $ P.zipWith (\x ys -> P.map (\y -> Z:.x:.(P.fst y)::DIM2) ys) [0..] rows
indMap i = newInds!i
main = print $ run $ toDense 7 [[(1,2),(2,3)],[(0,1),(3,2),(6,7)]]
When I load this into ghci and try to run main, I get
/tmp/accelerate-cuda/dragon5565.cu(27): error: no instance of overloaded function "atomicCAS" matches the argument list
argument types are: (uint64_t *, uint64_t, uint64_t)
1 error detected in the compilation of "/tmp/tmpxft_0000162c_00000000-4_dragon5565.cpp4.ii".
*** Exception: nvcc terminated abnormally (2)
The contents of /tmp/accelerate-cuda/dragon5565.cu are:
#include <accelerate_cuda_extras.h>
static __constant__ DIM1 sh0;
static TexInt64 arr0_a0;
static TexInt64 arr0_a1;
typedef DIM2 DimOut;
typedef DIM1 DimIn0;
extern "C" __global__ void permute(int64_t* d_out_a0, const int64_t* d_in0_a0, const DimOut shOut, const DimIn0 shIn0)
{
const int shapeSize = size(shIn0);
const int gridSize = __umul24(blockDim.x, gridDim.x);
int ix;
for (ix = __umul24(blockDim.x, blockIdx.x) + threadIdx.x; ix < shapeSize; ix += gridSize) {
DimOut dst;
const int x0_a0 = ix;
const int v0 = toIndex(sh0, shape(x0_a0));
dst.a1 = indexArray(arr0_a1, v0);
dst.a0 = indexArray(arr0_a0, v0);
if (!ignore(dst)) {
const int jx = toIndex(shOut, dst);
const int64_t x0_a0 = d_in0_a0[ix];
uint64_t x1_a0, _x1_a0 = reinterpret64(d_out_a0[jx]);
do {
x1_a0 = _x1_a0;
_x1_a0 = atomicCAS((uint64_t*) &d_out_a0[jx], x1_a0, reinterpret64(x1_a0 + x0_a0));
} while(x1_a0 != _x1_a0);
}
}
}
I would be grateful if you could look into this. If you need any more info, just let me know.
Cheers,
Philipp
When attempting to multiply a matrix by a vector on the left with the following code:
(*^) :: (Elt e,IsNum e) => Acc (Array DIM1 e) -> Acc (Array DIM2 e) -> Acc (Array DIM1 e)
(*^) v m = generate (index1 b)
(\ ix -> let (Z:.cols) = unlift ix :: (Z:.Exp Int)
in the $ (.^) v
$ takeCol cols m )
where (Z:.a:.b) = unlift $ shape m :: (Z:.Exp Int:.Exp Int)
(.^) :: (Elt e,IsNum e) => Acc (Array DIM1 e) -> Acc (Array DIM1 e) -> Acc (Scalar e)
(.^) u v = fold1 (+) (zipWith (*) u v)
takeCol :: Elt e => Exp Int -> Acc (Array DIM2 e) -> Acc (Array DIM1 e)
takeCol n xs = let (Z:.rows:.cols) = (unlift $ shape xs :: (Z:.Exp Int:.Exp Int))
in flatten $ backpermute
( index2 (1::Exp Int) cols)
( \ix -> let (Z:.j:.i) = unlift ix :: (Z:.Exp Int:.Exp Int)
in index2 n i )
xs
and the following input:
run $ (use v) *^ (use m)
(where (v :: Array (Z:.5) Float) & (m :: Array (Z:.5:.3) Float)),
I get the following error in GHCi:
*** Exception:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/mchakravarty/accelerate/issues
./Data/Array/Accelerate/Smart.hs:561 (convertSharingExp): inconsistent valuation @ shared 'Exp' tree with stable name 54;
env = [56]
Cleanup the mess in Smart.hs
left by switching to the faster sharing recovery algorithm.
test2 - when reshaping the array and finding size of array at the same time, fails with "lost device memory reference" error.
test2 is successful when using Data.Array.Accelerate.Interpreter instead.
module Test where
import Data.Array.Accelerate as Acc
import Data.Array.Accelerate.CUDA
import Debug.Trace
-- Exp a -> a
fromExp :: (Elt a) => Exp a -> a
fromExp x = (toList(run $ unit $ x))!!0
arr = use$fromList(fromExp (index1 4)) ([1, 3, 5, 7]::[Int])
first len = fromExp $ (reshape (index1 len) arr) ! (index1 0)
test0 = 4 == (fromExp $ size arr) -- Success, returns True
test1 = first (4::Exp Int) -- Success, returns 1
test2 = first (size arr::Exp Int) -- Failure, "lost device memory reference", expected return 1
test3 = first (lift $ fromExp $ size arr::Exp Int) -- Success, returns 1
sure that's not an accelerate error, but it is annoying. e.g.,
It seems haddock is asuming that Data-Array-Accelerate-Analysis should exists.
I'll try to submit this as a haddock issue as well.
Extend the current code dealing with sharing of array computations to handle scalar computations as well.
Hello,
I'm trying to use Accelerate for hydrodynamic simulations.
As a training, I'm writing a Lattice-Boltzmann solver with Accelerate. The program, under construction, is
I have expressed what I want to write also in C++ and CUDA. They are
main-omp.cpp and main-cuda.cu at the same folder.
To begin with, I wrote a function to initialize the array in Accelerate,
(it corresponds to the function 'initialize()' in fluid.h)
but it fails with 'submit a bug report' error.
It says 'too many resources requested,' so I looked at the printout of Accelerate's kernel,
but for me it looks normal.
Am I doing something wrong, so that I'm wasting resources?
Or shall I decrease e.g. the resolution?
./MainAcc.hs 0
... some warnings omitted ...
map
(\x0 -> (+) ((+) ((+) ((+) ((+) ((+) ((+) ((+) (2 (3 x0),
1 (3 x0)),
0 (3 x0)),
2 (2 x0)),
1 (2 x0)),
0 (2 x0)),
2 (1 x0)),
1 (1 x0)),
0 (1 x0)))
(generate
(Z :. 1024) :. 768
(\x0 -> ((0.0,0.0,0.0),
(0.1,
0.7,
(+) (0.2,
() (1.0e-3,
(/) (() (12.0, fromIntegral (indexHead x0)), 768.0)))),
(0.0,0.0,0.0),
((<) ((+) (() (64.0,
() ((-) (fromIntegral (indexHead (indexTail x0)),
(/) (768.0, 6.0)),
(-) (fromIntegral (indexHead (indexTail x0)),
(/) (768.0, 6.0)))),
() ((-) (fromIntegral (indexHead x0), (/) (768.0, 2.0)),
(-) (fromIntegral (indexHead x0), (/) (768.0, 2.0)))),
() ((/) (768.0, 24.0), (/) (768.0, 24.0)))) ?
(1.0, 0.0))))
MainAcc.hs:
** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/mchakravarty/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:59 (unhandled): CUDA Exception: too many resources requested for launch
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.