acceleratehs / accelerate Goto Github PK

View Code? Open in Web Editor NEW

886.0 42.0 118.0 15.74 MB

Embedded language for high-performance array computations

Home Page: https://www.acceleratehs.org

License: Other

Haskell 94.12% Shell 0.07% C 5.26% C++ 0.55%

haskell accelerate llvm cuda parallel-computing gpu-computing gpu hacktoberfest

accelerate's People

Contributors

Stargazers

Watchers

Forkers

rrnewton blever sseefried-scratch rmukhtar rkudiyarov sajith wilbowma blambo acfoltzer mchakravarty tmcdonell repos-haskell nushio3 plcplc robeverest danielschuessler borsboom nebuta audreyt eccstartup klarh alsam dharmeshpatel33 teuffy fmma iu-parfunc josephwinston tkonolige amosr kumasento jiahao mikusp cdornan mylearninghardsource chubbymaggie bgamari sjfloat sisirkoppaka paronsson matt-noonan silky errord bakhtiyarneyman malliwi88 andreabedini aronarts roboguy13 egisatoshi feibhwang ian-mi alang9 ndcroos meteogrid solertis pwoh haskell-mouse shabesoglu biocad sighingnow haskellstudio ocramz joshmeredith lonnell delanoe axect chessai monad-one ajknapp electrocutie traviswhitaker googol-lab lambdaxymox sakehl rogerbosman peterzs andrewdmeier ivogabe noughtmare dpvanbalen arjanadriaanse ahkole tomsmeding robbert-vdh asm-volatile embedtiktok jaynotleno linkv-rtc j-mueller zleonhe drewfenwick standardgalactic bwijgers bsdshell hugopeters1024 naraenda faustind ameya-deshmukh riscky jasonoro zarazek

accelerate's Issues

error: more than one instance of overloaded function "max" matches the argument list:

The following program:

import Data.Array.Accelerate as A
import Data.Array.Accelerate.CUDA

main = print $ run $ A.fold A.max (constant 0) (use (fromList (Z:.10) [1..]) :: Acc (Vector Int))

produces a series of errors like this, on x86_64/Linux:

/tmp/accelerate-cuda-3450/dragon3450.cu(87): error: more than one instance of overloaded function "max" matches the argument list:
            function "max(int, int)"
            function "max(unsigned int, unsigned int)"
            function "max(int, unsigned int)"
            function "max(unsigned int, int)"
            function "max(long long, long long)"
            function "max(unsigned long long, unsigned long long)"
            function "max(long long, unsigned long long)"
            function "max(unsigned long long, long long)"
            function "max(float, float)"
            function "max(double, double)"
            function "max(float, double)"
            function "max(double, float)"
            argument types are: (int64_t, int64_t)

12 errors detected in the compilation of "/tmp/tmpxft_00000d7c_00000000-4_dragon3450.cpp4.ii".

Miscellaneous haddock nitpicks

Keeping a list here of small things to fix in the docs:

"Segment" should explain if it a list of lengths or a list of of start positions.
Is the argument documentation for permute off? The "permuted array" that is taken as an argument, is that actually the input array, with the result being the "permuted array"?

Direct loading of arrays in files, analogous to Use

When adding new Accelerate backends it would be nice to have the option of running standalone benchmarks that are completely separate from any Haskell process.

We can already do this with generated data (Generate). But it would be nice to also have another array-inlet, in addition to "Use" that reads an array from disk.

Inconsistent valuation

This might be related to #35 although it is different error message.

import Data.Array.Accelerate as Acc
import qualified Data.Array.Accelerate.Interpreter as I

nobug :: Exp Int
nobug = the $ unit 1

-------------------------

bug :: Acc (Vector Int)
bug = Acc.map f def
  where
    def = use $ fromList (Z :. 0) [] :: Acc (Vector Int)
    f i = the $ unit i

Running the method 'bug' produces:

*Main> bug
let a0 =*** Exception: 
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/mchakravarty/accelerate/issues
./Data/Array/Accelerate/Smart.hs:121 (prjIdx): inconsistent valuation

Segmentation fault

I get a segfault when I use run1 instead of run with the CUDA backend. Here's my program:

{-# LANGUAGE FlexibleContexts #-}

import Data.Array.Accelerate as A
--import Data.Array.Accelerate.Interpreter
import Data.Array.Accelerate.CUDA

import PortablePixmap
import Prelude as P
import Data.List (foldl')

import System.Environment
import System.IO

import Debug.Trace

main  = do
  [a,b,c,d,x,y,z] <- getArgs
  hSetBinaryMode stdout True
  print (mandelset (read a) (read b) (read c) (read d) (read x) (read y) (read z))

-- -----------------------------------------------------------------------------

type F = Double

type Complex = (F,F)
type ComplexPlane = Array DIM2 Complex

mandelset :: F -> F -> F -> F -> Int -> Int -> Int -> PixMap
mandelset x y x' y' screenX screenY lIMIT
   = createPixmap (P.fromIntegral screenX) (P.fromIntegral screenY) lIMIT $
        P.map prettyRGB result
  where
   result = [ x | (_,_,x) <- toList (mandel x y x' y' screenX screenY lIMIT) ]

   prettyRGB::Int -> (Int,Int,Int)
   prettyRGB s = (r, g, b)
     where t = ((lIMIT - s) * 255) `quot` lIMIT
           r = t `mod` 128 + 64
           g = t * 2 `mod` 128 + 64
           b = t * 3 `mod` 256

mandel :: F -> F -> F -> F -> Int -> Int -> Int -> Array DIM2 (F,F,Int)
mandel x y x' y' screenX screenY depth
  = foldl (flip ($)) zs0 (P.take depth (repeat go))
  where
   (cs,zs0) = trace "genPlane" $ run $ lift (cs, zs0)
        where cs  = genPlane x y x' y' screenX screenY
              zs0 = mkinit cs

   go :: Array DIM2 (F,F,Int) -> Array DIM2 (F,F,Int)
   go zs = zs `seq` (trace "go" $ run1 (A.zipWith iter (use cs)) zs)


genPlane :: F -> F
         -> F -> F
         -> Int
         -> Int
         -> Acc ComplexPlane
genPlane lowx lowy highx highy viewx viewy
   = generate (constant (Z:.viewy:.viewx))
              (\ix -> let pr = unindex2 ix
                          x = A.fromIntegral (A.fst pr)
                          y = A.fromIntegral (A.snd pr)
                      in
                        lift ( elowx + (x * exsize) / eviewx
                             , elowy + (y * eysize) / eviewy))
   where
      elowx, elowy, exsize, eysize, eviewx, eviewy :: Exp F

      elowx  = constant lowx
      elowy  = constant lowy

      exsize = constant (highx - lowx)
      eysize = constant (highy - lowy)

      eviewx = constant (P.fromIntegral viewx)
      eviewy = constant (P.fromIntegral viewy)


next :: Exp Complex -> Exp Complex -> Exp Complex
next c z = c `plus` (z `times` z)


plus :: Exp Complex -> Exp Complex -> Exp Complex
plus = lift2 f
  where f :: (Exp F, Exp F) -> (Exp F, Exp F) -> (Exp F, Exp F)
        f (x1,y1) (x2,y2) = (x1+x2,y1+y2)

times :: Exp Complex -> Exp Complex -> Exp Complex
times = lift2 f
  where f :: (Exp F, Exp F) -> (Exp F, Exp F) -> (Exp F, Exp F)
        f (x,y) (x',y')   =  (x*x'-y*y', x*y'+y*x')

dot :: Exp Complex -> Exp F
dot = lift1 f
  where f :: (Exp F, Exp F) -> Exp F
        f (x,y) = x*x + y*y


iter :: Exp Complex -> Exp (F,F,Int) -> Exp (F,F,Int)
iter c z = f (unlift z)
 where
  f :: (Exp F, Exp F, Exp Int) -> Exp (F,F,Int)
  f (x,y,i) =
     (dot z' >* 4.0) ? ( lift (x,y,i)
                       , lift (A.fst z', A.snd z', i+1) )
     where z' = A.curry (next c) x y


mkinit :: Acc ComplexPlane -> Acc (Array DIM2 (F,F,Int))
mkinit cs = A.map (lift1 f) cs
  where f :: (Exp F, Exp F) -> (Exp F, Exp F, Exp Int)
        f (x,y) = (x,y,0)

Run it like this:

./mandel -0.25 -1.0 0.0 -0.75 512 512 255 >out

Internal error with radix sort example

A gist of the code I'm running, and the result is here.

I can duplicate this with GHCs 7.2.1 and 7.4.1. My device info is as follows:

Prelude Foreign.CUDA.Driver> initialise []
Prelude Foreign.CUDA.Driver> props =<< device 0
DeviceProperties {deviceName = "Quadro 5000", computeCapability = 2.0, totalGlobalMem = 2683502592, totalConstMem = 65536, sharedMemPerBlock = 49152, regsPerBlock = 32768, warpSize = 32, maxThreadsPerBlock = 1024, maxThreadsPerMultiProcessor = 1536, maxBlockSize = (1024,1024,64), maxGridSize = (65535,65535,65535), maxTextureDim1D = 65536, maxTextureDim2D = (65536,65535), maxTextureDim3D = (2048,2048,2048), clockRate = 1026000, multiProcessorCount = 11, memPitch = 2147483647, memBusWidth = 320, memClockRate = 1500000, textureAlignment = 512, computeMode = Default, deviceOverlap = True, concurrentKernels = True, eccEnabled = False, asyncEngineCount = 2, cacheMemL2 = 655360, tccDriverEnabled = False, pciInfo = PCI {busID = 2, deviceID = 0, domainID = 0}, kernelExecTimeoutEnabled = False, integrated = False, canMapHostMemory = True, unifiedAddressing = True}

@rrnewton notes in #48 that the current (driver default) behaviour is to spin when waiting for GPU operations to complete, which is not friendly towards other Haskell threads that want to do useful work. We should change this to something that is gentler with CPU resources (CU_CTX_SCHED_BLOCKING_SYNC).

Tangentially related to #13.

Support for "combining" operators

Originally suggested by @blever

The ability to combine multiple arrays into a single one can currently only be achieved using zip/zipWith. Of course, these arrays can only combine 2 arrays at a time, so in general it would be useful to have operators for combining more than 2 arrays in more elaborate, but structured, ways.

Some requirements could be:

'combine' operator:
- a generalised array combining operator
- sum of input arrays sizes is equal to output array size - i.e. no elements are lost or duplicated
- no permutation is performed on elements within in input array - i.e. output array elements are contiguous with respect to their source input array
- input array elements must all be of the same type and shape (shape would need to be a run-time check)
- the combination does not need to preserve dimensionality - e.g. multiple 1D arrays could be combined to produce another 1D array (concatenation), or a 2D array (stacked) or even "maybe" a 3D array (stack-stacking?)
'append' operator:
- a specialised array combining operator
- two input arrays - place one array at the "end" of the other
- input array elements of the same type
- the higher dimensions of the input arrays must have the same extent

In addition to fulfilling the need of a common pattern, combining operators would allow for further backend optimisations:

input arrays to a combine operator would not require intermediate writing to memory - they can be written directly (by their producer) to their location within the combined output array
on architectures such as Fermi, the generation of the input arrays can be done in parallel by using separate streams - 'combine' in this case acts as synchronisation barrier until all computations are complete

accelerate-cuda build error (Not in scope: 'noSrcLoc')

when I do cabal install accelerate-cuda-0.12.1.0
with ghc-7.4.2 (as it comes from ghc-7.4.2-x86_64-unknown-linux.tar.bz2)
I get a bunch of errors like this one:

[ 4 of 25] Compiling Data.Array.Accelerate.CUDA.CodeGen.Base ( Data/Array/Accelerate/CUDA/CodeGen/Base.hs, dist/build/Data/Array/Accelerate/CUDA/CodeGen/Base.o )

Data/Array/Accelerate/CUDA/CodeGen/Base.hs:74:55:
Not in scope: `noSrcLoc'

It seems to work with

cabal install accelerate-cuda --constraint='srcloc<0.2' --constraint='language-c-quote<0.4'

Conversion to/from Data.Vector of some flavor

Feature request: fast, safe conversion to and from some instance of Data.Vector.Generic

I whipped up a prototype of conversions between Array DIM1 e and the Storable vector variant, as that exports a raw pointer interface.

I'm unsure about the safety of the "unsafe" versions, not being too familiar with the Accelerate array data model, and would like feedback on whether they're actually safe to use in pure code. However, both of the QC roundtrip properties Work On My Machine™.

If the Accelerate.IO modules are not the preferred way to add such functionality, l'd be open to suggestions. Also, I imagine there is a way to do row-major conversions with arbitrary shapes other than DIM1, but I figured I'd get the simple case working first.

not clear how to use the examples code to test that its working

Hello, with the current 0.9 code, whats the desired way to run the accelerate examples binary to do a test run?

because i seem to be getting stack over flow errors when I try to run it, and it seems to be needing command line args that aren't clear how to supply

thanks
-carter

Interpreter doesn't terminate on backpermute-reverse

As the test works fine on the CUDA backend, it is unlikely to be a frontend issue.

Make it easier to retrieve multiple results from an Acc run.

I have written some Accelerate programs in which I end up using multiple "run"s, often just for debugging or diagnostic purposes, for example, to print intermediate results in the computation.

The frustrating thing is that because it includes array level tuples and Let's, Accelerate compiles arbitrary graphs of array operations. So of course it's already possible to extract many arrays of varying types from a single Accelerate run.

However, putting all the results into a big tuple of arrays can be inconvenient, especially with all the lifting and unlifting business.

Can anyone think of a way to make it easer to construct entire Accelerate computation graphs with multiple outputs? Perhaps the analogy is to the State monad vs. ST. It would be nice to be able to virtually "run" accelerate multiple times within a monad, but really only call a single run.

Lift array computations out of scalar expressions

@tmcdonell suggested that the frontend lifts array computations embedded in scalar expressions using let bindings to simplify backend code (especially wrt. to reference counting).

cannot run examples (CUDA Exception: driver not initialised)

the full log:

$ /Users/<...>/Library/Haskell/ghc-7.4.1/lib/accelerate-examples-0.12.1.0/bin/accelerate-mandelbrot --cuda
accelerate-mandelbrot:
*** Internal error in package accelerate-cuda ***
*** Please submit a bug report at https://github.com/AccelerateHS/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:139 (unhandled): CUDA Exception: driver not initialised

executing on MacOS X 10.6 and 10.8, 64bit
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Sat_Apr__7_14:56:41_PDT_2012
Cuda compilation tools, release 4.2, V0.2.1221

the only executable which is running is accelerate-examples,
failing ones: accelerate-crystal accelerate-fluid accelerate-mandelbrot accelerate-quickcheck.

Backend kit needs to be adapted to new array tuples

The backend kit needs to follow 298a3f2.

NVCC error regarding missing instance of function template

Here's the error:

/tmp/accelerate-cuda-11123/dragon11123.cu(12): error: no instance of function template "mod" matches the argument list
    argument types are: (const int, Int64)

1 error detected in the compilation of "/tmp/tmpxft_00002b75_00000000-4_dragon11123.cpp4.ii".
accelerate_anova1_v1.gpu.exe: nvcc terminated abnormally (2)

Here's a reproducer: https://gist.github.com/2949100

(Sorry I removed some cruft but it's far from minimal.)

show instance for Arrays can throw Prelude.(!!) exception

It seems to assume that the shape is correct. While editing a test case I accidentally provided the wrong shape to "fromList":

xs = Sugar.fromList sh [(1,1),(2,2)]
sh = Z :. (5::Int)

I said it was length 5 when it was really only length 2. It would be great to get a more precise error message for this case.

Lambda-bound variables during sharing recovery

When we introduce nested parallelism, we need to handle lambda-bound variables properly. What needs to be improved is the following:

Compute levels for lambda-bound scalar variables (in Phase One of sharing recovery) and use the levels to compute the 'Tag' values.
We may want to have one level for the combination of array-valued and scalar expressions.

share references to arrays lifted from scalar expressions

Each use of an array from within a scalar expression will generate code to a new array variable (texture reference). If the same array is used multiple times, these references should be shared.

Interpreter and CUDA backend intermittently fail on scan-segsum

Is this a backend or frontend problem?

Concurrent execution

Originally suggested by @tmcdonell

AccelerateHS 0.12.1 seems not to build with GHC 7.4.1

I get the following error at "cabal configure":

checking size of Int... target (undefined::Int)' is not a module name or a source file checking size of Char... target(undefined::Char)' is not a module name or a source file

FFI support to interface with existing CUDA code

rename: unsupported operation (Invalid cross-device link)

I just tried the latest accelerate from github, and now I'm getting this error:

mandel: /tmp/accelerate-cuda-538/dragon538.cubin: rename: unsupported operation (Invalid cross-
device link)

This happens with all the examples I've tried, including the mandel example from earlier.

{-# LANGUAGE FlexibleContexts #-}

import Data.Array.Accelerate as A
--import Data.Array.Accelerate.Interpreter
import Data.Array.Accelerate.CUDA

import PortablePixmap
import Prelude as P
import Data.List (foldl')

import System.Environment
import System.IO

import Debug.Trace

main  = do
  (a:b:c:d:x:y:z:_) <- getArgs
  hSetBinaryMode stdout True
  print (mandelset (read a) (read b) (read c) (read d) (read x) (read y) (read z))

-- -----------------------------------------------------------------------------

type F = Double

type Complex = (F,F)
type ComplexPlane = Array DIM2 Complex

mandelset :: F -> F -> F -> F -> Int -> Int -> Int -> PixMap
mandelset x y x' y' screenX screenY lIMIT
   = createPixmap (P.fromIntegral screenX) (P.fromIntegral screenY) lIMIT $
        P.map prettyRGB result
  where
   result = [ x | (_,_,x) <- toList (mandel x y x' y' screenX screenY lIMIT) ]

   prettyRGB::Int -> (Int,Int,Int)
   prettyRGB s = (r, g, b)
     where t = ((lIMIT - s) * 255) `quot` lIMIT
           r = t `mod` 128 + 64
           g = t * 2 `mod` 128 + 64
           b = t * 3 `mod` 256

mandel :: F -> F -> F -> F -> Int -> Int -> Int -> Array DIM2 (F,F,Int)
mandel x y x' y' screenX screenY depth
  = foldl (flip ($)) zs0 (P.take depth (repeat go))
  where
   (cs,zs0) = trace "genPlane" $ run $ lift (cs, zs0)
        where cs  = genPlane x y x' y' screenX screenY
              zs0 = mkinit cs

   go :: Array DIM2 (F,F,Int) -> Array DIM2 (F,F,Int)
   go zs = zs `seq` (trace "go" $ run (A.zipWith iter (use cs) (use zs)))


genPlane :: F -> F
         -> F -> F
         -> Int
         -> Int
         -> Acc ComplexPlane
genPlane lowx lowy highx highy viewx viewy
   = generate (constant (Z:.viewy:.viewx))
              (\ix -> let pr = unindex2 ix
                          x = A.fromIntegral (A.fst pr)
                          y = A.fromIntegral (A.snd pr)
                      in
                        lift ( elowx + (x * exsize) / eviewx
                             , elowy + (y * eysize) / eviewy))
   where
      elowx, elowy, exsize, eysize, eviewx, eviewy :: Exp F

      elowx  = constant lowx
      elowy  = constant lowy

      exsize = constant (highx - lowx)
      eysize = constant (highy - lowy)

      eviewx = constant (P.fromIntegral viewx)
      eviewy = constant (P.fromIntegral viewy)


next :: Exp Complex -> Exp Complex -> Exp Complex
next c z = c `plus` (z `times` z)


plus :: Exp Complex -> Exp Complex -> Exp Complex
plus = lift2 f
  where f :: (Exp F, Exp F) -> (Exp F, Exp F) -> (Exp F, Exp F)
        f (x1,y1) (x2,y2) = (x1+x2,y1+y2)

times :: Exp Complex -> Exp Complex -> Exp Complex
times = lift2 f
  where f :: (Exp F, Exp F) -> (Exp F, Exp F) -> (Exp F, Exp F)
        f (x,y) (x',y')   =  (x*x'-y*y', x*y'+y*x')

dot :: Exp Complex -> Exp F
dot = lift1 f
  where f :: (Exp F, Exp F) -> Exp F
        f (x,y) = x*x + y*y


iter :: Exp Complex -> Exp (F,F,Int) -> Exp (F,F,Int)
iter c z = f (unlift z)
 where
  f :: (Exp F, Exp F, Exp Int) -> Exp (F,F,Int)
  f (x,y,i) =
     (dot z' >* 4.0) ? ( lift (x,y,i)
                       , lift (A.fst z', A.snd z', i+1) )
     where z' = A.curry (next c) x y


mkinit :: Acc ComplexPlane -> Acc (Array DIM2 (F,F,Int))
mkinit cs = A.map (lift1 f) cs
  where f :: (Exp F, Exp F) -> (Exp F, Exp F, Exp Int)
        f (x,y) = (x,y,0)

Thread blocked indefinitely

When using the CUDA backend, but not the interpreter, it is easy to get a "thread blocked indefinitely on MVar" exception by having one GPU computation depend on another. I presume this is due to the use of withMVar in run, so I worked around it with some seqs. Is that right?

This seems like a bug, since if something works with the interpreter we would expect it to work with the CUDA backend. I can imagine it would be difficult to fix though. Could it be documented somewhere?

array indexing of 8- and 16-bit types

Indexing an array of 8- or 16-bit type within scalar code yields spurious results:

ghci> let xs = use $ fromList (Z:.10) [0..] :: Acc (Vector Int8)
ghci> CUDA.run $ A.map (\i -> xs ! index1 (A.fromIntegral i)) xs
Array (Z :. 10) [0,4,4,6,4,-123,6,-121,4,69]

type mismatch in convertAccFun1

D.A.A.Smart.convertAccFun1 will throw a type mismatch error in the following example:

import Data.Array.Accelerate            as Acc
import Data.Array.Accelerate.AST        ( Afun )
import Data.Array.Accelerate.Smart

z :: Acc (Scalar Int)
z = unit 0

xs :: Acc (Vector Int)
xs = use $ fromList (Z:.10) [0..]

acc1 :: Acc (Vector Int) -> Acc (Vector Int)
acc1 = Acc.map (\_ -> the z)

afun1 :: Afun (Vector Int -> Vector Int)
afun1 = convertAccFun1 acc1

and the session in ghci:

ghci> acc1 xs
let a0 = unit 0
in map
     (\x0 -> a0!(index Z)) (use (Array Z :. 10 [0,1,2,3,4,5,6,7,8,9]))

ghci> :force afun1
*** Exception: 
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/mchakravarty/accelerate/issues
./Data/Array/Accelerate/Smart.hs:114 (prjIdx): type mismatch
couldn't match expected type `Array (:. Z Int) Int' with actual type `Array Z Int'

No Bool and Char arrays with the CUDA backend

The CUDA backend can currently not handle arrays that contain elements of typeBool or Char. In D.A.A.Array.Data, see the instance declarations for ArrayElem Bool and ArrayElem Char for details.

accelerate-cuda: Trouble with gcc + Mac OS + cross compilation

I just got a CUDA-capable macbook pro and am trying to run accelerate-cuda on my laptop for the first time. This is a fresh 10.8 machine with Apple's "command line development" utilities. I seem to get the same error (below) with either a cabal install of the stable 0.12 hackage version, or HEAD.

[~/accelerate/accelerate-cuda] (master)$ ./configure 
checking for nvcc... nvcc
checking whether the C++ compiler works... yes
checking for C++ compiler default output file name... a.out
checking for suffix of executables... 
checking whether we are cross compiling... configure: error: in `/Users/rrnewton/accelerate/accelerate-cuda':
configure: error: cannot run C++ compiled programs.
If you meant to cross compile, use `--host'.
See `config.log' for more details

[~/accelerate/accelerate-cuda] (master)$ gcc --version
i686-apple-darwin11-llvm-gcc-4.2 (GCC) 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)
Copyright (C) 2007 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

stencil2 function can yield invalid indices

I can't see this ever being an issue other than during debugging, since one would just use stencil, but anyway, something fishy is going on...

import Data.Array.Accelerate as A

is :: Array DIM2 Int
is = fromList (Z:.10:.10) [0..]

fs :: Array DIM2 Float
fs = fromList (Z:.10:.10) [0..]

-- Ignoring the first parameter is OK
--
ok1 = stencil2 centre Clamp (use fs) Clamp (use is)
  where
    centre :: Stencil3x3 Float -> Stencil3x3 Int -> Exp Int
    centre _ (_,(_,y,_),_)  = y


-- Using both is also OK
--
ok2 = stencil2 centre Clamp (use fs) Clamp (use is)
  where
    centre :: Stencil3x3 Float -> Stencil3x3 Int -> Exp Float
    centre (_,(_,x,_),_) (_,(_,y,_),_)  = x + A.fromIntegral y


-- Not using the second parameter to the stencil function ('y') somehow results
-- 'x' having the wrong indices, and environment projection fails:
--
-- stencil2*** Exception: Fatal error in Smart.prjIdx:
--   Couldn't match expected type `((Float,Float,Float),(Float,Float,Float),(Float,Float,Float))' with actual type `((Int,Int,Int),(Int,Int,Int),(Int,Int,Int))'
--   Type mismatch at shared 'Exp' tree with stable name 42; i = 0
--   Possible reason: nested data parallelism — array computation that depends on a
--     scalar variable of type 'Exp a'
--
bad1 = stencil2 centre Clamp (use fs) Clamp (use is)
  where
    centre :: Stencil3x3 Float -> Stencil3x3 Int -> Exp Float
    centre (_,(_,x,_),_) _  = x

CUDA Exception: unspecified launch failure on accelerate-examples

Hi,

when running accelerate-examples on my machine, the first tests pass, but then I get the following error message:

scanseg-sum: Failed: 
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/mchakravarty/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:58 (unhandled): CUDA Exception: unspecified launch failure

All the later tests report fail with the same error message. The test "slices" gives, in addition, the error

./Data/Array/Accelerate/CUDA/Compile.hs:415 (prepareAcc): IndexAny: not implemented yet

If I run the program through cuda-memcheck, it reports

========= Out-of-range Shared or Local Address
=========     at 0x000001a8 in inclusive_scan
=========     by thread (544,0,0) in block (0,0,0)

My specifications are

DeviceProperties {deviceName = "GeForce GT 540M", computeCapability = 2.1, totalGlobalMem = 1073414144, totalConstMem = 65536, sharedMemPerBlock = 49152, regsPerBlock = 32768, warpSize = 32, maxThreadsPerBlock = 1024, maxThreadsPerMultiProcessor = 1536, maxBlockSize = (1024,1024,64), maxGridSize = (65535,65535,65535), maxTextureDim1D = 65536, maxTextureDim2D = (65536,65535), maxTextureDim3D = (2048,2048,2048), clockRate = 1344000, multiProcessorCount = 2, memPitch = 2147483647, memBusWidth = 128, memClockRate = 800000, textureAlignment = 512, computeMode = Default, deviceOverlap = True, concurrentKernels = True, eccEnabled = False, asyncEngineCount = 1, cacheMemL2 = 131072, tccDriverEnabled = False, pciInfo = PCI {busID = 1, deviceID = 0, domainID = 0}, kernelExecTimeoutEnabled = False, integrated = False, canMapHostMemory = True, unifiedAddressing = True}

I'm using ghc 7.4.1, cuda-0.4.1.0, and accelerate-0.9.0.1.

If I can provide any additional helpful information, please let me know.

Kind regards,
Philipp

accelerate-cuda tries to 'mkdir cache' inside the install dir

I have accelerate-cuda installed globally (.cabal/config: user-install: False)
I cannot run any of the accelerate-examples. For instance,

waldmann@octopus:~$ accelerate-mandelbrot
accelerate-mandelbrot: /usr/local/share/accelerate-cuda-0.12.1.0/cache: createDirectory: permission denied (Permission denied)

Of course I can
sudo chmod go+w /usr/local/share/accelerate-cuda-0.12.1.0
but that does not seem right.

Permute does not properly write-combine results

Originally reported by @tmcdonell

When one or more threads try to write to the same location, the hardware write-combining mechanism accepts one transaction and rejects all others. The permute operation does not currently take this into account.

main :: IO ()
main = do
  putStr ""Interpreter : "" ; print     (Interp.run accumulate)
  putStr ""CUDA        : "" ; print =<< (CUDA.run   accumulate)

accumulate :: Acc (Vector Int)
accumulate = Acc.permute (+) dst (idx Acc.!) src
  where
    src = Acc.use $ Acc.fromList 16 (repeat 1)
    idx = Acc.use $ Acc.fromList 16 [0,0,3,2,1,1,2,1,3,3,1,0,0,2,1,1] :: Acc (Vector Int)
    dst = Acc.use $ Acc.fromList 4  (repeat 0)

Which results in:

*Test> :main
Interpreter : Array 4 [4,6,3,3]
CUDA        : Array 4 [1,1,1,1]

Compute 1.0 devices do not support any atomic primitives. At least for integral types, we can work around this by tagging each transaction with a thread ID (or similar). This requires many additional memory transactions and wastes the upper bits.

For devices of compute 1.1 and greater, we can use atomic compare-and-swap. This is limited to 32-bit and 64-bit [unsigned] integers, but doesn't require any additional transactions (assuming the internals are intelligent). To coerce

#define INT_AS_FLOAT(x) (*((float*)&(x)))
#define FLOAT_AS_INT(x) (*((int*)&(x)))

// Actually, pointer casting breaks strict aliasing rules, so a union would be better.
float __int_as_float(int a)
{
   union {int a; float b;} u;

   u.a = a;
   return u.b;
}

@tmcdonell's code for the tagging approach is https://gist.github.com/924007 and for atomic compare-and-swap is https://gist.github.com/924009

Getting an error when using exponentiation (^)

If you are using (^) on Exps you get the error:

**** Exception: Prelude.Ord.compare applied to EDSL types

See https://gist.github.com/2768391 for an example.

My guess is that you are inheriting the default implementation of (^) that uses Ord functions

"size" function at Language level

Already there exists the "size" function:

size :: (Shape ix, Elt e) => Acc (Array ix e) -> Exp Int

But it would be good to also have a different version that operates directly on shapes, i.e.:

size' :: (Shape ix) => Exp ix -> Exp Int

Easy enough to implement if there are no objections.

What name would people propose it be called given that size is already taken?

CUDA backend failures

Currently we have 11 failures:

zip
scan-segsum (see also Issue #44)
stencil-1D
stencil2-2D
backpermute-transpose
gather-if (see also Issue #46)
scatter
scatter-if
saxpy
filter
radixsort (see also Issue #33)

Find the cause for these failures and decide which one we want to fix for 0.12 and which ones we might delay until 0.12.1.

scatter/gather ops fail in the CUDA backend

Failing tests:

gather-if
scatter
scatter-if

Fixing this depends on fixing Issue #4 (see Issue #45).

type mismatch

via @arj

import Data.Array.Accelerate as Acc

type AccVector a = Acc (Vector a)
type AccSparseVector a = ((AccVector Int), (AccVector a))

getEntry :: (Elt a) => Exp Int -> a -> AccSparseVector a -> Exp a
getEntry i d (idx,val) = Acc.snd $ the $ Acc.foldAll f def xs
  where
    xs  = Acc.zip idx val
    def = constant (0 :: Int,d)
    f ack v = (i ==* Acc.fst v) ? (v, ack)
--    f ack v = (i ==* i) ? (v, ack) -- This fails!
--    f ack v = ack -- This works!
--    f ack v = v   -- This works!

vectorFromSparseVector :: (Elt a) => AccSparseVector a -> Int -> a -> AccVector a
vectorFromSparseVector sv@(idx,val) size d = Acc.map m def
  where
    def = use $ fromList (Z :. size) $ take size $ ([1,2..] :: [Int]) :: AccVector Int
    m i = getEntry i d sv 

------------

nobug = getEntry 1 (0.0 :: Float) (idx, val) 
  where
    idx = use $ fromList (Z :. (1 :: Int)) ([1] :: [Int])
    val = use $ fromList (Z :. (1 :: Int)) ([1] :: [Float])

-------------------------

bug = vectorFromSparseVector (idx, val) 1 $ (0.0 :: Float)
  where
    idx = use $ fromList (Z :. (1 :: Int)) ([1] :: [Int])
    val = use $ fromList (Z :. (1 :: Int)) ([1] :: [Float])

and the ghci session:

ghci> bug
let a0 =*** Exception: 
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/mchakravarty/accelerate/issues
./Data/Array/Accelerate/Smart.hs:114 (prjIdx): type mismatch
couldn't match expected type `((),Int)' with actual type `(((),Int),Float)'

accelerate-examples failes: "CUDA Exception: invalid handle"

I installed accelerate 0.9.0 from github, with nVidia's CUDA 3.2 and the hackage cuda-0.3.2 package. My platform is a 2009 iMac with NVIDIA GeForce GT 120 (256MB video RAM) running OS X 10.6.8. Installation of everything went smoothly, once I set my environment variables LDFLAGS, CPPFLAGS, LD_LIBRARY_PATH appropriately to point at the nvidia files in /usr/local/cuda/...

When I run accelerate-examples, it works (though with some stack overflows which are apparently expected).

However, when I run accelerate-examples --cuda, all tests fail with the same error:

map-abs: Failed:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/mchakravarty/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:58 (unhandled): CUDA Exception: invalid handle

So I am dutifully reporting the error. Any suggestions about how to gather more info or what I've done wrong?

Comprehensive support for standard Haskell classes and numeric conversions

reduced performance of small types

Originally reported by @tmcdonell

CUDA devices do not coalesce memory transfers to global memory of 8- and 16-bit types. Without providing alternate skeletons that process multiple elements per thread (vec4 and vec2 types respectively), we may be able to promote these to 32-bit transactions, and mask off the irrelevant data. Similar issues exist for shared memory bank conflicts.

Random number generation

error: no instance of overloaded function "atomicCAS" matches the argument list

Hi,

I get the error in the title when I try to run the following small example (that is converting a matrix from sparse list representation to dense array representation):

import Data.Array.Accelerate as Acc
import Data.Array.Accelerate.CUDA
import Prelude   hiding (replicate, zip, unzip, map, scanl, scanl1, scanr, scanr1, zipWith,
                         filter, max, min, not, fst, snd, curry, uncurry)
import qualified Prelude as P


type SRow = [(Int,Int)]

toDense :: Int -> [SRow] -> Acc (Array DIM2 Int)
toDense nCols rows =
  rows' where
    sh = (Z :. nCols :. (length rows) :: DIM2)
    nElts = sum (P.map length rows)
    emptyArr = use $ fromList sh (repeat 0)
    arrElts = use $ fromList (Z:.nElts :: DIM1)
      (concatMap (P.map P.snd) rows)
    rows' = permute (+) emptyArr indMap arrElts
    newInds = use $ fromList (Z:.nElts :: DIM1) $
      P.concat $ P.zipWith (\x ys -> P.map (\y -> Z:.x:.(P.fst y)::DIM2) ys) [0..] rows
    indMap i = newInds!i

main = print $ run $ toDense 7 [[(1,2),(2,3)],[(0,1),(3,2),(6,7)]]

When I load this into ghci and try to run main, I get

/tmp/accelerate-cuda/dragon5565.cu(27): error: no instance of overloaded function "atomicCAS" matches the argument list
            argument types are: (uint64_t *, uint64_t, uint64_t)

1 error detected in the compilation of "/tmp/tmpxft_0000162c_00000000-4_dragon5565.cpp4.ii".
*** Exception: nvcc terminated abnormally (2)

The contents of /tmp/accelerate-cuda/dragon5565.cu are:

#include <accelerate_cuda_extras.h>
static __constant__ DIM1 sh0;
static TexInt64 arr0_a0;
static TexInt64 arr0_a1;
typedef DIM2 DimOut;
typedef DIM1 DimIn0;
extern "C" __global__ void permute(int64_t* d_out_a0, const int64_t* d_in0_a0, const DimOut shOut, const DimIn0 shIn0)
{
const int shapeSize = size(shIn0);
const int gridSize = __umul24(blockDim.x, gridDim.x);
int ix;

for (ix = __umul24(blockDim.x, blockIdx.x) + threadIdx.x; ix < shapeSize; ix += gridSize) {
DimOut dst;
const int x0_a0 = ix;
const int v0 = toIndex(sh0, shape(x0_a0));

dst.a1 = indexArray(arr0_a1, v0);
dst.a0 = indexArray(arr0_a0, v0);
if (!ignore(dst)) {
const int jx = toIndex(shOut, dst);
const int64_t x0_a0 = d_in0_a0[ix];
uint64_t x1_a0,  _x1_a0 = reinterpret64(d_out_a0[jx]);

do {
x1_a0 = _x1_a0;
_x1_a0 = atomicCAS((uint64_t*) &d_out_a0[jx], x1_a0, reinterpret64(x1_a0 + x0_a0));
} while(x1_a0 != _x1_a0);
}
}
}

I would be grateful if you could look into this. If you need any more info, just let me know.

Cheers,
Philipp

Internal error: "inconsistent valuation @ shared 'Exp' tree with stable name 54"

When attempting to multiply a matrix by a vector on the left with the following code:

(*^) :: (Elt e,IsNum e) => Acc (Array DIM1 e) -> Acc (Array DIM2 e) -> Acc (Array DIM1 e)
(*^) v m = generate (index1 b)
        (\ ix -> let (Z:.cols) = unlift ix :: (Z:.Exp Int)
                  in the $ (.^) v
                   $ takeCol cols m )
    where (Z:.a:.b) = unlift $ shape m :: (Z:.Exp Int:.Exp Int)

(.^) :: (Elt e,IsNum e) => Acc (Array DIM1 e) -> Acc (Array DIM1 e) -> Acc (Scalar e)
(.^) u v = fold1 (+) (zipWith (*) u v)

takeCol :: Elt e => Exp Int -> Acc (Array DIM2 e) -> Acc (Array DIM1 e)
takeCol n xs = let (Z:.rows:.cols) = (unlift $ shape xs :: (Z:.Exp Int:.Exp Int))
                in flatten $ backpermute
                 ( index2 (1::Exp Int) cols)
                 ( \ix -> let (Z:.j:.i) = unlift ix :: (Z:.Exp Int:.Exp Int)
                           in index2 n i )
                   xs

and the following input:

run $ (use v) *^ (use m)

(where (v :: Array (Z:.5) Float) & (m :: Array (Z:.5:.3) Float)),

I get the following error in GHCi:

*** Exception:
*** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/mchakravarty/accelerate/issues
./Data/Array/Accelerate/Smart.hs:561 (convertSharingExp): inconsistent valuation @ shared 'Exp' tree with stable name 54;
env = [56]

Cleanup sharing

Cleanup the mess in Smart.hs left by switching to the faster sharing recovery algorithm.

"lost device memory reference" when using accessing size of array when reshaping

test2 - when reshaping the array and finding size of array at the same time, fails with "lost device memory reference" error.

test2 is successful when using Data.Array.Accelerate.Interpreter instead.

module Test where
import Data.Array.Accelerate as Acc
import Data.Array.Accelerate.CUDA
import Debug.Trace

-- Exp a -> a
fromExp :: (Elt a) => Exp a -> a
fromExp x = (toList(run $ unit $ x))!!0

arr = use$fromList(fromExp (index1 4)) ([1, 3, 5, 7]::[Int])

first len = fromExp $ (reshape (index1 len) arr) ! (index1 0)

test0 = 4 == (fromExp $ size arr) -- Success, returns True
test1 = first (4::Exp Int) -- Success, returns 1
test2 = first (size arr::Exp Int) -- Failure, "lost device memory reference", expected return 1
test3 = first (lift $ fromExp $ size arr::Exp Int) -- Success, returns 1

broken URLs for accelerate on hackage

sure that's not an accelerate error, but it is annoying. e.g.,

http://hackage.haskell.org/packages/archive/accelerate/0.12.1.0/doc/html/Data-Array-Accelerate-Analysis-Shape.html

It seems haddock is asuming that Data-Array-Accelerate-Analysis should exists.

I'll try to submit this as a haddock issue as well.

Sharing recovery in scalar expressions

Extend the current code dealing with sharing of array computations to handle scalar computations as well.

CUDA Exception: too many resources requested for launch

Hello,
I'm trying to use Accelerate for hydrodynamic simulations.
As a training, I'm writing a Lattice-Boltzmann solver with Accelerate. The program, under construction, is

https://github.com/nushio3/accelerate-test/blob/7a8248fa30c0e728cea0fe03ccd21bf5bed8a5ef/step05/MainAcc.hs

I have expressed what I want to write also in C++ and CUDA. They are
main-omp.cpp and main-cuda.cu at the same folder.

To begin with, I wrote a function to initialize the array in Accelerate,
(it corresponds to the function 'initialize()' in fluid.h)
but it fails with 'submit a bug report' error.

It says 'too many resources requested,' so I looked at the printout of Accelerate's kernel,
but for me it looks normal.
Am I doing something wrong, so that I'm wasting resources?
Or shall I decrease e.g. the resolution?

./MainAcc.hs 0
... some warnings omitted ...
map
(\x0 -> (+) ((+) ((+) ((+) ((+) ((+) ((+) ((+) (2 (3 x0),
1 (3 x0)),
0 (3 x0)),
2 (2 x0)),
1 (2 x0)),
0 (2 x0)),
2 (1 x0)),
1 (1 x0)),
0 (1 x0)))
(generate
(Z :. 1024) :. 768
(\x0 -> ((0.0,0.0,0.0),
(0.1,
0.7,
(+) (0.2,
() (1.0e-3,
(/) (() (12.0, fromIntegral (indexHead x0)), 768.0)))),
(0.0,0.0,0.0),
((<) ((+) (() (64.0,
() ((-) (fromIntegral (indexHead (indexTail x0)),
(/) (768.0, 6.0)),
(-) (fromIntegral (indexHead (indexTail x0)),
(/) (768.0, 6.0)))),
() ((-) (fromIntegral (indexHead x0), (/) (768.0, 2.0)),
(-) (fromIntegral (indexHead x0), (/) (768.0, 2.0)))),
() ((/) (768.0, 24.0), (/) (768.0, 24.0)))) ?
(1.0, 0.0))))
MainAcc.hs:
** Internal error in package accelerate ***
*** Please submit a bug report at https://github.com/mchakravarty/accelerate/issues
./Data/Array/Accelerate/CUDA.hs:59 (unhandled): CUDA Exception: too many resources requested for launch