Giter Club home page Giter Club logo

aparapi's People

Contributors

grfrost avatar mailtrv avatar rlamothe avatar

Watchers

 avatar

aparapi's Issues

Add Kernel and JNI build support for Applets and JNLP JWS

I have successfully completed work implementing Aparapi in an Applet + JNLP 
environment. This required a few small changes to the Aparapi source code and 
build scripts.

In summary:

- Added support for sun.jnlp.applet.launcher in Kernel
- Added support for org.jdesktop.applet.util.JNLPAppletLauncher in Kernel
- Changed build.xml to output .jnilib files in appropriate /dist directories 
instead of platform specific binaries (.so, .dll, .dylib) dropped in the root 
folder


Once I complete preparation of the Applet+JNLP Eclipse projects I will work on 
checking those in under separate issue requests.

Original issue reported on code.google.com by [email protected] on 11 Feb 2012 at 12:15

Attachments:

Doubles generate compiler errors

What steps will reproduce the problem?
1. Any kernel with double values used

What is the expected output? What do you see instead?
I am testing with the kernel found in the users guide, the one that takes an 
array of floats and squares each element. The kernel works fine for floats, but 
for doubles, I get:

************************************************
:1:26: warning: unknown '#pragma OPENCL EXTENSION' - ignored
#pragma OPENCL EXTENSION cl_amd_fp64 : enable
                         ^
:4:13: error: must specify '#pragma OPENCL EXTENSION cl_khr_fp64: enable' 
before using 'double'
   __global double *val$out;
            ^

************************************************



What version of the product are you using? On what operating system?
- Ubuntu 11.10 64-bit
- Aparapi 2012-02-15 (latest version in Downloads at the time I write this)
- NVidia GTX480 with drivers v295.20 (latest at the time I write this)

Please provide any additional information below.
I assume the problem is I am using an NVidia card? I am available for any 
testing required.

Original issue reported on code.google.com by [email protected] on 19 Feb 2012 at 8:57

Support for char

Add support for Java's char-type by mapping it to an unsigned short. 

Java doesn't support unsigned numeric value but char happens to map precisely 
to an unsigned short. It would therefore be convenient to be able to use it as 
a numeric value. 

Original issue reported on code.google.com by [email protected] on 10 Oct 2011 at 9:10

Failed to load aparapi native library

What steps will reproduce the problem?
1.while trying to run samle code squares 
2.
3.

What is the expected output? What do you see instead?
it is runing in JTP modde instead GPU mode 

WARNING: Check your environment. Failed to load aparapi native library 
aparapi_x86 or possibly failed to locate opencl native library 
(opencl.dll/opencl.so). Ensure that both are in your PATH (windows) or in 
LD_LIBRARY_PATH (linux).
Execution mode=JTP


What version of the product are you using? On what operating system?
i m using centos 5 and AMD card with AMD-APP-SDK-v2.5-RC2-lnx64 driver

Please provide any additional information below.
i already set LD_LIBRARY_PATH 

Original issue reported on code.google.com by [email protected] on 1 Dec 2011 at 11:24

Allow aparapi to execute on any OpenCL 1.1 compatible runtime

What steps will reproduce the problem?
1. Execute an aparapi enabled application on platform supporting OpenCL 1.1


What is the expected output? What do you see instead?
Expect application to execute.
Instead see 'fall back' message and application runs in a thread pool instead 
of GPU



Original issue reported on code.google.com by [email protected] on 12 Oct 2011 at 4:27

Add support for Mac OS.

What steps will reproduce the problem?
1. Try to build/run Aparapi on Mac OS

What is the expected output? What do you see instead?
Expected to build and execute.
Current build does not support Mac OS.  
Current runtime component does not support Apple's OpenCL.



Original issue reported on code.google.com by [email protected] on 12 Oct 2011 at 9:47

Unable to compile and run nbody example

What steps will reproduce the problem?
1. While trying to compile or run example nbody
2.
3.

What is the expected output? What do you see instead?
Buildfile: C:\Aparapi\Aparapi\examples\nbody\build.xml

clean:
   [delete] Deleting directory C:\Aparapi\Aparapi\examples\nbody\classes

clean:

check:

build:
    [mkdir] Created dir: C:\Aparapi\Aparapi\examples\nbody\classes
    [javac] Compiling 1 source file to C:\Aparapi\Aparapi\examples\nbody\classes
    [javac] C:\Aparapi\Aparapi\examples\nbody\src\com\amd\aparapi\examples\nbody\Main.java:290: error: method enable in class Texture ca
nnot be applied to given types;
    [javac]                texture.enable();
    [javac]                       ^
    [javac]   required: GL
    [javac]   found: no arguments
    [javac]   reason: actual and formal argument lists differ in length
    [javac] 1 error

BUILD FAILED
C:\Aparapi\Aparapi\examples\nbody\build.xml:59: Compile failed; see the 
compiler error output for details.


What version of the product are you using? On what operating system?
I m using the provided trunk till date on Windows 7 X64 with AMD GPU supporting 
OpenCl. 

Please provide any additional information below.
Even Facing  problem in examples also saying -> 
Error: Could not find or load main class com.amd.aparapi.examples.nbody.Main




Original issue reported on code.google.com by [email protected] on 7 Oct 2011 at 4:07

Add Kernel support for returning all available OpenCL hardware information

OpenCL allows the developer to query the underlying hardware for available 
information which can then be used at runtime to determine appropriate kernel 
parameters. We are specifically interested in this information in order to 
properly partition our data based on the available GPU memory constraints on 
the deployed hardware platform.

Ideally, this would be returned in a Map<String,String> or Map<Enum,String>.

For example:

CL_DEVICE_ADDRESS_BITS
CL_DEVICE_AVAILABLE
CL_DEVICE_COMPILER_AVAILABLE
CL_DEVICE_ENDIAN_LITTLE
CL_DEVICE_ERROR_CORRECTION_SUPPORT
CL_DEVICE_EXECUTION_CAPABILITIES
CL_DEVICE_EXTENSIONS
CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE
CL_DEVICE_GLOBAL_MEM_CACHE_SIZE
CL_DEVICE_GLOBAL_MEM_CACHE_TYPE
CL_DEVICE_GLOBAL_MEM_SIZE
CL_DEVICE_HOST_UNIFIED_MEMORY
CL_DEVICE_IMAGE2D_MAX_HEIGHT
CL_DEVICE_IMAGE2D_MAX_WIDTH
CL_DEVICE_IMAGE3D_MAX_DEPTH
CL_DEVICE_IMAGE3D_MAX_HEIGHT
CL_DEVICE_IMAGE3D_MAX_WIDTH
CL_DEVICE_IMAGE_SUPPORT
CL_DEVICE_LOCAL_MEM_SIZE
CL_DEVICE_LOCAL_MEM_TYPE
CL_DEVICE_MAX_CLOCK_FREQUENCY
CL_DEVICE_MAX_COMPUTE_UNITS
CL_DEVICE_MAX_CONSTANT_ARGS
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE
CL_DEVICE_MAX_MEM_ALLOC_SIZE
CL_DEVICE_MAX_PARAMETER_SIZE
CL_DEVICE_MAX_READ_IMAGE_ARGS
CL_DEVICE_MAX_SAMPLERS
CL_DEVICE_MAX_WORK_GROUP_SIZE
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS
CL_DEVICE_MAX_WORK_ITEM_SIZES
CL_DEVICE_MAX_WRITE_IMAGE_ARGS
CL_DEVICE_MEM_BASE_ADDR_ALIGN
CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE
CL_DEVICE_NAME
CL_DEVICE_NATIVE_VECTOR_WIDTH_CHAR
CL_DEVICE_NATIVE_VECTOR_WIDTH_DOUBLE
CL_DEVICE_NATIVE_VECTOR_WIDTH_FLOAT
CL_DEVICE_NATIVE_VECTOR_WIDTH_INT
CL_DEVICE_NATIVE_VECTOR_WIDTH_LONG
CL_DEVICE_NATIVE_VECTOR_WIDTH_SHORT
CL_DEVICE_OPENCL_C_VERSION
CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR
CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE
CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT
CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT
CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG
CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT
CL_DEVICE_PROFILE
CL_DEVICE_PROFILING_TIMER_RESOLUTION
CL_DEVICE_QUEUE_PROPERTIES
CL_DEVICE_SINGLE_FP_CONFIG
CL_DEVICE_TYPE
CL_DEVICE_VENDOR
CL_DEVICE_VENDOR_ID
CL_DEVICE_VERSION
CL_DRIVER_VERSION
CL_PLATFORM_EXTENSIONS
CL_PLATFORM_NAME
CL_PLATFORM_PROFILE
CL_PLATFORM_VENDOR
CL_PLATFORM_VERSION

Original issue reported on code.google.com by [email protected] on 14 Feb 2012 at 8:50

How to control locking

I am trying to implement a basic reduction algorithm, like "max".
Using the basic approach, each kernel will handle two elements and gradually 
compose the result. This approach requires that the kernels run in perfect 
lockstep otherwise the combined result will not be correct.

Sample implementation idea:
http://developer.apple.com/library/mac/#samplecode/OpenCL_Parallel_Reduction_Exa
mple/Listings/reduce_float_kernel_cl.html#//apple_ref/doc/uid/DTS40008188-reduce
_float_kernel_cl-DontLinkElementID_7

Pseudo-code:

int id = getGlobalId();
for(int scan = 0; i < scans; i++) {
  int other = (1 << scan) + id;
  if (other < length)
    shared[id] = Math.max(shared[id], shared[other]);
}

Is it possible to explicitly emit lock values?

One solution is to use the "passes" to issues the "scan" number, but this seems 
fairly wastefull as it will re-execute the kernels rather than keep them 
running.

In my case the reduction happens after a larger number of operations, so I need 
to produce a separate kernel to do the reduction as I cannot just run multiple 
passes on the entire kernel. Using the extra reducer kernel means that I need 
to replicate both support code and copy large arrays between the two.

Are there any thoughts on how this should be solved with Aparapi?

Original issue reported on code.google.com by [email protected] on 28 Oct 2011 at 12:04

clEnqueueNDRangeKernel() failed

What steps will reproduce the problem?
1. Make simple kernel
2. Run on machine with more than 1 GPU card
3. Fails with "clEnqueueNDRangeKernel() failed invalid work group size"

What is the expected output? What do you see instead?
Error message:
!!!!!!! clEnqueueNDRangeKernel() failed invalid work group size
after clEnqueueNDRangeKernel, globalSize=16 localSize=32 usingNull=0
Nov 15, 2011 4:07:37 PM com.amd.aparapi.KernelRunner executeOpenCL
WARNING: ### CL exec seems to have failed. Trying to revert to Java ###

What version of the product are you using? On what operating system?
2011-10-13 Ubuntu

Please provide any additional information below.
There is a check in KernelRunner.java:1081 that ensures that localSize <= 
globalSize, but in aparapi.c:1073 it does this:
size_t globalSizeAsSizeT = (globalSize /jniContext->deviceIdc);

This is done to work on multiple devices, and the following loop enqueues the 
work on multiple devices, but calls clEnqueueNDRangeKernel() with these 
numbers. According to the OpenCL docs, the error code means:

"CL_INVALID_WORK_GROUP_SIZE if local is specified and number of workitems 
specified by global is not evenly divisable by size of work-given by 
local_work_size or ..."

I am not sure how it is supposed to work, but according to the error 
description "global should be evenly divisible by local", but since we have 
global=16 and local=32 they are not, hence the error.

Original issue reported on code.google.com by [email protected] on 15 Nov 2011 at 4:01

Primitives with @Local annotation

What steps will reproduce the problem?
1. Have a primitive annotated with @Local. 

What is the expected output? What do you see instead?
Try to use the variable with the assumption that each block/group has a copy. 
Notice that each thread has its own copy.

Workaround:
Declare the variable as an array of size 1.

What version of the product are you using? On what operating system?
aparapi R288, Ubuntu 11.10 amd64, Java 7

Please provide any additional information below.
I need 2-3 single variables to be available to all threads in a block/group. I 
declare local memory variables and have the thread with localId() 0 read the 
value from the global memory and write it there (I am not sure this is a good 
idea, so any comments on that are welcome). After the assignment, I have a 
localBarrier(). 
-If you declare the variable as a primitive, only thread 0 sees the correct 
value (all others see the default value 0.0)
-If you declare the variable as an array of size [0], then the behaviour is the 
expected from @Local

Original issue reported on code.google.com by [email protected] on 27 Feb 2012 at 10:26

com_amd_aparapi_KernelRunner.h missing?

What steps will reproduce the problem?

Try to do a fresh checkout from svn and compile code.

What is the expected output? What do you see instead?

Missing file during compilation.

I guess com.amd.aparapi.jni/include directory is missing in trunk.

Original issue reported on code.google.com by [email protected] on 28 Sep 2011 at 7:28

Does not work, device AMD GPU

What steps will reproduce the problem?
1. Run Aparapi program on Windows 7 64bit
2. Set device to GPU.

What is the expected output? What do you see instead?

com.amd.aparapi.KernelRunner warnFallBackAndExecute
WARNING: Reverting to Java Thread Pool (JTP) for class AparapiSample$1: initJNI 
failed to return a valid handle
.....
platform name    0 Advanced Micro Devices, Inc.
platform version 0 OpenCL 1.2 AMD-APP (923.1)
platform Advanced Micro Devices, Inc. version OpenCL 1.2 AMD-APP (923.1) is not 
OpenCL 1.1 skipping!


What version of the product are you using? On what operating system?
Aparapi 2012-05-06


Please provide any additional information below.

The output from clinfo:

Number of platforms:                             1
  Platform Profile:                              FULL_PROFILE
  Platform Version:                              OpenCL 1.2 AMD-APP (923.1)
  Platform Name:                                 AMD Accelerated Parallel Processing
  Platform Vendor:                               Advanced Micro Devices, Inc.
  Platform Extensions:                           cl_khr_icd cl_amd_event_callback cl_amd_offline_devices cl_khr_d3d10_sharing


  Platform Name:                                 AMD Accelerated Parallel Processing
Number of devices:                               3
  Device Type:                                   CL_DEVICE_TYPE_GPU
  Device ID:                                     4098
  Board name:                                    AMD Radeon HD 6480G
  Max compute units:                             3
  Max work items dimensions:                     3
    Max work items[0]:                           256
    Max work items[1]:                           256
    Max work items[2]:                           256
  Max work group size:                           256
  Preferred vector width char:                   16
  Preferred vector width short:                  8
  Preferred vector width int:                    4
  Preferred vector width long:                   2
  Preferred vector width float:                  4
  Preferred vector width double:                 0
  Native vector width char:                      16
  Native vector width short:                     8
  Native vector width int:                       4
  Native vector width long:                      2
  Native vector width float:                     4
  Native vector width double:                    0
  Max clock frequency:                           444Mhz
  Address bits:                                  32
  Max memory allocation:                         199753728
  Image support:                                 Yes
  Max number of images read arguments:           128
  Max number of images write arguments:          8
  Max image 2D width:                            8192
  Max image 2D height:                           8192
  Max image 3D width:                            2048
  Max image 3D height:                           2048
  Max image 3D depth:                            2048
  Max samplers within kernel:                    16
  Max size of kernel argument:                   1024
  Alignment (bits) of base address:              2048
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                                     No
    Quiet NaNs:                                  Yes
    Round to nearest even:                       Yes
    Round to zero:                               Yes
    Round to +ve and infinity:                   Yes
    IEEE754-2008 fused multiply-add:             Yes
  Cache type:                                    None
  Cache line size:                               0
  Cache size:                                    0
  Global memory size:                            536870912
  Constant buffer size:                          65536
  Max number of constant args:                   8
  Local memory type:                             Scratchpad
  Local memory size:                             32768
  Kernel Preferred work group size multiple:     64
  Error correction support:                      0
  Unified memory for Host and Device:            1
  Profiling timer resolution:                    1
  Device endianess:                              Little
  Available:                                     Yes
  Compiler available:                            Yes
  Execution capabilities:
    Execute OpenCL kernels:                      Yes
    Execute native function:                     No
  Queue properties:
    Out-of-Order:                                No
    Profiling :                                  Yes
  Platform ID:                                   000007FEE71B2A08
  Name:                                          BeaverCreek
  Vendor:                                        Advanced Micro Devices, Inc.
  Device OpenCL C version:                       OpenCL C 1.2
  Driver version:                                CAL 1.4.1720 (VM)
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 1.2 AMD-APP (923.1)
  Extensions:                                    cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_lo
cal_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store 
cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd
_vec3 cl_amd_printf cl_amd_media_ops cl_amd_popcnt cl_khr_d3d10_sharing


  Device Type:                                   CL_DEVICE_TYPE_GPU
  Device ID:                                     4098
  Board name:                                    AMD Radeon 6600M and 6700M Series
  Max compute units:                             6
  Max work items dimensions:                     3
    Max work items[0]:                           256
    Max work items[1]:                           256
    Max work items[2]:                           256
  Max work group size:                           256
  Preferred vector width char:                   16
  Preferred vector width short:                  8
  Preferred vector width int:                    4
  Preferred vector width long:                   2
  Preferred vector width float:                  4
  Preferred vector width double:                 0
  Native vector width char:                      16
  Native vector width short:                     8
  Native vector width int:                       4
  Native vector width long:                      2
  Native vector width float:                     4
  Native vector width double:                    0
  Max clock frequency:                           444Mhz
  Address bits:                                  32
  Max memory allocation:                         536870912
  Image support:                                 Yes
  Max number of images read arguments:           128
  Max number of images write arguments:          8
  Max image 2D width:                            8192
  Max image 2D height:                           8192
  Max image 3D width:                            2048
  Max image 3D height:                           2048
  Max image 3D depth:                            2048
  Max samplers within kernel:                    16
  Max size of kernel argument:                   1024
  Alignment (bits) of base address:              2048
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                                     No
    Quiet NaNs:                                  Yes
    Round to nearest even:                       Yes
    Round to zero:                               Yes
    Round to +ve and infinity:                   Yes
    IEEE754-2008 fused multiply-add:             Yes
  Cache type:                                    None
  Cache line size:                               0
  Cache size:                                    0
  Global memory size:                            2147483648
  Constant buffer size:                          65536
  Max number of constant args:                   8
  Local memory type:                             Scratchpad
  Local memory size:                             32768
  Kernel Preferred work group size multiple:     64
  Error correction support:                      0
  Unified memory for Host and Device:            0
  Profiling timer resolution:                    1
  Device endianess:                              Little
  Available:                                     Yes
  Compiler available:                            Yes
  Execution capabilities:
    Execute OpenCL kernels:                      Yes
    Execute native function:                     No
  Queue properties:
    Out-of-Order:                                No
    Profiling :                                  Yes
  Platform ID:                                   000007FEE71B2A08
  Name:                                          Turks
  Vendor:                                        Advanced Micro Devices, Inc.
  Device OpenCL C version:                       OpenCL C 1.2
  Driver version:                                CAL 1.4.1720 (VM)
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 1.2 AMD-APP (923.1)
  Extensions:                                    cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_lo
cal_int32_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store 
cl_khr_gl_sharing cl_ext_atomic_counters_32 cl_amd_device_attribute_query cl_amd
_vec3 cl_amd_printf cl_amd_media_ops cl_amd_popcnt cl_khr_d3d10_sharing


  Device Type:                                   CL_DEVICE_TYPE_CPU
  Device ID:                                     4098
  Board name:
  Max compute units:                             2
  Max work items dimensions:                     3
    Max work items[0]:                           1024
    Max work items[1]:                           1024
    Max work items[2]:                           1024
  Max work group size:                           1024
  Preferred vector width char:                   16
  Preferred vector width short:                  8
  Preferred vector width int:                    4
  Preferred vector width long:                   2
  Preferred vector width float:                  4
  Preferred vector width double:                 0
  Native vector width char:                      16
  Native vector width short:                     8
  Native vector width int:                       4
  Native vector width long:                      2
  Native vector width float:                     4
  Native vector width double:                    0
  Max clock frequency:                           1896Mhz
  Address bits:                                  64
  Max memory allocation:                         2147483648
  Image support:                                 Yes
  Max number of images read arguments:           128
  Max number of images write arguments:          8
  Max image 2D width:                            8192
  Max image 2D height:                           8192
  Max image 3D width:                            2048
  Max image 3D height:                           2048
  Max image 3D depth:                            2048
  Max samplers within kernel:                    16
  Max size of kernel argument:                   4096
  Alignment (bits) of base address:              1024
  Minimum alignment (bytes) for any datatype:    128
  Single precision floating point capability
    Denorms:                                     Yes
    Quiet NaNs:                                  Yes
    Round to nearest even:                       Yes
    Round to zero:                               Yes
    Round to +ve and infinity:                   Yes
    IEEE754-2008 fused multiply-add:             Yes
  Cache type:                                    Read/Write
  Cache line size:                               64
  Cache size:                                    65536
  Global memory size:                            3735633920
  Constant buffer size:                          65536
  Max number of constant args:                   8
  Local memory type:                             Global
  Local memory size:                             32768
  Kernel Preferred work group size multiple:     1
  Error correction support:                      0
  Unified memory for Host and Device:            1
  Profiling timer resolution:                    539
  Device endianess:                              Little
  Available:                                     Yes
  Compiler available:                            Yes
  Execution capabilities:
    Execute OpenCL kernels:                      Yes
    Execute native function:                     Yes
  Queue properties:
    Out-of-Order:                                No
    Profiling :                                  Yes
  Platform ID:                                   000007FEE71B2A08
  Name:                                          AMD A4-3300M APU with Radeon(tm) HD Graphics
  Vendor:                                        AuthenticAMD
  Device OpenCL C version:                       OpenCL C 1.2
  Driver version:                                2.0 (sse2)
  Profile:                                       FULL_PROFILE
  Version:                                       OpenCL 1.2 AMD-APP (923.1)
  Extensions:                                    cl_khr_fp64 cl_amd_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int3
2_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics 
cl_khr_int64_extended_atomics cl_khr_byte_addressable_store cl_khr_gl_sharing 
cl_ex
t_device_fission cl_amd_device_attribute_query cl_amd_vec3 cl_amd_printf 
cl_amd_media_ops cl_amd_popcnt cl_khr_d3d10_sharing


Original issue reported on code.google.com by [email protected] on 7 May 2012 at 10:22

OpenCL exception does not automatically failover to JTP mode

When OpenCL execution throws an exception due to an uninitialized method local 
variable, Aparapi does not automatically failover to JTP but instead appears to 
hang.

This is a continution of http://code.google.com/p/aparapi/issues/detail?id=34 
and original attached code.


Original issue reported on code.google.com by [email protected] on 10 Feb 2012 at 11:04

Invoking Math.log(float)

The Java Math library only has Math.log(double), which means that floats must 
handled with a cast to float.

For a simple kernel, this looks like:
public class LogKernel extends Kernel {
    private float[] data;
    private int offset;
    @Override public void run(){
       int i= getGlobalId();
       data[offset+i]=(float)Math.log(data[offset+i]);
    }
}

This works fine if the hardware supports double precision, otherwise it fails. 
Is there any way to rewrite the kernel to instruct Aparapi to use the single 
precision version of log when executed on a GPU ?

Original issue reported on code.google.com by [email protected] on 25 Oct 2011 at 9:19

Failed to load aparapi native library

What steps will reproduce the problem?
1.check the LD_LIBRARY_PATH
2.check all .so file whether they r cupperted or not 
3.

What is the expected output? What do you see instead?
it suppose to run on GPU instead it run on jtp mode 

What version of the product are you using? On what operating system?
aparapi_2011_09_13 and centoS 5.5 

Please provide any additional information below.
thiswarning is coming 
WARNING: Check your environment. Failed to load aparapi native library 
aparapi_x86 or possibly failed to locate opencl native library 
(opencl.dll/opencl.so). Ensure that both are in your PATH (windows) or in 
LD_LIBRARY_PATH (linux).

Original issue reported on code.google.com by [email protected] on 1 Dec 2011 at 11:16

Special construction can cause Aparapi to fail parsing valid Java bytecode

What steps will reproduce the problem?
Create a simple kernel like this:
class MyKernel extends Kernel {
  int[] a = new int[1];
  int[] b = new int[1];

  public void run() {
      a[b[0]++] = 1;
  }
}

What is the expected output? What do you see instead?
As this is valid Java (and bytecode) I expected it to be parsed.
Instead I get:
 com.amd.aparapi.ClassParseException: @16 IASTORE Detected an non-reducable operand consumer/producer mismatch
    at com.amd.aparapi.MethodModel.applyTransformations(MethodModel.java:1320)
    at com.amd.aparapi.MethodModel.foldExpressions(MethodModel.java:606)
    at com.amd.aparapi.MethodModel.init(MethodModel.java:1493)
    at com.amd.aparapi.MethodModel.<init>(MethodModel.java:1454)
    at com.amd.aparapi.ClassModel.getMethodModel(ClassModel.java:2344)
    at com.amd.aparapi.ClassModel.getEntrypoint(ClassModel.java:2377)


What version of the product are you using? On what operating system?
OSX, Eclipse

Please provide any additional information below.
The bytecode looks like this (annotated with stack state after instruction 
execution):

public void run();
  Code:
   0:   aload_0     -- this
   1:   getfield        -- a
   4:   aload_0     -- a, this
   5:   getfield        -- a, b
   8:   iconst_0        -- a, b, 0
   9:   dup2            -- a, b, 0, b, 0
   10:  iaload      -- a, b, 0, int
   11:  dup_x2      -- a, int, b, 0, int
   12:  iconst_1        -- a, int, b, 0, int, 1
   13:  iadd            -- a, int, b, 0, int+1
   14:  iastore     -- a, int
   15:  iconst_1        -- a, int, 1
   16:  iastore     --> Error, "int" is produced @11, a comes from @1
   17:  return

The parser detects that @14 is not a producer, causing it to activate the 
transform, but no suitable transform is found, hence the exception.

I do not understand the code parser design well enough to suggest a fix, but 
the above example is the smallest I can think of that looks like a real version 
in a test application.

Original issue reported on code.google.com by [email protected] on 9 Feb 2012 at 3:19

"wide" opcode leads to NPE in MethodModel.foldExpressions()

What steps will reproduce the problem?
1. Create a kernel with the following run() body:

int x = 0;
x += 128;

2. Execute the kernel

What is the expected output? What do you see instead?

Expected is that the kernel runs on the GPU, without side effects in this 
minimum example. What happens is the following exception, and JTP execution:

com.amd.aparapi.ClassParseException: java.lang.NullPointerException
    at com.amd.aparapi.MethodModel.init(MethodModel.java:1542)
    at com.amd.aparapi.MethodModel.<init>(MethodModel.java:1452)
    at com.amd.aparapi.ClassModel.getMethodModel(ClassModel.java:2344)
    at com.amd.aparapi.ClassModel.getEntrypoint(ClassModel.java:2377)
    at com.amd.aparapi.ClassModel.getEntrypoint(ClassModel.java:2386)
    at com.amd.aparapi.KernelRunner.execute(KernelRunner.java:1335)
    at com.amd.aparapi.Kernel.execute(Kernel.java:1682)
    at com.amd.aparapi.Kernel.execute(Kernel.java:1613)
    at com.amd.aparapi.Kernel.execute(Kernel.java:1583)
    at mypackage.MyClass.main(MyClass.java)
Caused by: java.lang.NullPointerException
    at com.amd.aparapi.MethodModel.foldExpressions(MethodModel.java:587)
    at com.amd.aparapi.MethodModel.init(MethodModel.java:1491)
    ... 11 more

What version of the product are you using? On what operating system?

aparapi-2012-02-15 on Linux 2.6.35-32-generic #67-Ubuntu SMP Mon Mar 5 19:39:49 
UTC 2012 x86_64 GNU/Linux, using a somewhat unorthodox setup using Maven and 
Eclipse

Please provide any additional information below.

The bytecode provided by the compiler is the following:

0  iconst_0
1  istore_1 [x]
2  wide
3  iinc 1 128 [x]
8  return

Alternative, working kernels:

int x = 0;
x += 127;

0  iconst_0
1  istore_1 [x]
2  iinc 1 127 [x]
5  return

and

int x = 0, y = 128;
x -= y;

 0  iconst_0
 1  istore_1 [x]
 2  sipush 128
 5  istore_2 [y]
 6  iload_1 [x]
 7  iload_2 [y]
 8  isub
 9  istore_1 [x]
10  return

It seems clear to me that Aparapi is choking on the "wide" opcode. What 
Wikipedia has to say about the "wide" opcode: "execute opcode, where opcode is 
either iload, fload, aload, lload, dload, istore, fstore, astore, lstore, 
dstore, or ret, but assume the index is 16 bit; or execute iinc, where the 
index is 16 bits and the constant to increment by is a signed 16 bit short".

Original issue reported on code.google.com by [email protected] on 28 Mar 2012 at 1:26

Single kernel needs to support multiple entry points

We have a use case which requires us to execute a single kernel and enter the 
kernel from multiple different entry points.

For example, we need to perform an initial calculation on a dataset and store 
the results on the GPU with one kernel call. Then we need to calculate a 
secondary result based on the initial results with a second kernel call.

Currently, the following methods do not appear to be completely implemented:

Kernel.execute(Entry _entry, int _globalSize)
Kernel.execute(String _entryPoint, int _globalSize)
Kernel.execute(String _entryPoint, int _globalSize, int _passes) 

Original issue reported on code.google.com by [email protected] on 22 Nov 2011 at 10:47

I built all this yesterday

Everything worked,I ran examples,everyone of them. today I got 12.4 ,so I 
figured i would rebuild it all.I also updates my headers 1.2, from the debian 
repo
I redownloaded svn checkout http://aparapi.googlecode.com/svn/trunk aparapi
I ran ant to see if i got the same error ,I did yesterday to make sure it was 
the same code. I did not get the error, so I knew something had been changed on 
your end. I went to edit the build file, like I did ,the first time.
It appears to no longer be reading that build file, no matter what I put in 
there it will not even throw and error now.It just keeps telling me it needs a 
path to the SDK.

Someone broke the build files, please put them back the way they were. I can 
not understand why people shoot themselves in the foot making changes like that
when you have working code and and pretty good directions how to build.Changing 
build files just pisses people off and makes your directions not work,which 
pisses them off more, they then go download CUDA.AMD started this GPU code long 
before the other people even got started, I used folding at home,but all the 
shooting themselves in the foot have now put them way behind,please stop the 
foot shooting and move forward,it should be a given at this point the software 
will build.

While it was working it was super fast! Thanks for your your time !

Original issue reported on code.google.com by [email protected] on 10 May 2012 at 6:22

Working with float[][]

What steps will reproduce the problem?
1. Create a small kernel that uses float[][]
2. Execute it

What is the expected output? What do you see instead?
Either an error message like "not supported" or a "FallBackWarning".
I see this message instead:

Failed to execute: null
java.lang.NullPointerException
    at com.amd.aparapi.KernelWriter.convertType(KernelWriter.java:111)
    at com.amd.aparapi.KernelWriter.write(KernelWriter.java:260)
    at com.amd.aparapi.KernelRunner.execute(KernelRunner.java:1197)
    at com.amd.aparapi.Kernel.execute(Kernel.java:1523)
    at com.amd.aparapi.Kernel.execute(Kernel.java:1469)

What version of the product are you using? On what operating system?
r89, Windows x86, Oracle JDK 1.6.0_20

Please provide any additional information below.
The problem is the field type parser. Because the field is float[][], the 
typename is "[[F".
The code extracts the first '[' and then assumes that the next char is the 
typename, but it is another '['.

It would be very nice if multidimensional arrays were supported. I do not need 
jagged arrays, so my solution is to rewrite the array as a single float[] and 
then just do index calculations manually. However this gives a lot of memory 
copy, because I copy outside Aparapi, then Aparapi copies to device, and the 
same on the way back.

I would be interested in having a go a implementing this for non-jagged arrays 
inside Aparapi, could you give me some hints as to where I should look in the 
code?

Original issue reported on code.google.com by [email protected] on 28 Oct 2011 at 8:20

Allow Aparapi to load existing OpenCL when available

For some of our use cases, we've been trying to find ways to avoid the 
expensive initialization costs of using Aparapi.

For example, one of our tests is taking ~8ms to complete the kernel execution, 
but the initial Aparapi execution and OpenCL generation is taking ~250-300ms. 
This cost really adds up over multiple different kernels or even re-executions 
of the same kernel outside of a loop (different execution scopes).

One solution would be the following:

- Allow the user to specify that Aparapi should serialize the generated OpenCL 
code to a local .cl file during regular execution

- Allow the user to specify that Aparapi should deserialize a user-defined .cl 
file instead of generating OpenCL from Java code

- Allow Aparapi to follow all of its existing auto-fallback options if the .cl 
file cannot be found, is invalid, etc.
  - Log an error
  - Revert to existing behavior

Original issue reported on code.google.com by [email protected] on 27 Nov 2011 at 7:32

Suggestion to support structs (classes)

I noticed a comment in the Wiki describing the lack of struct support due to a 
mismatch in the C and Java memory models.

On a different Java/JNI open-source project I am working with structs are 
supported using custom annotations.

Here is a link to the documentation: 
https://www.alljoyn.org/sites/default/files/alljoyn-development-guide-java-sdk.p
df

5.4.2 Complex data types using the @Position annotation

Here is some example code from the documentation:

public class ImageInfo{
    @Position(0)
    public String fileName;
    @Position(1)
    public int isoValue;
    @Position(2)
    public int dateTaken;
}

Is this something we would like to consider for Aparapi and OpenCL structs?

Original issue reported on code.google.com by [email protected] on 29 Dec 2011 at 4:21

OpenCL compile fails

I'm trying to leverage OpenCL to calculate the Levenstein difference between 
two strings. I've altered the algorithm so only primitives and 1D arrays are 
used, and no char primitive is used, but I always get:

Feb 06, 2012 8:32:45 PM com.amd.aparapi.KernelRunner warnFallBackAndExecute
WARNING: Reverting to Java Thread Pool (JTP) for class 
org.quelea.SongDuplicateChecker$1: OpenCL compile failed

I've attached the code - I'm not sure if this is a bug or something I'm doing 
wrongly?

Original issue reported on code.google.com by [email protected] on 6 Feb 2012 at 8:35

Attachments:

Aparapi should default to CPU mode before JTP mode if OpenCL successfully generated

During our testing of Aparapi, we've encountered a number of instances where 
the OpenCL code has been successfully created, but fails to execute on the 
targeted GPU.

But we have also noticed that if we force Aparapi to use CPU mode instead of 
JTP mode, the OpenCL code has generally executed 2x or greater performance on 
CPU than JTP.

Since OpenCL is intended to support GPU/CPU/FPGA with the same code, we would 
like to request the following change:

(pseudo code)

if opencl_generated_successfully
{
  try GPU
if GPU fails
  try CPU
if CPU fails
  try JTP
}
else
{
 try JTP
}

Original issue reported on code.google.com by [email protected] on 10 Nov 2011 at 6:13

Support for reading static fields

What steps will reproduce the problem?
1. set -Dcom.amd.aparapi.enable.GETSTATIC=true
2. Run a kernel with a "final static int" field


What is the expected output? What do you see instead?
Should run or fallback, instead it throws "Field not found".


What version of the product are you using? On what operating system?
r89, Win7 x86

Please provide any additional information below.
This is caused by incomplete native code that does not read the static fields 
correctly.

Attached is a patch that fixes this. 

I assume that static fields are not supported because they may go cross 
threads, which means that they should use an array instead. But in the simple 
case where the "final" keyword is applied, it is essentially a const. Removing 
the "static" keyword will in theory allocate storage on a pr. object basis.


Original issue reported on code.google.com by [email protected] on 28 Oct 2011 at 11:25

Attachments:

Allow kernel to make static calls to kernel methods

What steps will reproduce the problem?
1.Code a static method in kernel class
2.Attempt to call this method from kernel
3.

What is the expected output? What do you see instead?
Expect kernel to execute
Please use labels and text to provide additional information.
static call is trapped during bytecode parsing

Here is the email from Witold Bolt
Hi,

Finally I had some time last evening to hack on Aparapi.


Original issue reported on code.google.com by [email protected] on 20 Nov 2011 at 5:22

Attachments:

Aparapi will only run in JTP mode

What steps will reproduce the problem?
I'm trying to use aparapi in one of my projects. When running the example apps 
I get the error:

Apr 04, 2012 5:41:12 PM com.amd.aparapi.KernelRunner warnFallBackAndExecute
WARNING: Reverting to Java Thread Pool (JTP) for class com.amd.aparapi.sample.sq
uares.Main$1: Range workgroup size 256 > device 128
com.amd.aparapi.RangeException: Range workgroup size 256 > device 128
        at com.amd.aparapi.KernelRunner.executeOpenCL(KernelRunner.java:1239)
        at com.amd.aparapi.KernelRunner.execute(KernelRunner.java:1513)
        at com.amd.aparapi.Kernel.execute(Kernel.java:1682)
        at com.amd.aparapi.Kernel.execute(Kernel.java:1613)
        at com.amd.aparapi.Kernel.execute(Kernel.java:1583)
        at com.amd.aparapi.sample.squares.Main.main(Main.java:82)
Execution mode=JTP

After creating a project in Eclipse I added the aparapi.jar to the build path, 
added the aparapi_x86.dll and opencl.dll to the PATH, installed the newest 
version of OpenCL 1.1 still when trying to recreate the square example I get:

com.amd.aparapi.Kernel$EXECUTION_MODE <clinit>
WARNING: Check your environment. Failed to load aparapi native library 
aparapi_x86 or possibly failed to locate opencl native library 
(opencl.dll/opencl.so). Ensure that both are in your PATH (windows) or in 
LD_LIBRARY_PATH (linux).

Where am I going wrong? Both java and system are 32-bit


Original issue reported on code.google.com by [email protected] on 4 Apr 2012 at 4:45

JTP execution not executing passes

What steps will reproduce the problem?
1. Make a simple kernel
2. Execute with execute(n, 2)

What is the expected output? What do you see instead?
Kernel should run two passes, it only runs one.

What version of the product are you using? On what operating system?
Latest, r258, on OSX


Please provide any additional information below.
I think the new Range calculations have somehow made the "passes" go away.
I can see that the value is passed along, but there is no loop using the value.
Searching for "passid" inside KernelRunner.java shows that it is only used with 
SEQ mode (line 682).

Original issue reported on code.google.com by [email protected] on 15 Feb 2012 at 12:30

Will only run in JTP mode

Install and run Mandel sample.  Runs but with JNI fallback.

18-Oct-2011 11:17:03 com.amd.aparapi.KernelRunner warnFallBackAndExecute
WARNING: Reverting to Java Thread Pool (JTP) for class 
com.amd.aparapi.sample.mandel.Main$MandelKernel: initJNI failed to return a 
valid handle
Execution mode=JTP;

Vista 32 bit
java version "1.6.0_26"
Intel CPU, GeForce GTX 460.
I have the latest AMD APP SDK.

Our own custom Java/openCL code using Jocl runs on either GPU or CPU. Device
information from JOCL below in case that helps.

Number of platforms: 2
Number of devices in platform NVIDIA CUDA: 1
Number of devices in platform AMD Accelerated Parallel Processing: 1
--- Info for device GeForce GTX 460: ---
CL_DEVICE_NAME:     GeForce GTX 460
CL_DEVICE_VENDOR:   NVIDIA Corporation
CL_DRIVER_VERSION:  275.33
CL_DEVICE_TYPE:         CL_DEVICE_TYPE_GPU
...
--- Info for device Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz
CL_DEVICE_NAME: Intel(R) Core(TM)2 CPU          6600  @ 2.40GHz
CL_DRIVER_VERSION:  2.0
CL_DEVICE_TYPE:     CL_DEVICE_TYPE_CPU

Original issue reported on code.google.com by [email protected] on 18 Oct 2011 at 10:28

Submit Aparapi to Khronos Group

If you take a look at http://www.khronos.org/opencl/resources under the section 
"Java Bindings to OpenCL" you will see that only JOCL is listed.

Aparapi should be submitted to Khronos as a first-class Java OpenCL binding to 
be listed under the above link.

Original issue reported on code.google.com by [email protected] on 2 Jan 2012 at 4:02

kernel.execute and getGlobalId should accept an array of ints

I believe that according to 
http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/get_global_id.html 
the method getGlobalId() and correspondingly kernel.execute() should support an 
array of ints. This would allow us to navigate a table structure using 
different index values from the same global id.

The current work-around for this is to encode the table values in the single 
global id similar to:

final int i = this.getGlobalId() / some_variable;
final int j = this.getGlobalId() % some_variable;

Ideally we could do something similar to:

final int i = this.getGlobalId()[0]
final int j = this.getGlobalId()[1]

Original issue reported on code.google.com by [email protected] on 8 Nov 2011 at 5:15

Configure Aparapi to bundle and load JNI libraries from Aparapi JAR

As we begin to integrate Aparapi into more generalized and production-ready 
projects, it is becoming obvious that we need a way for Aparapi to bundle all 
of its JNI libraries into the Aparapi JAR and load them automatically, instead 
of relying on the java.library.path to be set by the calling code.

This is entirely possible and has been done by a number of other projects, but 
will require both changes to how we load native libraries and how we build and 
deploy Aparapi.

Original issue reported on code.google.com by [email protected] on 3 Apr 2012 at 5:16

Aparapi won't work with Intel OpenCL SDK 1.5

Aparapi won't work with "Intel OpenCL SDK 1.5" despite the latter provides 
support for OpenCL 1.1.

The error message in verbose mode is:

platform name    1 Intel(R) Corporation
platform version 1 OpenCL 1.1 LINUX
platform Intel(R) Corporation does not support requested device type skipping!

Wondering if this just an overly strick check or if the device type information 
is actually fundamental for aparapi to work?

Original issue reported on code.google.com by [email protected] on 27 Oct 2011 at 1:22

  • Merged into: #16

Parser produces NaN for floating point numerals smaller then 1 (f.i. 1.0e-10) if the execution mode is GPU

What steps will reproduce the problem?
Here You see a small example for the problem:
Java-code:
213      gauss[ zz ] = 1.0e-10 * rint( 1.0e10 *  v0 * q );

The resulting bytecode from the Eclipse Class File Editor:
    918  aload_0 [this]
    919  getfield de.nsa_gmbh.hs.sixin.SxKernel.gauss : double[] [40]
    922  iload 39 [zz]
    924  ldc2_w <Double 1.0E-10> [142]
    927  aload_0 [this]
    928  ldc2_w <Double 1.0E10> [68]
    931  dload 40 [v0]
    933  dmul
    934  dload 44 [q]
    936  dmul
    937  invokevirtual de.nsa_gmbh.hs.sixin.SxKernel.rint(double) : double [64]
    940  dmul
    941  dastore

The resulting bytecode from my favorite Bytecode-Plugin for Eclipse:
    LINENUMBER 213 L69
    ALOAD 0
    GETFIELD de/nsa_gmbh/hs/sixin/SxKernel.gauss : [D
    ILOAD 39
    LDC 1.0E-10
    ALOAD 0
    LDC 1.0E10
    DLOAD 40
    DMUL
    DLOAD 44
    DMUL
    INVOKEVIRTUAL de/nsa_gmbh/hs/sixin/SxKernel.rint(D)D
    DMUL
    DASTORE

The resulting openCL code for the GPU from Aparapi:
      this->gauss[zz]  = NAN * rint(((1.0E10 * v0) * q));

Another example (without the bytecode):
Java:
      tryAgain = ( min( q, 1D - q ) < 1.0e-8 );

The resulting openCL code for the GPU from Aparapi:
      tryAgain = (fmin(q, (1.0 - q))<NAN)?1:0)

or even much more simple:
Java:
         double th =  1.0e-8;
The resulting openCL code for the GPU from Aparapi:
      double th = NAN;

What is the expected output? What do you see instead?
expected: 1.0E-10 instead: NAN

What version of the product are you using? On what operating system?
I works on a Dell Latitude E6520 under MS Windows 7 SP1 with
the Eclipse SDK ( Version: 3.7.2, Build id: M20120208-0800 )

The graphic card is NVIDIA Quattro NVS 4200M where the most uptodate driver 
version and CUDA 4.2
installed.



Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 16 May 2012 at 10:00

Does not detect AMD GPU

What steps will reproduce the problem?
1. Run Aparapi program on AMD Linux 64bit
2. Set device to GPU.

What is the expected output? What do you see instead?
Expected to execute on GPU, runs JTP.

verboseJNI gives:
platform name    0 Advanced Micro Devices, Inc.
platform version 0 OpenCL 1.1 AMD-APP-SDK-v2.5 (684.213)
platform Advanced Micro Devices, Inc. does not support requested device type 
skipping!

What version of the product are you using? On what operating system?
Aparapi 2011-10-13
Linux 3.0.0-12-generic #20-Ubuntu SMP Fri Oct 7 14:56:25 UTC 2011 x86_64 x86_64 
x86_64 GNU/Linux

Please provide any additional information below.
The machine has two Radeon cards, and lshw reports them as this:
*-display
    description: VGA compatible controller
    product: Antilles [AMD Radeon HD 6990]
    vendor: ATI Technologies Inc
    physical id: 0
    bus info: pci@0000:0c:00.0
    version: 00
    width: 64 bits
    clock: 33MHz
    capabilities: pm pciexpress msi vga_controller bus_master cap_list rom
    configuration: driver=fglrx_pci latency=0
    resources: irq:95 memory:d0000000-dfffffff memory:fe9e0000-fe9fffff ioport:e000(size=256) memory:fe9c0000-fe9dffff


Any special tricks I can use to figure out why it fails to recognize the cards?

Original issue reported on code.google.com by [email protected] on 11 Nov 2011 at 11:28

Allow access to clGetDeviceInfo

We are at a point where we need access to a number of items returned by 
clGetDeviceInfo.

Operating without knowledge of the following parameter's return value in 
particular is really giving us grief:


CL_DEVICE_MAX_MEM_ALLOC_SIZE    

Return type: cl_ulong

Max size of memory object allocation in bytes. The minimum value is max (1/4th 
of CL_DEVICE_GLOBAL_MEM_SIZE, 128*1024*1024)


Original issue reported on code.google.com by [email protected] on 7 May 2012 at 11:34

Add OS X build artifacts to svn:ignore and fix build.xml "build" and "clean" targets

While building the com.amd.aparapi.jni project in OS X the following folder is 
created locally and needs to be added to svn:ignore

libaparapi_x86_64.dylib.dSYM

Additionally, the build.xml "clean" target should be configured to delete the 
following:

libaparapi_x86_64.dylib.dSYM
libaparapi_${x86_or_x86_64}.dylib

It also appears that the "build" target was incorrectly calling the "check" 
target instead of the "clean" target (which already depends on check). I have 
included that change as well so now "build" cleans up beforehand correctly.

Please find attached the patch with all of these changes.

Original issue reported on code.google.com by [email protected] on 4 Jan 2012 at 5:11

  • Merged into: #37

Attachments:

build.xml does not build the the shared library (aparapi_x86.so or aparapi_x86_64.so)

What steps will reproduce the problem?
Follow the build instruction here: 
https://code.google.com/p/aparapi/wiki/DevelopersGuideLinux

What is the expected output? What do you see instead?
According to the wiki the expected output includes 4 things:
    aparapi.jar containing Aparapi classes for all platforms.
    the shared library for your platform (aparapi_x86.so or aparapi_x86_64.so).
    an /api subdirectory containing the 'public' javadoc for Aparapi.
    a samples directory containing the source and binaries for the mandel and squares sample projects. 

I get everything except the shared library.


What version of the product are you using? On what operating system?
aparapi r388, Ubuntu 11.10 64-bit, Java 7
g++ v4.6.1, ant v1.8.2

Original issue reported on code.google.com by [email protected] on 2 Apr 2012 at 10:12

Issues accessing a single Kernel instance from multiple threads.

What steps will reproduce the problem?
1.Execute a single Kernel instance from many threads
2.
3.

What is the expected output? What do you see instead?
Expect to see each thread to execute correctly (although we can't expect data 
integrity)

Please use labels and text to provide additional information.
Instead we get an JVM crash. 

This is the latest branch for supporting Multi-Dim kernel access.  But I 
suspect the same issue will be in the main branch.

My guess is that we are inadvertantly sharing JNIEnv* data across threads.  I 
just checked in a potential fix (in the multi-dim branch) but can't really test 
until Monday. 

Original issue reported on code.google.com by [email protected] on 15 Jan 2012 at 9:36

Need to support local shared memory

In our Kernel code, we need to have a way to access local shared memory buffers 
for thread access within an individual group.

This would allow us to perform calculations on data stored in memory that is 
local to each thread group, for example in reduction phases of map/reduce and 
would also allow us to use the available Kernel.localBarrier() effectively.

It appears that right now, all variables exist solely in global memory.

Original issue reported on code.google.com by [email protected] on 17 Nov 2011 at 10:29

Deadlock with JTP in executeJava

What steps will reproduce the problem?
1. Create a simple kernel like "data[getGlobalId()]=0"
2. Execute it with mode=JTP and an array of 5 elements

What is the expected output? What do you see instead?
Expected kernel to complete, but instead it hangs forever.

What version of the product are you using? On what operating system?
2011-10-13, or r89 on Win7 x64, but 32bit Java

Please provide any additional information below.
It looks as if the barrier is placed inside the loop that generates the 
threads. If numThreadSets is more than 1, it will need to do an extra loop with 
thrSetId (line 690), but the synchronization happens inside the loop, so it 
will block waiting for the threads to finish, but they are not created, hence a 
deadlock.

I have attached a patch where I have moved the synchronization outside the 
loop, and that seems to work for me. It should not do bad things because it is 
apparently just waiting for all threads to complete before starting the next 
pass.

Original issue reported on code.google.com by [email protected] on 18 Oct 2011 at 1:53

Attachments:

Integrate FindBugs into the Ant build scripts

I ran the latest Aparapi trunk code (as of today) through FindBugs and it 
exposed 83 areas for investigation and possible improvement, including a number 
of high priority bugs. I have attached the XML output.

Two suggestions for this ticket:

- Include FindBugs as an integral component of the Ant build scripts
- Either fix potential bugs or comment in the code reasons why changes are not 
needed (possibly use FindBugs annotations to avoid Ant output)

Original issue reported on code.google.com by [email protected] on 23 Nov 2011 at 7:47

Attachments:

Doubles generate compiler errors cl_amd_fp64 should be cl_khr_fp64, NaNs etc.

What steps will reproduce the problem?
1. Open BlackScholes example code
2. Change all floats to double
3. Compile and run (nvidia quadro 600)

What is the expected output? What do you see instead?

Q600 has native fp64 support, but the compiler generates some errors:

clBuildProgram failed
************************************************
:1:26: warning: unknown '#pragma OPENCL EXTENSION' - ignored
#pragma OPENCL EXTENSION cl_amd_fp64 : enable
                         ^
:4:13: error: must specify '#pragma OPENCL EXTENSION cl_khr_fp64: enable' 
before using 'double'
   __global double *randArray;
            ^
:14:16: error: use of undeclared identifier 'NaN'
   double c2 = NaN;
               ^
:17:16: error: use of undeclared identifier 'NaN'
   double c5 = NaN;
               ^
:25:103: error: use of undeclared identifier 'NaN'
   double y = 1.0 - (((0.3989422917366028 * exp(((-X * X) / 2.0))) * t) * (0.3193815350532532 + (t * (NaN + (t * (1.781477928161621 + (t * (-1.8212559223175049 + (t * NaN)))))))));
                                                                                                      ^
:48:53: error: use of undeclared identifier 'NaN'
      double R = (0.009999999776482582 * inRand) + (NaN * (1.0 - inRand));
                                                    ^
:49:60: error: use of undeclared identifier 'NaN'
      double sigmaVal = (0.009999999776482582 * inRand) + (NaN * (1.0 - inRand));


What version of the product are you using? On what operating system?
aparapi-2011-10-13, windows XP

Please provide any additional information below.
FWIW floats work fine.

Original issue reported on code.google.com by [email protected] on 6 Dec 2011 at 3:48

Create a Kernel library of pre-configured kernels

One thing that I think would be extremely useful and valuable would be if 
Aparapi supplied a library of pre-configured and optimized kernels for end-user 
use. For example, it would be nice to have a library of kernels with 
functionality similar to the library of example code available for CUDA, except 
for OpenCL via Aparapi. This could also help to augment any documentation in 
the Wiki needed to explain each use case.

My main motivation behind this request is the fact that even though all of the 
Aparapi examples appear to use single class files with kernels defined as inner 
classes, in my experience most production use of Aparapi will define kernels in 
separate classes which are then instantiated and executed from somewhere in the 
application code. There have been a number of times it would have been nice if 
there was a pre-configured XYZ kernel to use instead of writing one from 
scratch (after investigating the necessary logic in OpenCL or CUDA).

Original issue reported on code.google.com by [email protected] on 23 Jan 2012 at 2:10

CPU still beats up GPU by 4x in biotonic sort. what could be the reason



What steps will reproduce the problem?
  Well I went through the forum page http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=141035
 So i coded biotonic sort and tried to achieve best.
The code is attached and is some what able to do better than the posted one.



What is the expected output? What do you see instead?
  I was expecting to have GPU doing better than CPU. but in the end CPU still beats up GPU by 4x.

What version of the product are you using? On what operating system?
I am using Windows 7 x64. Java JDK 7 and latest Aparapi.
Hardware used are:-
Intel(R) Core(TM) i3 CPU M 370 @ 2.40GHz ( or Intel Core i3 370M)
ATI Mobility Radeon HD 5400 Series(1 GB Memory) GPU 
4GB DDR3 RAM


Please provide any additional information below.
 same code is running on CPU as well as GPU, with array size= 4194304 with each element less than 1000000.

Got results in 2.2 Seconds with CPU, while GPU takes 11 seconds.

Vivek Kumar Chaubey


Original issue reported on code.google.com by [email protected] on 17 Dec 2011 at 2:31

Attachments:

Aparapi does not support OpenCL 1.2

When we try to use a new AMD Radeon HD 7970 GPU, which ships with OpenCL 1.2, 
we receive the following exception:


May 2, 2012 12:04:29 PM com.amd.aparapi.KernelRunner warnFallBackAndExecute
WARNING: Reverting to Java Thread Pool (JTP) for class abc.def.Kernel: initJNI 
failed to return a valid handle

Original issue reported on code.google.com by [email protected] on 3 May 2012 at 3:03

Add support for multiple GPU's

We need Aparapi to support multiple GPU's instead of just the first GPU it 
finds available.

We have two use cases for this:

- A single workstation with multiple GPU's located on separate cards
- A cluster of computers with +2 GPU's per node

It would be ideal if we did not have to specify specific information about our 
environment and Aparapi/OpenCL would automatically partition the work and 
distribute it out as required.

I believe both CUDA +4.0 and OpenCL +1.1 support multi-threaded multi-GPU 
environments.

I've attached a small presentation related to OpenCL 1.1 multi-GPU enhancements 
given at 2011 SigGraph.

Of course, it would be nice to see an AMD presentation of this same material 
highlighting Aparapi :)

Original issue reported on code.google.com by [email protected] on 25 Nov 2011 at 1:20

Attachments:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.