Giter Club home page Giter Club logo

compute's Introduction

Boost.Compute

Build Status Build status Coverage Status Gitter

Boost.Compute is a GPU/parallel-computing library for C++ based on OpenCL.

The core library is a thin C++ wrapper over the OpenCL API and provides access to compute devices, contexts, command queues and memory buffers.

On top of the core library is a generic, STL-like interface providing common algorithms (e.g. transform(), accumulate(), sort()) along with common containers (e.g. vector<T>, flat_set<T>). It also features a number of extensions including parallel-computing algorithms (e.g. exclusive_scan(), scatter(), reduce()) and a number of fancy iterators (e.g. transform_iterator<>, permutation_iterator<>, zip_iterator<>).

The full documentation is available at http://boostorg.github.io/compute/.

Example

The following example shows how to sort a vector of floats on the GPU:

#include <vector>
#include <algorithm>
#include <boost/compute.hpp>

namespace compute = boost::compute;

int main()
{
    // get the default compute device
    compute::device gpu = compute::system::default_device();

    // create a compute context and command queue
    compute::context ctx(gpu);
    compute::command_queue queue(ctx, gpu);

    // generate random numbers on the host
    std::vector<float> host_vector(1000000);
    std::generate(host_vector.begin(), host_vector.end(), rand);

    // create vector on the device
    compute::vector<float> device_vector(1000000, ctx);

    // copy data to the device
    compute::copy(
        host_vector.begin(), host_vector.end(), device_vector.begin(), queue
    );

    // sort data on the device
    compute::sort(
        device_vector.begin(), device_vector.end(), queue
    );

    // copy data back to the host
    compute::copy(
        device_vector.begin(), device_vector.end(), host_vector.begin(), queue
    );

    return 0;
}

Boost.Compute is a header-only library, so no linking is required. The example above can be compiled with:

g++ -I/path/to/compute/include sort.cpp -lOpenCL

More examples can be found in the tutorial and under the examples directory.

Support

Questions about the library (both usage and development) can be posted to the mailing list.

Bugs and feature requests can be reported through the issue tracker.

Also feel free to send me an email with any problems, questions, or feedback.

Help Wanted

The Boost.Compute project is currently looking for additional developers with interest in parallel computing.

Please send an email to Kyle Lutz ([email protected]) for more information.

compute's People

Contributors

9prady9 avatar amirshavit avatar aneeshaidertp avatar banche avatar bkchr avatar chesterkuo avatar ddemidov avatar dpaveldev avatar edbaunton avatar flast avatar henryiii avatar huanzhang12 avatar jagerman avatar jmr1 avatar jpola avatar jszuppe avatar junmuz avatar keryell avatar kristian-popov avatar kylelutz avatar lakshayg avatar mageswaran1989 avatar mwyborski avatar okdshin avatar rosenrodt avatar roshanr95 avatar shehzan10 avatar thtrummer avatar vaa-msu avatar zialus avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

compute's Issues

Add .x, .y, .z accessors to vector types

Make is possible to access components of vector types (e.g. float4_, int8_) by names (e.g. .x, .y).

For example, this should be possible:

boost::compute::float4_ v;
v.x = 1;
v.y = 2;
v.z = 3;
v.w = 4;
assert(v == boost::compute::float4_(1, 2, 3, 4));

Note that names should only be available if the type actually has enough values (e.g. float2_ should not have a .z field).

Add bind() function

Add support for a bind() function which can compose functions like std/boost::bind().

For example, the user should be able to write code like the following to compute the cube of each value:

transform(vec.begin(), vec.end(), vec.begin(), bind(pow<float>(), _1, 3.0f)), queue);`

Some examples fail to build with visual studio 2010

Hey Kyle,

I started to look into building the compute library under windows. Some of the examples don't build. I assume that they should? I haven't looked too closely into this yet and wanted to get an idea of whether something has been done under windows yet. Have you tried building under windows or do you know if other people have? I can't build the tests yet because of a cmake problem that I haven't been able to solve. I hope to sort that out soon.

Cheers,
Dominic

"clRetainDevice" maybe NULL

I have an Intel OpenCL which support OpenCL1.2 and NVidia which support OpenCL1.0 on my compute. I'm always getting an error when the exmaple "hello world" runs in clRetainDevice.
I used the lib of Intel OpenCL.

Fix crash when build() fails with binary program

When compiling with BOOST_COMPUTE_DEBUG_KERNEL_COMPILATION, calls to build() which fail will cause the program source and build log to be printed to stderr.

However, when the program was loaded from a binary no source exists and thus the call to source() made by the debugging output will result in an exception being thrown. This case should be checked for and reported instead of crashing with the exception.

Fatal Error: boost/move/move.hpp

Hi, when I compile test_device.hpp, I am getting this error. The location of move.hpp in my system is at /usr/include/boost/variant/detail/move.hpp

compilation issue

Having the following compilation error:

///////////////////////////////////////////

g++ -O2 -g -o main.o -c main.cpp -I/usr/local/cuda-5.5/include -I/home/sajjad/Downloads/OpenCL/library/clMath/clFFT/clFFT-develop/src/include -I/home/sajjad/Downloads/OpenCL/library/BoostCompute/compute-master/include
In file included from /home/sajjad/Downloads/OpenCL/library/BoostCompute/compute-master/include/boost/compute/system.hpp:21:0,
from /home/sajjad/Downloads/OpenCL/library/BoostCompute/compute-master/include/boost/compute/algorithm/accumulate.hpp:16,
from /home/sajjad/Downloads/OpenCL/library/BoostCompute/compute-master/include/boost/compute/algorithm.hpp:18,
from /home/sajjad/Downloads/OpenCL/library/BoostCompute/compute-master/include/boost/compute.hpp:14,
from main.cpp:3:
/home/sajjad/Downloads/OpenCL/library/BoostCompute/compute-master/include/boost/compute/device.hpp:17:31: fatal error: boost/move/move.hpp: No such file or directory
compilation terminated.
make: *** [main.o] Error 1

///////////////////////////////////////////////////////

I checked the directory structure and neither the directory nor the file exists.

Any hint to get around this issue?

Thanks

A Bug inside the acquire.hpp

Hi ,

I am getting an error at the function "void opengl_enqueue_release_gl_objects" even though no reference has been made to the function.

The issue has been solved by putting an inline before the function definition in the source.

Not sure though if it is a crack or a hack.

Looking forward to be addressed by the authors.

Thanks

Support calling custom functions from lambda expressions

It should be possible to call custom functions from lambda expressions.

For example, this should be possible:

BOOST_COMPUTE_FUNCTION(float, square, (float x),
{
    return x * x;
});

transform(vec.begin(), vec.end(), vec.begin(), square(2 * _1), queue);

compute::sort fails with as little as 33 items

using https://github.com/kylelutz/compute/blob/master/example/sort_vector.cpp
we can make it fail by using a host_vector initialized with more than 32 items.
Tested on mac osx with a hd4000 and on linux using a nvidia card.

On mac osx:

input: [ 7, 49, 73, 58, 30, 72, 44, 78, 23, 9, 40, 65, 92, 42, 87, 3, 27, 29, 40, 12, 3, 69, 9, 57, 60, 33, 99, 78, 16, 35, 97, 26, 12 ]
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::compute::context_error> >'
  what():  [CL_DEVICE_NOT_AVAILABLE] : OpenCL Error : Error: Build Program driver returned (10015)
[1]    92979 abort      ./sort_vector.osx

On linux/nvidia:

input: [ 83, 86, 77, 15, 93, 35, 86, 92, 49, 21, 62, 27, 90, 59, 63, 26, 40, 26, 72, 36, 11, 68, 67, 29, 82, 30, 62, 23, 67, 35, 29, 2, 22 ]
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<boost::compute::runtime_exception> >'
  what():  Build Program Failure
Stack dump:
[1]    19134 segmentation fault (core dumped)  ./sort_vector.linux

Easy c&p reproduce:
https://gist.github.com/hansbogert/10975461

Linking Error

Hi,
When I tried g++ -I/usr/local/cuda-5.5/include -I/home/frost/Documents/compute/include test_system.cpp -lOpenCL. I am getting the error as in the screenshot.

screenshot from 2014-01-05 21 43 12

Tests fail on NVIDIA hardware when compute mode != default

Hello Kyle,

I am not sure about severity of this issue: some tests fail on NVIDIA GPUs when the GPUs are not in DEFAULT compute mode (that is, either EXCLUSIVE_THREAD or EXCLUSIVE_PROCESS).

$ sudo nvidia-smi -c 3
Set compute mode to EXCLUSIVE_PROCESS for GPU 0000:04:00.0.
Set compute mode to EXCLUSIVE_PROCESS for GPU 0000:05:00.0.
Set compute mode to EXCLUSIVE_PROCESS for GPU 0000:08:00.0.
All done.

$ BOOST_COMPUTE_DEFAULT_DEVICE="Tesla C2075" make test
...
The following tests FAILED:
      8 - core.kernel (Failed)
     13 - algorithm.accumulate (Failed)
     18 - algorithm.copy (Failed)
     20 - algorithm.count (Failed)
     31 - algorithm.inplace_reduce (Failed)
     38 - algorithm.partition (Failed)
     41 - algorithm.reduce (Failed)
     45 - algorithm.scan (Failed)
     47 - algorithm.sort (Failed)
     49 - algorithm.transform (Failed)
     52 - container.array (Failed)
     54 - container.flat_set (Failed)
     66 - blas.gemm (Failed)
     70 - ext.complex (Failed)
     74 - ext.tuple (Failed)
...
$ ./test_kernel 
Running 3 test cases...
/home/demidov/work/opencl/compute/include/boost/compute/context.hpp(45): fatal error in "boost::compute::context::context(const boost::compute::device&, const cl_context_properties*)": std::exception: Invalid Device
/home/demidov/work/opencl/compute/test/test_kernel.cpp(49): last checkpoint

*** 1 failure detected in test suite "TestKernel"

All of these tests have several cases inside them, and each case initializes its own OpenCL context. But some tests with multiple cases do pass (e.g. core.buffer). Is it possible that some resources are not freed in failing test cases, and so OpenCL context is not destroyed there?

The issue could in principle be solved by introducing global fixture that would hold an OpenCL context.

This, of course, would not be a problem in a usual application, since an OpenCL context would only be created once there. But it looks like there is a potential for memory leaks.

Allow custom argument names with BOOST_COMPUTE_FUNCTION()

Right now the BOOST_COMPUTE_FUNCTION() macro automatically generates argument names of the form _1, _2, ..., _N. It should be possible to assign meaningful names to function arguments instead of always requiring the use of the auto-generated names.

Add offline program cache support

Add support for caching program binaries offline. NVIDIA's OpenCL implementation already does this automatically (it stores binaries in ~/.nv/ComputeCache) but this would be a big win for other implementations (e.g. AMD, Intel).

segmentation fault when calling clRetainDevice in boost::compute::device::device(cl_device_id id, bool retain)

On another machine it works, but on my workstation, I get a segmentation fault SIGSEV when calling clRetainDevice in boost::compute::device::device(cl_device_id id, bool retain). Other OpenCL applications work on this machine, but they might not call clRetainDevice().

If I comment out the calls to clRetainDevice() and clReleaseDevice() I can run some examples, while others still segfault :
$example/hello_world
hello from GeForce 9800 GT
$example/list_devices
Platform 'NVIDIA CUDA'
GPU Device: GeForce 9800 GT
$examples/vector_addition
SIGSEV

That's what other OpenCL apps report about the device:
NVIDIA CUDA
NVIDIA Corporation
cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll
GeForce 9800 GT
NVIDIA Corporation
cl_khr_byte_addressable_store cl_khr_icd cl_khr_gl_sharing cl_nv_compiler_options cl_nv_device_attribute_query cl_nv_pragma_unroll cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics

That's on ubuntu 13.04

User defined data type support in Compute.

Hi,
I created a simple test case for sorting of user defined data types. but it throws a compilation error as shown below.
Do you plan to support User defined data types. Or are all the algorithms in compute restricted to OpenCL basic data types.
Am i missing something in the below code.
Regards,
Banger.

/******_Code Snippet_/

include

include

include

include <boost/compute.hpp>

class UDD
{
public:
int a;
int b;
float c;
bool operator() (const UDD &lhs, const UDD &rhs)
{
return lhs.a < rhs.a;
}
};

UDD rand_UDD()
{
UDD temp;
temp.a = rand() % 100;
temp.b = rand() % 100;
temp.c = (float)(rand() % 100) / 1.3;
return temp;
}

// this example demonstrates how to sort a vector of ints on the GPU
int main()
{
// create vector of random values on the host
std::vector host_vector(10);
std::generate(host_vector.begin(), host_vector.end(), rand_UDD);
// transfer the values to the device
boost::compute::vector device_vector = host_vector;

// sort the values on the device
boost::compute::sort(device_vector.begin(), device_vector.end());

}

Implement set algorithms

Implement the various set algorithms from the STL. These algorithms take advantage of sorted inputs and can be accellerated on the GPU.

The algorithms are:

  • set_difference()
  • set_intersection()
  • set_union()
  • set_symmetric_difference()
  • includes()

test_copy_if build failure

I get the following error message:

[ 32%] Building CXX object test/CMakeFiles/test_copy_if.dir/test_copy_if.cpp.o
cd /home/scratch/boost.compute/build/test && /usr/bin/c++ -DBOOST_COMPUTE_DEBUG_KERNEL_COMPILATION -DBOOST_TEST_DYN_LINK -g -I/usr/local/cuda/include -I/home/dmeiser/software/boost-1_53_0-ser/include -I/home/scratch/boost.compute/include -I/home/scratch/boost.compute/test/../include -o CMakeFiles/test_copy_if.dir/test_copy_if.cpp.o -c /home/scratch/boost.compute/test/test_copy_if.cpp
/home/scratch/boost.compute/include/boost/compute/functional/geometry.hpp: In member function โ€˜void clip_points_below_plane::test_method()โ€™:
/home/scratch/boost.compute/include/boost/compute/functional/geometry.hpp:21: error: โ€˜template class boost::compute::dotโ€™ is not a function,
/home/scratch/boost.compute/include/boost/compute/lambda/functional.hpp:222: error: conflict with โ€˜template<class Arg1, class Arg2> const typename boost::proto::result_of::make_expr<boost::proto::tagns_::tag::function, boost::compute::lambda::detail::dot_func, const Arg1&, const Arg2&, void, void, void, void, void, void, void, void, void>::type boost::compute::lambda::dot(const Arg1&, const Arg2&)โ€™
/home/scratch/boost.compute/test/test_copy_if.cpp:106: error: in call to โ€˜dotโ€™

This could be a gcc issue (I'm trying to build with gcc 4.4.6). Will try to build with newer gcc and clang

Improve inner_product performance

inner_product has to be improved a lot for it to even compare with STL. Currently, it takes 20x time compared to the STL version.

PR #69 makes the implementation more native. I could get a performance gain of about 10x.

But, it is still 2x slower than the STL version. I am trying out a vector-based implementation to make use of SIMD while multiplying/adding. Any other ideas?

Externally initialized opencl buffers

Hello Kyle!

Is it possible to use externally initialized OpenCL contexts, queues, and buffers with Boost.Compute algorithms?

It seems that it should be possible with boost::compute::make_buffer_iterator(), but boost::compute::buffer only provides protected constructor from cl_mem.

Best regards,
Denis

Bug in reduce_on_gpu.hpp

gid (reduce_on_gpu.hpp:38) must be commented out to avoid "unused variable" warning by OpenCL compiler.

What compilers, operating systems, and video cards(devices) are supported?

Greetings.
What compilers, operating systems, and video cards(devices) are supported Boost.Compute?

It would be nice if the main page was something like:
Supported:

  • Compilers: GCC >= 4.7.2, MSVC >= 11, Clang >= 3.3
  • OS: Win, Linux, Unix, XNU
  • Devices: GeForce / Tesla (GF1xx, GK1xx), Radeon XXX, Xeon Phi, All MultiCores x86_64 CPUs

And which ones supposedly are supported, and tested on which of these?

Regards, Alexey

Boost.compute does not compile with clang++ -std=c++11

Hi,

The compilation succeeds with default settings, but fails if c++11 standard is used:

$ cd compute/build
$ CC=clang CXX=clang++ cmake -DCMAKE_CXX_FLAGS=-std=c++11 -DBOOST_COMPUTE_BUILD_BENCHMARKS=ON -DBOOST_COMPUTE_BUILD_EXAMPLES=ON -DBOOST_COMPUTE_BUILD_TESTS=ON ..
-- The C compiler identification is Clang 3.2.0
-- The CXX compiler identification is Clang 3.2.0
-- Check for working C compiler: /usr/bin/clang
-- Check for working C compiler: /usr/bin/clang -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/clang++
-- Check for working CXX compiler: /usr/bin/clang++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Found OpenCL: /usr/lib64/libOpenCL.so  
-- Boost version: 1.52.0
-- Boost version: 1.52.0
-- Found the following Boost libraries:
--   unit_test_framework
-- Configuring done
-- Generating done
-- Build files have been written to: /home/demidov/work/opencl/compute/build

$ make 
Scanning dependencies of target copy_data
[  1%] Building CXX object example/CMakeFiles/copy_data.dir/copy_data.cpp.o
In file included from /home/demidov/work/opencl/compute/example/copy_data.cpp:14:
In file included from /home/demidov/work/opencl/compute/include/boost/compute.hpp:14:
In file included from /home/demidov/work/opencl/compute/include/boost/compute/algorithm.hpp:16:
In file included from /home/demidov/work/opencl/compute/include/boost/compute/algorithm/adjacent_find.hpp:17:
In file included from /home/demidov/work/opencl/compute/include/boost/compute/algorithm/find.hpp:14:
In file included from /home/demidov/work/opencl/compute/include/boost/compute/lambda.hpp:14:
/home/demidov/work/opencl/compute/include/boost/compute/lambda/context.hpp:360:58: error: no type named 'type' in
      'boost::result_of<boost::compute::lambda::expression<boost::proto::exprns_::expr<boost::proto::tagns_::tag::terminal,
      boost::proto::argsns_::term<boost::compute::lambda::placeholder<0> >, 0> > ()>'
        typename ::boost::result_of<expression<Expr>()>::type
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~
/home/demidov/work/opencl/compute/include/boost/compute/lambda/placeholders.hpp:27:58: note: in instantiation of template class
      'boost::compute::lambda::expression<boost::proto::exprns_::expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<boost::compute::lambda::placeholder<0> >, 0>
      >' requested here
expression<proto::terminal<placeholder<0> >::type> const _1;
                                                         ^
In file included from /home/demidov/work/opencl/compute/example/copy_data.cpp:14:
In file included from /home/demidov/work/opencl/compute/include/boost/compute.hpp:14:
In file included from /home/demidov/work/opencl/compute/include/boost/compute/algorithm.hpp:16:
In file included from /home/demidov/work/opencl/compute/include/boost/compute/algorithm/adjacent_find.hpp:17:
In file included from /home/demidov/work/opencl/compute/include/boost/compute/algorithm/find.hpp:14:
In file included from /home/demidov/work/opencl/compute/include/boost/compute/lambda.hpp:14:
/home/demidov/work/opencl/compute/include/boost/compute/lambda/context.hpp:360:58: error: no type named 'type' in
      'boost::result_of<boost::compute::lambda::expression<boost::proto::exprns_::expr<boost::proto::tagns_::tag::terminal,
      boost::proto::argsns_::term<boost::compute::lambda::placeholder<1> >, 0> > ()>'
        typename ::boost::result_of<expression<Expr>()>::type
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~
/home/demidov/work/opencl/compute/include/boost/compute/lambda/placeholders.hpp:28:58: note: in instantiation of template class
      'boost::compute::lambda::expression<boost::proto::exprns_::expr<boost::proto::tagns_::tag::terminal, boost::proto::argsns_::term<boost::compute::lambda::placeholder<1> >, 0>
      >' requested here
expression<proto::terminal<placeholder<1> >::type> const _2;
                                                         ^
2 errors generated.
make[2]: *** [example/CMakeFiles/copy_data.dir/copy_data.cpp.o] Error 1
make[1]: *** [example/CMakeFiles/copy_data.dir/all] Error 2
make: *** [all] Error 2

Radix Sort not compiling

Ran the example on the main github page, the radix sort fails to build. I'm getting a return value of -11 which is CL_BUILD_PROGRAM_FAILURE.
Debug shows:
Boost.Compute: kernel compilation failed (-11)
--- source ---

define K2 (1 << K)

define RADIX_MASK ((((T)(1)) << K) - 1)

define SIGN_BIT ((sizeof(T) * CHAR_BIT) - 1)

inline uint radix(const T x, const uint low_bit)
{

if defined(IS_FLOATING_POINT)

const T mask = -(x >> SIGN_BIT) | (((T)(1)) << SIGN_BIT);
return ((x ^ mask) >> low_bit) & RADIX_MASK;

elif defined(IS_SIGNED)

return ((x ^ (((T)(1)) << SIGN_BIT)) >> low_bit) & RADIX_MASK;

else

return (x >> low_bit) & RADIX_MASK;

endif

}
__kernel void count(__global const T _input,
const uint input_size,
__global uint *global_counts,
__global uint *global_offsets,
__local uint *local_counts,
const uint low_bit)
{
const uint gid = get_global_id(0);
const uint lid = get_local_id(0);
if(lid < K2){
local_counts[lid] = 0;
}
barrier(CLK_LOCAL_MEM_FENCE);
if(gid < input_size){
T value = input[gid];
uint bucket = radix(value, low_bit);
atomic_inc(local_counts + bucket);
}
barrier(CLK_LOCAL_MEM_FENCE);
if(lid < K2){
global_counts[K2_get_group_id(0) + lid] = local_counts[lid];
if(get_group_id(0) == (get_num_groups(0) - 1)){
global_offsets[lid] = local_counts[lid];
}
}
}
__kernel void scan(__global const uint *block_offsets,
__global uint *global_offsets,
const uint block_count)
{
__global const uint *last_block_offsets =
block_offsets + K2 * (block_count - 1);
uint sum = 0;
for(uint i = 0; i < K2; i++){
uint x = global_offsets[i] + last_block_offsets[i];
global_offsets[i] = sum;
sum += x;
}
}
__kernel void scatter(__global const T *input,
const uint input_size,
const uint low_bit,
__global const uint *counts,
__global const uint *global_offsets,
__global T *output)
{
const uint gid = get_global_id(0);
const uint lid = get_local_id(0);
T value;
uint bucket;
__local uint local_input[BLOCK_SIZE];
if(gid < input_size){
value = input[gid];
bucket = radix(value, low_bit);
local_input[lid] = bucket;
}
__local uint local_counts[(1 << K)];
if(lid < K2){
local_counts[lid] = counts[get_group_id(0) * K2 + lid];
}
barrier(CLK_LOCAL_MEM_FENCE);
if(gid >= input_size){
return;
}
uint offset = global_offsets[bucket] + local_counts[bucket];
uint local_offset = 0;
for(uint i = 0; i < lid; i++){
if(local_input[i] == bucket)
local_offset++;
}
output[offset + local_offset] = value;
}

--- build log ---
error: definition of macro 'K' conflicts with an identifier used in the precompiled header

Add get_info<Info>() specializations

Right now, the get_info<T>(enum) function allows users to retrieve values from the corresponding clGet*Info() functions.

However, the current API requires users to specify both the enum name (e.g. CL_DEVICE_LOCAL_MEM_SIZE) and the value type (e.g. cl_ulong).

We should add specializations for the known enum types with hard-coded return types. For example, the following test should pass:

device.get_info<CL_DEVICE_LOCAL_MEM_SIZE>() == device.get_info<cl_ulong>(CL_DEVICE_LOCAL_MEM_SIZE);

Care should be taken to ensure that only enum types available from the OpenCL version reported by cl.h are used. For example, don't try to use info enums from OpenCL 1.2 like CL_DEVICE_PRINTF_BUFFER_SIZE if CL_VERSION_1_2 is not defined.

sort_by_key - new algorithm request

I have a bunch of code based on Thrust which I would like to convert over to Boost.Compute. Since my code uses thrust::sort_by_key a lot (as well as some other xxx_key functions) I was wondering if it would be possible to add this functionality to Boost.Compute as well?

Tests failing

Hello Kyle,

Here is output of make test on NVIDIA and AMD cards:

Nvidia Tesla K20:

Test project /home/demidov/work/opencl/compute/build
      Start  1: core.buffer
 1/69 Test  #1: core.buffer ............................   Passed    0.53 sec
      Start  2: core.command_queue
 2/69 Test  #2: core.command_queue .....................   Passed    0.88 sec
      Start  3: core.device
 3/69 Test  #3: core.device ............................   Passed    0.12 sec
      Start  4: core.image2d
 4/69 Test  #4: core.image2d ...........................   Passed    0.51 sec
      Start  5: core.image3d
 5/69 Test  #5: core.image3d ...........................   Passed    0.51 sec
      Start  6: core.image_sampler
 6/69 Test  #6: core.image_sampler .....................   Passed    0.51 sec
      Start  7: core.kernel
 7/69 Test  #7: core.kernel ............................***Failed    0.50 sec
      Start  8: core.program
 8/69 Test  #8: core.program ...........................   Passed    0.49 sec
      Start  9: core.system
 9/69 Test  #9: core.system ............................   Passed    0.12 sec
      Start 10: core.type_traits
10/69 Test #10: core.type_traits .......................   Passed    0.01 sec
      Start 11: core.types
11/69 Test #11: core.types .............................   Passed    0.00 sec
      Start 12: algorithm.accumulate
12/69 Test #12: algorithm.accumulate ...................***Failed    0.51 sec
      Start 13: algorithm.adjacent_difference
13/69 Test #13: algorithm.adjacent_difference ..........   Passed    0.50 sec
      Start 14: algorithm.adjacent_find
14/69 Test #14: algorithm.adjacent_find ................   Passed    0.50 sec
      Start 15: algorithm.any_all_none_of
15/69 Test #15: algorithm.any_all_none_of ..............   Passed    0.51 sec
      Start 16: algorithm.binary_search
16/69 Test #16: algorithm.binary_search ................   Passed    0.51 sec
      Start 17: algorithm.copy
17/69 Test #17: algorithm.copy .........................***Failed    0.51 sec
      Start 18: algorithm.copy_if
18/69 Test #18: algorithm.copy_if ......................   Passed    0.50 sec
      Start 19: algorithm.count
19/69 Test #19: algorithm.count ........................***Failed    0.51 sec
      Start 20: algorithm.equal
20/69 Test #20: algorithm.equal ........................   Passed    0.50 sec
      Start 21: algorithm.equal_range
21/69 Test #21: algorithm.equal_range ..................   Passed    0.53 sec
      Start 22: algorithm.extrema
22/69 Test #22: algorithm.extrema ......................   Passed    0.49 sec
      Start 23: algorithm.fill
23/69 Test #23: algorithm.fill .........................   Passed    0.51 sec
      Start 24: algorithm.find
24/69 Test #24: algorithm.find .........................   Passed    0.50 sec
      Start 25: algorithm.for_each
25/69 Test #25: algorithm.for_each .....................   Passed    0.50 sec
      Start 26: algorithm.gather
26/69 Test #26: algorithm.gather .......................   Passed    0.50 sec
      Start 27: algorithm.generate
27/69 Test #27: algorithm.generate .....................   Passed    0.49 sec
      Start 28: algorithm.histogram
28/69 Test #28: algorithm.histogram ....................   Passed    0.51 sec
      Start 29: algorithm.inner_product
29/69 Test #29: algorithm.inner_product ................   Passed    0.51 sec
      Start 30: algorithm.inplace_reduce
30/69 Test #30: algorithm.inplace_reduce ...............***Failed    2.26 sec
      Start 31: algorithm.insertion_sort
31/69 Test #31: algorithm.insertion_sort ...............***Failed    1.69 sec
      Start 32: algorithm.iota
32/69 Test #32: algorithm.iota .........................   Passed    2.89 sec
      Start 33: algorithm.is_sorted
33/69 Test #33: algorithm.is_sorted ....................   Passed    2.76 sec
      Start 34: algorithm.merge
34/69 Test #34: algorithm.merge ........................   Passed    1.71 sec
      Start 35: algorithm.mismatch
35/69 Test #35: algorithm.mismatch .....................   Passed    0.51 sec
      Start 36: algorithm.partial_sum
36/69 Test #36: algorithm.partial_sum ..................   Passed    0.51 sec
      Start 37: algorithm.partition
37/69 Test #37: algorithm.partition ....................***Failed    6.62 sec
      Start 38: algorithm.radix_sort
38/69 Test #38: algorithm.radix_sort ...................***Failed    2.57 sec
      Start 39: algorithm.random_shuffle
39/69 Test #39: algorithm.random_shuffle ...............   Passed    1.04 sec
      Start 40: algorithm.reduce
40/69 Test #40: algorithm.reduce .......................***Failed    1.68 sec
      Start 41: algorithm.remove
41/69 Test #41: algorithm.remove .......................   Passed    6.25 sec
      Start 42: algorithm.replace
42/69 Test #42: algorithm.replace ......................   Passed    1.08 sec
      Start 43: algorithm.reverse
43/69 Test #43: algorithm.reverse ......................   Passed    3.36 sec
      Start 44: algorithm.scan
44/69 Test #44: algorithm.scan .........................***Failed    0.51 sec
      Start 45: algorithm.scatter
45/69 Test #45: algorithm.scatter ......................***Failed    1.11 sec
      Start 46: algorithm.sort
46/69 Test #46: algorithm.sort .........................***Failed   11.38 sec
      Start 47: algorithm.stable_sort
47/69 Test #47: algorithm.stable_sort ..................   Passed    0.50 sec
      Start 48: algorithm.transform
48/69 Test #48: algorithm.transform ....................***Failed    4.70 sec
      Start 49: algorithm.transform_reduce
49/69 Test #49: algorithm.transform_reduce .............***Failed    1.13 sec
      Start 50: container.allocator
50/69 Test #50: container.allocator ....................   Passed    0.50 sec
      Start 51: container.array
51/69 Test #51: container.array ........................***Failed    0.51 sec
      Start 52: container.flat_map
52/69 Test #52: container.flat_map .....................   Passed   19.66 sec
      Start 53: container.flat_set
53/69 Test #53: container.flat_set .....................***Failed    1.70 sec
      Start 54: container.stack
54/69 Test #54: container.stack ........................   Passed    1.63 sec
      Start 55: container.string
55/69 Test #55: container.string .......................   Passed    0.50 sec
      Start 56: container.valarray
56/69 Test #56: container.valarray .....................   Passed    0.51 sec
      Start 57: container.vector
57/69 Test #57: container.vector .......................   Passed    7.94 sec
      Start 58: iterator.adjacent_transform_iterator
58/69 Test #58: iterator.adjacent_transform_iterator ...   Passed    1.07 sec
      Start 59: iterator.zip_iterator
59/69 Test #59: iterator.zip_iterator ..................   Passed    1.46 sec
      Start 60: random.mersenne_twister
60/69 Test #60: random.mersenne_twister ................   Passed    1.15 sec
      Start 61: blas.gemm
61/69 Test #61: blas.gemm ..............................***Failed    1.67 sec
      Start 62: blas.gemv
62/69 Test #62: blas.gemv ..............................   Passed    1.07 sec
      Start 63: blas.iamax
63/69 Test #63: blas.iamax .............................   Passed    1.08 sec
      Start 64: blas.norm2
64/69 Test #64: blas.norm2 .............................   Passed    1.62 sec
      Start 65: ext.complex
65/69 Test #65: ext.complex ............................   Passed    3.91 sec
      Start 66: ext.lambda
66/69 Test #66: ext.lambda .............................   Passed    1.65 sec
      Start 67: ext.malloc
67/69 Test #67: ext.malloc .............................   Passed    0.51 sec
      Start 68: ext.pair
68/69 Test #68: ext.pair ...............................   Passed    2.82 sec
      Start 69: ext.tuple
69/69 Test #69: ext.tuple ..............................***Failed    0.51 sec

74% tests passed, 18 tests failed out of 69

Total Test time (real) = 119.04 sec

The following tests FAILED:
      7 - core.kernel (Failed)
     12 - algorithm.accumulate (Failed)
     17 - algorithm.copy (Failed)
     19 - algorithm.count (Failed)
     30 - algorithm.inplace_reduce (Failed)
     31 - algorithm.insertion_sort (Failed)
     37 - algorithm.partition (Failed)
     38 - algorithm.radix_sort (Failed)
     40 - algorithm.reduce (Failed)
     44 - algorithm.scan (Failed)
     45 - algorithm.scatter (Failed)
     46 - algorithm.sort (Failed)
     48 - algorithm.transform (Failed)
     49 - algorithm.transform_reduce (Failed)
     51 - container.array (Failed)
     53 - container.flat_set (Failed)
     61 - blas.gemm (Failed)
     69 - ext.tuple (Failed)

AMD Capeverde:

Test project /home/demidov/work/opencl/compute/build/test
      Start  1: core.buffer
 1/69 Test  #1: core.buffer ............................   Passed    2.12 sec
      Start  2: core.command_queue
 2/69 Test  #2: core.command_queue .....................   Passed    0.89 sec
      Start  3: core.device
 3/69 Test  #3: core.device ............................   Passed    0.29 sec
      Start  4: core.image2d
 4/69 Test  #4: core.image2d ...........................***Failed    0.88 sec
      Start  5: core.image3d
 5/69 Test  #5: core.image3d ...........................   Passed    0.29 sec
      Start  6: core.image_sampler
 6/69 Test  #6: core.image_sampler .....................   Passed    0.29 sec
      Start  7: core.kernel
 7/69 Test  #7: core.kernel ............................***Failed    0.29 sec
      Start  8: core.program
 8/69 Test  #8: core.program ...........................***Failed    0.29 sec
      Start  9: core.system
 9/69 Test  #9: core.system ............................   Passed    0.29 sec
      Start 10: core.type_traits
10/69 Test #10: core.type_traits .......................   Passed    0.01 sec
      Start 11: core.types
11/69 Test #11: core.types .............................   Passed    0.00 sec
      Start 12: algorithm.accumulate
12/69 Test #12: algorithm.accumulate ...................***Failed    2.64 sec
      Start 13: algorithm.adjacent_difference
13/69 Test #13: algorithm.adjacent_difference ..........***Failed    0.60 sec
      Start 14: algorithm.adjacent_find
14/69 Test #14: algorithm.adjacent_find ................***Failed    0.90 sec
      Start 15: algorithm.any_all_none_of
15/69 Test #15: algorithm.any_all_none_of ..............***Failed    0.90 sec
      Start 16: algorithm.binary_search
16/69 Test #16: algorithm.binary_search ................***Failed    1.48 sec
      Start 17: algorithm.copy
17/69 Test #17: algorithm.copy .........................***Failed   13.78 sec
      Start 18: algorithm.copy_if
18/69 Test #18: algorithm.copy_if ......................***Failed    2.04 sec
      Start 19: algorithm.count
19/69 Test #19: algorithm.count ........................***Failed    2.34 sec
      Start 20: algorithm.equal
20/69 Test #20: algorithm.equal ........................***Failed    2.52 sec
      Start 21: algorithm.equal_range
21/69 Test #21: algorithm.equal_range ..................***Failed    0.88 sec
      Start 22: algorithm.extrema
22/69 Test #22: algorithm.extrema ......................***Failed    2.36 sec
      Start 23: algorithm.fill
23/69 Test #23: algorithm.fill .........................***Failed    1.18 sec
      Start 24: algorithm.find
24/69 Test #24: algorithm.find .........................***Failed    2.95 sec
      Start 25: algorithm.for_each
25/69 Test #25: algorithm.for_each .....................***Failed    0.91 sec
      Start 26: algorithm.gather
26/69 Test #26: algorithm.gather .......................***Failed    1.49 sec
      Start 27: algorithm.generate
27/69 Test #27: algorithm.generate .....................***Failed    0.60 sec
      Start 28: algorithm.histogram
28/69 Test #28: algorithm.histogram ....................***Failed    0.89 sec
      Start 29: algorithm.inner_product
29/69 Test #29: algorithm.inner_product ................***Failed    1.18 sec
      Start 30: algorithm.inplace_reduce
30/69 Test #30: algorithm.inplace_reduce ...............***Failed    1.77 sec
      Start 31: algorithm.insertion_sort
31/69 Test #31: algorithm.insertion_sort ...............***Failed    6.14 sec
      Start 32: algorithm.iota
32/69 Test #32: algorithm.iota .........................***Failed    0.88 sec
      Start 33: algorithm.is_sorted
33/69 Test #33: algorithm.is_sorted ....................***Failed    2.04 sec
      Start 34: algorithm.merge
34/69 Test #34: algorithm.merge ........................***Failed    1.18 sec
      Start 35: algorithm.mismatch
35/69 Test #35: algorithm.mismatch .....................***Failed    1.19 sec
      Start 36: algorithm.partial_sum
36/69 Test #36: algorithm.partial_sum ..................***Failed    0.88 sec
      Start 37: algorithm.partition
37/69 Test #37: algorithm.partition ....................***Failed    2.66 sec
      Start 38: algorithm.radix_sort
38/69 Test #38: algorithm.radix_sort ...................***Failed    6.13 sec
      Start 39: algorithm.random_shuffle
39/69 Test #39: algorithm.random_shuffle ...............***Failed    1.79 sec
      Start 40: algorithm.reduce
40/69 Test #40: algorithm.reduce .......................***Failed    3.51 sec
      Start 41: algorithm.remove
41/69 Test #41: algorithm.remove .......................***Failed    0.90 sec
      Start 42: algorithm.replace
42/69 Test #42: algorithm.replace ......................***Failed    0.89 sec
      Start 43: algorithm.reverse
43/69 Test #43: algorithm.reverse ......................***Failed    1.18 sec
      Start 44: algorithm.scan
44/69 Test #44: algorithm.scan .........................***Failed    3.23 sec
      Start 45: algorithm.scatter
45/69 Test #45: algorithm.scatter ......................***Failed    1.76 sec
      Start 46: algorithm.sort
46/69 Test #46: algorithm.sort .........................***Failed    9.35 sec
      Start 47: algorithm.stable_sort
47/69 Test #47: algorithm.stable_sort ..................***Failed    0.90 sec
      Start 48: algorithm.transform
48/69 Test #48: algorithm.transform ....................***Failed   12.02 sec
      Start 49: algorithm.transform_reduce
49/69 Test #49: algorithm.transform_reduce .............***Failed    1.52 sec
      Start 50: container.allocator
50/69 Test #50: container.allocator ....................   Passed    0.59 sec
      Start 51: container.array
51/69 Test #51: container.array ........................***Failed    6.25 sec
      Start 52: container.flat_map
52/69 Test #52: container.flat_map .....................***Failed    5.27 sec
      Start 53: container.flat_set
53/69 Test #53: container.flat_set .....................***Failed    3.85 sec
      Start 54: container.stack
54/69 Test #54: container.stack ........................***Failed    2.64 sec
      Start 55: container.string
55/69 Test #55: container.string .......................   Passed    0.60 sec
      Start 56: container.valarray
56/69 Test #56: container.valarray .....................***Failed    3.52 sec
      Start 57: container.vector
57/69 Test #57: container.vector .......................***Failed   54.50 sec
      Start 58: iterator.adjacent_transform_iterator
58/69 Test #58: iterator.adjacent_transform_iterator ...***Failed    1.48 sec
      Start 59: iterator.zip_iterator
59/69 Test #59: iterator.zip_iterator ..................***Failed    1.49 sec
      Start 60: random.mersenne_twister
60/69 Test #60: random.mersenne_twister ................***Failed    0.60 sec
      Start 61: blas.gemm
61/69 Test #61: blas.gemm ..............................***Failed    0.90 sec
      Start 62: blas.gemv
62/69 Test #62: blas.gemv ..............................***Failed    0.61 sec
      Start 63: blas.iamax
63/69 Test #63: blas.iamax .............................***Failed    0.61 sec
      Start 64: blas.norm2
64/69 Test #64: blas.norm2 .............................***Failed    0.60 sec
      Start 65: ext.complex
65/69 Test #65: ext.complex ............................***Failed    4.40 sec
      Start 66: ext.lambda
66/69 Test #66: ext.lambda .............................***Failed    3.24 sec
      Start 67: ext.malloc
67/69 Test #67: ext.malloc .............................   Passed    0.88 sec
      Start 68: ext.pair
68/69 Test #68: ext.pair ...............................***Failed    5.90 sec
      Start 69: ext.tuple
69/69 Test #69: ext.tuple ..............................   Passed    2.37 sec

17% tests passed, 57 tests failed out of 69

Total Test time (real) = 206.29 sec

The following tests FAILED:
      4 - core.image2d (Failed)
      7 - core.kernel (Failed)
      8 - core.program (Failed)
     12 - algorithm.accumulate (Failed)
     13 - algorithm.adjacent_difference (Failed)
     14 - algorithm.adjacent_find (Failed)
     15 - algorithm.any_all_none_of (Failed)
     16 - algorithm.binary_search (Failed)
     17 - algorithm.copy (Failed)
     18 - algorithm.copy_if (Failed)
     19 - algorithm.count (Failed)
     20 - algorithm.equal (Failed)
     21 - algorithm.equal_range (Failed)
     22 - algorithm.extrema (Failed)
     23 - algorithm.fill (Failed)
     24 - algorithm.find (Failed)
     25 - algorithm.for_each (Failed)
     26 - algorithm.gather (Failed)
     27 - algorithm.generate (Failed)
     28 - algorithm.histogram (Failed)
     29 - algorithm.inner_product (Failed)
     30 - algorithm.inplace_reduce (Failed)
     31 - algorithm.insertion_sort (Failed)
     32 - algorithm.iota (Failed)
     33 - algorithm.is_sorted (Failed)
     34 - algorithm.merge (Failed)
     35 - algorithm.mismatch (Failed)
     36 - algorithm.partial_sum (Failed)
     37 - algorithm.partition (Failed)
     38 - algorithm.radix_sort (Failed)
     39 - algorithm.random_shuffle (Failed)
     40 - algorithm.reduce (Failed)
     41 - algorithm.remove (Failed)
     42 - algorithm.replace (Failed)
     43 - algorithm.reverse (Failed)
     44 - algorithm.scan (Failed)
     45 - algorithm.scatter (Failed)
     46 - algorithm.sort (Failed)
     47 - algorithm.stable_sort (Failed)
     48 - algorithm.transform (Failed)
     49 - algorithm.transform_reduce (Failed)
     51 - container.array (Failed)
     52 - container.flat_map (Failed)
     53 - container.flat_set (Failed)
     54 - container.stack (Failed)
     56 - container.valarray (Failed)
     57 - container.vector (Failed)
     58 - iterator.adjacent_transform_iterator (Failed)
     59 - iterator.zip_iterator (Failed)
     60 - random.mersenne_twister (Failed)
     61 - blas.gemm (Failed)
     62 - blas.gemv (Failed)
     63 - blas.iamax (Failed)
     64 - blas.norm2 (Failed)
     65 - ext.complex (Failed)
     66 - ext.lambda (Failed)
     68 - ext.pair (Failed)

compute::transform over zip iterators

Is it possible to transform a zip iterator in a general way? As an example, rotate 2D points, where components are stored in individual vectors:

y0 = cos(a) * x0 - sin(a) * x1;
y1 = sin(a) * x0 + cos(a) * x1;

I want to implement this along the lines of

bc::vector<double> x0(n), x1(n), y0(n), y1(n);

bc::transform(
  bc::make_zip_iterator( boost::make_tuple( x0.begin(), x1.begin() ) ),
  bc::make_zip_iterator( boost::make_tuple( x0.end(), x1.end() ) ),
  bc::make_zip_iterator( boost::make_tuple( y0.begin(), y1.begin() ) ),
  ??? // what to write here?
);

If the output is just a vector, one could use bc::get<I>() functionals, but what to use when output is a zip iterator?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.