Giter Club home page Giter Club logo

cub's Introduction

⚠️ The CUB repository has been archived and is now part of the unified nvidia/cccl repository. See the announcement here for more information. Please visit the new repository for the latest updates. ⚠️


About CUB

CUB provides state-of-the-art, reusable software components for every layer of the CUDA programming model:

Orientation of collective primitives within the CUDA software stack

CUB is included in the NVIDIA HPC SDK and the CUDA Toolkit.

We recommend the CUB Project Website for further information and examples.



A Simple Example

#include <cub/cub.cuh>

// Block-sorting CUDA kernel
__global__ void BlockSortKernel(int *d_in, int *d_out)
{
     using namespace cub;

     // Specialize BlockRadixSort, BlockLoad, and BlockStore for 128 threads
     // owning 16 integer items each
     typedef BlockRadixSort<int, 128, 16>                     BlockRadixSort;
     typedef BlockLoad<int, 128, 16, BLOCK_LOAD_TRANSPOSE>   BlockLoad;
     typedef BlockStore<int, 128, 16, BLOCK_STORE_TRANSPOSE> BlockStore;

     // Allocate shared memory
     __shared__ union {
         typename BlockRadixSort::TempStorage  sort;
         typename BlockLoad::TempStorage       load;
         typename BlockStore::TempStorage      store;
     } temp_storage;

     int block_offset = blockIdx.x * (128 * 16);	  // OffsetT for this block's ment

     // Obtain a segment of 2048 consecutive keys that are blocked across threads
     int thread_keys[16];
     BlockLoad(temp_storage.load).Load(d_in + block_offset, thread_keys);
     __syncthreads();

     // Collectively sort the keys
     BlockRadixSort(temp_storage.sort).Sort(thread_keys);
     __syncthreads();

     // Store the sorted segment
     BlockStore(temp_storage.store).Store(d_out + block_offset, thread_keys);
}

Each thread block uses cub::BlockRadixSort to collectively sort its own input segment. The class is specialized by the data type being sorted, by the number of threads per block, by the number of keys per thread, and implicitly by the targeted compilation architecture.

The cub::BlockLoad and cub::BlockStore classes are similarly specialized. Furthermore, to provide coalesced accesses to device memory, these primitives are configured to access memory using a striped access pattern (where consecutive threads simultaneously access consecutive items) and then transpose the keys into a blocked arrangement of elements across threads.

Once specialized, these classes expose opaque TempStorage member types. The thread block uses these storage types to statically allocate the union of shared memory needed by the thread block. (Alternatively these storage types could be aliased to global memory allocations).



Supported Compilers

CUB is regularly tested using the specified versions of the following compilers. Unsupported versions may emit deprecation warnings, which can be silenced by defining CUB_IGNORE_DEPRECATED_COMPILER during compilation.

  • NVCC 11.0+
  • GCC 5+
  • Clang 7+
  • MSVC 2019+ (19.20/16.0/14.20)



Releases

CUB is distributed with the NVIDIA HPC SDK and the CUDA Toolkit in addition to GitHub.

See the changelog for details about specific releases.

CUB Release Included In
2.0.1 CUDA Toolkit 12.0
2.0.0 TBD
1.17.2 TBD
1.17.1 TBD
1.17.0 TBD
1.16.0 TBD
1.15.0 NVIDIA HPC SDK 22.1 & CUDA Toolkit 11.6
1.14.0 NVIDIA HPC SDK 21.9
1.13.1 CUDA Toolkit 11.5
1.13.0 NVIDIA HPC SDK 21.7
1.12.1 CUDA Toolkit 11.4
1.12.0 NVIDIA HPC SDK 21.3
1.11.0 CUDA Toolkit 11.3
1.10.0 NVIDIA HPC SDK 20.9 & CUDA Toolkit 11.2
1.9.10-1 NVIDIA HPC SDK 20.7 & CUDA Toolkit 11.1
1.9.10 NVIDIA HPC SDK 20.5
1.9.9 CUDA Toolkit 11.0
1.9.8-1 NVIDIA HPC SDK 20.3
1.9.8 CUDA Toolkit 11.0 Early Access
1.9.8 CUDA 11.0 Early Access
1.8.0
1.7.5 Thrust 1.9.2
1.7.4 Thrust 1.9.1-2
1.7.3
1.7.2
1.7.1
1.7.0 Thrust 1.9.0-5
1.6.4
1.6.3
1.6.2 (previously 1.5.5)
1.6.1 (previously 1.5.4)
1.6.0 (previously 1.5.3)
1.5.2
1.5.1
1.5.0
1.4.1
1.4.0
1.3.2
1.3.1
1.3.0
1.2.3
1.2.2
1.2.0
1.1.1
1.0.2
1.0.1
0.9.4
0.9.2
0.9.1
0.9.0



Development Process

CUB and Thrust depend on each other. It is recommended to clone Thrust and build CUB as a component of Thrust.

CUB uses the CMake build system to build unit tests, examples, and header tests. To build CUB as a developer, the following recipe should be followed:

# Clone Thrust and CUB from Github. CUB is located in Thrust's
# `dependencies/cub` submodule.
git clone --recursive https://github.com/NVIDIA/thrust.git
cd thrust

# Create build directory:
mkdir build
cd build

# Configure -- use one of the following:
cmake -DTHRUST_INCLUDE_CUB_CMAKE=ON ..   # Command line interface.
ccmake -DTHRUST_INCLUDE_CUB_CMAKE=ON ..  # ncurses GUI (Linux only)
cmake-gui  # Graphical UI, set source/build directories and options in the app

# Build:
cmake --build . -j <num jobs>   # invokes make (or ninja, etc)

# Run tests and examples:
ctest

By default, the C++14 standard is targeted, but this can be changed in CMake. More information on configuring your CUB build and creating a pull request is found in CONTRIBUTING.md.



Open Source License

CUB is available under the "New BSD" open-source license:

Copyright (c) 2010-2011, Duane Merrill.  All rights reserved.
Copyright (c) 2011-2018, NVIDIA CORPORATION.  All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
   *  Redistributions of source code must retain the above copyright
      notice, this list of conditions and the following disclaimer.
   *  Redistributions in binary form must reproduce the above copyright
      notice, this list of conditions and the following disclaimer in the
      documentation and/or other materials provided with the distribution.
   *  Neither the name of the NVIDIA CORPORATION nor the
      names of its contributors may be used to endorse or promote products
      derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

cub's People

Contributors

alliepiper avatar andrewcorrigan avatar artem-b avatar balnian avatar brycelelbach avatar bytehamster avatar canonizer avatar codereport avatar dkolsen-pgi avatar dumerrill avatar elehcim avatar elstehle avatar fkallen avatar gevtushenko avatar griwes avatar himanshu007-creator avatar hwinkler avatar jrhemstad avatar kaatish avatar kshitij12345 avatar matthiaskohl avatar miscco avatar mnicely avatar nv-dlasalle avatar peter9606 avatar psvvsp avatar robertmaynard avatar seunghwak avatar sjfeng1999 avatar zasdfgbnm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cub's Issues

Warpreduce executes reduction operator for out-of-bounds items

A 1.5.5 change in the signedness of lane_id is allowing lanes that are past "last lane" to execute their reduction operators. Although the partial sums produced by these lanes are not incoroporated into result returned to lane0, this presents a problem if the operation on uniniitalized data throws an error (memory exception, divide by zero, etc.)

cub::NumericTraits has no methods "Lowest", "Max"

When trying to compile the reduction example, the following error message overflows the stack. They all look like:

../../test/test_util.h(987): error: class "cub::NumericTraits" has no member "Lowest"

[Question] How to use external memory for temp storage?

When calling

BlockRadixSort(temp_storage).SortBlockedToStriped(thread_keys, thread_values);

the examples demand memory that can be seen by all threads e.g.

shared typename BlockRadixSort::TempStorage temp_storage;

However, how do I use external memory if temp_storage needs more than 48KiB? How do I allocate this memory from the host?

Thanks in advance,

Christian

Vectorized Binary Search

Are there any plans to implement vectorized binary search? Is it possible to improve upon the performance of thrust binary search using CUB?

Numerical stability issues with block load & store using ITEMS_PER_THREAD = 1

There appear to be numerical stability issues when I use cub::BlockLoad/cub::BlockStore with ITEMS_PER_THREAD = 1. Using ITEMS_PER_THREAD = 2 or more is okay.

Steps to reproduce:

  1. Install Boost. (I am on Mac, and I installed Boost via the Homebrew bottle: brew install boost)

  2. Save the attached test.cu and Makefile to a new directory. This test case computes a Hadamard product C <- alpha * A (*) B + beta * C via a "basic" kernel (basicHadKernel) and one using CUB (cubHadKernel).

  3. Edit Makefile to change the include path to CUB.

  4. Run make && ./test. You should see:

    Running 1 test case...
    
    *** No errors detected
    
  5. Change line 38 of test.cu to:

    static constexpr int ITEMS_PER_THREAD = 1;
  6. Run make && ./test. You should now see something like:

    Running 1 test case...
    test.cu:279: error: in "test_had": difference{1.43188e-05} between C2[i]{-0.00731719984} and expected[i]{-0.00731730461} exceeds 0.001%
    test.cu:279: error: in "test_had": difference{0.000314992} between C2[i]{0.000154208246} and expected[i]{0.000154256821} exceeds 0.001%
    test.cu:279: error: in "test_had": difference{1.46731e-05} between C2[i]{0.0272928048} and expected[i]{0.0272932053} exceeds 0.001%
    [... output clipped ...]
    
    *** 333 failures are detected in the test module "tests"
    

    As you can see, many values now exceed the tolerance threshold.

CUB code doesn't compile with CUDA 5.0 on Ubuntu Linux 11.10

I ran into this in my own code, and then tried the CUB tests, and they fail to compile too.

harrism@snafubuntu:~/src/cub/test$ make test_block_radix_sort
mkdir -p bin
"/usr/local/cuda/bin/nvcc"  -gencode=arch=compute_20,code=\"sm_20,compute_20\"  -o bin/test_block_radix_sort_nvvm_5.0_noabi_i386 test_block_radix_sort.cu -Xptxas -v -Xcudafe -# -Xptxas -abi=no -m32 -I. -I../cub  -O3
../cub/block/../warp/warp_reduce.cuh(134): warning: declaration of "T" hides template parameter

../cub/block/../warp/warp_reduce.cuh(456): error: a class or namespace qualified name is required

../cub/block/../warp/warp_reduce.cuh(459): error: name followed by "::" must be a class or namespace name

Front end time                      17.78 (CPU)      18.00 (elapsed)
2 errors detected in the compilation of "/tmp/tmpxft_000012b7_00000000-6_test_block_radix_sort.cpp1.ii".
Total compilation time              18.01 (CPU)      19.00 (elapsed)
make: *** [bin/test_block_radix_sort_nvvm_5.0_noabi_i386] Error 2

harrism@snafubuntu:~/src/cub/test$ uname -a
Linux snafubuntu 3.0.0-32-generic #51-Ubuntu SMP Thu Mar 21 15:50:59 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

harrism@snafubuntu:~/src/cub/test$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221

harrism@snafubuntu:~/src/cub/test$ g++ --version
g++ (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1
Copyright (C) 2011 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

The errors for other tests start the same, but are much worse (many more error messages).

RFEs for cub::DeviceScan::ExclusiveScan

  1. Add API that also acceptsinit value and constructs an exclusive scan with the following semantics:output[0] = init, output[i>0] = init ⊕ input[0] ⊕ ... ⊕ input[i-1]
  2. Add API that doesn't require identity value for an user-provided reduction operator. If user doesn't know, or doesn't want to provide one, exclusive scan, similar to inclusive scan, should just work. If this is less efficient, that's okay and its user's choice to have faster implementation by providing the identityvalue.

DeviceRadixSort::SortKeys<bool> yields compiler errors

The following program:

#include <cub/cub.cuh>

int main()
{
  cub::DeviceRadixSort::SortKeys<bool>;
  return 0;
}

does not compile:

$nvcc -I. sort_bool.cu
sort_bool.cu(5): warning: expression has no effect

./cub/device/dispatch/../../block_sweep/block_radix_sort_upsweep.cuh(92): error: class "cub::Traits<__nv_bool>" has no member "UnsignedBits"
          detected during:
            instantiation of class "cub::BlockRadixSortUpsweep<BlockRadixSortUpsweepPolicy, Key, Offset> [with BlockRadixSortUpsweepPolicy=cub::DeviceRadixSortDispatch<false, __nv_bool, cub::NullType, int>::PtxUpsweepPolicy, Key=__nv_bool, Offset=int]" 
./cub/device/dispatch/device_radix_sort_dispatch.cuh(80): here
            instantiation of "void cub::DeviceRadixSortUpsweepKernel<BlockRadixSortUpsweepPolicy,DESCENDING,Key,Offset>(Key *, Offset *, Offset, int, int, __nv_bool, cub::GridEvenShare<Offset>) [with BlockRadixSortUpsweepPolicy=cub::DeviceRadixSortDispatch<false, __nv_bool, cub::NullType, int>::PtxUpsweepPolicy, DESCENDING=false, Key=__nv_bool, Offset=int]" 
./cub/device/dispatch/device_radix_sort_dispatch.cuh(931): here
            instantiation of "cudaError_t cub::DeviceRadixSortDispatch<DESCENDING, Key, Value, Offset>::Dispatch(void *, size_t &, cub::DoubleBuffer<Key> &, cub::DoubleBuffer<Value> &, Offset, int, int, cudaStream_t, __nv_bool) [with DESCENDING=false, Key=__nv_bool, Value=cub::NullType, Offset=int]" 
./cub/device/device_radix_sort.cuh(331): here
            instantiation of "cudaError_t cub::DeviceRadixSort::SortKeys(void *, size_t &, cub::DoubleBuffer<Key> &, int, int, int, cudaStream_t, __nv_bool) [with Key=__nv_bool]" 

...

I tried to find out if this was intended to be supported or not. The documentation implies it is:

DeviceRadixSort can sort all of the built-in C++ numeric primitive types, e.g.: unsigned char, int,
double, etc. Although the direct radix sorting method can only be applied to unsigned integral types,
BlockRadixSort is able to sort signed and floating-point types via simple bit-wise transformations
that ensure lexicographic key ordering.

bool is indeed an integral type, so I think it's worth supporting for completeness.

CUDA error: invalid device function when DeviceRadixSort::SortKeys called from __host__ __device__ function

The following program:

#if !defined(__CUDA_ARCH__) || (__CUDA_ARCH__>= 350 && defined(__CUDACC_RDC__))
#  define CUB_CDP 1
#endif

#include <thrust/device_vector.h>
#include <thrust/tabulate.h>
#include <thrust/functional.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <thrust/execution_policy.h>
#include <cub/device/device_radix_sort.cuh>
#include <cassert>

template<typename T>
__host__ __device__
cudaError_t cub_sort_n(T* first, size_t n)
{
  cudaError_t result = cudaSuccess;

#if !defined(__CUDA_ARCH__) || CUB_CDP
  // XXX the additional temporary storage bytes
  //     must be allocated on a 16b aligned address
  struct __align__(16) aligned_type {};

  if(n > 1)
  {
    cub::DoubleBuffer<T> double_buffer;

    // measure the number of additional temporary storage bytes required
    size_t num_additional_temp_storage_bytes = 0;

    cudaError_t error = cub::DeviceRadixSort::SortKeys(0, num_additional_temp_storage_bytes, double_buffer, static_cast<int>(n));
    if(error != cudaSuccess) return error;

    // allocate temporary storage for double buffer as well as additional bytes
    size_t num_double_buffer_bytes = n * sizeof(T);
    size_t num_aligned_double_buffer_bytes = thrust::detail::util::round_i(num_double_buffer_bytes, sizeof(aligned_type));
    size_t num_aligned_total_temporary_storage_bytes = num_aligned_double_buffer_bytes + num_additional_temp_storage_bytes;

    thrust::device_vector<char> temporary_storage(num_aligned_total_temporary_storage_bytes);

    void* additional_temp_storage_ptr = thrust::raw_pointer_cast(temporary_storage.data() + num_aligned_double_buffer_bytes);

    double_buffer.d_buffers[0] = thrust::raw_pointer_cast(&*first);
    double_buffer.d_buffers[1] = reinterpret_cast<T*>(thrust::raw_pointer_cast(temporary_storage.data()));

    result = cub::DeviceRadixSort::SortKeys(additional_temp_storage_ptr,
                                            num_additional_temp_storage_bytes,
                                            double_buffer,
                                            static_cast<int>(n));
    if(result != cudaSuccess) return result;

    if(double_buffer.Current() != 0)
    {
      T* temp_ptr = reinterpret_cast<T*>(double_buffer.d_buffers[1]);
      thrust::copy(thrust::device, temp_ptr, temp_ptr + n, first);
    }
  }
#else
  result = cudaErrorNotSupported;
#endif

  return result;
}


int main()
{
  size_t n = 2;

  thrust::host_vector<int> ref(n);

  // generate unsorted values
  thrust::tabulate(ref.begin(), ref.end(), thrust::placeholders::_1 % 1024);

  thrust::device_vector<int> data = ref;

  cudaError_t error = cub_sort_n(thrust::raw_pointer_cast(data.data()), data.size());
  if(error)
  {
    std::cerr << "CUDA error: " << cudaGetErrorString(error) << std::endl;
    std::exit(-1);
  }

  thrust::sort(ref.begin(), ref.end());

  assert(ref == data);

  assert(thrust::is_sorted(data.begin(), data.end()));

  return 0;
}

Yields the following output at runtime:

$ nvcc -I. repro.cu -run
CUDA error: invalid device function 

I believe the way that CUB uses the CUB_CDP macro is causing the host compiler to elide the CUDA kernels because it thinks they are unused. It requires quite a few workarounds to get this right -- CUB should probably perform these on behalf of the client.

Support for custom types in cub::DeviceRadixSort?

I couldn't find a better place to ask this question, so pardon me for asking here:

Is there a way to use keys of non-standard types for cub::DeviceRadixSort?

So far I managed to sort arrays of ints, floats, doubles, longs, bools, etc.

But what I would like to sort an array where each item is a custom type:

typedef struct {
  char data[32]
} LongLongLong;

Unfortunately I get the following error when I try to do that:

../cub/device/dispatch/dispatch_radix_sort.cuh(254): error: class "cub::Traits<LongLongLong>" has no member "UnsignedBits"

Should I maybe add a NumericTraits for my custom type? How would I go about that?

Thank you.

Caching Device Allocator fails with debug output enabled

Bryan F

Jun 1

Visual Studio 2013, CUDA 7.5, CUB 1.5.2

Error in util_allocator.cuh@Line 414: map/set iterator not dereferencable

This issue is caused by erase being called on the block iterator before the debug output is printed.

cached_blocks.erase(block_itr);
cached_bytes[device].free -= search_key.bytes;
cached_bytes[device].live += search_key.bytes;

if (debug) _CubLog("\tDevice %d reused cached block at %p (%lld bytes) for stream %lld (previously associated with stream %lld).\n",
device, search_key.d_ptr, (long long)search_key.bytes, (long long)search_key.associated_stream, (long long)block_itr->associated_stream);

Fixed:

cached_bytes[device].free -= search_key.bytes;
cached_bytes[device].live += search_key.bytes;

if (debug) _CubLog("\tDevice %d reused cached block at %p (%lld bytes) for stream %lld (previously associated with stream %lld).\n",
device, search_key.d_ptr, (long long)search_key.bytes, (long long)search_key.associated_stream, (long long)block_itr->associated_stream);

cached_blocks.erase(block_itr);

The same error should occur at several locations where block_itr is used in the debug output.

argsort

Hi, I was wondering if you block-local sorts could return both the sorted values and their index in the array?

cudaErrorIllegalAddress error in DeviceSelect::If()

Being still new to CUDA and CUB, I'm not quite sure it's a CUB error and not mine, but this time I did my very best to exclude all other sources.

The Setup:
the GeForce GTX 950, CUDA 8.0, CUB 1.6.4

I allocated two arrays of uint3 (15 millions each) and requested by DeviceSelect::If() 313087 bytes for temp_storage. The compare functor is modified from snippet accordingly to type:

struct Check {
	int compare;
	CUB_RUNTIME_FUNCTION __forceinline__
	Check( int compare ) : compare( compare ) {}

	CUB_RUNTIME_FUNCTION __forceinline__
	bool operator()( const uint3 &a ) const {
		return ( a.x > compare );
	}
};

This doesn't compile with following error:

/home/daktfi/cub/cub/device/dispatch/../../agent/agent_select_if.cuh(279): error: calling a host function("Check::operator ()") from a device function("cub::AgentSelectIf< ::cub::DispatchSelectIf< ::uint3 *, ::cub::NullType *, ::uint3 *, unsigned int *, ::Check, ::cub::NullType, int, (bool)0> ::PtxSelectIfPolicyT, ::uint3 *, ::cub::NullType *, ::uint3 *, ::Check, ::cub::NullType, int, (bool)0> ::InitializeSelections<(bool)1, (bool)0> ") is not allowed

So, I changed the declaration to

struct Check {
	int compare;
	__host__ __device__ __forceinline__
	Check( int compare ) : compare( compare ) {}

	__host__ __device__ __forceinline__
	bool operator()( const uint3 &a ) const {
		return ( a.x > compare );
	}
};

specifying both host and device for methods. This way the code compiles, but after launch of DeviceSelect::If() and cudaDeviceSynchronise() cudaPeekAtLastError() returns 77 (cudaErrorIllegalAddress). I'm pretty much sure this isn't carried from previous asynchronous calls (after every device-related call I invoke cudaDeviceSynchronise() and check cudaPeekAtLastError()), but quite in doubt where else it can come from... All the device code is executed in non-default cudaStream s, created at the very beginning.
The code snippet is:

Check ch0( 0 );
// Memory is allocated at d_ptr, size equals mem_req, counted ahead of time
uint3 *ptr_a = ( uint3 * )d_ptr, *ptr_b = ptr_a + row_count;
void *d_tmp = ptr_b + row_count;

fprintf( stderr, "DevSelect( %p, %lu, %p, %p, %lu ) %p of %p\n", d_tmp, tmp_select, ptr_a,
	 ptr_b, row_count, ( char * ) d_tmp + tmp_select, ( char * )d_ptr + mem_req );

size_t tmp_sel_ask;
cub::DeviceSelect::If( nullptr, tmp_sel_ask, ptr_a, ptr_b, &length, row_count, ch0, s, true );

cudaDeviceSynchronize();
rc = cudaPeekAtLastError();

if( rc ) {
	std::cerr << "Some cuda error " << cudaGetLastError() << std::endl;
	cudaFree( d_ptr );
	return rc;
} else
	std::cerr << "DevSelect ready, needs " << tmp_sel_ask << " of " << tmp_select << std::endl;

if( tmp_sel_ask > tmp_select ) {
	cudaFree( d_ptr );
	return rc;
} else
	std::cerr << "DevSelect clear to go" << std::endl;

cub::DeviceSelect::If( d_tmp, tmp_select, ptr_a, ptr_b, &length, row_count, ch0, s, true );

cudaDeviceSynchronize();
rc = cudaPeekAtLastError();

if( rc ) {
	std::cerr << "DevSelect error " << cudaGetLastError() << std::endl;
	cudaFree( d_ptr );
	return rc;
} else
	std::cerr << "DevSelect ok" << std::endl;

The output is:

DevSelect( 0xb188d2a00, 313087, 0xb03180000, 0xb0dd29500, 15000000 ) 0xb1891f0ff of 0xb1891f0ff
DevSelect ready, needs 313087 of 313087
DevSelect clear to go
Invoking scan_init_kernel<<<306, 128, 0, 140458720343936>>>()
DevSelect error 77

Can you please look into the issue and send me the right way to fix it?

DeviceScan::ExclusiveSum blocks execution thread

Found it out with nvidia visual profiler

void *d_temp_storage = NULL;
                              size_t temp_storage_bytes = 0;
                              CubDebugExit(cub::DeviceScan::ExclusiveSum(d_temp_storage,
                                                                    temp_storage_bytes,
                                                                    d_in,
                                                                    d_out,
                                                                    count,
                                                                    s));

                              CubDebugExit(g_allocator.DeviceAllocate(&d_temp_storage, temp_storage_bytes, s));

                              CubDebugExit(cub::DeviceScan::ExclusiveSum(d_temp_storage,
                                                                    temp_storage_bytes,
                                                                    d_in,
                                                                    d_out,
                                                                    count,
                                                                    s));

asynchronous cuda 

Compiling

error: no instance of function template "gunrock::oprtr::advance::LaunchKernel" matches the argument list
argument types are: (int *, gunrock::app::EnactorStats, gunrock::app::FrontierAttribute, gunrock::app::bfs::BFSProblem<int, int, float, true, false, false>::DataSlice *, VertexId *, __nv_bool *, __nv_bool *, unsigned int *, int *, int *, VertexId *, int *, int *, int *, SizeT *, VertexId *, int, int, gunrock::util::CtaWorkProgressLifetime, mgpu::CudaContext, gunrock::oprtr::advance::TYPE)
detected during:
instantiation of "cudaError_t gunrock::app::bfs::BFSEnactor::EnactBFS<AdvanceKernelPolicy,FilterKernelPolicy,BFSProblem>(mgpu::CudaContext &, BFSProblem *, BFSProblem::VertexId, int) [with INSTRUMENT=true, AdvanceKernelPolicy=gunrock::oprtr::advance::KernelPolicy<gunrock::app::bfs::BFSProblem<int, int, float, true, false, false>, 300, true, 8, 10, 8, 4096, 1, 0, 5, 32, 512, 7, LB>, FilterKernelPolicy=gunrock::oprtr::filter::KernelPolicy<gunrock::app::bfs::BFSProblem<int, int, float, true, false, false>, 300, true, 0, true, 8, 8, 1, 0, 5, 5, 8>, BFSProblem=gunrock::app::bfs::BFSProblem<int, int, float, true, false, false>]"
(495): here
instantiation of "cudaError_t gunrock::app::bfs::BFSEnactor::Enact(mgpu::CudaContext &, BFSProblem *, BFSProblem::VertexId, int, int) [with INSTRUMENT=true, BFSProblem=gunrock::app::bfs::BFSProblem<int, int, float, true, false, false>]"

I am using:
CUDA 6.5,
GCC4.4.7
CENTOS 6.4
Boost version: 1.57.0
cub 1.3.2
modernGPU

Check for C++ 11 should be changed that Visual Studio 2013 is also recognized as C++ 11 capable

In file '<cub_root>/cub/host/mutex.cuh', in lines 37 and 70, replace the line
#if __cplusplus > 199711L
with the following line
#if __cplusplus > 199711L || (defined(_MSC_VER) && _MSC_VER >= 1800)

Visual Studio 2013 (and later) has also the C++ 11 class 'mutex'.

The check only using the value of "__cplusplus" does not work for VS 2013 because it is set to v alue 199711L for VS 2013 (don't know for VS 2015).
See http://stackoverflow.com/questions/14131454/visual-studio-2012-cplusplus-and-c-11

allocator.cuh lots of "format '%d'" warnings

I created a simplified test code that reproduces this problem. I get lots of compiler warnings with CUDA 5 on Ubuntu 12.04 64-bit with this code:

#include <cub.cuh>

template <int BlockSize>
__global__ void foo()
{
    typedef cub::BlockReduce<int, BlockSize> BlockMax;
    __shared__ typename BlockMax::SmemStorage smem_storage;
    int myMax = BlockMax::Reduce(smem_storage, threadIdx.x, max);
}

int main(void)
{
    return 1;
}

Compile output and OS/compiler versions in below output.

harrism@snafubuntu:/src/cubtest$ nvcc -I ../cub/cub test.cu
../cub/cub/grid/../allocator.cuh: In member function ‘virtual cudaError_t cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::SetMaxCachedBytes(size_t)’:
../cub/cub/grid/../allocator.cuh:329:67: warning: format ‘%d’ expects argument of type ‘int’, but argument 2 has type ‘size_t {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh: In member function ‘virtual cudaError_t cub::GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::DeviceAllocate(void*, size_t, cub::GpuOrdinal)’:
../cub/cub/grid/../allocator.cuh:388:222: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘size_t {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:388:222: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘std::multiset<cub::GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool ()(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:388:222: warning: format ‘%d’ expects argument of type ‘int’, but argument 5 has type ‘std::map<int, long unsigned int>::mapped_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:388:222: warning: format ‘%d’ expects argument of type ‘int’, but argument 6 has type ‘std::multiset<cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool ()(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:415:228: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘size_t {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:415:228: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘std::multiset<cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool (
)(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:415:228: warning: format ‘%d’ expects argument of type ‘int’, but argument 5 has type ‘std::map<int, long unsigned int>::mapped_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:415:228: warning: format ‘%d’ expects argument of type ‘int’, but argument 6 has type ‘std::multiset<cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool ()(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh: In member function ‘virtual cudaError_t cub::GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::DeviceFree(void, cub::GpuOrdinal)’:
../cub/cub/grid/../allocator.cuh:495:209: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘size_t {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:495:209: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘std::multiset<cub::GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool ()(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:495:209: warning: format ‘%d’ expects argument of type ‘int’, but argument 5 has type ‘std::map<int, long unsigned int>::mapped_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:495:209: warning: format ‘%d’ expects argument of type ‘int’, but argument 6 has type ‘std::multiset<cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool (
)(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:513:207: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘size_t {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:513:207: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘std::multiset<cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool ()(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:513:207: warning: format ‘%d’ expects argument of type ‘int’, but argument 5 has type ‘std::map<int, long unsigned int>::mapped_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:513:207: warning: format ‘%d’ expects argument of type ‘int’, but argument 6 has type ‘std::multiset<cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool (
)(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh: In member function ‘virtual cudaError_t cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::FreeAllCached()’:
../cub/cub/grid/../allocator.cuh:594:219: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘size_t {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:594:219: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘std::multiset<cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool ()(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:594:219: warning: format ‘%d’ expects argument of type ‘int’, but argument 5 has type ‘std::map<int, long unsigned int>::mapped_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:594:219: warning: format ‘%d’ expects argument of type ‘int’, but argument 6 has type ‘std::multiset<cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool (
)(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
harrism@snafubuntu:
/src/cubtest$ uname -a
Linux snafubuntu 3.2.0-39-generic #62-Ubuntu SMP Thu Feb 28 00:28:53 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
harrism@snafubuntu:/src/cubtest$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221
harrism@snafubuntu:
/src/cubtest$ g++ --version
g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
Copyright (C) 2011 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

caching allocator needs to clean up cuda error upon successful retry

when CUB fails to allocate memory it cleans cache and retries.
If it succeeds then everything works fine until some poor guy checks last error in CUDA.
And that OOM error pops up. The cure is simple: if there was a failure which was successfully
resolved then please make cudaGetLastError call to clean up CUDA state. Thank you.

I did what you suggested and I discovered a very high amount of needed memory from the cub::DeviceHistogram::HistogramEven routine.

I did what you suggested and I discovered a very high amount of needed memory from the cub::DeviceHistogram::HistogramEven routine. It seems to want more memory than a size_t (assuming it is not larger than 64bit can handle).

The histgram routine says, it needs about 18446744073709552000 bytes which are about 16 exabyte of memory. Since this number is larger than size_t, I assume there was a memory overflow somewhere in the calculation of the needed memory space.

I attached a full program this time and hope you can recreate the problem.

struct floatExp {
    float a, b, c;
};

int main() {
    cudaDeviceSetLimit(cudaLimitPrintfFifoSize, 1024 * 1024 * 1024);

    unsigned int const sz = 65536 * 2;

    unsigned int *idx_d = NULL;
    unsigned int *newIdx_d = NULL;
    void *tmpSort_d = NULL, *tmpHist_d = NULL;
    size_t tmpSortSz = 0, tmpHistSz = 0;


    cudaMalloc(reinterpret_cast<void **>(&idx_d), sizeof(unsigned int) * sz);

    cudaMalloc(reinterpret_cast<void **>(&newIdx_d), sizeof(unsigned int) * sz);

    float *data_d = NULL;
    float *newData_d = NULL;
    cudaMalloc(reinterpret_cast<void **>(&data_d), sizeof(float) * sz * 3);
    cudaMalloc(reinterpret_cast<void **>(&newData_d), sizeof(float) * sz * 3);

    unsigned int totalBlockSz = 801u*801u*401u+2u;


    unsigned int *idx_h = new unsigned int[sz];
    for (unsigned int i = 0; i < sz; ++i) {
        idx_h[i] = sz - i;
    }
    cudaMemcpy(idx_d, idx_h, sizeof(unsigned int) * sz, cudaMemcpyHostToDevice);


    unsigned int *hist_d = NULL;
    cudaMalloc(reinterpret_cast<void **>(&hist_d), sizeof(unsigned int) * (sz + 1));



    cub::DeviceRadixSort::SortPairs(tmpSort_d, tmpSortSz, idx_d, newIdx_d,
        reinterpret_cast<floatExp *>(const_cast<float *>(data_d)), reinterpret_cast<floatExp *>(newData_d), sz);
    cudaDeviceSynchronize();

    printf("TotalBlockSz %i\n", totalBlockSz);
    printf("sz %u \n", sz);

    cub::DeviceHistogram::HistogramEven(tmpHist_d, tmpHistSz, newIdx_d, hist_d,
        static_cast<int>(totalBlockSz + 2), 0u, totalBlockSz + 1u, static_cast<int>(sz));

    printf("tmpHistSz %f\n", static_cast<float>(tmpHistSz));
    printf("tmpHistSz %lu\n", tmpHistSz);
    cudaDeviceSynchronize();


    cudaDeviceSynchronize();
    cudaError_t err = cudaGetLastError();
    printf("Last Cuda error was: %i\n", err);

    system("pause");

    cudaFree(idx_d);
    cudaFree(newIdx_d);
    cudaFree(data_d);
    cudaFree(newData_d);
    cudaFree(hist_d);

    return 0;

}

DeviceRadixSort::SortPairs fails to sort array

When I try to sort array of 40m (roughly) pairs or longer it simply does not sort them without reporting any errors.
Device is: Device 0: GeForce GTX 950 (PTX version 520, SM520, 6 SMs, 904 free / 1995 total MB physmem, 105.760 GB/s @ 3305000 kHz mem clock, ECC off)
cub version 1.5.5 (latest at the moment).

Sample project to reproduce the problem is attached
check_dev_radix.zip

When run with increasingly larger size of array to sort it eventually fails to sort it.
As I understand, the critical size depends on amount of free RAM. The problem is - no error reported.

Strange segfault compiling with cub, but calling the cub function

I've been experiencing crashes using libcub in a project I'm working on. I'm not even sure the problem I'm seeing is with cub itself, or whether it's a linking misconfiguration issue. Or a CUDA bug. Anyway, I've narrowed it down to a two-file MCVE:

file cub.cu:

#include <cub/device/device_partition.cuh>

struct an_op {
    typedef bool result_type;
    typedef int argument_type;
    __host__ __device__ bool operator()(const int &a) { return true; }
};

size_t get_scratch_size_outer()
{
    unsigned num_items = 1234;
    size_t scratch_size = 0;
    ::cub::DevicePartition::If<const bool*, bool*, unsigned*, an_op>(
        nullptr,               // d_temp_storage
        scratch_size,          // temp_storage_bytes,
        nullptr,               // d_in,
        nullptr,               // d_out,
        nullptr,               // d_num_selected_out,
        num_items,             // num_items,
        an_op(),               // select_op,
        0,                     // stream             = 0  
        false                  // debug_synchronous  = false
    );
    // ignoring errors here
    return scratch_size;
}

file main.cpp:

#include <cuda_runtime_api.h>

size_t get_scratch_size_outer();

void foo() { get_scratch_size_outer(); }

int main(int argc, char** argv) {
    cudaSetDevice(0);
    enum cudaFuncCache cc;
    cudaDeviceGetCacheConfig(&cc);
    return 0;
}

The build happens as follows:

/usr/local/cuda/bin/nvcc -M -D__CUDACC__ /ghome/joeuser/src/kernel_tester/src/cub.cu -o /home/joeuser/src/kernel_tester/CMakeFiles/ktkernels.dir/src/ktkernels_generated_cub.cu.o.NVCC-depend -ccbin /path/to/local/gcc/bin/gcc-4.9.3 -m64 --std c++11 -D__STRICT_ANSI__ -Xcompiler ,\"-Wall\",\"-g\",\"-g\",\"-DDEBUG\",\"-O0\" -gencode arch=compute_50,code=compute_50 -g -G --generate-line-info -DNVCC -I/usr/local/cuda/include -I/home/joeuser/src/kernel_tester/src -I/path/to/local/gcc/include -I/home/joeuser/opt/cub -I/usr/local/cuda/include
/usr/local/cuda/bin/nvcc /ghome/joeuser/src/kernel_tester/src/cub.cu -c -o /home/joeuser/src/kernel_tester/CMakeFiles/ktkernels.dir/src/./ktkernels_generated_cub.cu.o -ccbin /path/to/local/gcc/bin/gcc-4.9.3 -m64 --std c++11 -D__STRICT_ANSI__ -Xcompiler ,\"-Wall\",\"-g\",\"-g\",\"-DDEBUG\",\"-O0\" -gencode arch=compute_50,code=compute_50 -g -G --generate-line-info -DNVCC -I/usr/local/cuda/include -I/home/joeuser/src/kernel_tester/src -I/path/to/local/gcc/include -I/home/joeuser/opt/cub -I/usr/local/cuda/include
/usr/bin/ar cq lib/libktkernels.a  CMakeFiles/ktkernels.dir/src/ktkernels_generated_cub.cu.o
/usr/bin/ranlib lib/libktkernels.a
/path/to/local/gcc/bin/g++-4.9.3   -D__STRICT_ANSI__ -Wall -std=c++11 -g -g -DDEBUG -O0 -I/ghome/joeuser/src/kernel_tester/src -I/path/to/local/gcc/include -isystem /home/joeuser/opt/cub -I/usr/local/cuda/include -I/home/joeuser/src/kernel_tester/lib    -o CMakeFiles/tester.dir/src/main.cpp.o -c /home/joeuser/src/kernel_tester/src/main.cpp
/path/to/local/gcc/bin/g++-4.9.3   -Wall -std=c++11 -g -g -DDEBUG -O0   CMakeFiles/tester.dir/src/main.cpp.o  -o bin/tester -rdynamic -Wl,-Bstatic -lcudart_static -Wl,-Bdynamic -lpthread -lrt -ldl lib/libktkernels.a -Wl,-Bstatic -lcudart_static -Wl,-Bdynamic -lpthread -lrt -ldl 

Stack trace of segfaulting thread:

        Stack trace of thread 31006:
        #0  0x00007ff067e78edb n/a (libcuda.so.1)
        #1  0x00007ff067e7a457 n/a (libcuda.so.1)
        #2  0x00007ff067e9454e n/a (libcuda.so.1)
        #3  0x00007ff067e389cd n/a (libcuda.so.1)
        #4  0x00007ff067e7f183 n/a (libcuda.so.1)
        #5  0x00007ff067e82cac n/a (libcuda.so.1)
        #6  0x00007ff067e389cd n/a (libcuda.so.1)
        #7  0x00007ff067e94fda n/a (libcuda.so.1)
        #8  0x00007ff067e53615 n/a (libcuda.so.1)
        #9  0x00007ff067e389cd n/a (libcuda.so.1)
        #10 0x00007ff067e569c3 n/a (libcuda.so.1)
        #11 0x00007ff067e592ee n/a (libcuda.so.1)
        #12 0x00007ff067e5049f n/a (libcuda.so.1)
        #13 0x00007ff067e2d9e2 n/a (libcuda.so.1)
        #14 0x00007ff067c89123 n/a (libcuda.so.1)
        #15 0x00007ff067c89b7e n/a (libcuda.so.1)
        #16 0x00007ff067d76f80 n/a (libcuda.so.1)
        #17 0x00007ff067d77290 n/a (libcuda.so.1)
        #18 0x0000000000428e3d cudart::contextState::loadCubin(bool*, void**) (tester)
        #19 0x000000000041bbd0 cudart::globalModule::loadIntoContext(cudart::contextState*) (tester)
        #20 0x0000000000427fa6 cudart::contextState::applyChanges() (tester)
        #21 0x000000000042c711 cudart::contextStateManager::getRuntimeContextState(cudart::contextState**, bool) (tester)
        #22 0x000000000042031c cudart::doLazyInitContextState() (tester)
        #23 0x0000000000412c3e cudart::cudaApiDeviceGetCacheConfig(cudaFuncCache*) (tester)
        #24 0x00000000004355b4 cudaDeviceGetCacheConfig (tester)
        #25 0x0000000000402f06 main (tester)
        #26 0x00007ff068ab2700 __libc_start_main (libc.so.6)
        #27 0x0000000000402e09 _start (tester)

Notes:

  • I am not actually calling any cub code.
  • This crashes on a machine with a Maxwell Titan X and CUDA 7.5, and doesn't crash on a machine with a GTX 650 and CUDA 8.0 RC.
  • Both machines are Intel X86_64 machine with Linux 4.x.

CUDA error: misaligned address after cub::DeviceRadixSort::SortKeys()

The following program:

#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/system/cuda/execution_policy.h>
#include <cub/device/device_radix_sort.cuh>


template<typename T>
void cub_sort(T* first, size_t n)
{
  cudaError_t error = cudaSuccess;

  cub::DoubleBuffer<T> double_buffer;

  // measure the number of additional temporary storage bytes required
  size_t num_additional_temp_storage_bytes = 0;
  error = cub::DeviceRadixSort::SortKeys(0, num_additional_temp_storage_bytes, double_buffer, n);
  if(error)
  {
    throw thrust::system_error(error, thrust::cuda_category(), "after cub::DeviceRadixSort::SortKeys(0)");
  }

  // allocate temporary storage for double buffer as well as additional bytes
  // XXX seems like we should align up the additional bytes
  size_t num_double_buffer_bytes = n * sizeof(T);
  thrust::device_vector<char> temporary_storage(num_double_buffer_bytes + num_additional_temp_storage_bytes);

  void* additional_temp_storage_ptr = thrust::raw_pointer_cast(temporary_storage.data() + num_double_buffer_bytes);

  double_buffer.d_buffers[0] = thrust::raw_pointer_cast(&*first);
  double_buffer.d_buffers[1] = reinterpret_cast<T*>(thrust::raw_pointer_cast(temporary_storage.data()));

  error = cub::DeviceRadixSort::SortKeys(additional_temp_storage_ptr,
                                         num_additional_temp_storage_bytes,
                                         double_buffer,
                                         static_cast<int>(n));

  if(error)
  {
    throw thrust::system_error(error, thrust::cuda_category(), "after cub::DeviceRadixSort::SortKeys(1)");
  }

  error = cudaDeviceSynchronize();
  if(error)
  {
    throw thrust::system_error(error, thrust::cuda_category(), "after cub::DeviceRadixSort::SortKeys(1)");
  }

  if(double_buffer.Current() != 0)
  {
    T* temp_ptr = reinterpret_cast<T*>(double_buffer.d_buffers[1]);
    thrust::copy(thrust::cuda::par, temp_ptr, temp_ptr + n, first);
  }
}


int main()
{
   thrust::device_vector<int> data(1);

   cub_sort(thrust::raw_pointer_cast(data.data()), data.size());

  return 0;
}

Produces the following error at runtime:

$ nvcc -I. -arch=sm_35 misaligned_address.cu -run
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  after cub::DeviceRadixSort::SortKeys(1): misaligned address
Aborted (core dumped)

cub::DeviceScan::ExclusiveSum can't prefix sum of float into double

To increase accuracy(in my case array has 100M elements and it's noticeable) I'd like to store result of into double*, but I'm getting following complication error

code https://github.com/sh1ng/arboretum/blob/master/src/core/garden.cu#L314

/Users/sh1ng/Projects/cub/cub/agent/agent_scan.cuh:351: error: no instance of overloaded function "cub::BlockStore<OutputIteratorT, BLOCK_DIM_X, ITEMS_PER_THREAD, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>::Store [with OutputIteratorT=double *, BLOCK_DIM_X=128, ITEMS_PER_THREAD=12, ALGORITHM=cub::BLOCK_STORE_WARP_TRANSPOSE, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, PTX_ARCH=610]" matches the argument list
argument types are: (double *, float [12])
object type is: cub::BlockStore<double *, 128, 12, cub::BLOCK_STORE_WARP_TRANSPOSE, 1, 1, 610>
detected during:
instantiation of "void cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, IdentityT, OffsetT>::ConsumeTile<IS_FULL_TILE>(OffsetT, OffsetT, int, OffsetT, cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, IdentityT, OffsetT>::ScanTileStateT &) [with AgentScanPolicyT=cub::DispatchScan<float *, double *, cub::Sum, float, int>::PtxAgentScanPolicy, InputIteratorT=float *, OutputIteratorT=double *, ScanOpT=cub::Sum, IdentityT=float, OffsetT=int, IS_FULL_TILE=true]"
(372): here
instantiation of "void cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, IdentityT, OffsetT>::ConsumeRange(int, cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, IdentityT, OffsetT>::ScanTileStateT &) [with AgentScanPolicyT=cub::DispatchScan<float *, double *, cub::Sum, float, int>::PtxAgentScanPolicy, InputIteratorT=float *, OutputIteratorT=double *, ScanOpT=cub::Sum, IdentityT=float, OffsetT=int]"
../cub/cub/device/dispatch/dispatch_scan.cuh(127): here
instantiation of "void cub::DeviceScanSweepKernel<ScanPolicyT,InputIteratorT,OutputIteratorT,ScanTileStateT,ScanOpT,IdentityT,OffsetT>(InputIteratorT, OutputIteratorT, ScanTileStateT, ScanOpT, IdentityT, OffsetT) [with ScanPolicyT=cub::DispatchScan<float *, double *, cub::Sum, float, int>::PtxAgentScanPolicy, InputIteratorT=float *, OutputIteratorT=double *, ScanTileStateT=cub::ScanTileState<float, true>, ScanOpT=cub::Sum, IdentityT=float, OffsetT=int]"
../cub/cub/device/dispatch/dispatch_scan.cuh(543): here
instantiation of "cudaError_t cub::DispatchScan<InputIteratorT, OutputIteratorT, ScanOpT, IdentityT, OffsetT>::Dispatch(void *, size_t &, InputIteratorT, OutputIteratorT, ScanOpT, IdentityT, OffsetT, cudaStream_t, __nv_bool) [with InputIteratorT=float *, OutputIteratorT=double *, ScanOpT=cub::Sum, IdentityT=float, OffsetT=int]"
../cub/cub/device/device_scan.cuh(156): here
instantiation of "cudaError_t cub::DeviceScan::ExclusiveSum(void *, size_t &, InputIteratorT, OutputIteratorT, int, cudaStream_t, __nv_bool) [with InputIteratorT=float *, OutputIteratorT=double *]"
src/core/garden.cu(314): here
instantiation of "void arboretum::core::GardenBuilder<node_type>::FindBestSplits(int, const arboretum::io::DataMatrix *, const thrust::host_vector<float, thrust::system::cuda::experimental::pinned_allocator> &) [with node_type=unsigned short]"
src/core/garden.cu(135): here
instantiation of "void arboretum::core::GardenBuilder<node_type>::GrowTree(arboretum::core::RegTree *, const arboretum::io::DataMatrix *, const thrust::host_vector<float, thrust::system::cuda::experimental::pinned_allocator> &) [with node_type=unsigned short]"
src/core/garden.cu(105): here
instantiation of "arboretum::core::GardenBuilder<node_type>::GardenBuilder(const arboretum::core::TreeParam &, const arboretum::io::DataMatrix *) [with node_type=unsigned short]"
src/core/garden.cu(509): here

Retaining zips of previous branches/versions

I noticed that the 1.5.1 branch was removed, along with the cub.1.5.1.zip file when 1.5.2 was released. Would it be possible to retain archives of these older versions so that builds can be reproducible? In my case, I had a 1.5.1 dependency in a Dockerfile. This isn't a train smash for me by any means, but it'd be nice to have some future-proofing.

DeviceRadixSort::SortKeys returns all 0

Hi, I'm using DeviceRadixSort::SortKeys to sort an array of integers. If the num_items passed to SortKeys is <= 1792 the output vector is all zeros. If I increase the num_items to 1793 the output is correct (without changing any of the other arguments). For size <=1792 it seems to use the single tile path. I'm compiling for sm_30. Any ideas how to further track this down?

cub::DeviceRadixSort::SortPairs for multiple values

I have three vectors

thrust::device_vector<unsigned long> keys(N);
thrust::device_vector<unsigned> value1(N);
thrust::device_vector<unsigned> value2(N);

I am trying to achieve the equivalent of thrust::sort_by_key using cub::DeviceRadixSort::SortPairs.

thrust::sort_by_key(keys.begin(), keys.end(),
    thrust::make_zip_iterator(thrust::tuple(value1.begin(), value2.begin())));

So essentially, my question is, how do I sort by key where each key has multiple values associated to it?

How can I prefix sum array of floats into double array

In my case to reduce computation error I'm going to use double as result.

But I'm getting compilation error when use method
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceScan::ExclusiveSum
if I provide both vector of the same time(float or double) it compiles ok.
But I don't want to use double for original array(to save memory) and float for result array(to increase accuracy).

How can I sum floats into doubles?

DeviceSegmentedRadixSort should allow d_keys_in, d_values_in, d_begin_offsets, and d_end_offsets to have pointer-to-const type

The DeviceSegmentedRadixSort member functions should allow input-only parameters to have pointer-to-const type.

For example, the documentation for the SortKeys() variant not taking a DoubleBuffer specifies:

The contents of the input data are not altered by the sorting operation

.. and also documents that d_keys_in, d_begin_offsets, and d_end_offsets are input-only parameters. However, attempting to pass a pointer-to-const results in a compilation error:

#include <cstdlib>
#include <iostream>

#include <cuda_runtime.h>

#include <cub/device/device_segmented_radix_sort.cuh>

#define CUDA_CALL(call) \
  do { \
    const cudaError_t err = (call); \
    if (cudaSuccess != err) { \
      std::cerr << __FILE__ << ':' << __LINE__ << ": `" #call "' failed because of: " << cudaGetErrorString(err) << '\n' << std::flush; \
      return EXIT_FAILURE; \
    } \
  } while (0)

int main(int argc, char *argv[]) {

  // Declare, allocate, and initialize device-accessible pointers for sorting data
  int  num_items;          // e.g., 7
  int  num_segments;       // e.g., 3
  int  *d_offsets;         // e.g., [0, 3, 3, 7]
  int  *d_keys_in;         // e.g., [8, 6, 7, 5, 3, 0, 9]
  int  *d_keys_out;        // e.g., [-, -, -, -, -, -, -]

  num_items = 7;
  num_segments = 3;
  int  h_offsets[] = { 0, 3, 3, 7 };
  CUDA_CALL(cudaMalloc((void**)&d_offsets, (num_segments + 1) * sizeof(int)));
  CUDA_CALL(cudaMemcpy((void*)d_offsets, (const void*)h_offsets, (num_segments + 1) * sizeof(int), cudaMemcpyHostToDevice));
  int  h_keys_in[] = { 8, 6, 7, 5, 3, 0, 9 };
  CUDA_CALL(cudaMalloc((void**)&d_keys_in, num_items * sizeof(int)));
  CUDA_CALL(cudaMemcpy((void*)d_keys_in, (const void*)h_keys_in, num_items * sizeof(int), cudaMemcpyHostToDevice));
  int  *h_keys_out = new int[num_items];
  CUDA_CALL(cudaMalloc((void**)&d_keys_out, num_items * sizeof(int)));

  // Determine temporary device storage requirements
  void     *d_temp_storage = NULL;
  size_t   temp_storage_bytes = 0;
  CUDA_CALL(cub::DeviceSegmentedRadixSort::SortKeys(d_temp_storage,
                                                    temp_storage_bytes,
                                                    (const int*)d_keys_in,
                                                    d_keys_out,
                                                    num_items,
                                                    num_segments,
                                                    d_offsets,
                                                    d_offsets + 1));
  // Allocate temporary storage
  CUDA_CALL(cudaMalloc(&d_temp_storage, temp_storage_bytes));
  // Run sorting operation
  CUDA_CALL(cub::DeviceSegmentedRadixSort::SortKeys(d_temp_storage,
                                                    temp_storage_bytes,
                                                    (const int*)d_keys_in,
                                                    d_keys_out,
                                                    num_items,
                                                    num_segments,
                                                    d_offsets,
                                                    d_offsets + 1));

  CUDA_CALL(cudaMemcpy((void*)h_keys_out, (const void*)d_keys_out, num_items * sizeof(int), cudaMemcpyDeviceToHost));

  std::cout << '[';
  for (int i = 0; i < num_items; ++i) {
    if (i != 0) std::cout << ", ";
    std::cout << h_keys_out[i];
  }
  std::cout << "]\n";

  return EXIT_SUCCESS;
}

Removing the casts to const int* in the two invocations of cub::DeviceSegmentedRadixSort::SortKeys() allows the sample to compile and produce the expected output:

[6, 7, 8, 0, 3, 5, 9]

Floating point exception (core dumped) during cub::DeviceRadixSort::SortKeys

The following program:

#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/system/cuda/execution_policy.h>
#include <cub/device/device_radix_sort.cuh>


template<typename T>
void cub_sort(T* first, size_t n)
{
  cudaError_t error = cudaSuccess;

  cub::DoubleBuffer<T> double_buffer;

  // measure the number of additional temporary storage bytes required
  size_t num_additional_temp_storage_bytes = 0;
  error = cub::DeviceRadixSort::SortKeys(0, num_additional_temp_storage_bytes, double_buffer, n);
  if(error)
  {
    throw thrust::system_error(error, thrust::cuda_category(), "after cub::DeviceRadixSort::SortKeys(0)");
  }

  // allocate temporary storage for double buffer as well as additional bytes
  // XXX seems like we should align up the additional bytes
  size_t num_double_buffer_bytes = n * sizeof(T);
  thrust::device_vector<char> temporary_storage(num_double_buffer_bytes + num_additional_temp_storage_bytes);

  void* additional_temp_storage_ptr = thrust::raw_pointer_cast(temporary_storage.data() + num_double_buffer_bytes);

  double_buffer.d_buffers[0] = thrust::raw_pointer_cast(&*first);
  double_buffer.d_buffers[1] = reinterpret_cast<T*>(thrust::raw_pointer_cast(temporary_storage.data()));

  error = cub::DeviceRadixSort::SortKeys(additional_temp_storage_ptr,
                                         num_additional_temp_storage_bytes,
                                         double_buffer,
                                         static_cast<int>(n));

  if(error)
  {
    throw thrust::system_error(error, thrust::cuda_category(), "after cub::DeviceRadixSort::SortKeys(1)");
  }

  if(double_buffer.Current() != 0)
  {
    T* temp_ptr = reinterpret_cast<T*>(double_buffer.d_buffers[1]);
    thrust::copy(thrust::cuda::par, temp_ptr, temp_ptr + n, first);
  }
}


int main()
{
   thrust::device_vector<int> data(0);

   cub_sort(thrust::raw_pointer_cast(data.data()), data.size());

  return 0;
}

Generates the following output at runtime:

$ nvcc -arch=sm_35 -I. cub_repro.cu -run
Floating point exception (core dumped)

Presumably a divide by zero occurs due to n == 0.

Compilation error when do ExclusiveSum with custom oprtation

cub 1.5.5

error: no instance of function template "cub::DeviceScan::ExclusiveSum" matches the argument list
argument types are: (void *, size_t, float2 *, float2 *, arboretum::core::CustomSum, float2, const size_t, cudaStream_t)

or

error: no instance of function template "cub::DeviceScan::ExclusiveSum" matches the argument list
argument types are: (long, size_t, float *, float *, arboretum::core::CustomSum, float, const size_t, cudaStream_t)

    struct CustomSum
    {
        template <typename T>
        CUB_RUNTIME_FUNCTION __forceinline__
        T operator()(const T &a, const T &b) const {
            return a + b;
        }
    };

RFE: cub::BlockScan : different type for input and output data, as well as scan operator

  1. Allow both inclusive and exclusive sum have different types for input and output. For example, input type is float, output type is integer. Sum is done as float, but result is stored as int.
  2. Allow scan operator to have different type than that of input or output types [1]. For example, if input type is "float", output type is "int", and scan operator has signature "double operator()(double,double)", the final result of the scan is of type "double" but stored as "int". This will help to avoid precision loss during scan operation in some cases.

http://en.cppreference.com/w/cpp/algorithm/exclusive_scan

Compilation error with GCC 4.8.4 nvcc 7.0.27

When I attempt to build cub on Ubuntu 14.04 with gcc 4.8.4 and nvcc 7.0.27, I get this error:

thread_load.cuh(423): error: reinterpret_cast cannot cast away const or other type qualifiers
          detected during:
            instantiation of "T cub::ThreadLoad(T *, cub::Int2Type<MODIFIER>, cub::Int2Type<1>) [with T=const pyramid::weight_type, MODIFIER=3]"
(442): here
            instantiation of "std::iterator_traits<InputIteratorT>::value_type cub::ThreadLoad<MODIFIER,InputIteratorT>(InputIteratorT) [with MODIFIER=cub::LOAD_CS, InputIteratorT=const pyramid::weight_type *]"
kernels.cu(56): here

It looks like this is related.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.