nvidia / cub Goto Github PK

View Code? Open in Web Editor NEW

1.7K 88.0 444.0 17.94 MB

[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl

License: BSD 3-Clause "New" or "Revised" License

Cuda 93.28% C++ 2.11% CMake 2.01% Python 1.12% Jupyter Notebook 1.48%

cuda nvidia-hpc-sdk cub gpu cpp cpp11 cpp14 cpp17 cpp20 cxx

cub's Issues

Infinite loop in cub::CachingDeviceAllocator::NearestPowerOf

it happens for values between 2^63 and 2^64-1

cub::NumericTraits has no methods "Lowest", "Max"

When trying to compile the reduction example, the following error message overflows the stack. They all look like:

../../test/test_util.h(987): error: class "cub::NumericTraits" has no member "Lowest"

[Question] How to use external memory for temp storage?

When calling

BlockRadixSort(temp_storage).SortBlockedToStriped(thread_keys, thread_values);

the examples demand memory that can be seen by all threads e.g.

shared typename BlockRadixSort::TempStorage temp_storage;

However, how do I use external memory if temp_storage needs more than 48KiB? How do I allocate this memory from the host?

Thanks in advance,

Christian

Compiling

error: no instance of function template "gunrock::oprtr::advance::LaunchKernel" matches the argument list
argument types are: (int *, gunrock::app::EnactorStats, gunrock::app::FrontierAttribute, gunrock::app::bfs::BFSProblem<int, int, float, true, false, false>::DataSlice *, VertexId *, __nv_bool *, __nv_bool *, unsigned int *, int *, int *, VertexId *, int *, int *, int *, SizeT *, VertexId *, int, int, gunrock::util::CtaWorkProgressLifetime, mgpu::CudaContext, gunrock::oprtr::advance::TYPE)
detected during:
instantiation of "cudaError_t gunrock::app::bfs::BFSEnactor::EnactBFS<AdvanceKernelPolicy,FilterKernelPolicy,BFSProblem>(mgpu::CudaContext &, BFSProblem *, BFSProblem::VertexId, int) [with INSTRUMENT=true, AdvanceKernelPolicy=gunrock::oprtr::advance::KernelPolicy<gunrock::app::bfs::BFSProblem<int, int, float, true, false, false>, 300, true, 8, 10, 8, 4096, 1, 0, 5, 32, 512, 7, LB>, FilterKernelPolicy=gunrock::oprtr::filter::KernelPolicy<gunrock::app::bfs::BFSProblem<int, int, float, true, false, false>, 300, true, 0, true, 8, 8, 1, 0, 5, 5, 8>, BFSProblem=gunrock::app::bfs::BFSProblem<int, int, float, true, false, false>]"
(495): here
instantiation of "cudaError_t gunrock::app::bfs::BFSEnactor::Enact(mgpu::CudaContext &, BFSProblem *, BFSProblem::VertexId, int, int) [with INSTRUMENT=true, BFSProblem=gunrock::app::bfs::BFSProblem<int, int, float, true, false, false>]"

I am using:
CUDA 6.5,
GCC4.4.7
CENTOS 6.4
Boost version: 1.57.0
cub 1.3.2
modernGPU

Any plans to support Pascal generation(compute_61, code=sm_61)?

cudaErrorIllegalAddress error in DeviceSelect::If()

Being still new to CUDA and CUB, I'm not quite sure it's a CUB error and not mine, but this time I did my very best to exclude all other sources.

The Setup:
the GeForce GTX 950, CUDA 8.0, CUB 1.6.4

I allocated two arrays of uint3 (15 millions each) and requested by DeviceSelect::If() 313087 bytes for temp_storage. The compare functor is modified from snippet accordingly to type:

struct Check {
	int compare;
	CUB_RUNTIME_FUNCTION __forceinline__
	Check( int compare ) : compare( compare ) {}

	CUB_RUNTIME_FUNCTION __forceinline__
	bool operator()( const uint3 &a ) const {
		return ( a.x > compare );
	}
};

This doesn't compile with following error:

/home/daktfi/cub/cub/device/dispatch/../../agent/agent_select_if.cuh(279): error: calling a host function("Check::operator ()") from a device function("cub::AgentSelectIf< ::cub::DispatchSelectIf< ::uint3 *, ::cub::NullType *, ::uint3 *, unsigned int *, ::Check, ::cub::NullType, int, (bool)0> ::PtxSelectIfPolicyT, ::uint3 *, ::cub::NullType *, ::uint3 *, ::Check, ::cub::NullType, int, (bool)0> ::InitializeSelections<(bool)1, (bool)0> ") is not allowed

So, I changed the declaration to

struct Check {
	int compare;
	__host__ __device__ __forceinline__
	Check( int compare ) : compare( compare ) {}

	__host__ __device__ __forceinline__
	bool operator()( const uint3 &a ) const {
		return ( a.x > compare );
	}
};

specifying both host and device for methods. This way the code compiles, but after launch of DeviceSelect::If() and cudaDeviceSynchronise() cudaPeekAtLastError() returns 77 (cudaErrorIllegalAddress). I'm pretty much sure this isn't carried from previous asynchronous calls (after every device-related call I invoke cudaDeviceSynchronise() and check cudaPeekAtLastError()), but quite in doubt where else it can come from... All the device code is executed in non-default cudaStream s, created at the very beginning.
The code snippet is:

Check ch0( 0 );
// Memory is allocated at d_ptr, size equals mem_req, counted ahead of time
uint3 *ptr_a = ( uint3 * )d_ptr, *ptr_b = ptr_a + row_count;
void *d_tmp = ptr_b + row_count;

fprintf( stderr, "DevSelect( %p, %lu, %p, %p, %lu ) %p of %p\n", d_tmp, tmp_select, ptr_a,
	 ptr_b, row_count, ( char * ) d_tmp + tmp_select, ( char * )d_ptr + mem_req );

size_t tmp_sel_ask;
cub::DeviceSelect::If( nullptr, tmp_sel_ask, ptr_a, ptr_b, &length, row_count, ch0, s, true );

cudaDeviceSynchronize();
rc = cudaPeekAtLastError();

if( rc ) {
	std::cerr << "Some cuda error " << cudaGetLastError() << std::endl;
	cudaFree( d_ptr );
	return rc;
} else
	std::cerr << "DevSelect ready, needs " << tmp_sel_ask << " of " << tmp_select << std::endl;

if( tmp_sel_ask > tmp_select ) {
	cudaFree( d_ptr );
	return rc;
} else
	std::cerr << "DevSelect clear to go" << std::endl;

cub::DeviceSelect::If( d_tmp, tmp_select, ptr_a, ptr_b, &length, row_count, ch0, s, true );

cudaDeviceSynchronize();
rc = cudaPeekAtLastError();

if( rc ) {
	std::cerr << "DevSelect error " << cudaGetLastError() << std::endl;
	cudaFree( d_ptr );
	return rc;
} else
	std::cerr << "DevSelect ok" << std::endl;

The output is:

DevSelect( 0xb188d2a00, 313087, 0xb03180000, 0xb0dd29500, 15000000 ) 0xb1891f0ff of 0xb1891f0ff
DevSelect ready, needs 313087 of 313087
DevSelect clear to go
Invoking scan_init_kernel<<<306, 128, 0, 140458720343936>>>()
DevSelect error 77

Can you please look into the issue and send me the right way to fix it?

RFE: cub::BlockScan::ScanByKey

Need thurst::{exclusive,inclusive}_scan_by_key like functionality in CUB,which is currently absent.

CUDA error: misaligned address after cub::DeviceRadixSort::SortKeys()

The following program:

#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/system/cuda/execution_policy.h>
#include <cub/device/device_radix_sort.cuh>


template<typename T>
void cub_sort(T* first, size_t n)
{
  cudaError_t error = cudaSuccess;

  cub::DoubleBuffer<T> double_buffer;

  // measure the number of additional temporary storage bytes required
  size_t num_additional_temp_storage_bytes = 0;
  error = cub::DeviceRadixSort::SortKeys(0, num_additional_temp_storage_bytes, double_buffer, n);
  if(error)
  {
    throw thrust::system_error(error, thrust::cuda_category(), "after cub::DeviceRadixSort::SortKeys(0)");
  }

  // allocate temporary storage for double buffer as well as additional bytes
  // XXX seems like we should align up the additional bytes
  size_t num_double_buffer_bytes = n * sizeof(T);
  thrust::device_vector<char> temporary_storage(num_double_buffer_bytes + num_additional_temp_storage_bytes);

  void* additional_temp_storage_ptr = thrust::raw_pointer_cast(temporary_storage.data() + num_double_buffer_bytes);

  double_buffer.d_buffers[0] = thrust::raw_pointer_cast(&*first);
  double_buffer.d_buffers[1] = reinterpret_cast<T*>(thrust::raw_pointer_cast(temporary_storage.data()));

  error = cub::DeviceRadixSort::SortKeys(additional_temp_storage_ptr,
                                         num_additional_temp_storage_bytes,
                                         double_buffer,
                                         static_cast<int>(n));

  if(error)
  {
    throw thrust::system_error(error, thrust::cuda_category(), "after cub::DeviceRadixSort::SortKeys(1)");
  }

  error = cudaDeviceSynchronize();
  if(error)
  {
    throw thrust::system_error(error, thrust::cuda_category(), "after cub::DeviceRadixSort::SortKeys(1)");
  }

  if(double_buffer.Current() != 0)
  {
    T* temp_ptr = reinterpret_cast<T*>(double_buffer.d_buffers[1]);
    thrust::copy(thrust::cuda::par, temp_ptr, temp_ptr + n, first);
  }
}


int main()
{
   thrust::device_vector<int> data(1);

   cub_sort(thrust::raw_pointer_cast(data.data()), data.size());

  return 0;
}

Produces the following error at runtime:

$ nvcc -I. -arch=sm_35 misaligned_address.cu -run
terminate called after throwing an instance of 'thrust::system::system_error'
  what():  after cub::DeviceRadixSort::SortKeys(1): misaligned address
Aborted (core dumped)

1.5.5 breaks BlockScan API

allocator.cuh lots of "format '%d'" warnings

I created a simplified test code that reproduces this problem. I get lots of compiler warnings with CUDA 5 on Ubuntu 12.04 64-bit with this code:

#include <cub.cuh>

template <int BlockSize>
__global__ void foo()
{
    typedef cub::BlockReduce<int, BlockSize> BlockMax;
    __shared__ typename BlockMax::SmemStorage smem_storage;
    int myMax = BlockMax::Reduce(smem_storage, threadIdx.x, max);
}

int main(void)
{
    return 1;
}

Compile output and OS/compiler versions in below output.

harrism@snafubuntu:/src/cubtest$ nvcc -I ../cub/cub test.cu
../cub/cub/grid/../allocator.cuh: In member function ‘virtual cudaError_t cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::SetMaxCachedBytes(size_t)’:
../cub/cub/grid/../allocator.cuh:329:67: warning: format ‘%d’ expects argument of type ‘int’, but argument 2 has type ‘size_t {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh: In member function ‘virtual cudaError_t cub::GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::DeviceAllocate(void*, size_t, cub::GpuOrdinal)’:
../cub/cub/grid/../allocator.cuh:388:222: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘size_t {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:388:222: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘std::multiset<cub::GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool ()(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:388:222: warning: format ‘%d’ expects argument of type ‘int’, but argument 5 has type ‘std::map<int, long unsigned int>::mapped_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:388:222: warning: format ‘%d’ expects argument of type ‘int’, but argument 6 has type ‘std::multiset<cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool ()(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:415:228: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘size_t {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:415:228: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘std::multiset<cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool ()(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:415:228: warning: format ‘%d’ expects argument of type ‘int’, but argument 5 has type ‘std::map<int, long unsigned int>::mapped_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:415:228: warning: format ‘%d’ expects argument of type ‘int’, but argument 6 has type ‘std::multiset<cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool ()(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh: In member function ‘virtual cudaError_t cub::GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::DeviceFree(void, cub::GpuOrdinal)’:
../cub/cub/grid/../allocator.cuh:495:209: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘size_t {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:495:209: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘std::multiset<cub::GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool ()(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:495:209: warning: format ‘%d’ expects argument of type ‘int’, but argument 5 has type ‘std::map<int, long unsigned int>::mapped_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:495:209: warning: format ‘%d’ expects argument of type ‘int’, but argument 6 has type ‘std::multiset<cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool ()(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:513:207: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘size_t {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:513:207: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘std::multiset<cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool ()(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:513:207: warning: format ‘%d’ expects argument of type ‘int’, but argument 5 has type ‘std::map<int, long unsigned int>::mapped_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:513:207: warning: format ‘%d’ expects argument of type ‘int’, but argument 6 has type ‘std::multiset<cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool ()(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh: In member function ‘virtual cudaError_t cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::FreeAllCached()’:
../cub/cub/grid/../allocator.cuh:594:219: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘size_t {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:594:219: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘std::multiset<cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool ()(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:594:219: warning: format ‘%d’ expects argument of type ‘int’, but argument 5 has type ‘std::map<int, long unsigned int>::mapped_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:594:219: warning: format ‘%d’ expects argument of type ‘int’, but argument 6 has type ‘std::multiset<cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool ()(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
harrism@snafubuntu:/src/cubtest$ uname -a
Linux snafubuntu 3.2.0-39-generic #62-Ubuntu SMP Thu Feb 28 00:28:53 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
harrism@snafubuntu:/src/cubtest$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221
harrism@snafubuntu:/src/cubtest$ g++ --version
g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
Copyright (C) 2011 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Support for custom types in cub::DeviceRadixSort?

I couldn't find a better place to ask this question, so pardon me for asking here:

Is there a way to use keys of non-standard types for cub::DeviceRadixSort?

So far I managed to sort arrays of ints, floats, doubles, longs, bools, etc.

But what I would like to sort an array where each item is a custom type:

typedef struct {
  char data[32]
} LongLongLong;

Unfortunately I get the following error when I try to do that:

../cub/device/dispatch/dispatch_radix_sort.cuh(254): error: class "cub::Traits<LongLongLong>" has no member "UnsignedBits"

Should I maybe add a NumericTraits for my custom type? How would I go about that?

Thank you.

Check for C++ 11 should be changed that Visual Studio 2013 is also recognized as C++ 11 capable

In file '<cub_root>/cub/host/mutex.cuh', in lines 37 and 70, replace the line
#if __cplusplus > 199711L
with the following line
#if __cplusplus > 199711L || (defined(_MSC_VER) && _MSC_VER >= 1800)

Visual Studio 2013 (and later) has also the C++ 11 class 'mutex'.

The check only using the value of "__cplusplus" does not work for VS 2013 because it is set to v alue 199711L for VS 2013 (don't know for VS 2015).
See http://stackoverflow.com/questions/14131454/visual-studio-2012-cplusplus-and-c-11

cub::TilePrefixCallbackOp::WarpReduce doesn't permit ptx arch specialization

cub::TilePrefixCallbackOp should also accept ptx arch as its template argument, to specialise WarpReduce<T> (https://github.com/NVlabs/cub/blob/1.5.5/cub/agent/single_pass_scan_operators.cuh#L658)

RFEs for cub::DeviceScan::ExclusiveScan

Add API that also acceptsinit value and constructs an exclusive scan with the following semantics:output[0] = init, output[i>0] = init ⊕ input[0] ⊕ ... ⊕ input[i-1]
Add API that doesn't require identity value for an user-provided reduction operator. If user doesn't know, or doesn't want to provide one, exclusive scan, similar to inclusive scan, should just work. If this is less efficient, that's okay and its user's choice to have faster implementation by providing the identityvalue.

CUDA error: invalid device function when DeviceRadixSort::SortKeys called from host device function

The following program:

#if !defined(__CUDA_ARCH__) || (__CUDA_ARCH__>= 350 && defined(__CUDACC_RDC__))
#  define CUB_CDP 1
#endif

#include <thrust/device_vector.h>
#include <thrust/tabulate.h>
#include <thrust/functional.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <thrust/execution_policy.h>
#include <cub/device/device_radix_sort.cuh>
#include <cassert>

template<typename T>
__host__ __device__
cudaError_t cub_sort_n(T* first, size_t n)
{
  cudaError_t result = cudaSuccess;

#if !defined(__CUDA_ARCH__) || CUB_CDP
  // XXX the additional temporary storage bytes
  //     must be allocated on a 16b aligned address
  struct __align__(16) aligned_type {};

  if(n > 1)
  {
    cub::DoubleBuffer<T> double_buffer;

    // measure the number of additional temporary storage bytes required
    size_t num_additional_temp_storage_bytes = 0;

    cudaError_t error = cub::DeviceRadixSort::SortKeys(0, num_additional_temp_storage_bytes, double_buffer, static_cast<int>(n));
    if(error != cudaSuccess) return error;

    // allocate temporary storage for double buffer as well as additional bytes
    size_t num_double_buffer_bytes = n * sizeof(T);
    size_t num_aligned_double_buffer_bytes = thrust::detail::util::round_i(num_double_buffer_bytes, sizeof(aligned_type));
    size_t num_aligned_total_temporary_storage_bytes = num_aligned_double_buffer_bytes + num_additional_temp_storage_bytes;

    thrust::device_vector<char> temporary_storage(num_aligned_total_temporary_storage_bytes);

    void* additional_temp_storage_ptr = thrust::raw_pointer_cast(temporary_storage.data() + num_aligned_double_buffer_bytes);

    double_buffer.d_buffers[0] = thrust::raw_pointer_cast(&*first);
    double_buffer.d_buffers[1] = reinterpret_cast<T*>(thrust::raw_pointer_cast(temporary_storage.data()));

    result = cub::DeviceRadixSort::SortKeys(additional_temp_storage_ptr,
                                            num_additional_temp_storage_bytes,
                                            double_buffer,
                                            static_cast<int>(n));
    if(result != cudaSuccess) return result;

    if(double_buffer.Current() != 0)
    {
      T* temp_ptr = reinterpret_cast<T*>(double_buffer.d_buffers[1]);
      thrust::copy(thrust::device, temp_ptr, temp_ptr + n, first);
    }
  }
#else
  result = cudaErrorNotSupported;
#endif

  return result;
}


int main()
{
  size_t n = 2;

  thrust::host_vector<int> ref(n);

  // generate unsorted values
  thrust::tabulate(ref.begin(), ref.end(), thrust::placeholders::_1 % 1024);

  thrust::device_vector<int> data = ref;

  cudaError_t error = cub_sort_n(thrust::raw_pointer_cast(data.data()), data.size());
  if(error)
  {
    std::cerr << "CUDA error: " << cudaGetErrorString(error) << std::endl;
    std::exit(-1);
  }

  thrust::sort(ref.begin(), ref.end());

  assert(ref == data);

  assert(thrust::is_sorted(data.begin(), data.end()));

  return 0;
}

Yields the following output at runtime:

$ nvcc -I. repro.cu -run
CUDA error: invalid device function

I believe the way that CUB uses the CUB_CDP macro is causing the host compiler to elide the CUDA kernels because it thinks they are unused. It requires quite a few workarounds to get this right -- CUB should probably perform these on behalf of the client.

Numerical stability issues with block load & store using ITEMS_PER_THREAD = 1

There appear to be numerical stability issues when I use cub::BlockLoad/cub::BlockStore with ITEMS_PER_THREAD = 1. Using ITEMS_PER_THREAD = 2 or more is okay.

Steps to reproduce:

Install Boost. (I am on Mac, and I installed Boost via the Homebrew bottle: brew install boost)
Save the attached test.cu and Makefile to a new directory. This test case computes a Hadamard product C <- alpha * A (*) B + beta * C via a "basic" kernel (basicHadKernel) and one using CUB (cubHadKernel).
Edit Makefile to change the include path to CUB.

Run make && ./test. You should see:

Running 1 test case...

*** No errors detected

Change line 38 of test.cu to:

static constexpr int ITEMS_PER_THREAD = 1;

Run make && ./test. You should now see something like:

Running 1 test case...
test.cu:279: error: in "test_had": difference{1.43188e-05} between C2[i]{-0.00731719984} and expected[i]{-0.00731730461} exceeds 0.001%
test.cu:279: error: in "test_had": difference{0.000314992} between C2[i]{0.000154208246} and expected[i]{0.000154256821} exceeds 0.001%
test.cu:279: error: in "test_had": difference{1.46731e-05} between C2[i]{0.0272928048} and expected[i]{0.0272932053} exceeds 0.001%
[... output clipped ...]

*** 333 failures are detected in the test module "tests"

As you can see, many values now exceed the tolerance threshold.

Can't sum float2 into double2

Hi,

thanks you for fixing https://github.com/NVlabs/cub/issues/59

Now I'm trying to sum float2 into double2 and getting error


_cub/cub/block/block_load.cuh(85): error: no operator "=" matches these operands
            operand types are: double2 = float2

Compilation error with GCC 4.8.4 nvcc 7.0.27

When I attempt to build cub on Ubuntu 14.04 with gcc 4.8.4 and nvcc 7.0.27, I get this error:

thread_load.cuh(423): error: reinterpret_cast cannot cast away const or other type qualifiers
          detected during:
            instantiation of "T cub::ThreadLoad(T *, cub::Int2Type<MODIFIER>, cub::Int2Type<1>) [with T=const pyramid::weight_type, MODIFIER=3]"
(442): here
            instantiation of "std::iterator_traits<InputIteratorT>::value_type cub::ThreadLoad<MODIFIER,InputIteratorT>(InputIteratorT) [with MODIFIER=cub::LOAD_CS, InputIteratorT=const pyramid::weight_type *]"
kernels.cu(56): here

It looks like this is related.

caching allocator needs to clean up cuda error upon successful retry

when CUB fails to allocate memory it cleans cache and retries.
If it succeeds then everything works fine until some poor guy checks last error in CUDA.
And that OOM error pops up. The cure is simple: if there was a failure which was successfully
resolved then please make cudaGetLastError call to clean up CUDA state. Thank you.

Warpreduce executes reduction operator for out-of-bounds items

A 1.5.5 change in the signedness of lane_id is allowing lanes that are past "last lane" to execute their reduction operators. Although the partial sums produced by these lanes are not incoroporated into result returned to lane0, this presents a problem if the operation on uniniitalized data throws an error (memory exception, divide by zero, etc.)

Infinite loop in cub::CachingDeviceAllocator::NearestPowerOf

It happens for sizes between 2^63 and 2^64-1

Strange segfault compiling with cub, but calling the cub function

I've been experiencing crashes using libcub in a project I'm working on. I'm not even sure the problem I'm seeing is with cub itself, or whether it's a linking misconfiguration issue. Or a CUDA bug. Anyway, I've narrowed it down to a two-file MCVE:

file cub.cu:

#include <cub/device/device_partition.cuh>

struct an_op {
    typedef bool result_type;
    typedef int argument_type;
    __host__ __device__ bool operator()(const int &a) { return true; }
};

size_t get_scratch_size_outer()
{
    unsigned num_items = 1234;
    size_t scratch_size = 0;
    ::cub::DevicePartition::If<const bool*, bool*, unsigned*, an_op>(
        nullptr,               // d_temp_storage
        scratch_size,          // temp_storage_bytes,
        nullptr,               // d_in,
        nullptr,               // d_out,
        nullptr,               // d_num_selected_out,
        num_items,             // num_items,
        an_op(),               // select_op,
        0,                     // stream             = 0  
        false                  // debug_synchronous  = false
    );
    // ignoring errors here
    return scratch_size;
}

file main.cpp:

#include <cuda_runtime_api.h>

size_t get_scratch_size_outer();

void foo() { get_scratch_size_outer(); }

int main(int argc, char** argv) {
    cudaSetDevice(0);
    enum cudaFuncCache cc;
    cudaDeviceGetCacheConfig(&cc);
    return 0;
}

The build happens as follows:

/usr/local/cuda/bin/nvcc -M -D__CUDACC__ /ghome/joeuser/src/kernel_tester/src/cub.cu -o /home/joeuser/src/kernel_tester/CMakeFiles/ktkernels.dir/src/ktkernels_generated_cub.cu.o.NVCC-depend -ccbin /path/to/local/gcc/bin/gcc-4.9.3 -m64 --std c++11 -D__STRICT_ANSI__ -Xcompiler ,\"-Wall\",\"-g\",\"-g\",\"-DDEBUG\",\"-O0\" -gencode arch=compute_50,code=compute_50 -g -G --generate-line-info -DNVCC -I/usr/local/cuda/include -I/home/joeuser/src/kernel_tester/src -I/path/to/local/gcc/include -I/home/joeuser/opt/cub -I/usr/local/cuda/include
/usr/local/cuda/bin/nvcc /ghome/joeuser/src/kernel_tester/src/cub.cu -c -o /home/joeuser/src/kernel_tester/CMakeFiles/ktkernels.dir/src/./ktkernels_generated_cub.cu.o -ccbin /path/to/local/gcc/bin/gcc-4.9.3 -m64 --std c++11 -D__STRICT_ANSI__ -Xcompiler ,\"-Wall\",\"-g\",\"-g\",\"-DDEBUG\",\"-O0\" -gencode arch=compute_50,code=compute_50 -g -G --generate-line-info -DNVCC -I/usr/local/cuda/include -I/home/joeuser/src/kernel_tester/src -I/path/to/local/gcc/include -I/home/joeuser/opt/cub -I/usr/local/cuda/include
/usr/bin/ar cq lib/libktkernels.a  CMakeFiles/ktkernels.dir/src/ktkernels_generated_cub.cu.o
/usr/bin/ranlib lib/libktkernels.a
/path/to/local/gcc/bin/g++-4.9.3   -D__STRICT_ANSI__ -Wall -std=c++11 -g -g -DDEBUG -O0 -I/ghome/joeuser/src/kernel_tester/src -I/path/to/local/gcc/include -isystem /home/joeuser/opt/cub -I/usr/local/cuda/include -I/home/joeuser/src/kernel_tester/lib    -o CMakeFiles/tester.dir/src/main.cpp.o -c /home/joeuser/src/kernel_tester/src/main.cpp
/path/to/local/gcc/bin/g++-4.9.3   -Wall -std=c++11 -g -g -DDEBUG -O0   CMakeFiles/tester.dir/src/main.cpp.o  -o bin/tester -rdynamic -Wl,-Bstatic -lcudart_static -Wl,-Bdynamic -lpthread -lrt -ldl lib/libktkernels.a -Wl,-Bstatic -lcudart_static -Wl,-Bdynamic -lpthread -lrt -ldl

Stack trace of segfaulting thread:

        Stack trace of thread 31006:
        #0  0x00007ff067e78edb n/a (libcuda.so.1)
        #1  0x00007ff067e7a457 n/a (libcuda.so.1)
        #2  0x00007ff067e9454e n/a (libcuda.so.1)
        #3  0x00007ff067e389cd n/a (libcuda.so.1)
        #4  0x00007ff067e7f183 n/a (libcuda.so.1)
        #5  0x00007ff067e82cac n/a (libcuda.so.1)
        #6  0x00007ff067e389cd n/a (libcuda.so.1)
        #7  0x00007ff067e94fda n/a (libcuda.so.1)
        #8  0x00007ff067e53615 n/a (libcuda.so.1)
        #9  0x00007ff067e389cd n/a (libcuda.so.1)
        #10 0x00007ff067e569c3 n/a (libcuda.so.1)
        #11 0x00007ff067e592ee n/a (libcuda.so.1)
        #12 0x00007ff067e5049f n/a (libcuda.so.1)
        #13 0x00007ff067e2d9e2 n/a (libcuda.so.1)
        #14 0x00007ff067c89123 n/a (libcuda.so.1)
        #15 0x00007ff067c89b7e n/a (libcuda.so.1)
        #16 0x00007ff067d76f80 n/a (libcuda.so.1)
        #17 0x00007ff067d77290 n/a (libcuda.so.1)
        #18 0x0000000000428e3d cudart::contextState::loadCubin(bool*, void**) (tester)
        #19 0x000000000041bbd0 cudart::globalModule::loadIntoContext(cudart::contextState*) (tester)
        #20 0x0000000000427fa6 cudart::contextState::applyChanges() (tester)
        #21 0x000000000042c711 cudart::contextStateManager::getRuntimeContextState(cudart::contextState**, bool) (tester)
        #22 0x000000000042031c cudart::doLazyInitContextState() (tester)
        #23 0x0000000000412c3e cudart::cudaApiDeviceGetCacheConfig(cudaFuncCache*) (tester)
        #24 0x00000000004355b4 cudaDeviceGetCacheConfig (tester)
        #25 0x0000000000402f06 main (tester)
        #26 0x00007ff068ab2700 __libc_start_main (libc.so.6)
        #27 0x0000000000402e09 _start (tester)

Notes:

I am not actually calling any cub code.
This crashes on a machine with a Maxwell Titan X and CUDA 7.5, and doesn't crash on a machine with a GTX 650 and CUDA 8.0 RC.
Both machines are Intel X86_64 machine with Linux 4.x.

Floating point exception (core dumped) during cub::DeviceRadixSort::SortKeys

The following program:

#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/system/cuda/execution_policy.h>
#include <cub/device/device_radix_sort.cuh>


template<typename T>
void cub_sort(T* first, size_t n)
{
  cudaError_t error = cudaSuccess;

  cub::DoubleBuffer<T> double_buffer;

  // measure the number of additional temporary storage bytes required
  size_t num_additional_temp_storage_bytes = 0;
  error = cub::DeviceRadixSort::SortKeys(0, num_additional_temp_storage_bytes, double_buffer, n);
  if(error)
  {
    throw thrust::system_error(error, thrust::cuda_category(), "after cub::DeviceRadixSort::SortKeys(0)");
  }

  // allocate temporary storage for double buffer as well as additional bytes
  // XXX seems like we should align up the additional bytes
  size_t num_double_buffer_bytes = n * sizeof(T);
  thrust::device_vector<char> temporary_storage(num_double_buffer_bytes + num_additional_temp_storage_bytes);

  void* additional_temp_storage_ptr = thrust::raw_pointer_cast(temporary_storage.data() + num_double_buffer_bytes);

  double_buffer.d_buffers[0] = thrust::raw_pointer_cast(&*first);
  double_buffer.d_buffers[1] = reinterpret_cast<T*>(thrust::raw_pointer_cast(temporary_storage.data()));

  error = cub::DeviceRadixSort::SortKeys(additional_temp_storage_ptr,
                                         num_additional_temp_storage_bytes,
                                         double_buffer,
                                         static_cast<int>(n));

  if(error)
  {
    throw thrust::system_error(error, thrust::cuda_category(), "after cub::DeviceRadixSort::SortKeys(1)");
  }

  if(double_buffer.Current() != 0)
  {
    T* temp_ptr = reinterpret_cast<T*>(double_buffer.d_buffers[1]);
    thrust::copy(thrust::cuda::par, temp_ptr, temp_ptr + n, first);
  }
}


int main()
{
   thrust::device_vector<int> data(0);

   cub_sort(thrust::raw_pointer_cast(data.data()), data.size());

  return 0;
}

Generates the following output at runtime:

$ nvcc -arch=sm_35 -I. cub_repro.cu -run
Floating point exception (core dumped)

Presumably a divide by zero occurs due to n == 0.

cub::BlockStore::Store doesn't compile if OutputIteratorT::value_type != T

OutputIteartorT must be templated in
https://github.com/NVlabs/cub/blob/1.5.5/cub/block/block_store.cuh#L947
e.g. if it is discard_iterator

cub:BlockStore must be templated on type T, rather than iterator

Do you have any kind of Null iterator/discard iterator

Suppose I need to sort pairs but I don't care about sorted keys.
Is it possible to discard sorted key result?
It will reduce memory footprint that might be very useful sometimes.

RFE: cub::BlockScan : different type for input and output data, as well as scan operator

Allow both inclusive and exclusive sum have different types for input and output. For example, input type is float, output type is integer. Sum is done as float, but result is stored as int.
Allow scan operator to have different type than that of input or output types [1]. For example, if input type is "float", output type is "int", and scan operator has signature "double operator()(double,double)", the final result of the scan is of type "double" but stored as "int". This will help to avoid precision loss during scan operation in some cases.

http://en.cppreference.com/w/cpp/algorithm/exclusive_scan

CUB versioning

I have introduced CUB 1.5 as a dependency to my project here:
https://github.com/RAMitchell/xgboost/tree/master/plugin/updater_gpu

I have noticed that API breaking changes have been introduced in 1.6. Download links from the "stable releases" section however at http://nvlabs.github.io/cub/index.html all link to 1.6.

This makes it very hard to distribute my algorithm if my users can only get the newest version with a different API.

DeviceRadixSort::SortPairs fails to sort array

When I try to sort array of 40m (roughly) pairs or longer it simply does not sort them without reporting any errors.
Device is: Device 0: GeForce GTX 950 (PTX version 520, SM520, 6 SMs, 904 free / 1995 total MB physmem, 105.760 GB/s @ 3305000 kHz mem clock, ECC off)
cub version 1.5.5 (latest at the moment).

Sample project to reproduce the problem is attached
check_dev_radix.zip

When run with increasingly larger size of array to sort it eventually fails to sort it.
As I understand, the critical size depends on amount of free RAM. The problem is - no error reported.

DeviceSegmentedRadixSort should allow d_keys_in, d_values_in, d_begin_offsets, and d_end_offsets to have pointer-to-const type

The DeviceSegmentedRadixSort member functions should allow input-only parameters to have pointer-to-const type.

For example, the documentation for the SortKeys() variant not taking a DoubleBuffer specifies:

The contents of the input data are not altered by the sorting operation

.. and also documents that d_keys_in, d_begin_offsets, and d_end_offsets are input-only parameters. However, attempting to pass a pointer-to-const results in a compilation error:

#include <cstdlib>
#include <iostream>

#include <cuda_runtime.h>

#include <cub/device/device_segmented_radix_sort.cuh>

#define CUDA_CALL(call) \
  do { \
    const cudaError_t err = (call); \
    if (cudaSuccess != err) { \
      std::cerr << __FILE__ << ':' << __LINE__ << ": `" #call "' failed because of: " << cudaGetErrorString(err) << '\n' << std::flush; \
      return EXIT_FAILURE; \
    } \
  } while (0)

int main(int argc, char *argv[]) {

  // Declare, allocate, and initialize device-accessible pointers for sorting data
  int  num_items;          // e.g., 7
  int  num_segments;       // e.g., 3
  int  *d_offsets;         // e.g., [0, 3, 3, 7]
  int  *d_keys_in;         // e.g., [8, 6, 7, 5, 3, 0, 9]
  int  *d_keys_out;        // e.g., [-, -, -, -, -, -, -]

  num_items = 7;
  num_segments = 3;
  int  h_offsets[] = { 0, 3, 3, 7 };
  CUDA_CALL(cudaMalloc((void**)&d_offsets, (num_segments + 1) * sizeof(int)));
  CUDA_CALL(cudaMemcpy((void*)d_offsets, (const void*)h_offsets, (num_segments + 1) * sizeof(int), cudaMemcpyHostToDevice));
  int  h_keys_in[] = { 8, 6, 7, 5, 3, 0, 9 };
  CUDA_CALL(cudaMalloc((void**)&d_keys_in, num_items * sizeof(int)));
  CUDA_CALL(cudaMemcpy((void*)d_keys_in, (const void*)h_keys_in, num_items * sizeof(int), cudaMemcpyHostToDevice));
  int  *h_keys_out = new int[num_items];
  CUDA_CALL(cudaMalloc((void**)&d_keys_out, num_items * sizeof(int)));

  // Determine temporary device storage requirements
  void     *d_temp_storage = NULL;
  size_t   temp_storage_bytes = 0;
  CUDA_CALL(cub::DeviceSegmentedRadixSort::SortKeys(d_temp_storage,
                                                    temp_storage_bytes,
                                                    (const int*)d_keys_in,
                                                    d_keys_out,
                                                    num_items,
                                                    num_segments,
                                                    d_offsets,
                                                    d_offsets + 1));
  // Allocate temporary storage
  CUDA_CALL(cudaMalloc(&d_temp_storage, temp_storage_bytes));
  // Run sorting operation
  CUDA_CALL(cub::DeviceSegmentedRadixSort::SortKeys(d_temp_storage,
                                                    temp_storage_bytes,
                                                    (const int*)d_keys_in,
                                                    d_keys_out,
                                                    num_items,
                                                    num_segments,
                                                    d_offsets,
                                                    d_offsets + 1));

  CUDA_CALL(cudaMemcpy((void*)h_keys_out, (const void*)d_keys_out, num_items * sizeof(int), cudaMemcpyDeviceToHost));

  std::cout << '[';
  for (int i = 0; i < num_items; ++i) {
    if (i != 0) std::cout << ", ";
    std::cout << h_keys_out[i];
  }
  std::cout << "]\n";

  return EXIT_SUCCESS;
}

Removing the casts to const int* in the two invocations of cub::DeviceSegmentedRadixSort::SortKeys() allows the sample to compile and produce the expected output:

[6, 7, 8, 0, 3, 5, 9]

DeviceScan::ExclusiveSum blocks execution thread

Found it out with nvidia visual profiler

void *d_temp_storage = NULL;
                              size_t temp_storage_bytes = 0;
                              CubDebugExit(cub::DeviceScan::ExclusiveSum(d_temp_storage,
                                                                    temp_storage_bytes,
                                                                    d_in,
                                                                    d_out,
                                                                    count,
                                                                    s));

                              CubDebugExit(g_allocator.DeviceAllocate(&d_temp_storage, temp_storage_bytes, s));

                              CubDebugExit(cub::DeviceScan::ExclusiveSum(d_temp_storage,
                                                                    temp_storage_bytes,
                                                                    d_in,
                                                                    d_out,
                                                                    count,
                                                                    s));

asynchronous cuda

How can I prefix sum array of floats into double array

In my case to reduce computation error I'm going to use double as result.

But I'm getting compilation error when use method
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceScan::ExclusiveSum
if I provide both vector of the same time(float or double) it compiles ok.
But I don't want to use double for original array(to save memory) and float for result array(to increase accuracy).

How can I sum floats into doubles?

Vectorized Binary Search

Are there any plans to implement vectorized binary search? Is it possible to improve upon the performance of thrust binary search using CUB?

argsort

Hi, I was wondering if you block-local sorts could return both the sorted values and their index in the array?

Caching Device Allocator fails with debug output enabled

Bryan F

Jun 1

Visual Studio 2013, CUDA 7.5, CUB 1.5.2

Error in util_allocator.cuh@Line 414: map/set iterator not dereferencable

This issue is caused by erase being called on the block iterator before the debug output is printed.

cached_blocks.erase(block_itr);
cached_bytes[device].free -= search_key.bytes;
cached_bytes[device].live += search_key.bytes;

if (debug) _CubLog("\tDevice %d reused cached block at %p (%lld bytes) for stream %lld (previously associated with stream %lld).\n",
device, search_key.d_ptr, (long long)search_key.bytes, (long long)search_key.associated_stream, (long long)block_itr->associated_stream);

Fixed:

cached_bytes[device].free -= search_key.bytes;
cached_bytes[device].live += search_key.bytes;

cached_blocks.erase(block_itr);

The same error should occur at several locations where block_itr is used in the debug output.

KeyVairPair won't work if Key has non-trivial ctor

https://github.com/NVlabs/cub/blob/1.5.5/cub/util_type.cuh#L651

Is union still necessary, if so is there other way to enforce alignment ?

I did what you suggested and I discovered a very high amount of needed memory from the cub::DeviceHistogram::HistogramEven routine.

I did what you suggested and I discovered a very high amount of needed memory from the cub::DeviceHistogram::HistogramEven routine. It seems to want more memory than a size_t (assuming it is not larger than 64bit can handle).

The histgram routine says, it needs about 18446744073709552000 bytes which are about 16 exabyte of memory. Since this number is larger than size_t, I assume there was a memory overflow somewhere in the calculation of the needed memory space.

I attached a full program this time and hope you can recreate the problem.

struct floatExp {
    float a, b, c;
};

int main() {
    cudaDeviceSetLimit(cudaLimitPrintfFifoSize, 1024 * 1024 * 1024);

    unsigned int const sz = 65536 * 2;

    unsigned int *idx_d = NULL;
    unsigned int *newIdx_d = NULL;
    void *tmpSort_d = NULL, *tmpHist_d = NULL;
    size_t tmpSortSz = 0, tmpHistSz = 0;


    cudaMalloc(reinterpret_cast<void **>(&idx_d), sizeof(unsigned int) * sz);

    cudaMalloc(reinterpret_cast<void **>(&newIdx_d), sizeof(unsigned int) * sz);

    float *data_d = NULL;
    float *newData_d = NULL;
    cudaMalloc(reinterpret_cast<void **>(&data_d), sizeof(float) * sz * 3);
    cudaMalloc(reinterpret_cast<void **>(&newData_d), sizeof(float) * sz * 3);

    unsigned int totalBlockSz = 801u*801u*401u+2u;


    unsigned int *idx_h = new unsigned int[sz];
    for (unsigned int i = 0; i < sz; ++i) {
        idx_h[i] = sz - i;
    }
    cudaMemcpy(idx_d, idx_h, sizeof(unsigned int) * sz, cudaMemcpyHostToDevice);


    unsigned int *hist_d = NULL;
    cudaMalloc(reinterpret_cast<void **>(&hist_d), sizeof(unsigned int) * (sz + 1));



    cub::DeviceRadixSort::SortPairs(tmpSort_d, tmpSortSz, idx_d, newIdx_d,
        reinterpret_cast<floatExp *>(const_cast<float *>(data_d)), reinterpret_cast<floatExp *>(newData_d), sz);
    cudaDeviceSynchronize();

    printf("TotalBlockSz %i\n", totalBlockSz);
    printf("sz %u \n", sz);

    cub::DeviceHistogram::HistogramEven(tmpHist_d, tmpHistSz, newIdx_d, hist_d,
        static_cast<int>(totalBlockSz + 2), 0u, totalBlockSz + 1u, static_cast<int>(sz));

    printf("tmpHistSz %f\n", static_cast<float>(tmpHistSz));
    printf("tmpHistSz %lu\n", tmpHistSz);
    cudaDeviceSynchronize();


    cudaDeviceSynchronize();
    cudaError_t err = cudaGetLastError();
    printf("Last Cuda error was: %i\n", err);

    system("pause");

    cudaFree(idx_d);
    cudaFree(newIdx_d);
    cudaFree(data_d);
    cudaFree(newData_d);
    cudaFree(hist_d);

    return 0;

}

Compilation fails : 'cub::DispatchRadixSort<false,KeyT,ValueT,int>' : assignment operator could not be generated

While building CNTK, warnings are treated as error and because of this warning compilation fails

'cub::DispatchRadixSort<false,KeyT,ValueT,int>' : assignment operator could not be generated

Now a temporary workaround is to disable the "warnings as error switch in CNTK", but a better solution would be to eliminate this warning in CUB

Retaining zips of previous branches/versions

I noticed that the 1.5.1 branch was removed, along with the cub.1.5.1.zip file when 1.5.2 was released. Would it be possible to retain archives of these older versions so that builds can be reproducible? In my case, I had a 1.5.1 dependency in a Dockerfile. This isn't a train smash for me by any means, but it'd be nice to have some future-proofing.

In kernel float cub::BlockRadixSort crashes

I'm having issues with using radix sort in kernel for float sorting.

Attached example code with output and cuda-memcheck summary. The cub version is 1.5.1.

source.txt

output_with_sort_called.txt

output_with_sort_commented_out.txt

cuda-memcheck.txt

warning in cub/device/dispatch/dispatch_reduce_by_key.cuh

cub/cub/device/dispatch/dispatch_reduce_by_key.cuh:432:32: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
dim3 scan_grid_size(
^

it looks like a correct warning, probably harmless.

cub::DeviceScan::ExclusiveSum can't prefix sum of float into double

To increase accuracy(in my case array has 100M elements and it's noticeable) I'd like to store result of into double*, but I'm getting following complication error

code https://github.com/sh1ng/arboretum/blob/master/src/core/garden.cu#L314

/Users/sh1ng/Projects/cub/cub/agent/agent_scan.cuh:351: error: no instance of overloaded function "cub::BlockStore<OutputIteratorT, BLOCK_DIM_X, ITEMS_PER_THREAD, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>::Store [with OutputIteratorT=double *, BLOCK_DIM_X=128, ITEMS_PER_THREAD=12, ALGORITHM=cub::BLOCK_STORE_WARP_TRANSPOSE, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, PTX_ARCH=610]" matches the argument list
argument types are: (double *, float [12])
object type is: cub::BlockStore<double *, 128, 12, cub::BLOCK_STORE_WARP_TRANSPOSE, 1, 1, 610>
detected during:
instantiation of "void cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, IdentityT, OffsetT>::ConsumeTile<IS_FULL_TILE>(OffsetT, OffsetT, int, OffsetT, cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, IdentityT, OffsetT>::ScanTileStateT &) [with AgentScanPolicyT=cub::DispatchScan<float *, double *, cub::Sum, float, int>::PtxAgentScanPolicy, InputIteratorT=float *, OutputIteratorT=double *, ScanOpT=cub::Sum, IdentityT=float, OffsetT=int, IS_FULL_TILE=true]"
(372): here
instantiation of "void cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, IdentityT, OffsetT>::ConsumeRange(int, cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, IdentityT, OffsetT>::ScanTileStateT &) [with AgentScanPolicyT=cub::DispatchScan<float *, double *, cub::Sum, float, int>::PtxAgentScanPolicy, InputIteratorT=float *, OutputIteratorT=double *, ScanOpT=cub::Sum, IdentityT=float, OffsetT=int]"
../cub/cub/device/dispatch/dispatch_scan.cuh(127): here
instantiation of "void cub::DeviceScanSweepKernel<ScanPolicyT,InputIteratorT,OutputIteratorT,ScanTileStateT,ScanOpT,IdentityT,OffsetT>(InputIteratorT, OutputIteratorT, ScanTileStateT, ScanOpT, IdentityT, OffsetT) [with ScanPolicyT=cub::DispatchScan<float *, double *, cub::Sum, float, int>::PtxAgentScanPolicy, InputIteratorT=float *, OutputIteratorT=double *, ScanTileStateT=cub::ScanTileState<float, true>, ScanOpT=cub::Sum, IdentityT=float, OffsetT=int]"
../cub/cub/device/dispatch/dispatch_scan.cuh(543): here
instantiation of "cudaError_t cub::DispatchScan<InputIteratorT, OutputIteratorT, ScanOpT, IdentityT, OffsetT>::Dispatch(void *, size_t &, InputIteratorT, OutputIteratorT, ScanOpT, IdentityT, OffsetT, cudaStream_t, __nv_bool) [with InputIteratorT=float *, OutputIteratorT=double *, ScanOpT=cub::Sum, IdentityT=float, OffsetT=int]"
../cub/cub/device/device_scan.cuh(156): here
instantiation of "cudaError_t cub::DeviceScan::ExclusiveSum(void *, size_t &, InputIteratorT, OutputIteratorT, int, cudaStream_t, __nv_bool) [with InputIteratorT=float *, OutputIteratorT=double *]"
src/core/garden.cu(314): here
instantiation of "void arboretum::core::GardenBuilder<node_type>::FindBestSplits(int, const arboretum::io::DataMatrix *, const thrust::host_vector<float, thrust::system::cuda::experimental::pinned_allocator> &) [with node_type=unsigned short]"
src/core/garden.cu(135): here
instantiation of "void arboretum::core::GardenBuilder<node_type>::GrowTree(arboretum::core::RegTree *, const arboretum::io::DataMatrix *, const thrust::host_vector<float, thrust::system::cuda::experimental::pinned_allocator> &) [with node_type=unsigned short]"
src/core/garden.cu(105): here
instantiation of "arboretum::core::GardenBuilder<node_type>::GardenBuilder(const arboretum::core::TreeParam &, const arboretum::io::DataMatrix *) [with node_type=unsigned short]"
src/core/garden.cu(509): here

DeviceRadixSort::SortKeys<bool> yields compiler errors

The following program:

#include <cub/cub.cuh>

int main()
{
  cub::DeviceRadixSort::SortKeys<bool>;
  return 0;
}

does not compile:

$nvcc -I. sort_bool.cu
sort_bool.cu(5): warning: expression has no effect

./cub/device/dispatch/../../block_sweep/block_radix_sort_upsweep.cuh(92): error: class "cub::Traits<__nv_bool>" has no member "UnsignedBits"
          detected during:
            instantiation of class "cub::BlockRadixSortUpsweep<BlockRadixSortUpsweepPolicy, Key, Offset> [with BlockRadixSortUpsweepPolicy=cub::DeviceRadixSortDispatch<false, __nv_bool, cub::NullType, int>::PtxUpsweepPolicy, Key=__nv_bool, Offset=int]" 
./cub/device/dispatch/device_radix_sort_dispatch.cuh(80): here
            instantiation of "void cub::DeviceRadixSortUpsweepKernel<BlockRadixSortUpsweepPolicy,DESCENDING,Key,Offset>(Key *, Offset *, Offset, int, int, __nv_bool, cub::GridEvenShare<Offset>) [with BlockRadixSortUpsweepPolicy=cub::DeviceRadixSortDispatch<false, __nv_bool, cub::NullType, int>::PtxUpsweepPolicy, DESCENDING=false, Key=__nv_bool, Offset=int]" 
./cub/device/dispatch/device_radix_sort_dispatch.cuh(931): here
            instantiation of "cudaError_t cub::DeviceRadixSortDispatch<DESCENDING, Key, Value, Offset>::Dispatch(void *, size_t &, cub::DoubleBuffer<Key> &, cub::DoubleBuffer<Value> &, Offset, int, int, cudaStream_t, __nv_bool) [with DESCENDING=false, Key=__nv_bool, Value=cub::NullType, Offset=int]" 
./cub/device/device_radix_sort.cuh(331): here
            instantiation of "cudaError_t cub::DeviceRadixSort::SortKeys(void *, size_t &, cub::DoubleBuffer<Key> &, int, int, int, cudaStream_t, __nv_bool) [with Key=__nv_bool]" 

...

I tried to find out if this was intended to be supported or not. The documentation implies it is:

DeviceRadixSort can sort all of the built-in C++ numeric primitive types, e.g.: unsigned char, int,
double, etc. Although the direct radix sorting method can only be applied to unsigned integral types,
BlockRadixSort is able to sort signed and floating-point types via simple bit-wise transformations
that ensure lexicographic key ordering.

bool is indeed an integral type, so I think it's worth supporting for completeness.

cub::DeviceRadixSort::SortPairs for multiple values

I have three vectors

thrust::device_vector<unsigned long> keys(N);
thrust::device_vector<unsigned> value1(N);
thrust::device_vector<unsigned> value2(N);

I am trying to achieve the equivalent of thrust::sort_by_key using cub::DeviceRadixSort::SortPairs.

thrust::sort_by_key(keys.begin(), keys.end(),
    thrust::make_zip_iterator(thrust::tuple(value1.begin(), value2.begin())));

So essentially, my question is, how do I sort by key where each key has multiple values associated to it?

CUB code doesn't compile with CUDA 5.0 on Ubuntu Linux 11.10

I ran into this in my own code, and then tried the CUB tests, and they fail to compile too.

harrism@snafubuntu:~/src/cub/test$ make test_block_radix_sort
mkdir -p bin
"/usr/local/cuda/bin/nvcc"  -gencode=arch=compute_20,code=\"sm_20,compute_20\"  -o bin/test_block_radix_sort_nvvm_5.0_noabi_i386 test_block_radix_sort.cu -Xptxas -v -Xcudafe -# -Xptxas -abi=no -m32 -I. -I../cub  -O3
../cub/block/../warp/warp_reduce.cuh(134): warning: declaration of "T" hides template parameter

../cub/block/../warp/warp_reduce.cuh(456): error: a class or namespace qualified name is required

../cub/block/../warp/warp_reduce.cuh(459): error: name followed by "::" must be a class or namespace name

Front end time                      17.78 (CPU)      18.00 (elapsed)
2 errors detected in the compilation of "/tmp/tmpxft_000012b7_00000000-6_test_block_radix_sort.cpp1.ii".
Total compilation time              18.01 (CPU)      19.00 (elapsed)
make: *** [bin/test_block_radix_sort_nvvm_5.0_noabi_i386] Error 2

harrism@snafubuntu:~/src/cub/test$ uname -a
Linux snafubuntu 3.0.0-32-generic #51-Ubuntu SMP Thu Mar 21 15:50:59 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

harrism@snafubuntu:~/src/cub/test$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221

harrism@snafubuntu:~/src/cub/test$ g++ --version
g++ (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1
Copyright (C) 2011 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

The errors for other tests start the same, but are much worse (many more error messages).

cub:InequalityWrapper::operator() should be non-const

https://github.com/NVlabs/cub/blob/1.5.5/cub/thread/thread_operators.cuh#L99

If EqualityOp is a non-const callable, InequalityWrapper::operator() produces in compilation error. non-const will work with both non-const and const EqualityOp::operator()

Does cub::DeviceRadixSort support that the input and the output is the same pointer?

Hi,
I was wondering whether it is ok if I set the input and the output to the same pointer? It is usual to sort an array and output the sorted results to the original location.

Thanks.

Compilation error when do ExclusiveSum with custom oprtation

cub 1.5.5

error: no instance of function template "cub::DeviceScan::ExclusiveSum" matches the argument list
argument types are: (void *, size_t, float2 *, float2 *, arboretum::core::CustomSum, float2, const size_t, cudaStream_t)

error: no instance of function template "cub::DeviceScan::ExclusiveSum" matches the argument list
argument types are: (long, size_t, float *, float *, arboretum::core::CustomSum, float, const size_t, cudaStream_t)

    struct CustomSum
    {
        template <typename T>
        CUB_RUNTIME_FUNCTION __forceinline__
        T operator()(const T &a, const T &b) const {
            return a + b;
        }
    };

cub::DeviceRadixSort::SortPairs for multiple values

Sorry for a duplicate(https://github.com/NVlabs/cub/issues/40) but it's been closed without fix or workaround.

Suppose I've three arrays/vectors

keys
values1
values2

and I want to sort values1 and values2 by keys.
It can be done by calling SortPairs twice.

Can the running time be improved by putting them in a single sort?
Can you provide a code snippet to do that or implementation require?

DeviceRadixSort::SortKeys returns all 0

Hi, I'm using DeviceRadixSort::SortKeys to sort an array of integers. If the num_items passed to SortKeys is <= 1792 the output vector is all zeros. If I increase the num_items to 1793 the output is correct (without changing any of the other arguments). For size <=1792 it seems to use the single tile path. I'm compiling for sm_30. Any ideas how to further track this down?

Use tags for releases

Hi,

Would it be possible to use Git tags for releases rather than just branches? This is the expected way to handle this on GitHub (https://github.com/NVlabs/cub/releases). Thanks!

nvidia / cub Goto Github PK

cub's Issues

Recommend Projects

Recommend Topics

Recommend Org