nvidia / cub Goto Github PK
View Code? Open in Web Editor NEW[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
License: BSD 3-Clause "New" or "Revised" License
[ARCHIVED] Cooperative primitives for CUDA C++. See https://github.com/NVIDIA/cccl
License: BSD 3-Clause "New" or "Revised" License
it happens for values between 2^63 and 2^64-1
When trying to compile the reduction example, the following error message overflows the stack. They all look like:
../../test/test_util.h(987): error: class "cub::NumericTraits" has no member "Lowest"
When calling
BlockRadixSort(temp_storage).SortBlockedToStriped(thread_keys, thread_values);
the examples demand memory that can be seen by all threads e.g.
shared typename BlockRadixSort::TempStorage temp_storage;
However, how do I use external memory if temp_storage needs more than 48KiB? How do I allocate this memory from the host?
Thanks in advance,
Christian
error: no instance of function template "gunrock::oprtr::advance::LaunchKernel" matches the argument list
argument types are: (int *, gunrock::app::EnactorStats, gunrock::app::FrontierAttribute, gunrock::app::bfs::BFSProblem<int, int, float, true, false, false>::DataSlice *, VertexId *, __nv_bool *, __nv_bool *, unsigned int *, int *, int *, VertexId *, int *, int *, int *, SizeT *, VertexId *, int, int, gunrock::util::CtaWorkProgressLifetime, mgpu::CudaContext, gunrock::oprtr::advance::TYPE)
detected during:
instantiation of "cudaError_t gunrock::app::bfs::BFSEnactor::EnactBFS<AdvanceKernelPolicy,FilterKernelPolicy,BFSProblem>(mgpu::CudaContext &, BFSProblem *, BFSProblem::VertexId, int) [with INSTRUMENT=true, AdvanceKernelPolicy=gunrock::oprtr::advance::KernelPolicy<gunrock::app::bfs::BFSProblem<int, int, float, true, false, false>, 300, true, 8, 10, 8, 4096, 1, 0, 5, 32, 512, 7, LB>, FilterKernelPolicy=gunrock::oprtr::filter::KernelPolicy<gunrock::app::bfs::BFSProblem<int, int, float, true, false, false>, 300, true, 0, true, 8, 8, 1, 0, 5, 5, 8>, BFSProblem=gunrock::app::bfs::BFSProblem<int, int, float, true, false, false>]"
(495): here
instantiation of "cudaError_t gunrock::app::bfs::BFSEnactor::Enact(mgpu::CudaContext &, BFSProblem *, BFSProblem::VertexId, int, int) [with INSTRUMENT=true, BFSProblem=gunrock::app::bfs::BFSProblem<int, int, float, true, false, false>]"
I am using:
CUDA 6.5,
GCC4.4.7
CENTOS 6.4
Boost version: 1.57.0
cub 1.3.2
modernGPU
Being still new to CUDA and CUB, I'm not quite sure it's a CUB error and not mine, but this time I did my very best to exclude all other sources.
The Setup:
the GeForce GTX 950, CUDA 8.0, CUB 1.6.4
I allocated two arrays of uint3 (15 millions each) and requested by DeviceSelect::If() 313087 bytes for temp_storage. The compare functor is modified from snippet accordingly to type:
struct Check {
int compare;
CUB_RUNTIME_FUNCTION __forceinline__
Check( int compare ) : compare( compare ) {}
CUB_RUNTIME_FUNCTION __forceinline__
bool operator()( const uint3 &a ) const {
return ( a.x > compare );
}
};
This doesn't compile with following error:
/home/daktfi/cub/cub/device/dispatch/../../agent/agent_select_if.cuh(279): error: calling a host function("Check::operator ()") from a device function("cub::AgentSelectIf< ::cub::DispatchSelectIf< ::uint3 *, ::cub::NullType *, ::uint3 *, unsigned int *, ::Check, ::cub::NullType, int, (bool)0> ::PtxSelectIfPolicyT, ::uint3 *, ::cub::NullType *, ::uint3 *, ::Check, ::cub::NullType, int, (bool)0> ::InitializeSelections<(bool)1, (bool)0> ") is not allowed
So, I changed the declaration to
struct Check {
int compare;
__host__ __device__ __forceinline__
Check( int compare ) : compare( compare ) {}
__host__ __device__ __forceinline__
bool operator()( const uint3 &a ) const {
return ( a.x > compare );
}
};
specifying both host and device for methods. This way the code compiles, but after launch of DeviceSelect::If() and cudaDeviceSynchronise() cudaPeekAtLastError() returns 77 (cudaErrorIllegalAddress). I'm pretty much sure this isn't carried from previous asynchronous calls (after every device-related call I invoke cudaDeviceSynchronise() and check cudaPeekAtLastError()), but quite in doubt where else it can come from... All the device code is executed in non-default cudaStream s, created at the very beginning.
The code snippet is:
Check ch0( 0 );
// Memory is allocated at d_ptr, size equals mem_req, counted ahead of time
uint3 *ptr_a = ( uint3 * )d_ptr, *ptr_b = ptr_a + row_count;
void *d_tmp = ptr_b + row_count;
fprintf( stderr, "DevSelect( %p, %lu, %p, %p, %lu ) %p of %p\n", d_tmp, tmp_select, ptr_a,
ptr_b, row_count, ( char * ) d_tmp + tmp_select, ( char * )d_ptr + mem_req );
size_t tmp_sel_ask;
cub::DeviceSelect::If( nullptr, tmp_sel_ask, ptr_a, ptr_b, &length, row_count, ch0, s, true );
cudaDeviceSynchronize();
rc = cudaPeekAtLastError();
if( rc ) {
std::cerr << "Some cuda error " << cudaGetLastError() << std::endl;
cudaFree( d_ptr );
return rc;
} else
std::cerr << "DevSelect ready, needs " << tmp_sel_ask << " of " << tmp_select << std::endl;
if( tmp_sel_ask > tmp_select ) {
cudaFree( d_ptr );
return rc;
} else
std::cerr << "DevSelect clear to go" << std::endl;
cub::DeviceSelect::If( d_tmp, tmp_select, ptr_a, ptr_b, &length, row_count, ch0, s, true );
cudaDeviceSynchronize();
rc = cudaPeekAtLastError();
if( rc ) {
std::cerr << "DevSelect error " << cudaGetLastError() << std::endl;
cudaFree( d_ptr );
return rc;
} else
std::cerr << "DevSelect ok" << std::endl;
The output is:
DevSelect( 0xb188d2a00, 313087, 0xb03180000, 0xb0dd29500, 15000000 ) 0xb1891f0ff of 0xb1891f0ff
DevSelect ready, needs 313087 of 313087
DevSelect clear to go
Invoking scan_init_kernel<<<306, 128, 0, 140458720343936>>>()
DevSelect error 77
Can you please look into the issue and send me the right way to fix it?
Need thurst::{exclusive,inclusive}_scan_by_key like functionality in CUB,which is currently absent.
The following program:
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/system/cuda/execution_policy.h>
#include <cub/device/device_radix_sort.cuh>
template<typename T>
void cub_sort(T* first, size_t n)
{
cudaError_t error = cudaSuccess;
cub::DoubleBuffer<T> double_buffer;
// measure the number of additional temporary storage bytes required
size_t num_additional_temp_storage_bytes = 0;
error = cub::DeviceRadixSort::SortKeys(0, num_additional_temp_storage_bytes, double_buffer, n);
if(error)
{
throw thrust::system_error(error, thrust::cuda_category(), "after cub::DeviceRadixSort::SortKeys(0)");
}
// allocate temporary storage for double buffer as well as additional bytes
// XXX seems like we should align up the additional bytes
size_t num_double_buffer_bytes = n * sizeof(T);
thrust::device_vector<char> temporary_storage(num_double_buffer_bytes + num_additional_temp_storage_bytes);
void* additional_temp_storage_ptr = thrust::raw_pointer_cast(temporary_storage.data() + num_double_buffer_bytes);
double_buffer.d_buffers[0] = thrust::raw_pointer_cast(&*first);
double_buffer.d_buffers[1] = reinterpret_cast<T*>(thrust::raw_pointer_cast(temporary_storage.data()));
error = cub::DeviceRadixSort::SortKeys(additional_temp_storage_ptr,
num_additional_temp_storage_bytes,
double_buffer,
static_cast<int>(n));
if(error)
{
throw thrust::system_error(error, thrust::cuda_category(), "after cub::DeviceRadixSort::SortKeys(1)");
}
error = cudaDeviceSynchronize();
if(error)
{
throw thrust::system_error(error, thrust::cuda_category(), "after cub::DeviceRadixSort::SortKeys(1)");
}
if(double_buffer.Current() != 0)
{
T* temp_ptr = reinterpret_cast<T*>(double_buffer.d_buffers[1]);
thrust::copy(thrust::cuda::par, temp_ptr, temp_ptr + n, first);
}
}
int main()
{
thrust::device_vector<int> data(1);
cub_sort(thrust::raw_pointer_cast(data.data()), data.size());
return 0;
}
Produces the following error at runtime:
$ nvcc -I. -arch=sm_35 misaligned_address.cu -run
terminate called after throwing an instance of 'thrust::system::system_error'
what(): after cub::DeviceRadixSort::SortKeys(1): misaligned address
Aborted (core dumped)
I created a simplified test code that reproduces this problem. I get lots of compiler warnings with CUDA 5 on Ubuntu 12.04 64-bit with this code:
#include <cub.cuh>
template <int BlockSize>
__global__ void foo()
{
typedef cub::BlockReduce<int, BlockSize> BlockMax;
__shared__ typename BlockMax::SmemStorage smem_storage;
int myMax = BlockMax::Reduce(smem_storage, threadIdx.x, max);
}
int main(void)
{
return 1;
}
Compile output and OS/compiler versions in below output.
harrism@snafubuntu:/src/cubtest$ nvcc -I ../cub/cub test.cu/src/cubtest$ uname -a
../cub/cub/grid/../allocator.cuh: In member function ‘virtual cudaError_t cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::SetMaxCachedBytes(size_t)’:
../cub/cub/grid/../allocator.cuh:329:67: warning: format ‘%d’ expects argument of type ‘int’, but argument 2 has type ‘size_t {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh: In member function ‘virtual cudaError_t cub::GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::DeviceAllocate(void*, size_t, cub::GpuOrdinal)’:
../cub/cub/grid/../allocator.cuh:388:222: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘size_t {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:388:222: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘std::multiset<cub::GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool ()(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:388:222: warning: format ‘%d’ expects argument of type ‘int’, but argument 5 has type ‘std::map<int, long unsigned int>::mapped_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:388:222: warning: format ‘%d’ expects argument of type ‘int’, but argument 6 has type ‘std::multiset<cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool ()(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:415:228: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘size_t {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:415:228: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘std::multiset<cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool ()(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:415:228: warning: format ‘%d’ expects argument of type ‘int’, but argument 5 has type ‘std::map<int, long unsigned int>::mapped_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:415:228: warning: format ‘%d’ expects argument of type ‘int’, but argument 6 has type ‘std::multiset<cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool ()(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh: In member function ‘virtual cudaError_t cub::GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::DeviceFree(void, cub::GpuOrdinal)’:
../cub/cub/grid/../allocator.cuh:495:209: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘size_t {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:495:209: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘std::multiset<cub::GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool ()(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:495:209: warning: format ‘%d’ expects argument of type ‘int’, but argument 5 has type ‘std::map<int, long unsigned int>::mapped_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:495:209: warning: format ‘%d’ expects argument of type ‘int’, but argument 6 has type ‘std::multiset<cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool ()(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:513:207: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘size_t {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:513:207: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘std::multiset<cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool ()(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:513:207: warning: format ‘%d’ expects argument of type ‘int’, but argument 5 has type ‘std::map<int, long unsigned int>::mapped_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:513:207: warning: format ‘%d’ expects argument of type ‘int’, but argument 6 has type ‘std::multiset<cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool ()(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh: In member function ‘virtual cudaError_t cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::FreeAllCached()’:
../cub/cub/grid/../allocator.cuh:594:219: warning: format ‘%d’ expects argument of type ‘int’, but argument 3 has type ‘size_t {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:594:219: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘std::multiset<cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool ()(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:594:219: warning: format ‘%d’ expects argument of type ‘int’, but argument 5 has type ‘std::map<int, long unsigned int>::mapped_type {aka long unsigned int}’ [-Wformat]
../cub/cub/grid/../allocator.cuh:594:219: warning: format ‘%d’ expects argument of type ‘int’, but argument 6 has type ‘std::multiset<cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor, bool ()(const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&, const cub::_GLOBAL__N__39_tmpxft_00000f70_00000000_6_test_cpp1_ii_82694a5d::CachingDeviceAllocator::BlockDescriptor&)>::size_type {aka long unsigned int}’ [-Wformat]
harrism@snafubuntu:
Linux snafubuntu 3.2.0-39-generic #62-Ubuntu SMP Thu Feb 28 00:28:53 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
harrism@snafubuntu:/src/cubtest$ nvcc --version/src/cubtest$ g++ --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221
harrism@snafubuntu:
g++ (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
Copyright (C) 2011 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
I couldn't find a better place to ask this question, so pardon me for asking here:
Is there a way to use keys of non-standard types for cub::DeviceRadixSort?
So far I managed to sort arrays of ints, floats, doubles, longs, bools, etc.
But what I would like to sort an array where each item is a custom type:
typedef struct {
char data[32]
} LongLongLong;
Unfortunately I get the following error when I try to do that:
../cub/device/dispatch/dispatch_radix_sort.cuh(254): error: class "cub::Traits<LongLongLong>" has no member "UnsignedBits"
Should I maybe add a NumericTraits for my custom type? How would I go about that?
Thank you.
In file '<cub_root>/cub/host/mutex.cuh', in lines 37 and 70, replace the line
#if __cplusplus > 199711L
with the following line
#if __cplusplus > 199711L || (defined(_MSC_VER) && _MSC_VER >= 1800)
Visual Studio 2013 (and later) has also the C++ 11 class 'mutex'.
The check only using the value of "__cplusplus" does not work for VS 2013 because it is set to v alue 199711L for VS 2013 (don't know for VS 2015).
See http://stackoverflow.com/questions/14131454/visual-studio-2012-cplusplus-and-c-11
cub::TilePrefixCallbackOp
should also accept ptx arch as its template argument, to specialise WarpReduce<T>
(https://github.com/NVlabs/cub/blob/1.5.5/cub/agent/single_pass_scan_operators.cuh#L658)
init
value and constructs an exclusive scan with the following semantics:output[0] = init
, output[i>0] = init ⊕ input[0] ⊕ ... ⊕ input[i-1]
identity
value for an user-provided reduction operator. If user doesn't know, or doesn't want to provide one, exclusive scan, similar to inclusive scan, should just work. If this is less efficient, that's okay and its user's choice to have faster implementation by providing the identity
value.The following program:
#if !defined(__CUDA_ARCH__) || (__CUDA_ARCH__>= 350 && defined(__CUDACC_RDC__))
# define CUB_CDP 1
#endif
#include <thrust/device_vector.h>
#include <thrust/tabulate.h>
#include <thrust/functional.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <thrust/execution_policy.h>
#include <cub/device/device_radix_sort.cuh>
#include <cassert>
template<typename T>
__host__ __device__
cudaError_t cub_sort_n(T* first, size_t n)
{
cudaError_t result = cudaSuccess;
#if !defined(__CUDA_ARCH__) || CUB_CDP
// XXX the additional temporary storage bytes
// must be allocated on a 16b aligned address
struct __align__(16) aligned_type {};
if(n > 1)
{
cub::DoubleBuffer<T> double_buffer;
// measure the number of additional temporary storage bytes required
size_t num_additional_temp_storage_bytes = 0;
cudaError_t error = cub::DeviceRadixSort::SortKeys(0, num_additional_temp_storage_bytes, double_buffer, static_cast<int>(n));
if(error != cudaSuccess) return error;
// allocate temporary storage for double buffer as well as additional bytes
size_t num_double_buffer_bytes = n * sizeof(T);
size_t num_aligned_double_buffer_bytes = thrust::detail::util::round_i(num_double_buffer_bytes, sizeof(aligned_type));
size_t num_aligned_total_temporary_storage_bytes = num_aligned_double_buffer_bytes + num_additional_temp_storage_bytes;
thrust::device_vector<char> temporary_storage(num_aligned_total_temporary_storage_bytes);
void* additional_temp_storage_ptr = thrust::raw_pointer_cast(temporary_storage.data() + num_aligned_double_buffer_bytes);
double_buffer.d_buffers[0] = thrust::raw_pointer_cast(&*first);
double_buffer.d_buffers[1] = reinterpret_cast<T*>(thrust::raw_pointer_cast(temporary_storage.data()));
result = cub::DeviceRadixSort::SortKeys(additional_temp_storage_ptr,
num_additional_temp_storage_bytes,
double_buffer,
static_cast<int>(n));
if(result != cudaSuccess) return result;
if(double_buffer.Current() != 0)
{
T* temp_ptr = reinterpret_cast<T*>(double_buffer.d_buffers[1]);
thrust::copy(thrust::device, temp_ptr, temp_ptr + n, first);
}
}
#else
result = cudaErrorNotSupported;
#endif
return result;
}
int main()
{
size_t n = 2;
thrust::host_vector<int> ref(n);
// generate unsorted values
thrust::tabulate(ref.begin(), ref.end(), thrust::placeholders::_1 % 1024);
thrust::device_vector<int> data = ref;
cudaError_t error = cub_sort_n(thrust::raw_pointer_cast(data.data()), data.size());
if(error)
{
std::cerr << "CUDA error: " << cudaGetErrorString(error) << std::endl;
std::exit(-1);
}
thrust::sort(ref.begin(), ref.end());
assert(ref == data);
assert(thrust::is_sorted(data.begin(), data.end()));
return 0;
}
Yields the following output at runtime:
$ nvcc -I. repro.cu -run
CUDA error: invalid device function
I believe the way that CUB uses the CUB_CDP
macro is causing the host compiler to elide the CUDA kernels because it thinks they are unused. It requires quite a few workarounds to get this right -- CUB should probably perform these on behalf of the client.
There appear to be numerical stability issues when I use cub::BlockLoad
/cub::BlockStore
with ITEMS_PER_THREAD = 1. Using ITEMS_PER_THREAD = 2 or more is okay.
Steps to reproduce:
Install Boost. (I am on Mac, and I installed Boost via the Homebrew bottle: brew install boost
)
Save the attached test.cu
and Makefile
to a new directory. This test case computes a Hadamard product C <- alpha * A (*) B + beta * C
via a "basic" kernel (basicHadKernel
) and one using CUB (cubHadKernel
).
Edit Makefile
to change the include path to CUB.
Run make && ./test
. You should see:
Running 1 test case...
*** No errors detected
Change line 38 of test.cu
to:
static constexpr int ITEMS_PER_THREAD = 1;
Run make && ./test
. You should now see something like:
Running 1 test case...
test.cu:279: error: in "test_had": difference{1.43188e-05} between C2[i]{-0.00731719984} and expected[i]{-0.00731730461} exceeds 0.001%
test.cu:279: error: in "test_had": difference{0.000314992} between C2[i]{0.000154208246} and expected[i]{0.000154256821} exceeds 0.001%
test.cu:279: error: in "test_had": difference{1.46731e-05} between C2[i]{0.0272928048} and expected[i]{0.0272932053} exceeds 0.001%
[... output clipped ...]
*** 333 failures are detected in the test module "tests"
As you can see, many values now exceed the tolerance threshold.
Hi,
thanks you for fixing https://github.com/NVlabs/cub/issues/59
Now I'm trying to sum float2 into double2 and getting error
_cub/cub/block/block_load.cuh(85): error: no operator "=" matches these operands
operand types are: double2 = float2
When I attempt to build cub on Ubuntu 14.04 with gcc 4.8.4 and nvcc 7.0.27, I get this error:
thread_load.cuh(423): error: reinterpret_cast cannot cast away const or other type qualifiers
detected during:
instantiation of "T cub::ThreadLoad(T *, cub::Int2Type<MODIFIER>, cub::Int2Type<1>) [with T=const pyramid::weight_type, MODIFIER=3]"
(442): here
instantiation of "std::iterator_traits<InputIteratorT>::value_type cub::ThreadLoad<MODIFIER,InputIteratorT>(InputIteratorT) [with MODIFIER=cub::LOAD_CS, InputIteratorT=const pyramid::weight_type *]"
kernels.cu(56): here
It looks like this is related.
when CUB fails to allocate memory it cleans cache and retries.
If it succeeds then everything works fine until some poor guy checks last error in CUDA.
And that OOM error pops up. The cure is simple: if there was a failure which was successfully
resolved then please make cudaGetLastError call to clean up CUDA state. Thank you.
A 1.5.5 change in the signedness of lane_id is allowing lanes that are past "last lane" to execute their reduction operators. Although the partial sums produced by these lanes are not incoroporated into result returned to lane0, this presents a problem if the operation on uniniitalized data throws an error (memory exception, divide by zero, etc.)
It happens for sizes between 2^63 and 2^64-1
I've been experiencing crashes using libcub in a project I'm working on. I'm not even sure the problem I'm seeing is with cub itself, or whether it's a linking misconfiguration issue. Or a CUDA bug. Anyway, I've narrowed it down to a two-file MCVE:
file cub.cu:
#include <cub/device/device_partition.cuh>
struct an_op {
typedef bool result_type;
typedef int argument_type;
__host__ __device__ bool operator()(const int &a) { return true; }
};
size_t get_scratch_size_outer()
{
unsigned num_items = 1234;
size_t scratch_size = 0;
::cub::DevicePartition::If<const bool*, bool*, unsigned*, an_op>(
nullptr, // d_temp_storage
scratch_size, // temp_storage_bytes,
nullptr, // d_in,
nullptr, // d_out,
nullptr, // d_num_selected_out,
num_items, // num_items,
an_op(), // select_op,
0, // stream = 0
false // debug_synchronous = false
);
// ignoring errors here
return scratch_size;
}
file main.cpp:
#include <cuda_runtime_api.h>
size_t get_scratch_size_outer();
void foo() { get_scratch_size_outer(); }
int main(int argc, char** argv) {
cudaSetDevice(0);
enum cudaFuncCache cc;
cudaDeviceGetCacheConfig(&cc);
return 0;
}
The build happens as follows:
/usr/local/cuda/bin/nvcc -M -D__CUDACC__ /ghome/joeuser/src/kernel_tester/src/cub.cu -o /home/joeuser/src/kernel_tester/CMakeFiles/ktkernels.dir/src/ktkernels_generated_cub.cu.o.NVCC-depend -ccbin /path/to/local/gcc/bin/gcc-4.9.3 -m64 --std c++11 -D__STRICT_ANSI__ -Xcompiler ,\"-Wall\",\"-g\",\"-g\",\"-DDEBUG\",\"-O0\" -gencode arch=compute_50,code=compute_50 -g -G --generate-line-info -DNVCC -I/usr/local/cuda/include -I/home/joeuser/src/kernel_tester/src -I/path/to/local/gcc/include -I/home/joeuser/opt/cub -I/usr/local/cuda/include
/usr/local/cuda/bin/nvcc /ghome/joeuser/src/kernel_tester/src/cub.cu -c -o /home/joeuser/src/kernel_tester/CMakeFiles/ktkernels.dir/src/./ktkernels_generated_cub.cu.o -ccbin /path/to/local/gcc/bin/gcc-4.9.3 -m64 --std c++11 -D__STRICT_ANSI__ -Xcompiler ,\"-Wall\",\"-g\",\"-g\",\"-DDEBUG\",\"-O0\" -gencode arch=compute_50,code=compute_50 -g -G --generate-line-info -DNVCC -I/usr/local/cuda/include -I/home/joeuser/src/kernel_tester/src -I/path/to/local/gcc/include -I/home/joeuser/opt/cub -I/usr/local/cuda/include
/usr/bin/ar cq lib/libktkernels.a CMakeFiles/ktkernels.dir/src/ktkernels_generated_cub.cu.o
/usr/bin/ranlib lib/libktkernels.a
/path/to/local/gcc/bin/g++-4.9.3 -D__STRICT_ANSI__ -Wall -std=c++11 -g -g -DDEBUG -O0 -I/ghome/joeuser/src/kernel_tester/src -I/path/to/local/gcc/include -isystem /home/joeuser/opt/cub -I/usr/local/cuda/include -I/home/joeuser/src/kernel_tester/lib -o CMakeFiles/tester.dir/src/main.cpp.o -c /home/joeuser/src/kernel_tester/src/main.cpp
/path/to/local/gcc/bin/g++-4.9.3 -Wall -std=c++11 -g -g -DDEBUG -O0 CMakeFiles/tester.dir/src/main.cpp.o -o bin/tester -rdynamic -Wl,-Bstatic -lcudart_static -Wl,-Bdynamic -lpthread -lrt -ldl lib/libktkernels.a -Wl,-Bstatic -lcudart_static -Wl,-Bdynamic -lpthread -lrt -ldl
Stack trace of segfaulting thread:
Stack trace of thread 31006:
#0 0x00007ff067e78edb n/a (libcuda.so.1)
#1 0x00007ff067e7a457 n/a (libcuda.so.1)
#2 0x00007ff067e9454e n/a (libcuda.so.1)
#3 0x00007ff067e389cd n/a (libcuda.so.1)
#4 0x00007ff067e7f183 n/a (libcuda.so.1)
#5 0x00007ff067e82cac n/a (libcuda.so.1)
#6 0x00007ff067e389cd n/a (libcuda.so.1)
#7 0x00007ff067e94fda n/a (libcuda.so.1)
#8 0x00007ff067e53615 n/a (libcuda.so.1)
#9 0x00007ff067e389cd n/a (libcuda.so.1)
#10 0x00007ff067e569c3 n/a (libcuda.so.1)
#11 0x00007ff067e592ee n/a (libcuda.so.1)
#12 0x00007ff067e5049f n/a (libcuda.so.1)
#13 0x00007ff067e2d9e2 n/a (libcuda.so.1)
#14 0x00007ff067c89123 n/a (libcuda.so.1)
#15 0x00007ff067c89b7e n/a (libcuda.so.1)
#16 0x00007ff067d76f80 n/a (libcuda.so.1)
#17 0x00007ff067d77290 n/a (libcuda.so.1)
#18 0x0000000000428e3d cudart::contextState::loadCubin(bool*, void**) (tester)
#19 0x000000000041bbd0 cudart::globalModule::loadIntoContext(cudart::contextState*) (tester)
#20 0x0000000000427fa6 cudart::contextState::applyChanges() (tester)
#21 0x000000000042c711 cudart::contextStateManager::getRuntimeContextState(cudart::contextState**, bool) (tester)
#22 0x000000000042031c cudart::doLazyInitContextState() (tester)
#23 0x0000000000412c3e cudart::cudaApiDeviceGetCacheConfig(cudaFuncCache*) (tester)
#24 0x00000000004355b4 cudaDeviceGetCacheConfig (tester)
#25 0x0000000000402f06 main (tester)
#26 0x00007ff068ab2700 __libc_start_main (libc.so.6)
#27 0x0000000000402e09 _start (tester)
Notes:
The following program:
#include <thrust/device_vector.h>
#include <thrust/copy.h>
#include <thrust/system/cuda/execution_policy.h>
#include <cub/device/device_radix_sort.cuh>
template<typename T>
void cub_sort(T* first, size_t n)
{
cudaError_t error = cudaSuccess;
cub::DoubleBuffer<T> double_buffer;
// measure the number of additional temporary storage bytes required
size_t num_additional_temp_storage_bytes = 0;
error = cub::DeviceRadixSort::SortKeys(0, num_additional_temp_storage_bytes, double_buffer, n);
if(error)
{
throw thrust::system_error(error, thrust::cuda_category(), "after cub::DeviceRadixSort::SortKeys(0)");
}
// allocate temporary storage for double buffer as well as additional bytes
// XXX seems like we should align up the additional bytes
size_t num_double_buffer_bytes = n * sizeof(T);
thrust::device_vector<char> temporary_storage(num_double_buffer_bytes + num_additional_temp_storage_bytes);
void* additional_temp_storage_ptr = thrust::raw_pointer_cast(temporary_storage.data() + num_double_buffer_bytes);
double_buffer.d_buffers[0] = thrust::raw_pointer_cast(&*first);
double_buffer.d_buffers[1] = reinterpret_cast<T*>(thrust::raw_pointer_cast(temporary_storage.data()));
error = cub::DeviceRadixSort::SortKeys(additional_temp_storage_ptr,
num_additional_temp_storage_bytes,
double_buffer,
static_cast<int>(n));
if(error)
{
throw thrust::system_error(error, thrust::cuda_category(), "after cub::DeviceRadixSort::SortKeys(1)");
}
if(double_buffer.Current() != 0)
{
T* temp_ptr = reinterpret_cast<T*>(double_buffer.d_buffers[1]);
thrust::copy(thrust::cuda::par, temp_ptr, temp_ptr + n, first);
}
}
int main()
{
thrust::device_vector<int> data(0);
cub_sort(thrust::raw_pointer_cast(data.data()), data.size());
return 0;
}
Generates the following output at runtime:
$ nvcc -arch=sm_35 -I. cub_repro.cu -run
Floating point exception (core dumped)
Presumably a divide by zero occurs due to n == 0
.
OutputIteartorT must be templated in
https://github.com/NVlabs/cub/blob/1.5.5/cub/block/block_store.cuh#L947
e.g. if it is discard_iterator
cub:BlockStore must be templated on type T, rather than iterator
Suppose I need to sort pairs but I don't care about sorted keys.
Is it possible to discard sorted key result?
It will reduce memory footprint that might be very useful sometimes.
I have introduced CUB 1.5 as a dependency to my project here:
https://github.com/RAMitchell/xgboost/tree/master/plugin/updater_gpu
I have noticed that API breaking changes have been introduced in 1.6. Download links from the "stable releases" section however at http://nvlabs.github.io/cub/index.html all link to 1.6.
This makes it very hard to distribute my algorithm if my users can only get the newest version with a different API.
When I try to sort array of 40m (roughly) pairs or longer it simply does not sort them without reporting any errors.
Device is: Device 0: GeForce GTX 950 (PTX version 520, SM520, 6 SMs, 904 free / 1995 total MB physmem, 105.760 GB/s @ 3305000 kHz mem clock, ECC off)
cub version 1.5.5 (latest at the moment).
Sample project to reproduce the problem is attached
check_dev_radix.zip
When run with increasingly larger size of array to sort it eventually fails to sort it.
As I understand, the critical size depends on amount of free RAM. The problem is - no error reported.
The DeviceSegmentedRadixSort
member functions should allow input-only parameters to have pointer-to-const type.
For example, the documentation for the SortKeys() variant not taking a DoubleBuffer
specifies:
The contents of the input data are not altered by the sorting operation
.. and also documents that d_keys_in
, d_begin_offsets
, and d_end_offsets
are input-only parameters. However, attempting to pass a pointer-to-const results in a compilation error:
#include <cstdlib>
#include <iostream>
#include <cuda_runtime.h>
#include <cub/device/device_segmented_radix_sort.cuh>
#define CUDA_CALL(call) \
do { \
const cudaError_t err = (call); \
if (cudaSuccess != err) { \
std::cerr << __FILE__ << ':' << __LINE__ << ": `" #call "' failed because of: " << cudaGetErrorString(err) << '\n' << std::flush; \
return EXIT_FAILURE; \
} \
} while (0)
int main(int argc, char *argv[]) {
// Declare, allocate, and initialize device-accessible pointers for sorting data
int num_items; // e.g., 7
int num_segments; // e.g., 3
int *d_offsets; // e.g., [0, 3, 3, 7]
int *d_keys_in; // e.g., [8, 6, 7, 5, 3, 0, 9]
int *d_keys_out; // e.g., [-, -, -, -, -, -, -]
num_items = 7;
num_segments = 3;
int h_offsets[] = { 0, 3, 3, 7 };
CUDA_CALL(cudaMalloc((void**)&d_offsets, (num_segments + 1) * sizeof(int)));
CUDA_CALL(cudaMemcpy((void*)d_offsets, (const void*)h_offsets, (num_segments + 1) * sizeof(int), cudaMemcpyHostToDevice));
int h_keys_in[] = { 8, 6, 7, 5, 3, 0, 9 };
CUDA_CALL(cudaMalloc((void**)&d_keys_in, num_items * sizeof(int)));
CUDA_CALL(cudaMemcpy((void*)d_keys_in, (const void*)h_keys_in, num_items * sizeof(int), cudaMemcpyHostToDevice));
int *h_keys_out = new int[num_items];
CUDA_CALL(cudaMalloc((void**)&d_keys_out, num_items * sizeof(int)));
// Determine temporary device storage requirements
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
CUDA_CALL(cub::DeviceSegmentedRadixSort::SortKeys(d_temp_storage,
temp_storage_bytes,
(const int*)d_keys_in,
d_keys_out,
num_items,
num_segments,
d_offsets,
d_offsets + 1));
// Allocate temporary storage
CUDA_CALL(cudaMalloc(&d_temp_storage, temp_storage_bytes));
// Run sorting operation
CUDA_CALL(cub::DeviceSegmentedRadixSort::SortKeys(d_temp_storage,
temp_storage_bytes,
(const int*)d_keys_in,
d_keys_out,
num_items,
num_segments,
d_offsets,
d_offsets + 1));
CUDA_CALL(cudaMemcpy((void*)h_keys_out, (const void*)d_keys_out, num_items * sizeof(int), cudaMemcpyDeviceToHost));
std::cout << '[';
for (int i = 0; i < num_items; ++i) {
if (i != 0) std::cout << ", ";
std::cout << h_keys_out[i];
}
std::cout << "]\n";
return EXIT_SUCCESS;
}
Removing the casts to const int*
in the two invocations of cub::DeviceSegmentedRadixSort::SortKeys() allows the sample to compile and produce the expected output:
[6, 7, 8, 0, 3, 5, 9]
Found it out with nvidia visual profiler
void *d_temp_storage = NULL;
size_t temp_storage_bytes = 0;
CubDebugExit(cub::DeviceScan::ExclusiveSum(d_temp_storage,
temp_storage_bytes,
d_in,
d_out,
count,
s));
CubDebugExit(g_allocator.DeviceAllocate(&d_temp_storage, temp_storage_bytes, s));
CubDebugExit(cub::DeviceScan::ExclusiveSum(d_temp_storage,
temp_storage_bytes,
d_in,
d_out,
count,
s));
asynchronous cuda
In my case to reduce computation error I'm going to use double as result.
But I'm getting compilation error when use method
static CUB_RUNTIME_FUNCTION cudaError_t cub::DeviceScan::ExclusiveSum
if I provide both vector of the same time(float or double) it compiles ok.
But I don't want to use double for original array(to save memory) and float for result array(to increase accuracy).
How can I sum floats into doubles?
Are there any plans to implement vectorized binary search? Is it possible to improve upon the performance of thrust binary search using CUB?
Hi, I was wondering if you block-local sorts could return both the sorted values and their index in the array?
Bryan F
Jun 1
Visual Studio 2013, CUDA 7.5, CUB 1.5.2
Error in util_allocator.cuh@Line 414: map/set iterator not dereferencable
This issue is caused by erase being called on the block iterator before the debug output is printed.
cached_blocks.erase(block_itr);
cached_bytes[device].free -= search_key.bytes;
cached_bytes[device].live += search_key.bytes;
if (debug) _CubLog("\tDevice %d reused cached block at %p (%lld bytes) for stream %lld (previously associated with stream %lld).\n",
device, search_key.d_ptr, (long long)search_key.bytes, (long long)search_key.associated_stream, (long long)block_itr->associated_stream);
Fixed:
cached_bytes[device].free -= search_key.bytes;
cached_bytes[device].live += search_key.bytes;
if (debug) _CubLog("\tDevice %d reused cached block at %p (%lld bytes) for stream %lld (previously associated with stream %lld).\n",
device, search_key.d_ptr, (long long)search_key.bytes, (long long)search_key.associated_stream, (long long)block_itr->associated_stream);
cached_blocks.erase(block_itr);
The same error should occur at several locations where block_itr is used in the debug output.
https://github.com/NVlabs/cub/blob/1.5.5/cub/util_type.cuh#L651
Is union still necessary, if so is there other way to enforce alignment ?
I did what you suggested and I discovered a very high amount of needed memory from the cub::DeviceHistogram::HistogramEven routine. It seems to want more memory than a size_t (assuming it is not larger than 64bit can handle).
The histgram routine says, it needs about 18446744073709552000 bytes which are about 16 exabyte of memory. Since this number is larger than size_t, I assume there was a memory overflow somewhere in the calculation of the needed memory space.
I attached a full program this time and hope you can recreate the problem.
struct floatExp {
float a, b, c;
};
int main() {
cudaDeviceSetLimit(cudaLimitPrintfFifoSize, 1024 * 1024 * 1024);
unsigned int const sz = 65536 * 2;
unsigned int *idx_d = NULL;
unsigned int *newIdx_d = NULL;
void *tmpSort_d = NULL, *tmpHist_d = NULL;
size_t tmpSortSz = 0, tmpHistSz = 0;
cudaMalloc(reinterpret_cast<void **>(&idx_d), sizeof(unsigned int) * sz);
cudaMalloc(reinterpret_cast<void **>(&newIdx_d), sizeof(unsigned int) * sz);
float *data_d = NULL;
float *newData_d = NULL;
cudaMalloc(reinterpret_cast<void **>(&data_d), sizeof(float) * sz * 3);
cudaMalloc(reinterpret_cast<void **>(&newData_d), sizeof(float) * sz * 3);
unsigned int totalBlockSz = 801u*801u*401u+2u;
unsigned int *idx_h = new unsigned int[sz];
for (unsigned int i = 0; i < sz; ++i) {
idx_h[i] = sz - i;
}
cudaMemcpy(idx_d, idx_h, sizeof(unsigned int) * sz, cudaMemcpyHostToDevice);
unsigned int *hist_d = NULL;
cudaMalloc(reinterpret_cast<void **>(&hist_d), sizeof(unsigned int) * (sz + 1));
cub::DeviceRadixSort::SortPairs(tmpSort_d, tmpSortSz, idx_d, newIdx_d,
reinterpret_cast<floatExp *>(const_cast<float *>(data_d)), reinterpret_cast<floatExp *>(newData_d), sz);
cudaDeviceSynchronize();
printf("TotalBlockSz %i\n", totalBlockSz);
printf("sz %u \n", sz);
cub::DeviceHistogram::HistogramEven(tmpHist_d, tmpHistSz, newIdx_d, hist_d,
static_cast<int>(totalBlockSz + 2), 0u, totalBlockSz + 1u, static_cast<int>(sz));
printf("tmpHistSz %f\n", static_cast<float>(tmpHistSz));
printf("tmpHistSz %lu\n", tmpHistSz);
cudaDeviceSynchronize();
cudaDeviceSynchronize();
cudaError_t err = cudaGetLastError();
printf("Last Cuda error was: %i\n", err);
system("pause");
cudaFree(idx_d);
cudaFree(newIdx_d);
cudaFree(data_d);
cudaFree(newData_d);
cudaFree(hist_d);
return 0;
}
While building CNTK, warnings are treated as error and because of this warning compilation fails
'cub::DispatchRadixSort<false,KeyT,ValueT,int>' : assignment operator could not be generated
Now a temporary workaround is to disable the "warnings as error switch in CNTK", but a better solution would be to eliminate this warning in CUB
I noticed that the 1.5.1 branch was removed, along with the cub.1.5.1.zip file when 1.5.2 was released. Would it be possible to retain archives of these older versions so that builds can be reproducible? In my case, I had a 1.5.1 dependency in a Dockerfile. This isn't a train smash for me by any means, but it'd be nice to have some future-proofing.
I'm having issues with using radix sort in kernel for float sorting.
Attached example code with output and cuda-memcheck summary. The cub version is 1.5.1.
cub/cub/device/dispatch/dispatch_reduce_by_key.cuh:432:32: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
dim3 scan_grid_size(
^
it looks like a correct warning, probably harmless.
To increase accuracy(in my case array has 100M elements and it's noticeable) I'd like to store result of into double*, but I'm getting following complication error
code https://github.com/sh1ng/arboretum/blob/master/src/core/garden.cu#L314
/Users/sh1ng/Projects/cub/cub/agent/agent_scan.cuh:351: error: no instance of overloaded function "cub::BlockStore<OutputIteratorT, BLOCK_DIM_X, ITEMS_PER_THREAD, ALGORITHM, BLOCK_DIM_Y, BLOCK_DIM_Z, PTX_ARCH>::Store [with OutputIteratorT=double *, BLOCK_DIM_X=128, ITEMS_PER_THREAD=12, ALGORITHM=cub::BLOCK_STORE_WARP_TRANSPOSE, BLOCK_DIM_Y=1, BLOCK_DIM_Z=1, PTX_ARCH=610]" matches the argument list
argument types are: (double *, float [12])
object type is: cub::BlockStore<double *, 128, 12, cub::BLOCK_STORE_WARP_TRANSPOSE, 1, 1, 610>
detected during:
instantiation of "void cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, IdentityT, OffsetT>::ConsumeTile<IS_FULL_TILE>(OffsetT, OffsetT, int, OffsetT, cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, IdentityT, OffsetT>::ScanTileStateT &) [with AgentScanPolicyT=cub::DispatchScan<float *, double *, cub::Sum, float, int>::PtxAgentScanPolicy, InputIteratorT=float *, OutputIteratorT=double *, ScanOpT=cub::Sum, IdentityT=float, OffsetT=int, IS_FULL_TILE=true]"
(372): here
instantiation of "void cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, IdentityT, OffsetT>::ConsumeRange(int, cub::AgentScan<AgentScanPolicyT, InputIteratorT, OutputIteratorT, ScanOpT, IdentityT, OffsetT>::ScanTileStateT &) [with AgentScanPolicyT=cub::DispatchScan<float *, double *, cub::Sum, float, int>::PtxAgentScanPolicy, InputIteratorT=float *, OutputIteratorT=double *, ScanOpT=cub::Sum, IdentityT=float, OffsetT=int]"
../cub/cub/device/dispatch/dispatch_scan.cuh(127): here
instantiation of "void cub::DeviceScanSweepKernel<ScanPolicyT,InputIteratorT,OutputIteratorT,ScanTileStateT,ScanOpT,IdentityT,OffsetT>(InputIteratorT, OutputIteratorT, ScanTileStateT, ScanOpT, IdentityT, OffsetT) [with ScanPolicyT=cub::DispatchScan<float *, double *, cub::Sum, float, int>::PtxAgentScanPolicy, InputIteratorT=float *, OutputIteratorT=double *, ScanTileStateT=cub::ScanTileState<float, true>, ScanOpT=cub::Sum, IdentityT=float, OffsetT=int]"
../cub/cub/device/dispatch/dispatch_scan.cuh(543): here
instantiation of "cudaError_t cub::DispatchScan<InputIteratorT, OutputIteratorT, ScanOpT, IdentityT, OffsetT>::Dispatch(void *, size_t &, InputIteratorT, OutputIteratorT, ScanOpT, IdentityT, OffsetT, cudaStream_t, __nv_bool) [with InputIteratorT=float *, OutputIteratorT=double *, ScanOpT=cub::Sum, IdentityT=float, OffsetT=int]"
../cub/cub/device/device_scan.cuh(156): here
instantiation of "cudaError_t cub::DeviceScan::ExclusiveSum(void *, size_t &, InputIteratorT, OutputIteratorT, int, cudaStream_t, __nv_bool) [with InputIteratorT=float *, OutputIteratorT=double *]"
src/core/garden.cu(314): here
instantiation of "void arboretum::core::GardenBuilder<node_type>::FindBestSplits(int, const arboretum::io::DataMatrix *, const thrust::host_vector<float, thrust::system::cuda::experimental::pinned_allocator> &) [with node_type=unsigned short]"
src/core/garden.cu(135): here
instantiation of "void arboretum::core::GardenBuilder<node_type>::GrowTree(arboretum::core::RegTree *, const arboretum::io::DataMatrix *, const thrust::host_vector<float, thrust::system::cuda::experimental::pinned_allocator> &) [with node_type=unsigned short]"
src/core/garden.cu(105): here
instantiation of "arboretum::core::GardenBuilder<node_type>::GardenBuilder(const arboretum::core::TreeParam &, const arboretum::io::DataMatrix *) [with node_type=unsigned short]"
src/core/garden.cu(509): here
The following program:
#include <cub/cub.cuh>
int main()
{
cub::DeviceRadixSort::SortKeys<bool>;
return 0;
}
does not compile:
$nvcc -I. sort_bool.cu
sort_bool.cu(5): warning: expression has no effect
./cub/device/dispatch/../../block_sweep/block_radix_sort_upsweep.cuh(92): error: class "cub::Traits<__nv_bool>" has no member "UnsignedBits"
detected during:
instantiation of class "cub::BlockRadixSortUpsweep<BlockRadixSortUpsweepPolicy, Key, Offset> [with BlockRadixSortUpsweepPolicy=cub::DeviceRadixSortDispatch<false, __nv_bool, cub::NullType, int>::PtxUpsweepPolicy, Key=__nv_bool, Offset=int]"
./cub/device/dispatch/device_radix_sort_dispatch.cuh(80): here
instantiation of "void cub::DeviceRadixSortUpsweepKernel<BlockRadixSortUpsweepPolicy,DESCENDING,Key,Offset>(Key *, Offset *, Offset, int, int, __nv_bool, cub::GridEvenShare<Offset>) [with BlockRadixSortUpsweepPolicy=cub::DeviceRadixSortDispatch<false, __nv_bool, cub::NullType, int>::PtxUpsweepPolicy, DESCENDING=false, Key=__nv_bool, Offset=int]"
./cub/device/dispatch/device_radix_sort_dispatch.cuh(931): here
instantiation of "cudaError_t cub::DeviceRadixSortDispatch<DESCENDING, Key, Value, Offset>::Dispatch(void *, size_t &, cub::DoubleBuffer<Key> &, cub::DoubleBuffer<Value> &, Offset, int, int, cudaStream_t, __nv_bool) [with DESCENDING=false, Key=__nv_bool, Value=cub::NullType, Offset=int]"
./cub/device/device_radix_sort.cuh(331): here
instantiation of "cudaError_t cub::DeviceRadixSort::SortKeys(void *, size_t &, cub::DoubleBuffer<Key> &, int, int, int, cudaStream_t, __nv_bool) [with Key=__nv_bool]"
...
I tried to find out if this was intended to be supported or not. The documentation implies it is:
DeviceRadixSort can sort all of the built-in C++ numeric primitive types, e.g.: unsigned char, int,
double, etc. Although the direct radix sorting method can only be applied to unsigned integral types,
BlockRadixSort is able to sort signed and floating-point types via simple bit-wise transformations
that ensure lexicographic key ordering.
bool
is indeed an integral type, so I think it's worth supporting for completeness.
I have three vectors
thrust::device_vector<unsigned long> keys(N);
thrust::device_vector<unsigned> value1(N);
thrust::device_vector<unsigned> value2(N);
I am trying to achieve the equivalent of thrust::sort_by_key
using cub::DeviceRadixSort::SortPairs.
thrust::sort_by_key(keys.begin(), keys.end(),
thrust::make_zip_iterator(thrust::tuple(value1.begin(), value2.begin())));
So essentially, my question is, how do I sort by key where each key has multiple values associated to it?
I ran into this in my own code, and then tried the CUB tests, and they fail to compile too.
harrism@snafubuntu:~/src/cub/test$ make test_block_radix_sort
mkdir -p bin
"/usr/local/cuda/bin/nvcc" -gencode=arch=compute_20,code=\"sm_20,compute_20\" -o bin/test_block_radix_sort_nvvm_5.0_noabi_i386 test_block_radix_sort.cu -Xptxas -v -Xcudafe -# -Xptxas -abi=no -m32 -I. -I../cub -O3
../cub/block/../warp/warp_reduce.cuh(134): warning: declaration of "T" hides template parameter
../cub/block/../warp/warp_reduce.cuh(456): error: a class or namespace qualified name is required
../cub/block/../warp/warp_reduce.cuh(459): error: name followed by "::" must be a class or namespace name
Front end time 17.78 (CPU) 18.00 (elapsed)
2 errors detected in the compilation of "/tmp/tmpxft_000012b7_00000000-6_test_block_radix_sort.cpp1.ii".
Total compilation time 18.01 (CPU) 19.00 (elapsed)
make: *** [bin/test_block_radix_sort_nvvm_5.0_noabi_i386] Error 2
harrism@snafubuntu:~/src/cub/test$ uname -a
Linux snafubuntu 3.0.0-32-generic #51-Ubuntu SMP Thu Mar 21 15:50:59 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
harrism@snafubuntu:~/src/cub/test$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221
harrism@snafubuntu:~/src/cub/test$ g++ --version
g++ (Ubuntu/Linaro 4.6.1-9ubuntu3) 4.6.1
Copyright (C) 2011 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
The errors for other tests start the same, but are much worse (many more error messages).
https://github.com/NVlabs/cub/blob/1.5.5/cub/thread/thread_operators.cuh#L99
If EqualityOp is a non-const callable, InequalityWrapper::operator() produces in compilation error. non-const will work with both non-const and const EqualityOp::operator()
Hi,
I was wondering whether it is ok if I set the input and the output to the same pointer? It is usual to sort an array and output the sorted results to the original location.
Thanks.
cub 1.5.5
error: no instance of function template "cub::DeviceScan::ExclusiveSum" matches the argument list
argument types are: (void *, size_t, float2 *, float2 *, arboretum::core::CustomSum, float2, const size_t, cudaStream_t)
or
error: no instance of function template "cub::DeviceScan::ExclusiveSum" matches the argument list
argument types are: (long, size_t, float *, float *, arboretum::core::CustomSum, float, const size_t, cudaStream_t)
struct CustomSum
{
template <typename T>
CUB_RUNTIME_FUNCTION __forceinline__
T operator()(const T &a, const T &b) const {
return a + b;
}
};
Sorry for a duplicate(https://github.com/NVlabs/cub/issues/40) but it's been closed without fix or workaround.
Suppose I've three arrays/vectors
and I want to sort values1 and values2 by keys.
It can be done by calling SortPairs twice.
Can the running time be improved by putting them in a single sort?
Can you provide a code snippet to do that or implementation require?
Hi, I'm using DeviceRadixSort::SortKeys to sort an array of integers. If the num_items passed to SortKeys is <= 1792 the output vector is all zeros. If I increase the num_items to 1793 the output is correct (without changing any of the other arguments). For size <=1792 it seems to use the single tile path. I'm compiling for sm_30. Any ideas how to further track this down?
Hi,
Would it be possible to use Git tags for releases rather than just branches? This is the expected way to handle this on GitHub (https://github.com/NVlabs/cub/releases). Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.