Comments (5)
why does the sorting fails with error like "Can't allocate device memory"?
The sorting won't fail with a memory allocation error. If that's the error you're getting from CUB, then program was already failed and simply returning a latent error from an earlier failed attempt to allocate memory that wasn't cleared.
I think those extra allocations must be covered under temp_storage.
CUB does no allocation whatsoever. Everything its sorting needs is bundled up in the temp storage, which you can allocate (conservatively, even, using an upper bound of problem size, if that's available) way in advance. In general, CUDA device memory allocation is a stream-blocking, host-synchronizing event, and CUB doesn't want to impose that upon an application right in the middle of what the application is presuming to be an asynchornous stream computation.
from cub.
Hrm. Seems to work just fine for me. What OS, host, and CUDA compilers are
you using? (You might consider checking the status result from the
allocator.)
Compiling:
[dumerrill@dt06 removeme]$ nvcc -arch=sm_52 -std=c++11 -O3 main.cpp
sort_cub.cu -I../.. -I.
For 100M items:
[dumerrill@dt06 removeme]$ ./a.out 100000000
Found 4 CUDA devices:
Device 0: Graphics Device (PTX version 520, SM610, 30 SMs, 22749 free /
22912 total MB physmem, 347.040 GB/s @ 3615000 kHz mem clock, ECC on)
Device 1: Graphics Device (PTX version 520, SM610, 30 SMs, 22749 free /
22912 total MB physmem, 347.040 GB/s @ 3615000 kHz mem clock, ECC on)
Device 2: Graphics Device (PTX version 520, SM610, 30 SMs, 22749 free /
22912 total MB physmem, 347.040 GB/s @ 3615000 kHz mem clock, ECC on)
Device 3: Graphics Device (PTX version 520, SM610, 30 SMs, 22749 free /
22912 total MB physmem, 347.040 GB/s @ 3615000 kHz mem clock, ECC on)
Largest mem available 22.2GiB @0
Smallest mem available 22.2GiB @0
Device 0 selected
set length 100000000
Array length 100001408 (381MiB)
Data:
1804289383 0
846930886 1
1681692777 2
1714636915 3
1957747793 4
424238335 5
719885386 6
1649760492 7
596516649 8
1189641421 9
Sort0:
1804289383 0
846930886 1
1681692777 2
1714636915 3
1957747793 4
424238335 5
719885386 6
1649760492 7
596516649 8
1189641421 9
Sort1:
7 91538659
37 4880726
51 72684560
95 9505224
95 83691578
181 85858716
224 77198143
227 52701079
315 30544587
367 77156907
Sort ok
With 40M:
[dumerrill@dt06 removeme]$ ./a.out 40000000
Found 4 CUDA devices:
Device 0: Graphics Device (PTX version 520, SM610, 30 SMs, 22749 free /
22912 total MB physmem, 347.040 GB/s @ 3615000 kHz mem clock, ECC on)
Device 1: Graphics Device (PTX version 520, SM610, 30 SMs, 22749 free /
22912 total MB physmem, 347.040 GB/s @ 3615000 kHz mem clock, ECC on)
Device 2: Graphics Device (PTX version 520, SM610, 30 SMs, 22749 free /
22912 total MB physmem, 347.040 GB/s @ 3615000 kHz mem clock, ECC on)
Device 3: Graphics Device (PTX version 520, SM610, 30 SMs, 22749 free /
22912 total MB physmem, 347.040 GB/s @ 3615000 kHz mem clock, ECC on)
Largest mem available 22.2GiB @0
Smallest mem available 22.2GiB @0
Device 0 selected
set length 40000000
Array length 40000896 (152MiB)
Data:
1804289383 0
846930886 1
1681692777 2
1714636915 3
1957747793 4
424238335 5
719885386 6
1649760492 7
596516649 8
1189641421 9
Sort0:
1804289383 0
846930886 1
1681692777 2
1714636915 3
1957747793 4
424238335 5
719885386 6
1649760492 7
596516649 8
1189641421 9
Sort1:
37 4880726
95 9505224
315 30544587
448 12936177
452 28788976
490 27337079
614 3012490
657 32356371
657 34183315
677 28274133
Sort ok
On Tue, Nov 15, 2016 at 4:05 AM, daktfi [email protected] wrote:
When I try to sort array of 40m (roughly) pairs or longer it simply does
not sort them without reporting any errors.
Device is: Device 0: GeForce GTX 950 (PTX version 520, SM520, 6 SMs, 904
free / 1995 total MB physmem, 105.760 GB/s @ 3305000 kHz mem clock, ECC off)
cub version 1.5.5 (latest at the moment).Sample project to reproduce the problem is attached
check_dev_radix.zip
https://github.com/NVlabs/cub/files/591469/check_dev_radix.zipWhen run with increasingly larger size of array to sort it eventually
fails to sort it.—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/NVlabs/cub/issues/64, or mute the thread
https://github.com/notifications/unsubscribe-auth/ABaFwFudz5Uuz49R762PgwhP_I6PuKd1ks5q-XX0gaJpZM4KySCh
.
from cub.
I found the problem: it is necessary to check cudaPeekAtLastError()/cudaGetLastError() after sort. It seems sorting requires additional amount of videomemory beside allocated buffers and temp_storage (roughly again as much as keys size doubled). Mind the row with device specs in the original post: there were only 904 Mb of free memory.
I don't think it's a major bug, but still quite inconvenient. I think those extra allocations must be covered under temp_storage.
The setup is:
Kubuntu 16.04 fully updated, gcc 5.4, CUDA 8.0;
Core i7 (don't think that matters, though), 32 Gb RAM, GTX 950 with 2 Gb of memory;
from cub.
The implementation does peek errors after each kernel launch, e.g.,
https://github.com/NVlabs/cub/blob/1.5.5/cub/device/dispatch/dispatch_radix_sort.cuh#L900.
However, as you mention, this doesn't capture all runtime errors: others
only show up when the stream is synchronized with the host (e.g., during a
malloc or memcpy). If you want improved CUB debugging, you can set the
last, optional debug_synchronous parameter to true, the implementation will
synchronize the stream after each kernel invocation to catch CUDART errors
that won't otherwise be reported. (Of course, this incurs added runtime
overhead of synchronizing the device with the host.)
On Tue, Nov 15, 2016 at 12:32 PM, daktfi [email protected] wrote:
I found the problem: it is necessary to check cudaPeekAtLastError()/cudaGetLastError()
after sort. It seems sorting requires additional amount of videomemory
beside allocated buffers and temp_storage (roughly again as much as keys
size doubled).
I don't think it's a major bug, but still quite inconvenient. I think
those extra allocations must be covered under temp_storage.The setup is:
Kubuntu 16.04 fully updated, gcc 5.4, CUDA 8.0;
Core i7 (don't think that matters, though), 32 Gb RAM, GTX 950 with 2 Gb
of memory;—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/NVlabs/cub/issues/64#issuecomment-260709450, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABaFwJ9p7hzkw7LHPZkDOZcH1zfsHIwWks5q-eyTgaJpZM4KySCh
.
from cub.
Thanks for advice on debug, this'll be quite useful.
However, my point here and now is not about errors reporting (that's my fault here - I'm quite new to CUDA and missed few points in manual), but about memory consumption. If I already allocated TWO buffers for keys, TWO buffers for values and even some extra temporary storage, why does the sorting fails with error like "Can't allocate device memory"?
I don't care how much memory it needs to sort data, I just want to be able to allocate this amount. And to do that I have to know how big it is. Not a big deal, I repeat, but still a little annoying...
To be specific, I'm sorting rather large arrays (over 1B key-value pairs, hopefully both 64-bit). To do that, I split 'em into device-sortable blocks (and then merge 'em later, but this is completely irrelevant)... To calculate proper length for such smaller blocks I need to know exact memory consumption, and - oops! - I can't. :-)
After some runs of attached test I figured out the proper ratio to be about three (to sort 1M of 32-bit key + 32-bit value pairs I need about 24 Mb of memory, while total combined size of allocated buffers and temporary storage is noticeably smaller - I don't have PC with working CUDA setup right now to check exactly). Hope this number will help someone. :-)
from cub.
Related Issues (20)
- DeviceMemcpy::Batched supports only memory buffers HOT 4
- Specialize DeviceMemcpy::Batched to also support iterators HOT 1
- Documentation of warp-wide collectives refers to `__syncthreads` instead of `__syncwarp` HOT 1
- Add policy parameter to allow tuning
- Unresolved extern function 'cudaLaunchDevice' error while using NVCC 11.x and cub 2.10 with -G HOT 3
- Make decoupled look-back delay part of tuning HOT 2
- Implement tuning db merger HOT 3
- Write example for decoupled look-back API
- Segfault in CachingDeviceAllocator when out of memory HOT 4
- Tune Decoupled Look-back based Algorithms for H100 HOT 1
- Can't get correct result when use cub in CUDA12.0 HOT 24
- Illegal memory access on trying to use `DeviceReduce::Sum()` to count number of non-zeros HOT 2
- What is the correct compile command in Linux platform to compile a function citing cuh? HOT 1
- Segmented sorting does not preserve data in-between segments. HOT 6
- Misleading documentation for DeviceSegmentedRadixSort (or I'm using it wrong) HOT 1
- what's the purpose of CUB_SUBSCRIPTION_FACTOR
- select_if kernel needs grid boundary or reprogramming tile_idx HOT 1
- BlockLoad never attempts to vectorize HOT 5
- Possible bug in variable naming HOT 2
- How do I reduce partially filled 2D blocks? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cub.