Giter Club home page Giter Club logo

Comments (5)

dumerrill avatar dumerrill commented on May 18, 2024 1

why does the sorting fails with error like "Can't allocate device memory"?

The sorting won't fail with a memory allocation error. If that's the error you're getting from CUB, then program was already failed and simply returning a latent error from an earlier failed attempt to allocate memory that wasn't cleared.

I think those extra allocations must be covered under temp_storage.

CUB does no allocation whatsoever. Everything its sorting needs is bundled up in the temp storage, which you can allocate (conservatively, even, using an upper bound of problem size, if that's available) way in advance. In general, CUDA device memory allocation is a stream-blocking, host-synchronizing event, and CUB doesn't want to impose that upon an application right in the middle of what the application is presuming to be an asynchornous stream computation.

from cub.

dumerrill avatar dumerrill commented on May 18, 2024

Hrm. Seems to work just fine for me. What OS, host, and CUDA compilers are
you using? (You might consider checking the status result from the
allocator.)

Compiling:

[dumerrill@dt06 removeme]$ nvcc -arch=sm_52 -std=c++11 -O3 main.cpp
sort_cub.cu -I../.. -I.

For 100M items:

[dumerrill@dt06 removeme]$ ./a.out 100000000
Found 4 CUDA devices:
Device 0: Graphics Device (PTX version 520, SM610, 30 SMs, 22749 free /
22912 total MB physmem, 347.040 GB/s @ 3615000 kHz mem clock, ECC on)
Device 1: Graphics Device (PTX version 520, SM610, 30 SMs, 22749 free /
22912 total MB physmem, 347.040 GB/s @ 3615000 kHz mem clock, ECC on)
Device 2: Graphics Device (PTX version 520, SM610, 30 SMs, 22749 free /
22912 total MB physmem, 347.040 GB/s @ 3615000 kHz mem clock, ECC on)
Device 3: Graphics Device (PTX version 520, SM610, 30 SMs, 22749 free /
22912 total MB physmem, 347.040 GB/s @ 3615000 kHz mem clock, ECC on)
Largest mem available 22.2GiB @0
Smallest mem available 22.2GiB @0
Device 0 selected
set length 100000000

Array length 100001408 (381MiB)
Data:
1804289383 0
846930886 1
1681692777 2
1714636915 3
1957747793 4
424238335 5
719885386 6
1649760492 7
596516649 8
1189641421 9
Sort0:
1804289383 0
846930886 1
1681692777 2
1714636915 3
1957747793 4
424238335 5
719885386 6
1649760492 7
596516649 8
1189641421 9
Sort1:
7 91538659
37 4880726
51 72684560
95 9505224
95 83691578
181 85858716
224 77198143
227 52701079
315 30544587
367 77156907
Sort ok

With 40M:

[dumerrill@dt06 removeme]$ ./a.out 40000000
Found 4 CUDA devices:
Device 0: Graphics Device (PTX version 520, SM610, 30 SMs, 22749 free /
22912 total MB physmem, 347.040 GB/s @ 3615000 kHz mem clock, ECC on)
Device 1: Graphics Device (PTX version 520, SM610, 30 SMs, 22749 free /
22912 total MB physmem, 347.040 GB/s @ 3615000 kHz mem clock, ECC on)
Device 2: Graphics Device (PTX version 520, SM610, 30 SMs, 22749 free /
22912 total MB physmem, 347.040 GB/s @ 3615000 kHz mem clock, ECC on)
Device 3: Graphics Device (PTX version 520, SM610, 30 SMs, 22749 free /
22912 total MB physmem, 347.040 GB/s @ 3615000 kHz mem clock, ECC on)
Largest mem available 22.2GiB @0
Smallest mem available 22.2GiB @0
Device 0 selected
set length 40000000

Array length 40000896 (152MiB)
Data:
1804289383 0
846930886 1
1681692777 2
1714636915 3
1957747793 4
424238335 5
719885386 6
1649760492 7
596516649 8
1189641421 9
Sort0:
1804289383 0
846930886 1
1681692777 2
1714636915 3
1957747793 4
424238335 5
719885386 6
1649760492 7
596516649 8
1189641421 9
Sort1:
37 4880726
95 9505224
315 30544587
448 12936177
452 28788976
490 27337079
614 3012490
657 32356371
657 34183315
677 28274133
Sort ok

On Tue, Nov 15, 2016 at 4:05 AM, daktfi [email protected] wrote:

When I try to sort array of 40m (roughly) pairs or longer it simply does
not sort them without reporting any errors.
Device is: Device 0: GeForce GTX 950 (PTX version 520, SM520, 6 SMs, 904
free / 1995 total MB physmem, 105.760 GB/s @ 3305000 kHz mem clock, ECC off)
cub version 1.5.5 (latest at the moment).

Sample project to reproduce the problem is attached
check_dev_radix.zip
https://github.com/NVlabs/cub/files/591469/check_dev_radix.zip

When run with increasingly larger size of array to sort it eventually
fails to sort it.


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
https://github.com/NVlabs/cub/issues/64, or mute the thread
https://github.com/notifications/unsubscribe-auth/ABaFwFudz5Uuz49R762PgwhP_I6PuKd1ks5q-XX0gaJpZM4KySCh
.

from cub.

daktfi avatar daktfi commented on May 18, 2024

I found the problem: it is necessary to check cudaPeekAtLastError()/cudaGetLastError() after sort. It seems sorting requires additional amount of videomemory beside allocated buffers and temp_storage (roughly again as much as keys size doubled). Mind the row with device specs in the original post: there were only 904 Mb of free memory.
I don't think it's a major bug, but still quite inconvenient. I think those extra allocations must be covered under temp_storage.

The setup is:
Kubuntu 16.04 fully updated, gcc 5.4, CUDA 8.0;
Core i7 (don't think that matters, though), 32 Gb RAM, GTX 950 with 2 Gb of memory;

from cub.

dumerrill avatar dumerrill commented on May 18, 2024

The implementation does peek errors after each kernel launch, e.g.,
https://github.com/NVlabs/cub/blob/1.5.5/cub/device/dispatch/dispatch_radix_sort.cuh#L900.

However, as you mention, this doesn't capture all runtime errors: others
only show up when the stream is synchronized with the host (e.g., during a
malloc or memcpy). If you want improved CUB debugging, you can set the
last, optional debug_synchronous parameter to true, the implementation will
synchronize the stream after each kernel invocation to catch CUDART errors
that won't otherwise be reported. (Of course, this incurs added runtime
overhead of synchronizing the device with the host.)

On Tue, Nov 15, 2016 at 12:32 PM, daktfi [email protected] wrote:

I found the problem: it is necessary to check cudaPeekAtLastError()/cudaGetLastError()
after sort. It seems sorting requires additional amount of videomemory
beside allocated buffers and temp_storage (roughly again as much as keys
size doubled).
I don't think it's a major bug, but still quite inconvenient. I think
those extra allocations must be covered under temp_storage.

The setup is:
Kubuntu 16.04 fully updated, gcc 5.4, CUDA 8.0;
Core i7 (don't think that matters, though), 32 Gb RAM, GTX 950 with 2 Gb
of memory;


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
https://github.com/NVlabs/cub/issues/64#issuecomment-260709450, or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABaFwJ9p7hzkw7LHPZkDOZcH1zfsHIwWks5q-eyTgaJpZM4KySCh
.

from cub.

daktfi avatar daktfi commented on May 18, 2024

Thanks for advice on debug, this'll be quite useful.
However, my point here and now is not about errors reporting (that's my fault here - I'm quite new to CUDA and missed few points in manual), but about memory consumption. If I already allocated TWO buffers for keys, TWO buffers for values and even some extra temporary storage, why does the sorting fails with error like "Can't allocate device memory"?
I don't care how much memory it needs to sort data, I just want to be able to allocate this amount. And to do that I have to know how big it is. Not a big deal, I repeat, but still a little annoying...
To be specific, I'm sorting rather large arrays (over 1B key-value pairs, hopefully both 64-bit). To do that, I split 'em into device-sortable blocks (and then merge 'em later, but this is completely irrelevant)... To calculate proper length for such smaller blocks I need to know exact memory consumption, and - oops! - I can't. :-)
After some runs of attached test I figured out the proper ratio to be about three (to sort 1M of 32-bit key + 32-bit value pairs I need about 24 Mb of memory, while total combined size of allocated buffers and temporary storage is noticeably smaller - I don't have PC with working CUDA setup right now to check exactly). Hope this number will help someone. :-)

from cub.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.