CUDA Stream Compaction

University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2

Name: Bowen Yang
- LinkedIn
- GitHub
- Facebook
- Steam
Tested on: Windows 10 x64, i7-6800K @ 3.40GHz 32GB, GTX 1080 8GB (Personal computer at home)

Description

Stream compaction from scratch. In our case, we're to filter out all the elements that equals to 0. Analogly it's the dead paths in the list of all the rays.

Part 1: CPU implementation Part 2: Brute force naive reduction on GPU Part 3: Efficient implementation that's actually not so efficient Part 4: Thrust

Issues

Just as always, I modified the sm_20 option to sm_60 to make it compile on nvcc 9.2.

Performance Test

Scan

When element number is exactly 2-power

When element number is not exactly 2-powered

Compaction

When element number is exactly 2-powered

When element number is not exactly 2-powered

Thrust implementation

When element number is exactly 2-powered

When element number is not exactly 2-powered

**The results

****************
** SCAN TESTS **
****************
    [  34  23  45  34  24  43  35  44  26  22  13  28  37 ...  47   0 ]
==== cpu scan, power-of-two ====
   elapsed time: 0.001204ms    (std::chrono Measured)
    [   0  34  57 102 136 160 203 238 282 308 330 343 371 ... 6116 6163 ]
==== cpu scan, non-power-of-two ====
   elapsed time: 0.001205ms    (std::chrono Measured)
    [   0  34  57 102 136 160 203 238 282 308 330 343 371 ... 6064 6085 ]
    passed
==== naive scan, power-of-two ====
   elapsed time: 0.033792ms    (CUDA Measured)
    passed
==== naive scan, non-power-of-two ====
   elapsed time: 0.032896ms    (CUDA Measured)
    passed
==== work-efficient scan, power-of-two ====
   elapsed time: 0.123904ms    (CUDA Measured)
    passed
==== work-efficient scan, non-power-of-two ====
   elapsed time: 0.126976ms    (CUDA Measured)
    passed
==== thrust scan, power-of-two ====
   elapsed time: 13.1891ms    (CUDA Measured)
    passed
==== thrust scan, non-power-of-two ====
   elapsed time: 0.96256ms    (CUDA Measured)
    passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
    [   0   3   3   0   1   2   0   2   2   2   1   2   2 ...   3   0 ]
==== cpu compact without scan, power-of-two ====
   elapsed time: 0.001506ms    (std::chrono Measured)
    [   3   3   1   2   2   2   2   1   2   2   3   2   1 ...   3   3 ]
    passed
==== cpu compact without scan, non-power-of-two ====
   elapsed time: 0.002108ms    (std::chrono Measured)
    [   3   3   1   2   2   2   2   1   2   2   3   2   1 ...   2   3 ]
    passed
==== cpu compact with scan ====
   elapsed time: 0.003313ms    (std::chrono Measured)
    [   3   3   1   2   2   2   2   1   2   2   3   2   1 ...   3   3 ]
    passed
==== work-efficient compact, power-of-two ====
   elapsed time: 0.23984ms    (CUDA Measured)
    passed
==== work-efficient compact, non-power-of-two ====
   elapsed time: 0.238592ms    (CUDA Measured)
    passed

The explanation for the efficiency of the efficient implementation

From the charts we find the surprising fact that, efficient implementation is actually not that efficient. 2 main reasons for this to happen:

I/O intensive In our efficient implementation, we have to write the initial element to the GPU buffer or read the last element of the buffer back. This causes lots of system interrupts and therefore is harmful to performance.

The reduce algorithm itself With the layer going even deeper, stride becomes larger and larger, which, is a behavior that all caches hate. The spatial locality is horrible when the layer goes deep.

Something wrong with the thread scheduling With the layer going deeper, more and more threads become idle and does nothing useful, and even worse, with the stride growing larger, branch divergence inside warps is getting unacceptable.

uchanka / project2-stream-compaction Goto Github PK

project2-stream-compaction's Introduction

CUDA Stream Compaction

project2-stream-compaction's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent