Giter Club home page Giter Club logo

project2-stream-compaction's Introduction

CUDA Stream Compaction

University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2

  • Name: Bowen Yang
  • Tested on: Windows 10 x64, i7-6800K @ 3.40GHz 32GB, GTX 1080 8GB (Personal computer at home)

Description

Stream compaction from scratch. In our case, we're to filter out all the elements that equals to 0. Analogly it's the dead paths in the list of all the rays.

Part 1: CPU implementation Part 2: Brute force naive reduction on GPU Part 3: Efficient implementation that's actually not so efficient Part 4: Thrust

Issues

Just as always, I modified the sm_20 option to sm_60 to make it compile on nvcc 9.2.

Performance Test

Scan

When element number is exactly 2-power

When element number is not exactly 2-powered

Compaction

When element number is exactly 2-powered

When element number is not exactly 2-powered

Thrust implementation

When element number is exactly 2-powered

When element number is not exactly 2-powered

**The results

****************
** SCAN TESTS **
****************
    [  34  23  45  34  24  43  35  44  26  22  13  28  37 ...  47   0 ]
==== cpu scan, power-of-two ====
   elapsed time: 0.001204ms    (std::chrono Measured)
    [   0  34  57 102 136 160 203 238 282 308 330 343 371 ... 6116 6163 ]
==== cpu scan, non-power-of-two ====
   elapsed time: 0.001205ms    (std::chrono Measured)
    [   0  34  57 102 136 160 203 238 282 308 330 343 371 ... 6064 6085 ]
    passed
==== naive scan, power-of-two ====
   elapsed time: 0.033792ms    (CUDA Measured)
    passed
==== naive scan, non-power-of-two ====
   elapsed time: 0.032896ms    (CUDA Measured)
    passed
==== work-efficient scan, power-of-two ====
   elapsed time: 0.123904ms    (CUDA Measured)
    passed
==== work-efficient scan, non-power-of-two ====
   elapsed time: 0.126976ms    (CUDA Measured)
    passed
==== thrust scan, power-of-two ====
   elapsed time: 13.1891ms    (CUDA Measured)
    passed
==== thrust scan, non-power-of-two ====
   elapsed time: 0.96256ms    (CUDA Measured)
    passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
    [   0   3   3   0   1   2   0   2   2   2   1   2   2 ...   3   0 ]
==== cpu compact without scan, power-of-two ====
   elapsed time: 0.001506ms    (std::chrono Measured)
    [   3   3   1   2   2   2   2   1   2   2   3   2   1 ...   3   3 ]
    passed
==== cpu compact without scan, non-power-of-two ====
   elapsed time: 0.002108ms    (std::chrono Measured)
    [   3   3   1   2   2   2   2   1   2   2   3   2   1 ...   2   3 ]
    passed
==== cpu compact with scan ====
   elapsed time: 0.003313ms    (std::chrono Measured)
    [   3   3   1   2   2   2   2   1   2   2   3   2   1 ...   3   3 ]
    passed
==== work-efficient compact, power-of-two ====
   elapsed time: 0.23984ms    (CUDA Measured)
    passed
==== work-efficient compact, non-power-of-two ====
   elapsed time: 0.238592ms    (CUDA Measured)
    passed

The explanation for the efficiency of the efficient implementation

From the charts we find the surprising fact that, efficient implementation is actually not that efficient. 2 main reasons for this to happen:

I/O intensive In our efficient implementation, we have to write the initial element to the GPU buffer or read the last element of the buffer back. This causes lots of system interrupts and therefore is harmful to performance.

The reduce algorithm itself With the layer going even deeper, stride becomes larger and larger, which, is a behavior that all caches hate. The spatial locality is horrible when the layer goes deep.

Something wrong with the thread scheduling With the layer going deeper, more and more threads become idle and does nothing useful, and even worse, with the stride growing larger, branch divergence inside warps is getting unacceptable.

project2-stream-compaction's People

Contributors

kainino0x avatar immiao avatar ottaviohartman avatar shrekshao avatar pjcozzi avatar

Watchers

AIPC avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.