Giter Club home page Giter Club logo

gpuacceleratedtracking's People

Contributors

coezmaden avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

Forkers

mfkiwl ignss profmou

gpuacceleratedtracking's Issues

Untested kernel algorithms

Kernels 2 and upwards remain untested. Robust way to test the correctness of these new algorithms is needed for the benchmarks to have any validity.

CPU benchmarks allow for `algorithm` parameter

Currently CPU benchmarks allow for the algorithm parameter, which means the same code is tested multiple times, although it's the same code.

KernelAlgorithm distinction should only influence GPU benchmarks.

Review

Hi @coezmaden! I was asked to review your JuliaCon proceedings submission in here JuliaCon/proceedings-review#128, so here goes ๐Ÿ™‚

In general, your paper was nicely written and a pleasant read, so kudos for that! I have a few comments on the paper itself, but also on the code and the repository. I hope this is helpful!

(Also a caveat: I'm going to focus on the things I'm familiar with, which is Julia and GPUs. I hope somebody else can take a look at the SDR-related pieces.)

Reproducibility

The README should mention a Julia version to use, as 1.8 seemed incompatible (I took 1.7 from the Manifest, which works)

It would also be good if the README contained some information on how to reproduce the measurements. I gather I first need to run the scripts/run_benchmark scripts before scripts/plot_benchmarks? But that doesn't actually run on the GPU, so I have to customize the params? Doing so I still don't get the actual plots from the paper, so it'd be good to add some instructions on that.

Are there plans to upstream this work into Tracking.jl? I'm missing a bit how this work is to be used with the rest of the ecosystem, as the GPUAcceleratedTracking repository is focussed on the benchmarks related to the paper. The template in JuliaCon/proceedings-review#128 specifically asks about example usage / functionality documentation / tests etc, and that does seem to be missing a bit.

Performance

Generally I was surprised by the desktop GPU being outperformed by the CPU, and the explanation is fairly short. I see in the repository that you did actually profile the code using NVTX/nsys/ncu, so what were the conclusions from that work? What makes the GPU implementation slow (memory copy, kernels, etc)? Launch overhead shouldn't be it when we're doing >1ms of processing.

The charts in the paper also show some suspicious processing time drops at the highest sampling frequency, as well as some generally very noisy measurements for certain GPU implementations. What's going on there?

It also wasn't clear to me if your measurements are fully end-to-end? I notice you're using BenchmarkTools, so you're not 'cheating' by only measuring kernel execution time, but do your measurements include the time to upload memory to the GPU and download results back (if that matters)?

Finally, the template asks for a comparison to approaches with similar goals. Do you know of any?

Paper

In general, the paper is nicely written ๐Ÿ‘ There's a couple of typo's (s/tieing/tying/, s/hadrware/hardware/), so I'd recommend going through it with a LaTeX-aware spell checker.

The template does ask for a more explicit 'statement of need' though, so you should extract some of the introduction into such a section. As part of that, I would elaborate some more on the need for GPU power in the context of SDR/GNSS processing, as that's only mentioned in passing.

I would also suggest using \texttt for code-like names instead of quoting them as you currently do (e.g. an evaluation of the "cplx_multi" reduction).

Finally, there's some missing DOIs that the bot spotted in the other thread.

Now for some more detailed comments:

  1. Methodology

I'd mention the CUDA.jl (3.8) and Julia (1.7) version.

You mention 'this ensures a coalesced memory access ... discussed in more detail later', but I don't actually find more details there.

  1. Algorithm

wrt. device limits, I would mention that each dimension has its own limit, and then elaborate that blocks are special because there's a limit in x*y*z.

The grid limit is currently shown in the overview table, and isn't terribly interesting, especially because it's fairly stable: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications-technical-specifications-per-compute-capability

s/first-grade support/first-class support/

About the parallel reduction, it's unclear to me if you're describing CUDA's existing mapreducedim here or whether this is a new implementation or extension. I think it would be good to elaborate on that.

You also mention cooperative groups, but I don't see those used in the code? This because despite what your explanation suggests, they don't fully obviate a multi-pass reduction, as you need to be able to fit all thread blocks in memory when doing a grid-wide sync.

  1. Experiment Evaluation

Generally, see the comments on performance above. It would be good to put some of those clarifications in the paper.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.