coezmaden / gpuacceleratedtracking Goto Github PK
View Code? Open in Web Editor NEWRepository containing source code for the "Tracking.jl: Accelerating multi-antenna GNSS receivers with CUDA" paper.
License: MIT License
Repository containing source code for the "Tracking.jl: Accelerating multi-antenna GNSS receivers with CUDA" paper.
License: MIT License
Kernels 2 and upwards remain untested. Robust way to test the correctness of these new algorithms is needed for the benchmarks to have any validity.
Currently CPU benchmarks allow for the algorithm
parameter, which means the same code is tested multiple times, although it's the same code.
KernelAlgorithm
distinction should only influence GPU benchmarks.
Hi @coezmaden! I was asked to review your JuliaCon proceedings submission in here JuliaCon/proceedings-review#128, so here goes ๐
In general, your paper was nicely written and a pleasant read, so kudos for that! I have a few comments on the paper itself, but also on the code and the repository. I hope this is helpful!
(Also a caveat: I'm going to focus on the things I'm familiar with, which is Julia and GPUs. I hope somebody else can take a look at the SDR-related pieces.)
The README should mention a Julia version to use, as 1.8 seemed incompatible (I took 1.7 from the Manifest, which works)
It would also be good if the README contained some information on how to reproduce the measurements. I gather I first need to run the scripts/run_benchmark
scripts before scripts/plot_benchmarks
? But that doesn't actually run on the GPU, so I have to customize the params? Doing so I still don't get the actual plots from the paper, so it'd be good to add some instructions on that.
Are there plans to upstream this work into Tracking.jl? I'm missing a bit how this work is to be used with the rest of the ecosystem, as the GPUAcceleratedTracking repository is focussed on the benchmarks related to the paper. The template in JuliaCon/proceedings-review#128 specifically asks about example usage / functionality documentation / tests etc, and that does seem to be missing a bit.
Generally I was surprised by the desktop GPU being outperformed by the CPU, and the explanation is fairly short. I see in the repository that you did actually profile the code using NVTX/nsys/ncu, so what were the conclusions from that work? What makes the GPU implementation slow (memory copy, kernels, etc)? Launch overhead shouldn't be it when we're doing >1ms of processing.
The charts in the paper also show some suspicious processing time drops at the highest sampling frequency, as well as some generally very noisy measurements for certain GPU implementations. What's going on there?
It also wasn't clear to me if your measurements are fully end-to-end? I notice you're using BenchmarkTools, so you're not 'cheating' by only measuring kernel execution time, but do your measurements include the time to upload memory to the GPU and download results back (if that matters)?
Finally, the template asks for a comparison to approaches with similar goals. Do you know of any?
In general, the paper is nicely written ๐ There's a couple of typo's (s/tieing/tying/, s/hadrware/hardware/), so I'd recommend going through it with a LaTeX-aware spell checker.
The template does ask for a more explicit 'statement of need' though, so you should extract some of the introduction into such a section. As part of that, I would elaborate some more on the need for GPU power in the context of SDR/GNSS processing, as that's only mentioned in passing.
I would also suggest using \texttt
for code-like names instead of quoting them as you currently do (e.g. an evaluation of the "cplx_multi" reduction
).
Finally, there's some missing DOIs that the bot spotted in the other thread.
Now for some more detailed comments:
I'd mention the CUDA.jl (3.8) and Julia (1.7) version.
You mention 'this ensures a coalesced memory access ... discussed in more detail later', but I don't actually find more details there.
wrt. device limits, I would mention that each dimension has its own limit, and then elaborate that blocks are special because there's a limit in x*y*z
.
The grid limit is currently shown in the overview table, and isn't terribly interesting, especially because it's fairly stable: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications-technical-specifications-per-compute-capability
s/first-grade support/first-class support/
About the parallel reduction, it's unclear to me if you're describing CUDA's existing mapreducedim
here or whether this is a new implementation or extension. I think it would be good to elaborate on that.
You also mention cooperative groups, but I don't see those used in the code? This because despite what your explanation suggests, they don't fully obviate a multi-pass reduction, as you need to be able to fit all thread blocks in memory when doing a grid-wide sync.
Generally, see the comments on performance above. It would be good to put some of those clarifications in the paper.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.