The cise_bempp from tbetcke

cise_bempp's Introduction

Welcome 👋

I am a computational scientist on the interface of Mathematics, engineering, physical sciences and computer science. Basically, I love everything that combines great maths, lots of computing and cool applications. My particular focus area is boundary integral equations, where for a long time I have worked on the Bempp software package. Below you find some links to papers, and other official information.

cise_bempp's People

Contributors

Stargazers

Watchers

cise_bempp's Issues

Reviewer 1

Reviewer 2

We could tune our code for better auto-vectorization in Numba, but this would give little benefit as we have highly optimized hand- tuned OpenCL kernels already." To me this statement casts a shadow on the interpretation of the benchmarking results since it may not be "fair" to compare highly optimized OpenCL kernels to less well-optimized Numba kernels. I do understand that Numba is meant to be a backup implementation in Bempp, but then the authors should be careful to put these results in context. I ask that they provide some additional discussion on the "fairness" of this comparison and how the results should be interpreted.
Since Numba provides support for both CUDA and ROCm, the authors should explicitly state that are using Numba CUDA in the abstract and introduction. It's clear from the context since they are using an NVIDIA GPU but they need to say this to avoid confusion.
Having briefly experimented with PyOpenCL/OpenCL myself, I was surprised at the authors' characterization that using it was easy. I struggled to get it configured properly with my NVIDIA vendor driver, for example. Maybe the authors did not have this experience but since OpenCL has a general reputation for being harder to use than CUDA, it would be useful for the authors to acknowledge this and state whether their experience was similar or different.
This is a more minor comment but since OpenCL was chosen for portability, I would have been very interested to see benchmarking results on AMD GPUs too. If this is future work it is worth mentioning.
Finally, the authors should cite Numba.

Review Comments - Associate Editor

The technique described by the authors seems to entail a substantial duplication of code that might be avoidable using certain techniques. What trade-offs led the authors to their approach?
Modern FEM codes tend to employ atomic operations instead of graph coloring for improved performance. (Atomic FP operations, also in 64 bit, can be realized via the classical compare-and-swap technique.)
The authors state that SIMD vectorization is not used for the evaluation of singular integrals, on account of the far field representing the lion's share of the work. Under FMM acceleration, near-field and far-field should take approximately equal time. What were the obstacles to applying the optimization in this setting?
The authors describe repeated host-device transfers as a particular performance bottleneck. These are likely avoidable if the data is kept on the device throughout. Why was this route not chosen? This question is particularly salient because it directly influences the authors' disqualification of GPUs for direct evaluation tasks.
The comment on lower FP64 throughput on Nvidia hardware should be removed or rephrased as it only applies to gaming-targeted consumer hardware.
Numba is described as a fallback technology, but no particular scenario is given in which the fallback would be required. The installation instructions for PyOpenCL (https://documen.tician.de/pyopencl/misc.html) appear to contain instructions to deploy a full OpenCL stack on nearly any machine.
When the authors characterize OpenCL as a "second language", to what extent is this second language necessitated by the GPU as a "second environment"?
Roofline plots for cost/performance.

Reviewer 3

It needs to introduce the PDEs and range of applications at the beginning better, as well as more welcoming notation for the general reader. I was also hoping for more insight into the results (fraction of peak perf, fraction of time on host-to-device transfers, etc), and take-home messages about mistakes made and lessons learned.

Recommend Projects