Giter Club home page Giter Club logo

cise_bempp's Introduction

Welcome 👋

I am a computational scientist on the interface of Mathematics, engineering, physical sciences and computer science. Basically, I love everything that combines great maths, lots of computing and cool applications. My particular focus area is boundary integral equations, where for a long time I have worked on the Bempp software package. Below you find some links to papers, and other official information.

cise_bempp's People

Contributors

mscroggs avatar tbetcke avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

betckegroup

cise_bempp's Issues

Reviewer 1

  • I suggest to include references to previous work on BEM with GPUs, e.g., the paper "GCA-H² matrix compression for electrostatic simulations" by S. Börm and S. Christophersen or the papers "Algorithmic patterns for H-matrices on many-core processors" and "A scalable H-matrix approach for the solution of boundary integral equations on multi-GPU clusters" by P. Zaspel.

  • In the left column on page 3, I would expect "sets" instead of "set" in line 26.

  • In the left column on page 3, the integral in line 50 should probably be defined on \tau_i and \tau_j instead of \Gamma.

  • In the right column on page 3, may I suggest "the grid, the spaces, and the operator(s)" in line 10?

  • In the right column on page 3, lines 17 to 20 appear to work only if the support of the basis functions consists only of \tau_i and \tau_j, since only quadrature points from these triangles are used. For more general basis functions, A_{ij} would only be an entry of the element stiffness matrix.

  • In the right column on page 3, I suggest to include references for the "singularity-removing coordinate transformations" mentioned in line 27.

  • In the left column on page 4, I suggest "dive deep" instead of "deep dive" in line 30.

  • In the right column on page 4, I suggest "perform as efficiently as possible" instead of "perform as highly as possible" in line 27.

  • In the left column on page 5, it should be "relationships" instead of "relataionships" in line 11.

  • In the left column on page 5, I suggest to clarify that OpenCL kernels can not be written in C99 or C++, since important language features like recursion or the standard libraries are missing, while features like vector types have been added. It would be probably better to speak of OpenCL C.

  • In the right column on page 5, may I suggest "from the host" instead of "from host" in line 21?

  • In the left column on page 6, I did not understand what "test and trial connectivity information that stores the indices of each node for each triangle" means in lines 25-27. Do the nodes correspond to the basis functions? Is each triangle assigned a global index for each local index? I suggest to explain this aspect in greater detail.

  • In the left column on page 6, what are "basis function multipliers" in lines 30-31?

  • In the left column on page 6, what happens if the global assembled matrix does not fit into, e.g., graphics memory when using a GPU in line 33? Is it possible to split the global matrix into submatrices that fit into graphics memory, and then transfer them to main memory? Given the high computational complexity, would it be possible to overlap memory transfers and computation?

  • In the left column on page 8, may I suggest "coloring techniques" instead of "color coding techniques" in line 6?

  • In the left column on page 8, maybe an example of how "classic loop parallelism" works in line 33 would be useful. I was wondering how assembling the combination of one test element with all trial elements could still lead to linear complexity.

  • In the right column on page 8, why is only AVX used in line 31? As far as I can see, the processor used for the benchmarks supports AVX512, as well.

  • In the left column on page 11, I was surprised at the works "stack based functions" in lines 29-30. As far as I know, OpenCL assigned registers directly to variables and does not use a stack. The point still stands, of course, since registers are even more efficient than stack-based local variables, as long as there are enough of them available.

Reviewer 2

  • We could tune our code for better auto-vectorization in Numba, but this would give little benefit as we have highly optimized hand- tuned OpenCL kernels already." To me this statement casts a shadow on the interpretation of the benchmarking results since it may not be "fair" to compare highly optimized OpenCL kernels to less well-optimized Numba kernels. I do understand that Numba is meant to be a backup implementation in Bempp, but then the authors should be careful to put these results in context. I ask that they provide some additional discussion on the "fairness" of this comparison and how the results should be interpreted.

  • Since Numba provides support for both CUDA and ROCm, the authors should explicitly state that are using Numba CUDA in the abstract and introduction. It's clear from the context since they are using an NVIDIA GPU but they need to say this to avoid confusion.

  • Having briefly experimented with PyOpenCL/OpenCL myself, I was surprised at the authors' characterization that using it was easy. I struggled to get it configured properly with my NVIDIA vendor driver, for example. Maybe the authors did not have this experience but since OpenCL has a general reputation for being harder to use than CUDA, it would be useful for the authors to acknowledge this and state whether their experience was similar or different.

  • This is a more minor comment but since OpenCL was chosen for portability, I would have been very interested to see benchmarking results on AMD GPUs too. If this is future work it is worth mentioning.

  • Finally, the authors should cite Numba.

Review Comments - Associate Editor

  • The technique described by the authors seems to entail a substantial duplication of code that might be avoidable using certain techniques. What trade-offs led the authors to their approach?

  • Modern FEM codes tend to employ atomic operations instead of graph coloring for improved performance. (Atomic FP operations, also in 64 bit, can be realized via the classical compare-and-swap technique.)

  • The authors state that SIMD vectorization is not used for the evaluation of singular integrals, on account of the far field representing the lion's share of the work. Under FMM acceleration, near-field and far-field should take approximately equal time. What were the obstacles to applying the optimization in this setting?

  • The authors describe repeated host-device transfers as a particular performance bottleneck. These are likely avoidable if the data is kept on the device throughout. Why was this route not chosen? This question is particularly salient because it directly influences the authors' disqualification of GPUs for direct evaluation tasks.

  • The comment on lower FP64 throughput on Nvidia hardware should be removed or rephrased as it only applies to gaming-targeted consumer hardware.

  • Numba is described as a fallback technology, but no particular scenario is given in which the fallback would be required. The installation instructions for PyOpenCL (https://documen.tician.de/pyopencl/misc.html) appear to contain instructions to deploy a full OpenCL stack on nearly any machine.

  • When the authors characterize OpenCL as a "second language", to what extent is this second language necessitated by the GPU as a "second environment"?

  • Roofline plots for cost/performance.

Reviewer 3

  • It needs to introduce the PDEs and range of applications at the beginning better, as well as more welcoming notation for the general reader. I was also hoping for more insight into the results (fraction of peak perf, fraction of time on host-to-device transfers, etc), and take-home messages about mistakes made and lessons learned.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.