CUDA/pybind project structure adapted from https://github.com/pkestene/pybind11-cuda
This project demonstrates various algorithms for computing the Mandelbrot set. The Mandelbrot set is a complex set of points that produces a fractal shape when plotted. It is defined by iterating the function:
where
As
Figure 1: Mandelbrot set in the complex plane, in black.
Since each pixel has to be computed individually, this problem is embarrassingly parallel.
The Mandelbrot set is a fractal, and in order to explore its bounds, it is common to zoom in on certain areas, and create videos of the zooming process. This is a computationally intensive task, and the goal of this project is to explore various algorithms to compute the Mandelbrot set efficiently.
We have implemented various algorithms to compute the Mandelbrot set through a Python package called mandelbrot_lib
and a C++/CUDA-based Python package called cuda_mandelbrot_lib
. Some are common algorithms, and a few are our own.
from mandelbrot_lib import NaiveSequential, NumpyGrid, NumbaCuda
from cuda_mandelbrot_lib import MultithreadCPP, BaseCUDA, ManualUnroll, PolynomialUnroll, PragmaUnroll
This is a straightforward implementation of the Mandelbrot set computation in Python. It iterates over each point in the grid and computes the number of iterations required for the point to escape.
All our implementations are subclasses of a common base class _BaseAlgorithm
, and offer a similar interface, like the following:
naive = NaiveSequential(escape_radius=escape_radius)
output = naive.compute_grid(xmin, ymin, xmax, ymax, width, height, n_iterations)
The output is a matrix of the number of iterations before the norm of
This implementation uses Numpy to parallelize the computation. By leveraging Numpy's vectorized operations, the algorithm can compute multiple points simultaneously, resulting in faster computation times.
numpy = NumpyGrid(escape_radius=escape_radius)
The Numba implementation utilizes the GPU to accelerate the computation. Numba is a just-in-time compiler for Python that translates Python functions to optimized machine code at runtime. This implementation uses the following configuration:
- Number of threads per block: 16x16 (256 threads).
- Number of blocks: Determined by the grid size divided by the number of threads per block.
numba = NumbaCuda(escape_radius=escape_radius)
This is a C++ implementation of the Mandelbrot set computation. It provides a performance improvement over Python by using lower-level operations and optimizations available in C++. This implementation uses multithreading to parallelize the computation across multiple CPU cores.
cpp = MultithreadCPP(escape_radius=escape_radius)
The CUDA implementation leverages NVIDIA's CUDA platform to perform parallel computation on the GPU. CUDA allows for massive parallelism by using thousands of lightweight threads, each thread computing one pixel. This implementation uses:
- Number of threads per block: 16x16 (256 threads).
- Number of blocks: Determined by the grid size divided by the number of threads per block.
cuda = BaseCUDA(escape_radius=escape_radius)
The next three algorithms are our attemps at making a faster algorithm than the regular
MandelbrotCUDA
using general techniques and not Mandelbrot-specific micro-optimisations. For exemple, we could have manually defined certain areas of the complex plane we know are or aren't in the set, and skip computation for these areas, but this is not the purpose of this analysis. We focused our attention on a technique that is more transferable to other problems: unrolling.
Unrolling
Let's consider a simple loop function
def to_optimise():
for i in range(3):
do_something(i)
Under the hood, in most languages, it does the following:
i = 0
do_something(i)
if i >= 3: # evaluates to False
break
i += 1
do_something(i)
if i >= 3: # evaluates to False
break
i += 1
do_something(i)
if i >= 3: # evaluates to True
break
If do_something
is fast enough, all the checks and i
incrementations can take a noticeable toll on performance. Now, consider the "unrolled" version:
def unrolled():
do_something(0)
do_something(1)
do_something(2)
This does significantly less work, for the same result.
Unrolling Mandelbrot
How can this be applied to the Mandelbrot set computation? Let's write the basic algorithm
def escape_iteration(c, max_iter, escape_radius):
z = 0 + 0j
for iter_ in range(max_iter):
z = z * z + c
if norm(z) > escape_radius:
return iter_
return max_iter
A first idea that comes to mind is taking two-steps at each iteration. This "semi-unrolls" the loop:
def escape_iteration_double(c, max_iter, escape_radius):
z = 0 + 0j
for iter_ in range(max_iter // 2):
z = z * z + c
z = z * z + c
if norm(z) > escape_radius:
return iter_ * 2
return max_iter // 2
This eliminates half of the radius and #iterations checks, at the cost of sometimes overshooting the escape radius by one iteration. Indeed, this is doing iterations by batches of 2, but a point might only need an odd number of iterations to cross the escape radius.
The algorithm above is the basis for the ManualUnroll
algorithm, even though it does it in the (x, y) plane and not the complex plane, as detailled below:
A single iteration of the Mandelbrot function is:
Expanding this using vector notation, we get:
Using multiplication rules for complex numbers,
Combining two iterations into one, we get:
Let
then:
Another idea is to expand this formula down to the most granular polynomial.
Expanding the terms, we get:
For the real part:
For the imaginary part:
Therefore:
When computing this polynomial, we can re-use a lot of the factors.
xy = x * y
x2 = x * x
y2 = y * y
x4 = x2 * x2
x2y2 = x2 * y2
y4 = y2 * y2
ax2 = a * x2
x3y = xy * x2
...
We named this algorithm PolynomialUnroll
Finally, we suspected that unrolling more than 2 operations can be beneficial in areas of the plane where a lot of iterations are needed to escape the radius, and is worth the occasional overhead of doing more iterations than needed. This is what is implemented in the n_unroll.cu
script, which makes use of the #pragma unroll
decorator to unroll the loops at compile time, for fixed unroll lengths in PragmaUnroll
and implements compute_grid_2
, compute_grid_3
, compute_grid_5
and compute_grid_10
methods.
Below is a graph of the average execution time of each algorithm per grid size, in log-log scale.
Figure 2: Time v.s. grid size.
As we expect, all algorithms are asymptotically linear, but with different intercepts, and different behaviours for smaller grids.
Noticeably, all custom CUDA implementations beat the Numpy and Numba implementations by orders of magnitude, but are hard to tell apart.
Roughly, for large grids, in terms of speed,
A linear regression yields:
Model | Cells per µs |
---|---|
NaiveSequential | 0.01 |
NumpyGrid | 0.05 |
MultithreadCPP | 10.3 |
NumbaCuda | 132 |
PolynomialUnroll | 2580 |
PragmaUnroll (2) | 3340 |
BaseCUDA | 3520 |
PragmaUnroll (5) | 3770 |
PragmaUnroll (10) | 3770 |
PragmaUnroll (3) | 3890 |
ManualUnroll | 4230 |
We'll run a separate analysis for the CUDA implementations below as a tie-breaker:
Column | Time (ms) for (10k, 10k) grid |
---|---|
PolynomialUnroll | |
BaseCUDA | |
PragmaUnroll (2) | |
PragmaUnroll (5) | |
PragmaUnroll (3) | |
PragmaUnroll (10) | |
ManualUnroll |
In the end, our ManualUnroll
is statistically significantly faster than the CUDA benchmark BaseCUDA
(Z-statistic = 2.95, p-value = 0.0016), and than every other algorithm we tried.
All PragmaUnroll
algorithms slightly beat the benchmark, but not very significantly. The PolynomialUnroll
is the worse of all CUDA implementations, which is not surprising given the fact that a lot of intermediary terms have to be computed and stored.
All of this code can be executed from the Google Colab notebook linked at the top of this README. We chose to use Colab to prevent incompatible hardware issues, and to access free GPUs.
Finally, a zoom generated by our algorithm is shown in the following YouTube video: (click picture)
Settings can be found in the description. This took roughly 20 seconds to generate, for 300 frames in 720p resolution with the BaseCUDA implementation.