![]() |
Peano
|
This page discusses how ExaHyPE supports SIMT and SIMD compute kernels. ExaHyPE uses GPUs in a classic offloading, i.e. GPUs as accelerators. This means that it takes particular tasks, ships them to the GPU and then ships them back. This happens under the hood, and neither does data reside on the accelerator permanently nor is the accelerator used all the time.
The whole approach assumes that there are computations (cell/patch updates) which can be done without any effect on global variables. It works if and only if we can ship a task to the GPU, and then get the solution data for this task back and no other global variabled changes. We furthermore assume that the computations that fit onto the GPU have no state. They can use some global, static variables, but they cannot access the solver's state which can change over time. We rely on code parts which have no side effects and do not depend on the solver state (minus global variables).
The same argument holds for aggressive vectorisation: Vectorisation works best if the computations on a patch do not have any side effects. Basically, ExaHyPE assumes that GPU's SIMT and CPU's SIMD are two realisations of the same pattern.
Working without side-effects might not work for all patches: There are always patches which evaluate and analyse some global data, or build up global data structures. In this case, ExaHyPE only offloads the remaining, simple patches to the GPU. That is, having a solver that supports patches/cells without side effects does not mean that all cells have to be side effect-free.
Before you dive into this technical description of how GPU offloading (and very aggressive vectorisation) work, it makes sense to read through the tutorial on GPU offloading.
Configure ExaHyPE with the argument --with-gpu=xxx
and select an appropriate GPU programming model. Rebuild the whole Peano core. This includes the ExaHyPE libraries.
More details on teh maturity and working of coce
Discussion how to introduce your own GPU-specific kernels
The Finite Volume enclave solvers for example all support GPUs, but the versions without enclave tasking do not support them. The solver documentation should clarify which one to select. For most of these GPU-enabled solvers, it is sufficient to pass an additional flag
for the FV variants, e.g.) that tells the solver that we have PDE terms which have no side effect which also do not need the solver object. Consult the documentation of the respective solvers. Examples are
To faciliate the offloading, we have to create alternative versions of our PDE term functions that can work independent of the solver object (and in return cannot modify it). Depending on which terms we have, we'll need stateless versions of the flux, the non-conservative product and the source term. We'll always needs the eigenvalue function. Per function, keep the original one and add a second one which is
Offloadable
. This is the last argument, i.e. a static version of a function has exactly the same arguments as the non-static, default variant but then has one more argument. The last argument is solely there to be able to distinguish the static version from the normal one, as C++ cannot overload w.r.t. static vs. not static. Offloadable
is automatically defined in the abstract base class which 's Python API generates.Most ExaHyPE solvers allow you to insert the realisation of a routine directly from the Python code. In this case, the call ends up the AbstractMySolver class. Most classes call the corresponding routine set_implementation(). If you use this one, the code generator usually does not distinguish the two callback flavours and uses the same code snippet for the normal as well as the vectorised version. However, it is likely that the files end up in the cpp file of the abstract super class. That is, the compiler will not be able to inline anything. If your compiler struggles with the inlining, you might be forced either
ExaHyPE solvers which support GPUs/stateless operators typically host three different compute kernels. These kernels are plain C++ function calls, i.e. strings, on the Python level.
self._compute_kernel_call
is a C++ function call for the normal task. This guy is called whenever you hit an octant, i.e. Finite Volume patch or a DG cell. It calls the virtual flux, eigenvalue, ... functions on the solver object.self._fused_compute_kernel_call_cpu
is a C++ function call that ExaHyPE uses whenever it encounters a set of octants in the tree which can be processed embarrassingly parallel as they all invoke the same PDE and none of them alters the state of the underlying solver object. So the compute kernels identified by this string can call the static flux and eigenvalue instead of the virtual function.self._fused_compute_kernel_call_gpu
is the equivalent to _fused_compute_kernel_call_cpu
which the code uses when it decides to offload a bunch of cells to an accelerator.By default, these variants should point to reasonable variants once you switch on stateless operators and the variants are typically constructed in a way that they distinguish the various backends we support. That is, they yield a string of the form
which automatically picks a working default backend depending on your configuration.
While this gives you a working version quickly, you might want to switch to a tailored variant for your particular solver. This can be done by changing the actual solver kernel:
If only some patches/cells can be offloaded to the GPU, then you can redefine the routine
in your solver. By default, this routine returns true
always. Here's the clue: This is a normal function, i.e. you can use the solver's state and make the result depend on this one.
patchCanUseStatelessPDETerms() yields, by default, true. So all compute kernels end up on the GPU if they are embedded into enclave tasks. You migth want to alter this, and keep some kernels on the host.
The sections below discuss what you should see in the code.
Once you have a GPU-ready solver, it makes sense to doublecheck quickly if the solver really dispatches the correct compute kernels. For this, we first study the generated task. Stateless routines are also very useful on the CPU, as we can aggressively vectorise over them. Therefore, the task should look similar to
The kernels might look completely different, but the principle is clear: Whenever we hit a cell, we check if patchCanUseStatelessPDETerms() holds. If this is not the case, we use the default version. However, if it holds, then we invoke the specialised version which assumes that there is a version with static (stateless) calls.
Scroll further down until you see the fuse() routine. This one is again very simple to digest:
We see here, that SYCL can run on both the host and the device, whereas we pick the OpenMP version if and only if we have picked a valid device.
If this call logic makes no sense, most solvers have attributes such as
self._fused_compute_kernel_call_cpu self._fused_compute_kernel_call_gpu
to switch to the modified kernel realisation variant. Typically, there are factory mechanisms to alter those guys: