Peano
Loading...
Searching...
No Matches
Adding GPU support

ExaHyPE uses GPUs in a classic offloading mode: It takes particular tasks, ships them to the GPU and then ships them back. This happens under the hood, and neither does data reside on the accelerator permanently nor is the accelerator used all the time.

The whole approach assumes that there are computations (cell/patch updates) which can be done without any effect on global variables. It works if and only if we can ship a task to the GPU, and then get the solution data for this task back and no other global variabled changes. We furthermore assume that the computations that fit onto the GPU have no state. They can use some global, static variables, but they cannot access the solver's state which can change over time. We rely on code parts which have no side effects and do not depend on the solver state (minus global variables).

Working without side-effects might not work for all patches (Finite Volumes and Finite Differences) or octants (RKDG and ADER-DG) in your mesh: There are pieces of the mesh which evaluate and analyse some global data, or build up global data structures. In this case, ExaHyPE only offloads the remaining, simple patches/cotants to the GPU. That is, having a solver that supports patches/cells without side effects does not mean that all cells have to be side effect-free.

The second things to keep in mind is that ExaHyPE never offloads a single patch to the GPU. It can do, but this is then a degenerated variant of something way more generic: The code takes a set of octants from the mesh and moves them en bloc to the accelerator. There, one compute kernel updates all of these guys in one rush. We refer to this as fused updates.

The description above gives us a clear roadmap how to offload code in ExaHyPE to the accelerator:

Compile with GPU support

Configure ExaHyPE with the argument --with-gpu=xxx and select an appropriate GPU programming model. Rebuild the whole Peano core. This includes the ExaHyPE libraries.

If you want to profile or optimise your code further, you might want to pick a particular vendor toolchain. See configure's help on –with-toolchain. The logging section discusses variations of the toolchain support.

Pick a task-based solver

The particular tasking concept that we implement in ExaHyPE is called enclave tasking. You find the vanilla paper of this idea in

@article{doi:10.1137/19M1276194,
author = {Charrier, Dominic Etienne and Hazelwood, Benjamin and Weinzierl, Tobias},
title = {Enclave Tasking for DG Methods on Dynamically Adaptive Meshes},
journal = {SIAM Journal on Scientific Computing},
volume = {42},
number = {3},
pages = {C69-C96},
year = {2020},
doi = {10.1137/19M1276194}
}

Alternatively, each lowering into Peano yields a README-xxxxx.md file. In this markup file, you also find all the information of papers that have an influence on your chosen solver configuration. Solvers that support enclave tasking typically have a distinct name. exahype2.solvers.fv.rusanov.GlobalAdaptiveTimeStepWithEnclaveTasking is for example a solver which yields tasks.

Once you have chosen such a solver and compile the code, there should be a tasks subdirectory after you've lowered your code to C++. It makes sense to look into this C++ code from time to time to check if the generated code does make sense.

Assess suitability

Before we look into generated code or even start to implement something for GPUs, we should assess if there are any tasks that could, in theory, go to the accelerator.

Many developers jump straight into accelerator programming and then are frustrated if ExaHyPE does not use the GPU and blame the code. Most of their time, they haven't done their homework: Peano always tries to maximise the CPU usage first before it offloads to the GPU. For tiny problems or unfortunate load balancing, the code hence might not use the GPU even though the code is, in principle, GPU-ready. This is a natural consequence of our aim to minimise data movements. If we don't have to ship data to the GPU, we don't do it!

Peano provides an ecosystem to assess to which degree a code might be suitable for an accelerator even before we start the porting:

  1. Compile in the statistics mode. This requires you to switch to
    project.set_Peano4_installation(..., peano4.output.CompileMode.Stats)
  2. Run the code. I recommend that you use a fairly short run, i.e. only a few time steps. The code will provide a statistics file:
    tarch::logging::Statistics::writeToCSV(string) write statistics to file statistics-rank-0.csv (total no of entries=117)
    void writeToCSV(std::string filename="statistics")
    Write data to csv file.
  3. Peano's postprocessing comes along with several scripts to inspect and to postprocess the outcome. Inspect the statistics file
    python3 ~/git/Peano/python/peano4/postprocessing/inspect-statistics.py statistics-rank-0.csv
    Column | Counter
    --------|----------------
    0 | time (always used to plot)
    1 | exahype2::EnclaveBookkeeping::memory-allocations
    2 | mpi wait times
    3 | tarch::multicore::bsp-concurrency-level
    4 | tarch::multicore::spawned-tasks
    5 | tarch::multicore::taskfusion::process-fused-tasks
    6 | tarch::multicore::taskfusion::submit-fused-tasks
    7 | grid-control-events
    void inspect(std::string filename)
    Definition convert.cpp:54
    -lift-drop-statistics
    to find out which column gives you information about spawned tasks and all the fused tasks.
  4. Visualise the number of tasks over time:
    python3 ~/git/Peano/python/peano4/postprocessing/plot-statistics.py -c 4,5,6 -o pdf statistics-rank-0.csv
    python3 ~/git/Peano/python/peano4/postprocessing/plot-statistics.py -c 5 -pt step -o pdf statistics-rank-0.csv

If we get pictures similar to

we face issues: Apparently the code does spawn a lot of tasks which could, in theory, be fused, but the fusion only ever takes one task at a time and deploys it. Consult the @tarch_logging "statistics page" for some more information on the plots.

In this example, we clearly don't produce tasks quickly enough. There are always some cores idling which snap away fusable tasks before they can accumulate on the host. Further to that, the scheduler changes its behaviour after some time and no tasks are produced at all anymore. These effects have to be studied and sorted out first, before we continue to try to offload to the GPU.

Enable stateless compute kernels

You next have to instruct your solver that you plan to have patch or cell updates without (write) access to the solver's internal state. They also don't alter the global state. Obviously, you can weaken these assumptions later on, but this then requires further manual work.

The Finite Volume enclave solvers for example all support stateless kernels, but the versions without enclave tasking do not support them. The solver documentation should clarify which one to select. For most of these GPU-enabled solvers, it is sufficient to pass an additional flag

pde_terms_without_state=True

into the constructor that tells the solver that we have PDE terms which have no side effect which also do not need the solver object. Consult the documentation of the respective solvers. Again, exahype2.solvers.fv.rusanov.GlobalAdaptiveTimeStepWithEnclaveTasking is a good prototype to study:

name= "GPUFVSolver",
patch_size=18,
min_volume_h=0.0001,
max_volume_h=0.01,
pde_terms_without_state=True, # This part is the relevant one here
)
For the generic kernels that I use here most of the time.
Definition CellAccess.h:13
Particle fields for the SPH particles.
Definition hydro_part.h:99

Provide the additional PDE terms without state

To faciliate the offloading, we have to create alternative versions of our PDE term functions that can work independent of the solver object (and in return cannot modify it). Depending on which terms we have, we'll need stateless versions of the flux, the non-conservative product and the source term. We'll always needs the eigenvalue function. Per function, keep the original one and add a second one which is

  • static;
  • accepts an additional flag of type Offloadable. This is the last argument, i.e. a static version of a function has exactly the same arguments as the non-static, default variant but then has one more argument. The last argument is solely there to be able to distinguish the static version from the normal one, as C++ cannot overload w.r.t. static vs. not static. Offloadable is automatically defined in the abstract base class which 's Python API generates.
  • is embedded into
    #if defined(OpenMPGPUOffloading)
    #pragma omp declare target
    #endif
    my new static function variants
    #if defined(OpenMPGPUOffloading) #pragma omp end declare target
    #endif
    Though you can leave the declare annotations away if you use SYCL only.

Very often, the standard flux and eigenvalue routines can invoke the static variants and you can thus eliminate code redundancies. For example, you might have the normal flux function and a static flux function with the additional Offloadable parameter, but the normal function just invokes the static cousin.

In many of our codes, we take a function like

void examples::exahype2::SSInfall::SSInfall::flux( ... ) {
logTraceInWith4Arguments( "flux(...)", faceCentre, volumeH, t, normal );
compute something;
logTraceOutWith4Arguments( "flux(...)", faceCentre, volumeH, t, normal );
}
#define logTraceInWith4Arguments(methodName, argument0, argument1, argument2, argument3)
Definition Log.h:373
#define logTraceOutWith4Arguments(methodName, argument0, argument1, argument2, argument3)
Definition Log.h:383

and simply split it into two variants:

void examples::exahype2::SSInfall::SSInfall::flux( ... ) {
logTraceInWith4Arguments( "flux(...)", faceCentre, volumeH, t, normal );
flux(Q,faceCentre,volumeH,t,normal,F,Offloadable::Yes);
logTraceOutWith4Arguments( "flux(...)", faceCentre, volumeH, t, normal );
}
void examples::exahype2::SSInfall::SSInfall::flux( ..., Offloadable ) {
compute something;
}

The static version is used on the GPU. The normal one is used on the CPU yet defers immediately to the static version. Please note that GPUs are quite restrictive w.r.t. terminal outputs and assertions. You will have to remove both from your static routine versions. In the example above, the non-static version wraps the core functionality into assertions. Therefore, we have the assertions on the CPU, but the core part that goes to the GPU is free of assertions. The same holds for the logging.

Most ExaHyPE solvers allow you to insert the realisation of a routine directly from the Python code. In this case, both calls end up the AbstractMySolver class: the default variant and the stateless variant, too.

Compile, run and check

In principle, the code is now GPU ready. Compile and run.

Tailor kernel choice

Most ExaHyPE solvers employ at least three different kernel variations: a normal one, one that vectorises aggressively on the host, and one that offloads to the GPU. If you use the default configuration, all should be set to some reasonable defaults. If this is not the case, you can manually alter

    self._fused_compute_kernel_call_cpu
    self._fused_compute_kernel_call_gpu

These values hold C++ strings, which are typically method invocations to the corresponding computing kernels. If you play around with these values, you might want to study further GPU realisation details and dive into the code.

Fuse tasks

Finally, we might want to tailor the fusion of tasks. For example, you might want to offload to the GPU if and only if there are enough tasks that you can bundle (fuse) into one meta task. Of you might want to use different GPUs depending on the task context. These decisions are made by the multithreading orchestration (among other things).

You can switch the orchestration manually in your C++ file. By default, the initialisation of ExaHyPE does this:

void setOrchestration(tarch::multicore::orchestration::Strategy *realisation)
Definition multicore.cpp:54

You can alter this one. A more elegant variant is likely that you use exahype2.Project.set_multicore_orchestration() to let the lowering into Peano automatically insert an appropriate setOrchestration() call. The namespace tarch::multicore::orchestration holds a bundle of different orchestration strategies that you can use to tweak your code. A good starting point is

my_project.set_multicore_orchestration( "new otherOrchestration(...)" )

Advanced techniques

Select which octants (cells or patches) to ship to the GPU and back

If only some patches/cells can be offloaded to the GPU, then you can redefine the routine

virtual bool patchCanUseStatelessPDETerms(
double t,
double dt
) const;
Simple vector class.
Definition Vector.h:134

in your solver. By default, this routine returns true. This default is written in the AbstractXXX solver. But nothing stops you from redefining the function in your particular solver subclass.

Write your own orchestration

Very advanced codes write their own orchestration which ships particular tasks specifically to the GPU. Through this, you can tailor the task execution pattern towards your specific needs. Notably, the orchestration is asked per task type whether and where to ship the task.

Further reading