Peano
|
Benchmark which solves the single black hole described in a CCZ4 simulation. While the output is not particularly exciting (the simulation should yield a stationary solution after a while), it is numerically challenging as we have sophisticated boundary condition effects, a singularity in the centre which can lead to oscillations, need a scheme with low numerical dissipation, and so forth. Therefore, this benchmark combines two different solvers - a dissipative Finite Volume scheme and a higher order scheme - to obtain a compromise between accuracy and stability.
--enable-blockstructured
)--enable-exahype2
)As always, the parameter –help makes the Python script yield further info, while the sections below enlist characteristic choices of the parameters. There are different scripts in the directory. They serve different purposes and support slightly different parameters.
Depending on your system, you might have to load the GSL module separately before you build the application:
The following statements remove all the glue code and autogenerated C++ files. There is no risk in deleting them - if you rerun the Python script, they will all be reinstantiated.
To get rid of the output files, type in
If you have set up your experiment via the Python script with the -plot command, you can use the standard Peano scripts to convert Peano's patch files:
The picture below is taken from ExaHyPE's coupling discussion and illustrates that the higher order solver covers the whole domain, while the FV is only active in a tiny cross-shaped region within.
Benchmarking the single black hole is not easy, as it is a rather complex setup already with a rather long initialisation and grid construction phase. However, it is possible to benchmark it once you disable plots, the tracers, and you use an unreasonable coarse mesh.
For all the upscaling studies, we use the same kind of benchmarks. They can be generated in one go through
I have played around with a lot of different parameter settings. Some insights are summarised below for a normal setup as we find it on Hamilton:
Description | Cell size | End time | Patch size | Remarks |
---|---|---|---|---|
Minimal setup | 1.8 | 1e-2 | 3 | You cannot see AMR here, and the FV patch size means that the CFD conditions do not match. This is a toy setup for a local laptop. It scales only up to 4-8 cores, as it so too small. Despite the small patch size (6 is production-mode level), we see the impact of the lack of scaling. |
Proper minimal setup | 1.8 | 1e-2 | 6 | You cannot see AMR here either, but the FV patch size is a proper one, i.e. the black hole is resolved such that the CFL conditions of the FV and FD scheme match. |
One node setup | 0.6 | 1e-3 | 6 | This setup is already too big to be solved without AMR on a single node. We see that normal, native tasking performs best. |
One node setpu | 0.1 | 1e-4 | 6 | Use xxx nodes for regular grid |
Below are some performance data which I collected on my local workstation. They provide a first look-n-feel of what the code behaves. Please note that these results are definitely not representative, as they are acquired on an eight core machine. Modern workstations feature more cores and hence provide qualitatively different outcomes.
Name | Trees | Scheduler | Python script call | TBB | OpenMP | SYCL |
---|---|---|---|---|---|---|
(Almost) No parallel task producers | ||||||
Native tasking w/o priorities | x | native-no-priorities | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler native-no-priorities –trees 1 | 527 | 532 | 560 |
Native tasking with priorities | x | native | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler native –trees 1 | 518 | 529 | 566 |
Tailored task scheduling | x | tailored | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler tailored –trees 1 | 516 | 534 | 562 |
Parallel for | x | parallel-for | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler parallel-for –trees 1 | 307 | 637 | 560 |
Subtasks | x | subtasks | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler subtasks –trees 1 | 302 | 534 | 565 |
Fewer BSP sections than threads | ||||||
Native tasking w/o priorities | 6 | native-no-priorities | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler native-no-priorities –trees 6 | 428 | 457 | 698 |
Native tasking with priorities | 6 | native | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler native –trees 6 | 430 | 456 | 597 |
Tailored task scheduling | 6 | tailored | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler tailored –trees 6 | 427 | 453 | 605 |
Parallel for | 6 | parallel-for | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler parallel-for –trees 6 | 225 | 454 | 600 |
Subtasks | 6 | subtasks | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler subtasks –trees 6 | 223 | 453 | 605 |
One BSP section per threads | ||||||
Native tasking w/o priorities | 8 | native-no-priorities | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler native-no-priorities –trees 8 | x | 436 | 610 |
Native tasking with priorities | 8 | native | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler native –trees 8 | x | 433 | 611 |
Tailored task scheduling | 8 | tailored | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler tailored –trees 8 | x | 434 | 617 |
Parallel for | 8 | parallel-for | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler parallel-for –trees 8 | x | x | 615 |
Subtasks | 8 | subtasks | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler subtasks –trees 8 | 207 | x | 618 |
Threads overbooked with BSP sections | ||||||
Native tasking w/o priorities | 12 | native-no-priorities | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler native-no-priorities –trees 12 | x | 432 | 606 |
Native tasking with priorities | 12 | native | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler native –trees 12 | x | 435 | 607 |
Tailored task scheduling | 12 | tailored | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler tailored –trees 12 | x | 432 | 618 |
Parallel for | 12 | parallel-for | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler parallel-for –trees 12 | x | 435 | 611 |
Subtasks | 12 | subtasks | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler subtasks –trees 12 | 212 | 482 | 616 |
Name | Trees | Scheduler | Python script call | TBB | OpenMP | SYCL |
---|---|---|---|---|---|---|
(Almost) No parallel task producers | ||||||
Native tasking w/o priorities | x | native-no-priorities | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler native-no-priorities –trees 1 | 408 | 435 | 343 |
Native tasking with priorities | x | native | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler native –trees 1 | 406 | 433 | 340 |
Tailored task scheduling | x | tailored | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler tailored –trees 1 | 415 | 434 | 341 |
Parallel for | x | parallel-for | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler parallel-for –trees 1 | 194 | 432 | 341 |
Subtasks | x | subtasks | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler subtasks –trees 1 | 193 | 434 | 341 |
Fewer BSP sections than threads | ||||||
Native tasking w/o priorities | 6 | native-no-priorities | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler native-no-priorities –trees 6 | 398 | 420 | 365 |
Native tasking with priorities | 6 | native | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler native –trees 6 | 396 | 419 | 363 |
Tailored task scheduling | 6 | tailored | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler tailored –trees 6 | 398 | 418 | 359 |
Parallel for | 6 | parallel-for | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler parallel-for –trees 6 | 186 | 419 | 357 |
Subtasks | 6 | subtasks | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler subtasks –trees 6 | 183 | 420 | 361 |
One BSP section per threads | ||||||
Native tasking w/o priorities | 8 | native-no-priorities | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler native-no-priorities –trees 8 | 390 | 413 | 374 |
Native tasking with priorities | 8 | native | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler native –trees 8 | 386 | 414 | 364 |
Tailored task scheduling | 8 | tailored | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler tailored –trees 8 | 390 | 413 | 373 |
Parallel for | 8 | parallel-for | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler parallel-for –trees 8 | 178 | 412 | 374 |
Subtasks | 8 | subtasks | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler subtasks –trees 8 | 181 | 414 | 375 |
Threads overbooked with BSP sections | ||||||
Native tasking w/o priorities | 12 | native-no-priorities | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler native-no-priorities –trees 12 | 391 | 415 | 375 |
Native tasking with priorities | 12 | native | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler native –trees 12 | 392 | 413 | 365 |
Tailored task scheduling | 12 | tailored | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler tailored –trees 12 | 390 | 446 | 366 |
Parallel for | 12 | parallel-for | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler parallel-for –trees 12 | 182 | 412 | 375 |
Subtasks | 12 | subtasks | python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler subtasks –trees 12 | 178 | 506 | 377 |
The above measurements are done on my eight core laptop with the Intel oneAPI toolchain.
Below is a script that creates all data points in one go. You will have to amend the prefix accordingly.
If you create the files as per the script above, you can extract the quantities of interest as follows:
The upscaling studies yield graphs as the ones below, which highlight various characteristics of the code base:
The upscaling studies are triggered by the script below, which you might want to embed into a Slurm script to really acquire the data quickly. This one obviously works for OpenMP (omp) only. For TBB and SYCL, you will have to alter the Python script, as you cannot set the number of cores used through OMP_NUM_THREADS.
I use another script to acquire multi-node data:
The script distributes the data into several subdirectories. We have to collocate them into one archive per measurement and then we can apply our out-of-the-box scripts:
Get the tar archives. After that, we can use the out-of-the-box Peano scripts to generate the plots
which clearly showcase that the kernel parallelisation pays off even though the overall scalability is poor with the present rather coarse mesh:
Han's original setup uses the full-blown CCZ4 setup with
and aims for a final time of at least 46. We mirror these properties in the following configurations:
Name | Cell size | Limiter patch size | End time | Plot | AMR levels/FV patches | Required nodes | Call |
---|---|---|---|---|---|---|---|
Small test with vis | 1.0 | 6 | 60.0 | yes | 2 (h_min=0.111111) / 7 | 1 (around 5.8% of mem) | python3 convergence-study.py -plot -cs 1.0 -s fd4-rk1-limited-ps-6 -et 60.0 |
Medium test with vis | 0.5 | 6 | 60.0 | yes | 3 (h_min=0.037037) / 7 | 1 (around 40% of mem) | python3 convergence-study.py -plot -cs 0.5 -s fd4-rk1-limited-ps-6 -et 60.0 |
Large test with vis | 0.1 | 6 | 60.0 | yes | 4 (h_min=0.0123457) / 33 | 1 (around 30% of mem)* | python3 convergence-study.py -plot -cs 0.1 -s fd4-rk1-limited-ps-6 -et 60.0 |
------------------— | ----— | -----------------— | -------— | ------— | -----------------------— | ---------------------— | -----— |
Production test 0 | 0.05 | 6 | 60.0 | yes | 4 | x | python3 convergence-study.py -cs 0.05 -s fd4-rk1-limited-ps-6 -et 60.0 |
It is clear that the excessive use of AMR here ensures that no enclave tasks are left anymore.
I found that OpenMP is comparatively slow. TBB does a way nicer job. I therefore recommend to use TBB in this case for the benchmarking.
Even simple initial conditions can make our higher-order Finite Differences solver fail: The higher order solution representation is not capable to resolve shocks properly (or steep gradients), and the solution therefore starts to oscillate. At the same time, experiments with Finite Volumes yield solutions which a subject to very strong numerical diffusion. The solution quality is not good. The black hole disappears.
There are two fundamental approaches to tackle this dilemma: On the one hand, we can introduce artificial damping terms, i.e. numerical diffusion, in the FD4 scheme. In the context of Finite Differences, this is usually done via Kreiss-Oliger (KO) dissipation. ExaHyPE's FD4 scheme indeed supports KO term. In the present case study, we however switch them off! Instead, we accept that both FD4 and FV have their pros and cons and try to use each solver where it unfolds its full potential.
The image above illustrates our concept: There's an FD4 solver which computes the solution in the whole domain. This solution is is illustrated via the horizontal cut. In an area around the black hole (cross-shaped here; we visualise a slice through this domain), an additional Finite Volume solver is active. In each time step, the FD4 solver sets the boundary conditions of the Finite Volume solver. Those are the dark blue cells around the Finite Volume patches. With these coupling boundary conditions and global Sommerfeld conditions for the FD4 scheme, both solvers advance in time by one time step. After that one, the Finite Volume scheme overwrites the FD4 scheme in points which are covered by both numerical schemes (restriction).
This coupling strategy follows the default coupling schema of ExaHyPE. Different to the generic coupling recipes there, we let the higher-order solver cover the whole domain. As FV only will be used in a tiny region, this overhead is acceptable. Furthermore, we do not use any fancy colouring to decide which solver is active where. We hard-code this information, as we know where our black hole is.
Mathematically, the Finite Volume solver serves as a priori limiter to the FD4 solution. It is a priori, as we do not wait for the FD4 solution to determine if we need Finite Volumes. We overwrite always. For the realisation, we try to keep as much coupling logic as possible within the FD4 scheme: The limiter knows where to compute, but all the remaining logic is realised with the FD4 solver, as this one knows what the data layout of FD4 is. Furthermore, this separation of responsibilities allows us, in theory, to use the same limiter with RKDG as well.
As we combine solvers, we have to ensure that their time step sizes match. This means that the Finite Volume patch size has to be calibrated with \( p^{2}=16 \) relative to the high order's patch size. If they had the same mesh size, the higher order schemes would take significantly smaller time step sizes. However, we want to havee the same spatial and temporal accuacy, so we need a smaller mesh size for the Finite Volume scheme. The file benchmarks/exahype2/ccz4/single-black-hole/convergence-study.py and the solvers within applications/exahype2/ccz/CCZ4Solver.py hold further details.
As the FD4 scheme works with subpatches per octree cell (octant) of size \( 3^d \) or \( 6^d \) respectively, we get Finite Volume patches around the centre which are of the size \( (3\cdot 16)^d = 48^d \) or \( (6\cdot 16)^d = 96^d \). They host up to 884736 volumes. They are large.
Although we recalibrate the patch sizes to take the different spatial discretisation orders into account, this not yet ensure that both the Finite Differences and Finite Volume solver run in-sync. The Finite Volume scheme will be subject to a higher numerical dissipation, while the Finite Differences scheme might run into numerical oscillations/instabilities. At the same time, the limiter will always run at a significantly higher resolution than its higher-order counterpart and capture the singularity. It might therefore happen that it actually yields a smaller admissibile time step size. Therefore, it remains important that both solvers agree on a common time step size after each time step, even though both solvers should come up with an admissible time step size of roughly the same order.
We try to keep all coupling and other logic within the higher order solver. Our intention is to have a generic limiter which can work with multiple solvers, once we have multiple higher order schemes. As a consequence, I introduced a new benchmarks::exahype2::ccz4::CCZ4SBH_FV::reduceAdmissibleTimeStepSize() which allows us to reset the admissible time step size. Next, each higher order solver overwrites startTimeStep(), computes the minimum over the available time step sizes, and sets this minimum as own admissible size as well as the admissible time step size of the limiter.
Our FV solver is, through various inheritance levels, a subclass of peano.exahype2.solvers.fv.EnclaveTasking which in turn inherits from peano.exahype2.solvers.fv.FV. The solver is a text-book FV scheme, with one major added functionality: We add the implementation the routines benchmarks::exahype2::ccz4::CCZ4SBH_FV::isCellOverlappingWithBHImpactArea(), benchmarks::exahype2::ccz4::CCZ4SBH_FV::areBothAdjacentCellsOverlappingWithBHImpactArea(), and their cousins. That is, we implement the actual logic where to apply the limiter within the C++ class.
Within the Python code, we follow localisation recommendations for solver coupling and overwrite the routines
such that only data within a certain radius around the black hole are actually held. An analogous scheme holds for the faces. For the implementation of these guards, we rely on the previously introduced benchmarks::exahype2::ccz4::CCZ4SBH_FV::isCellOverlappingWithBHImpactArea() and benchmarks::exahype2::ccz4::CCZ4SBH_FV::areBothAdjacentCellsOverlappingWithBHImpactArea().
Their implementation is straightforward: We store these cells which overlap with this black hole influence region. Consequently, we store only faces persistently which have a cell left and right which actually holds data. Finally, we have to create faces temporarily - similar to hanging faces - around this Finite Volume area of interest. We need those as the Finite Volume solver will, for example, project its solution onto a patch's boundaries, even though we can throw them away afterwards, i.e. we do not store them.
It is the routine peano.exahype2.solvers.fv.FV.create_action_sets() which configures all the dynamic behaviour. After calling the superclass implementation, we modifiy the two guards _action_set_update_cell.guard and _action_set_merge_enclave_task_outcome.guard to ensure that the solution is also only computed on those cells which hold meaningful data. Without those guards, the solver would invoke the compute kernels everywhere, even though the majority of the mesh holds only garbage solution which is neither stores nor loaded persistently in-between two mesh traversals.
Finally, we set the priorities of the Finite Volume solver up. By default, any enclave tasks has the same (default) priority. We know however that the Finite Volume tasks are extremely expensive - notably compared to their higher order cell countersparts. In an ideal world, all enclave tasks should be postponed into a queue throughout the grid traversal besides the Finite Volume tasks which should go straight onto a core or the GPU. To re-prioritise, we add
to the constructor.
To realise the coupling, we have to augment the FD4 solver by some additional pre- and post-processing steps:
The primary solver takes its reconstructed solution, i.e. the Q values within the patch plus its halo (of size 3 for FD) and projects this information down into the FV solver. This should happen
I initially thought that the projection could be triggered by the FD4 solver after each time step. However, this is not a good idea, as we do not have consistent face data at this point. We however need the faces to have proper halo information to be able to interpolate correctly. So it has be done before the solver kicks off.
Associating it with the FD4 solver's preprocess_reconstructed_solution would be convenient, as these data hold the halo information already, i.e. they provide access to the patch data plus its halo. However, plugging into this routine makes assumptions on the evaluation order: We have to know that the FV code has not started any calculation yet. It is more convenient to let the coupling happen within either solver's _action_set_preprocess_solution action set as a preamble of its actual time step. This way, it will definitely be kicked off before we actually run into the timestepping.
Once the fifth step has terminated, the Finite Volume solver's patches have a properly befilled QNew on the faces. This QNew usually holds the halo copy from neighbouring patches. Where these neighbouring patches do not exist in the FV scheme, we now have interpolated data from the FD4 solution.
Our scheme couples the FD4 solver to the Finite Volume scheme over the faces. However, we realise the coupling volumetric, i.e we project volumetric data from FD4 onto FV data (temporarily) and from there project the volumetric Finite Volume representation again onto the faces. This not particularly elegant, and there might be way faster realisations which omit this temporary volumetric step, but is a realisation which can be implemented with the on-board routines and the logic of it is rather clear.
The coupling from the Finite Volume scheme to FD4 in contrast is a real volumetric coupling: We overwrite the FD4's self._action_set_postprocess_solution with an instance of exahype2.solvers.rkfd.actionsets.PatchWisePostprocessSolution. This postprocessing is be executed after both the Finite Volume and the Differences scheme have updated their patch. This is ensured by the fact that postprocessing steps always plug into touchCellLastTime (cmp Peano's generic discussion of the order of events).
For the realisation, we employ the built-in routines. For example, we call toolbox::blockstructured::restrictCellIntoOverlappingCell_inject() for every cell which holds Finite Volume data, i.e. overlaps with the black hole. This is slightly different to ExaHypE's generic coupling sketch, where I recommend to have at least one whole cell overlap between any two solvers: I do not only inject data from Finite Volume patches back which are completely surrounded by further FV patches. It seemed to me to be a waste of compute power to do this. Rather make the black hole influence area smaller.
We have played around with different restriction variants:
I found that using more than 8 trees per rank slows down the calculations massively and introduces an enormous overhead, when we do the domain partitioning. Therefore, I constrain the maximum number of initial trees to 8 per rank, and then I also introduce a minimal tree size of 100. As the documentation of exahype2::LoadBalancingConfiguration::LoadBalancingConfiguration() clarifies, this minimal tree size is not used for the initial decomposition (this is where the 8 steps in), but it is used afterwards.
The lionshare of the compute time is burnt within the Finite Volume patches. I followed a series of optimisation steps to get this part of the code faster:
When we visualise the CPU occupation, we get an image similar to the one below.
One time step starts at 115s. We have four threads here. Their partitions are not perfectly balanced. This leads to an oversubscribed thread 3. The primary sweep is relatively cheap, as most patch updates are deployed into tasks and hence not computed. Arround 118s, all four threads start the second grid sweep which again is very short besides on the fourth thread. It also remains a mystery why the heavy Finite Volume patch updates, which are all clearly visible (there are 7 of them), all are executed sequentially.
Things start to make sense once we study the result of tarch::multicore::orchestration::createDefaultStrategy() and run VTune once more with native threading. The default implementation always tries to hold back 16 tasks to be able to fuse them on the GPU. Unfortunately, our expensive Finite Volume patches are among these 16. Therefore, they are really only processed when a thread requests their outcome. The native task realisation maps all enclave tasks onto proper (OpenMP) tasks
and hence does a significantly better job. Finally, also some of the patch updates of the limited regions are computed in parallel. Still, we see those massive imbalances.
Tracing with Otter confirms that the Finite Volume and FD4 patches are spawned as enclave tasks. However, they drop in in totally random order and the expensive Finite Volume tasks run risk not to run in parallel but to be sequentialised. I hence introduce a tailored task orchestration:
Consult the documentation of the MulticoreOrchestration for information on its behaviour.
We see that the majority of the runtime is now spent in the interpolation of the FD4 solver. The higher order FD scheme is too expensive and notably serialises things. Studying the call stack reveals that this interpolation does not occur at the AMR boundaries, but it is the interpolation within the limited region which slows down the calculation. Therefore, I started to tune the routine toolbox::blockstructured::interpolateCellDataAssociatedToVolumesIntoOverlappingCell_linear(). All the (OpenMP) parallelisation in there is a result of this benchmark. These optimisations are to be enabled explicitly by the scheduler.
I next try to do the same thing with the FV kernel. Here, we follow the recipes of the page describing how to make our compute kernels use all cores. In line with the description in there, we exploit the fact that we know that all patches can be updated in an embarrassingly parallel way. Furthermore, we know that literally all patches will be stateless.