Peano
Loading...
Searching...
No Matches
Single black hole test

Benchmark which solves the single black hole described in a CCZ4 simulation. While the output is not particularly exciting (the simulation should yield a stationary solution after a while), it is numerically challenging as we have sophisticated boundary condition effects, a singularity in the centre which can lead to oscillations, need a scheme with low numerical dissipation, and so forth. Therefore, this benchmark combines two different solvers - a dissipative Finite Volume scheme and a higher order scheme - to obtain a compromise between accuracy and stability.

For this benchmark, the following extensions are mandatory:
  • blockstructured (--enable-blockstructured)
  • exahype2 (--enable-exahype2)
Furthermore, you will need the GNU Scientific Library (GSL) installed on your system.

Build executable

export PYTHONPATH=../../../../python:../../../../applications/exahype2/ccz4
python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 --scheduler native-no-priorities --trees 6
applications::exahype2::swe::VariableShortcuts s
Definition SWE.cpp:9
This code is taken from the original ExaHyPE project written by colleagues from the University of Tre...
Definition CCZ4Kernels.h:14
Definition ccz4.py:1
For the generic kernels that I use here most of the time.
Definition CellAccess.h:13

As always, the parameter –help makes the Python script yield further info, while the sections below enlist characteristic choices of the parameters. There are different scripts in the directory. They serve different purposes and support slightly different parameters.

Depending on your system, you might have to load the GSL module separately before you build the application:

module load gsl

Clean up old and temporary files

The following statements remove all the glue code and autogenerated C++ files. There is no risk in deleting them - if you rerun the Python script, they will all be reinstantiated.

rm *.o *.cmake
rm README-*.md
rm Makefile CMakeLists.txt
rm Abstract*
rm -rf celldata facedata globaldata observers tasks vertexdata repositories

To get rid of the output files, type in

rm *patch-file *vtu *.pvd *.bak *.csv

Visualise outcome

If you have set up your experiment via the Python script with the -plot command, you can use the standard Peano scripts to convert Peano's patch files:

export PYTHONPATH=../../../../python
/opt/paraview/bin/pvpython ~/git/Peano/python/peano4/visualisation/render.py --filter-fine-grid solution-CCZ4SBH_FV.peano-patch-file
/opt/paraview/bin/pvpython ~/git/Peano/python/peano4/visualisation/render.py --filter-fine-grid solution-CCZ4SBH_FD4.peano-patch-file

The picture below is taken from ExaHyPE's coupling discussion and illustrates that the higher order solver covers the whole domain, while the FV is only active in a tiny cross-shaped region within.

Benchmark runs

Benchmarking the single black hole is not easy, as it is a rather complex setup already with a rather long initialisation and grid construction phase. However, it is possible to benchmark it once you disable plots, the tracers, and you use an unreasonable coarse mesh.

For all the upscaling studies, we use the same kind of benchmarks. They can be generated in one go through

export PREFIX=tbb or omp or sycl
rm $PREFIX-*
function compile() {
for SCHEDULER in native-no-priorities native tailored parallel-for subtasks subtasks-and-kernel-parallelisation
do
for TREES in 1 6 8 12
do
OUTPUT=$PREFIX-without-amr-$SCHEDULER-$TREES-trees-$CELL_SIZE
python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-6 --cell-size $CELL_SIZE --scheduler $SCHEDULER --trees $TREES -et $END_TIME --output $OUTPUT
OUTPUT=$PREFIX-with-amr-$SCHEDULER-$TREES-trees-$CELL_SIZE
python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-6 --cell-size $CELL_SIZE --amr --scheduler $SCHEDULER --trees $TREES -et $END_TIME --output $OUTPUT
done
done
}
export CELL_SIZE=1.8
export END_TIME=1e-2
compile
export CELL_SIZE=0.6
export END_TIME=1e-3
compile
export CELL_SIZE=0.1
export END_TIME=1e-4
compile
The parallel namespace is Peano's core abstracts from both MPI and multicore parallelisation.
I've written an API to IIT, but I'm not currently using.

I have played around with a lot of different parameter settings. Some insights are summarised below for a normal setup as we find it on Hamilton:

Description Cell size End time Patch size Remarks
Minimal setup 1.8 1e-2 3 You cannot see AMR here, and the FV patch size means that the CFD conditions do not match. This is a toy setup for a local laptop. It scales only up to 4-8 cores, as it so too small. Despite the small patch size (6 is production-mode level), we see the impact of the lack of scaling.
Proper minimal setup 1.8 1e-2 6 You cannot see AMR here either, but the FV patch size is a proper one, i.e. the black hole is resolved such that the CFL conditions of the FV and FD scheme match.
One node setup 0.6 1e-3 6 This setup is already too big to be solved without AMR on a single node. We see that normal, native tasking performs best.
One node setpu 0.1 1e-4 6 Use xxx nodes for regular grid

Single node performance data studying task orchestration and kernel variants

Below are some performance data which I collected on my local workstation. They provide a first look-n-feel of what the code behaves. Please note that these results are definitely not representative, as they are acquired on an eight core machine. Modern workstations feature more cores and hence provide qualitatively different outcomes.

No AMR

Name Trees Scheduler Python script call TBB OpenMP SYCL
(Almost) No parallel task producers
Native tasking w/o priorities x native-no-priorities python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler native-no-priorities –trees 1 527 532 560
Native tasking with priorities x native python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler native –trees 1 518 529 566
Tailored task scheduling x tailored python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler tailored –trees 1 516 534 562
Parallel for x parallel-for python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler parallel-for –trees 1 307 637 560
Subtasks x subtasks python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler subtasks –trees 1 302 534 565
Fewer BSP sections than threads
Native tasking w/o priorities 6 native-no-priorities python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler native-no-priorities –trees 6 428 457 698
Native tasking with priorities 6 native python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler native –trees 6 430 456 597
Tailored task scheduling 6 tailored python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler tailored –trees 6 427 453 605
Parallel for 6 parallel-for python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler parallel-for –trees 6 225 454 600
Subtasks 6 subtasks python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler subtasks –trees 6 223 453 605
One BSP section per threads
Native tasking w/o priorities 8 native-no-priorities python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler native-no-priorities –trees 8 x 436 610
Native tasking with priorities 8 native python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler native –trees 8 x 433 611
Tailored task scheduling 8 tailored python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler tailored –trees 8 x 434 617
Parallel for 8 parallel-for python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler parallel-for –trees 8 x x 615
Subtasks 8 subtasks python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler subtasks –trees 8 207 x 618
Threads overbooked with BSP sections
Native tasking w/o priorities 12 native-no-priorities python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler native-no-priorities –trees 12 x 432 606
Native tasking with priorities 12 native python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler native –trees 12 x 435 607
Tailored task scheduling 12 tailored python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler tailored –trees 12 x 432 618
Parallel for 12 parallel-for python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler parallel-for –trees 12 x 435 611
Subtasks 12 subtasks python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 –scheduler subtasks –trees 12 212 482 616

With adaptive mesh refinement

Name Trees Scheduler Python script call TBB OpenMP SYCL
(Almost) No parallel task producers
Native tasking w/o priorities x native-no-priorities python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler native-no-priorities –trees 1 408 435 343
Native tasking with priorities x native python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler native –trees 1 406 433 340
Tailored task scheduling x tailored python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler tailored –trees 1 415 434 341
Parallel for x parallel-for python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler parallel-for –trees 1 194 432 341
Subtasks x subtasks python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler subtasks –trees 1 193 434 341
Fewer BSP sections than threads
Native tasking w/o priorities 6 native-no-priorities python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler native-no-priorities –trees 6 398 420 365
Native tasking with priorities 6 native python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler native –trees 6 396 419 363
Tailored task scheduling 6 tailored python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler tailored –trees 6 398 418 359
Parallel for 6 parallel-for python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler parallel-for –trees 6 186 419 357
Subtasks 6 subtasks python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler subtasks –trees 6 183 420 361
One BSP section per threads
Native tasking w/o priorities 8 native-no-priorities python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler native-no-priorities –trees 8 390 413 374
Native tasking with priorities 8 native python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler native –trees 8 386 414 364
Tailored task scheduling 8 tailored python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler tailored –trees 8 390 413 373
Parallel for 8 parallel-for python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler parallel-for –trees 8 178 412 374
Subtasks 8 subtasks python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler subtasks –trees 8 181 414 375
Threads overbooked with BSP sections
Native tasking w/o priorities 12 native-no-priorities python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler native-no-priorities –trees 12 391 415 375
Native tasking with priorities 12 native python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler native –trees 12 392 413 365
Tailored task scheduling 12 tailored python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler tailored –trees 12 390 446 366
Parallel for 12 parallel-for python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler parallel-for –trees 12 182 412 375
Subtasks 12 subtasks python3 performance-studies.py -lbm tailored -s fd4-rk1-limited-ps-3 -amr –scheduler subtasks –trees 12 178 506 377

The above measurements are done on my eight core laptop with the Intel oneAPI toolchain.

Script to create these data

Below is a script that creates all data points in one go. You will have to amend the prefix accordingly.

export PREFIX=tbb or omp or sycl
export CELL_SIZE=1.8
export RESULTS_FOLDER=results-single-node-orchestration-benchmarks
mkdir $RESULTS_FOLDER
for SCHEDULER in native-no-priorities native tailored parallel-for subtasks subtasks-and-kernel-parallelisation
do
for TREES in 1 6 8 12
do
OUTPUT=$PREFIX-without-amr-$SCHEDULER-$TREES-trees-$CELL_SIZE
./$OUTPUT > $RESULTS_FOLDER/$OUTPUT.txt
OUTPUT=$PREFIX-with-amr-$SCHEDULER-$TREES-trees-$CELL_SIZE
./$OUTPUT > $RESULTS_FOLDER/$OUTPUT.txt
done
done

If you create the files as per the script above, you can extract the quantities of interest as follows:

export PREFIX=tbb or omp or sycl
export CELL_SIZE=1.8
OUTPUT=results/$PREFIX.txt
echo $PREFIX > $OUTPUT
for AMR in without-amr with-amr
do
for TREES in 1 6 8 12
do
for SCHEDULER in native-no-priorities native tailored parallel-for subtasks subtasks-and-kernel-parallelisation
do
FILE=results/$PREFIX-$AMR-$SCHEDULER-$TREES-trees-$CELL_SIZE.txt
echo $FILE >> $OUTPUT
grep "time stepping:" $FILE >> $OUTPUT
done
done
done
echo wrote all data to $OUTPUT

Upscaling studies

The upscaling studies yield graphs as the ones below, which highlight various characteristics of the code base:

SLURM scripts

The upscaling studies are triggered by the script below, which you might want to embed into a Slurm script to really acquire the data quickly. This one obviously works for OpenMP (omp) only. For TBB and SYCL, you will have to alter the Python script, as you cannot set the number of cores used through OMP_NUM_THREADS.

#!/bin/bash
#SBATCH --job-name=upscaling
#SBATCH -p multi
#SBATCH --nodes=1
#SBATCH -n 1
#SBATCH -c 128
#SBATCH --time=48:00:00
#SBATCH --mail-type=END
#SBATCH --mail-user=tobias.weinzierl@durham.ac.uk
#SBATCH --array=1,2,4,8,16,24,32,40,48,56,64,80,96,128
source /etc/profile.d/modules.sh
# adopt to your build/system
module purge
module load oneapi/2022.2
export FLAVOUR_NOCONFLICT=1
module load gcc
module load intelmpi
export I_MPI_CXX=icpx
# switch to tbb or sycl
export PREFIX=omp
export TREES=8
export THREADS=$SLURM_ARRAY_TASK_ID
# this is how things work with OpenMP. Will be different for TBB and
# SYCL where you might have to adopt the underlying Python script.
export OMP_NUM_THREADS=$THREADS
export OMP_PROC_BIND=close
export RESULTS_FOLDER=results-$THREADS-threads
mkdir $RESULTS_FOLDER
for CELL_SIZE in 1.8 0.6 0.1
do
for SCHEDULER in native subtasks subtasks-and-kernel-parallelisation
do
OUTPUT=$PREFIX-without-amr-$SCHEDULER-$TREES-trees-$CELL_SIZE
echo run $OUTPUT
./$OUTPUT > $RESULTS_FOLDER/$OUTPUT.txt
OUTPUT=$PREFIX-with-amr-$SCHEDULER-$TREES-trees-$CELL_SIZE
echo run $OUTPUT
./$OUTPUT > $RESULTS_FOLDER/$OUTPUT.txt
done
done

I use another script to acquire multi-node data:

#!/bin/bash
#SBATCH --job-name=upscaling
#SBATCH -p multi
#SBATCH --nodes=1
#SBATCH -n 1
#SBATCH -c 128
#SBATCH --time=48:00:00
#SBATCH --mail-type=END
#SBATCH --mail-user=tobias.weinzierl@durham.ac.uk
#SBATCH --array=1,2,4,8,16,24,32,40,48,56,64,80,96,128
source /etc/profile.d/modules.sh
# adopt to your build/system
module purge
module load oneapi/2022.2
export FLAVOUR_NOCONFLICT=1
module load gcc
module load intelmpi
export I_MPI_CXX=icpx
# switch to tbb or sycl
export PREFIX=omp
export TREES=8
export RESULTS_FOLDER=results-$SLURM_NTASKS-nodes
mkdir $RESULTS_FOLDER
for CELL_SIZE in 1.8 0.6 0.1
do
for SCHEDULER in native subtasks subtasks-and-kernel-parallelisation
do
OUTPUT=$PREFIX-without-amr-$SCHEDULER-$TREES-trees-$CELL_SIZE
echo run $OUTPUT
./$OUTPUT > $RESULTS_FOLDER/$OUTPUT.txt
OUTPUT=$PREFIX-with-amr-$SCHEDULER-$TREES-trees-$CELL_SIZE
echo run $OUTPUT
./$OUTPUT > $RESULTS_FOLDER/$OUTPUT.txt
done
done

Data gathering

The script distributes the data into several subdirectories. We have to collocate them into one archive per measurement and then we can apply our out-of-the-box scripts:

export PREFIX=omp
export TREES=8
for SCHEDULER in native subtasks subtasks-and-kernel-parallelisation
do
rm $PREFIX-$SCHEDULER-without-amr-$TREES-trees.tar.gz
rm $PREFIX-$SCHEDULER-with-amr-$TREES-trees.tar.gz
tar -cf $PREFIX-$SCHEDULER-without-amr-$TREES-trees.tar --files-from /dev/null
tar -cf $PREFIX-$SCHEDULER-with-amr-$TREES-trees.tar --files-from /dev/null
for THREADS in 1 2 4 8 16 24 32 40 48 56 64 80 96 128
do
cp results-$THREADS-threads/$PREFIX-without-amr-$SCHEDULER-$TREES-trees.txt $THREADS-threads.txt
tar -rf $PREFIX-$SCHEDULER-without-amr-$TREES-trees.tar $THREADS-threads.txt
cp results-$THREADS-threads/$PREFIX-with-amr-$SCHEDULER-$TREES-trees.txt $THREADS-threads.txt
tar -rf $PREFIX-$SCHEDULER-with-amr-$TREES-trees.tar $THREADS-threads.txt
done
gzip $PREFIX-$SCHEDULER-without-amr-$TREES-trees.tar
gzip $PREFIX-$SCHEDULER-with-amr-$TREES-trees.tar
done

Visualisation

Get the tar archives. After that, we can use the out-of-the-box Peano scripts to generate the plots

export PREFIX=omp
export TREES=8
OUTPUT=$PREFIX-$TREES-trees-without-amr
python3 ../../../../python/exahype2/postprocessing/plot-scaling.py --max-cores-per-rank=-1 --log-x --log-y --output $OUTPUT \
--label "native tasking, subtasks, subtasks and parallel kernels" --title "No AMR" \
$PREFIX-native-without-amr-$TREES-trees.tar.gz \
$PREFIX-subtasks-without-amr-$TREES-trees.tar.gz \
$PREFIX-subtasks-and-kernel-parallelisation-without-amr-$TREES-trees.tar.gz
export OUTPUT=$PREFIX-$TREES-trees-with-amr
python3 ../../../../python/exahype2/postprocessing/plot-scaling.py --max-cores-per-rank=-1 --log-x --log-y --output $OUTPUT \
--label "native tasking, subtasks, subtasks and parallel kernels" --title "AMR, patch size 3, cell size 1.8" \
$PREFIX-native-without-amr-$TREES-trees.tar.gz \
$PREFIX-subtasks-without-amr-$TREES-trees.tar.gz \
$PREFIX-subtasks-and-kernel-parallelisation-without-amr-$TREES-trees.tar.gz
-lift-drop-statistics

which clearly showcase that the kernel parallelisation pays off even though the overall scalability is poor with the present rather coarse mesh:

Production runs

Han's original setup uses the full-blown CCZ4 setup with

python3 ccz4.py -ext adm -impl fd4-rk1-adaptive -s single-puncture -maxh 0.4 -minh 0.04 -ps 6 -plt 0.5 -et 100 -exn sbh27 --domain_r 9.0 --ReSwi 7 -cfl 0.1 -outdir ./ --KOSigma 8.0 -sommerfeld

and aims for a final time of at least 46. We mirror these properties in the following configurations:

Name Cell size Limiter patch size End time Plot AMR levels/FV patches Required nodes Call
Small test with vis 1.0 6 60.0 yes 2 (h_min=0.111111) / 7 1 (around 5.8% of mem) python3 convergence-study.py -plot -cs 1.0 -s fd4-rk1-limited-ps-6 -et 60.0
Medium test with vis 0.5 6 60.0 yes 3 (h_min=0.037037) / 7 1 (around 40% of mem) python3 convergence-study.py -plot -cs 0.5 -s fd4-rk1-limited-ps-6 -et 60.0
Large test with vis 0.1 6 60.0 yes 4 (h_min=0.0123457) / 33 1 (around 30% of mem)* python3 convergence-study.py -plot -cs 0.1 -s fd4-rk1-limited-ps-6 -et 60.0
------------------— ----— -----------------— -------— ------— -----------------------— ---------------------— -----—
Production test 0 0.05 6 60.0 yes 4 x python3 convergence-study.py -cs 0.05 -s fd4-rk1-limited-ps-6 -et 60.0
Todo
The calls above are not valid anymore. See benchmark docu further down which works.

It is clear that the excessive use of AMR here ensures that no enclave tasks are left anymore.

I found that OpenMP is comparatively slow. TBB does a way nicer job. I therefore recommend to use TBB in this case for the benchmarking.

Code remarks (FD4-FV scheme)

Even simple initial conditions can make our higher-order Finite Differences solver fail: The higher order solution representation is not capable to resolve shocks properly (or steep gradients), and the solution therefore starts to oscillate. At the same time, experiments with Finite Volumes yield solutions which a subject to very strong numerical diffusion. The solution quality is not good. The black hole disappears.

There are two fundamental approaches to tackle this dilemma: On the one hand, we can introduce artificial damping terms, i.e. numerical diffusion, in the FD4 scheme. In the context of Finite Differences, this is usually done via Kreiss-Oliger (KO) dissipation. ExaHyPE's FD4 scheme indeed supports KO term. In the present case study, we however switch them off! Instead, we accept that both FD4 and FV have their pros and cons and try to use each solver where it unfolds its full potential.

Todo
Remove the KO terms here

The image above illustrates our concept: There's an FD4 solver which computes the solution in the whole domain. This solution is is illustrated via the horizontal cut. In an area around the black hole (cross-shaped here; we visualise a slice through this domain), an additional Finite Volume solver is active. In each time step, the FD4 solver sets the boundary conditions of the Finite Volume solver. Those are the dark blue cells around the Finite Volume patches. With these coupling boundary conditions and global Sommerfeld conditions for the FD4 scheme, both solvers advance in time by one time step. After that one, the Finite Volume scheme overwrites the FD4 scheme in points which are covered by both numerical schemes (restriction).

This coupling strategy follows the default coupling schema of ExaHyPE. Different to the generic coupling recipes there, we let the higher-order solver cover the whole domain. As FV only will be used in a tiny region, this overhead is acceptable. Furthermore, we do not use any fancy colouring to decide which solver is active where. We hard-code this information, as we know where our black hole is.

Mathematically, the Finite Volume solver serves as a priori limiter to the FD4 solution. It is a priori, as we do not wait for the FD4 solution to determine if we need Finite Volumes. We overwrite always. For the realisation, we try to keep as much coupling logic as possible within the FD4 scheme: The limiter knows where to compute, but all the remaining logic is realised with the FD4 solver, as this one knows what the data layout of FD4 is. Furthermore, this separation of responsibilities allows us, in theory, to use the same limiter with RKDG as well.

Time step size

As we combine solvers, we have to ensure that their time step sizes match. This means that the Finite Volume patch size has to be calibrated with \( p^{2}=16 \) relative to the high order's patch size. If they had the same mesh size, the higher order schemes would take significantly smaller time step sizes. However, we want to havee the same spatial and temporal accuacy, so we need a smaller mesh size for the Finite Volume scheme. The file benchmarks/exahype2/ccz4/single-black-hole/convergence-study.py and the solvers within applications/exahype2/ccz/CCZ4Solver.py hold further details.

As the FD4 scheme works with subpatches per octree cell (octant) of size \( 3^d \) or \( 6^d \) respectively, we get Finite Volume patches around the centre which are of the size \( (3\cdot 16)^d = 48^d \) or \( (6\cdot 16)^d = 96^d \). They host up to 884736 volumes. They are large.

Although we recalibrate the patch sizes to take the different spatial discretisation orders into account, this not yet ensure that both the Finite Differences and Finite Volume solver run in-sync. The Finite Volume scheme will be subject to a higher numerical dissipation, while the Finite Differences scheme might run into numerical oscillations/instabilities. At the same time, the limiter will always run at a significantly higher resolution than its higher-order counterpart and capture the singularity. It might therefore happen that it actually yields a smaller admissibile time step size. Therefore, it remains important that both solvers agree on a common time step size after each time step, even though both solvers should come up with an admissible time step size of roughly the same order.

We try to keep all coupling and other logic within the higher order solver. Our intention is to have a generic limiter which can work with multiple solvers, once we have multiple higher order schemes. As a consequence, I introduced a new benchmarks::exahype2::ccz4::CCZ4SBH_FV::reduceAdmissibleTimeStepSize() which allows us to reset the admissible time step size. Next, each higher order solver overwrites startTimeStep(), computes the minimum over the available time step sizes, and sets this minimum as own admissible size as well as the admissible time step size of the limiter.

The Finite Volume solver

Our FV solver is, through various inheritance levels, a subclass of peano.exahype2.solvers.fv.EnclaveTasking which in turn inherits from peano.exahype2.solvers.fv.FV. The solver is a text-book FV scheme, with one major added functionality: We add the implementation the routines benchmarks::exahype2::ccz4::CCZ4SBH_FV::isCellOverlappingWithBHImpactArea(), benchmarks::exahype2::ccz4::CCZ4SBH_FV::areBothAdjacentCellsOverlappingWithBHImpactArea(), and their cousins. That is, we implement the actual logic where to apply the limiter within the C++ class.

Within the Python code, we follow localisation recommendations for solver coupling and overwrite the routines

  • _store_cell_data_default_guard()
  • _load_cell_data_default_guard()
  • _provide_cell_data_to_compute_kernels_default_guard

such that only data within a certain radius around the black hole are actually held. An analogous scheme holds for the faces. For the implementation of these guards, we rely on the previously introduced benchmarks::exahype2::ccz4::CCZ4SBH_FV::isCellOverlappingWithBHImpactArea() and benchmarks::exahype2::ccz4::CCZ4SBH_FV::areBothAdjacentCellsOverlappingWithBHImpactArea().

Their implementation is straightforward: We store these cells which overlap with this black hole influence region. Consequently, we store only faces persistently which have a cell left and right which actually holds data. Finally, we have to create faces temporarily - similar to hanging faces - around this Finite Volume area of interest. We need those as the Finite Volume solver will, for example, project its solution onto a patch's boundaries, even though we can throw them away afterwards, i.e. we do not store them.

It is the routine peano.exahype2.solvers.fv.FV.create_action_sets() which configures all the dynamic behaviour. After calling the superclass implementation, we modifiy the two guards _action_set_update_cell.guard and _action_set_merge_enclave_task_outcome.guard to ensure that the solution is also only computed on those cells which hold meaningful data. Without those guards, the solver would invoke the compute kernels everywhere, even though the majority of the mesh holds only garbage solution which is neither stores nor loaded persistently in-between two mesh traversals.

Finally, we set the priorities of the Finite Volume solver up. By default, any enclave tasks has the same (default) priority. We know however that the Finite Volume tasks are extremely expensive - notably compared to their higher order cell countersparts. In an ideal world, all enclave tasks should be postponed into a queue throughout the grid traversal besides the Finite Volume tasks which should go straight onto a core or the GPU. To re-prioritise, we add

self.enclave_task_priority = "tarch::multicore::Task::DefaultPriority+1"

to the constructor.

The FD4 solver

FD4-FV coupling (projection)

To realise the coupling, we have to augment the FD4 solver by some additional pre- and post-processing steps:

  1. We add Sommerfeld boundary conditions. This is important, as the FD4 solver will be responsible to simulate the solution close to the domain boundaries.
  2. The primary solver takes its reconstructed solution, i.e. the Q values within the patch plus its halo (of size 3 for FD) and projects this information down into the FV solver. This should happen

    • before the FV solver kicks off
    • using consistent FD4 data within the cell and on the faces

    I initially thought that the projection could be triggered by the FD4 solver after each time step. However, this is not a good idea, as we do not have consistent face data at this point. We however need the faces to have proper halo information to be able to interpolate correctly. So it has be done before the solver kicks off.

    Associating it with the FD4 solver's preprocess_reconstructed_solution would be convenient, as these data hold the halo information already, i.e. they provide access to the patch data plus its halo. However, plugging into this routine makes assumptions on the evaluation order: We have to know that the FV code has not started any calculation yet. It is more convenient to let the coupling happen within either solver's _action_set_preprocess_solution action set as a preamble of its actual time step. This way, it will definitely be kicked off before we actually run into the timestepping.

  3. The preprocessing has to be of type exahype2.solvers.rkfd.actionsets.PreprocessReconstructedSolutionWithHalo as we need the area around the FD4 patch to interpolate into all the "edge" volumes of the Finite Volume patch. This PreprocessReconstructedSolutionWithHalo is itself a subclass of peano4.toolbox.blockstructured.ReconstructPatchAndApplyFunctor and reconstructs the FD4 solver's halo (temporarily) into a field called oldQWithHalo before it invokes a functor on the reconstructed data.
  4. We use oldQWithHalo to project the solution down into a temporary FV patch. The temporary patch is called interpolatedFVDataWithHalo, and the actual interpolation is realised through toolbox::blockstructured::interpolateCellDataAssociatedToVolumesIntoOverlappingCell_linear(). At the end of the functor, we free the interpolatedFVDataWithHalo again.
  5. Before we free it, we take interpolatedFVDataWithHalo and map it onto the FV faces for those faces which are not in-between two FV cells. For this, we utilise toolbox::blockstructured::projectPatchHaloOntoFaces. This routine expects 2d face data always. For the faces in-between two FV patches which we don't want to overwrite, we pass in some temporary arrays just to make the function work, but we throw these temporary face data away immediately afterwards.

Once the fifth step has terminated, the Finite Volume solver's patches have a properly befilled QNew on the faces. This QNew usually holds the halo copy from neighbouring patches. Where these neighbouring patches do not exist in the FV scheme, we now have interpolated data from the FD4 solution.

Our scheme couples the FD4 solver to the Finite Volume scheme over the faces. However, we realise the coupling volumetric, i.e we project volumetric data from FD4 onto FV data (temporarily) and from there project the volumetric Finite Volume representation again onto the faces. This not particularly elegant, and there might be way faster realisations which omit this temporary volumetric step, but is a realisation which can be implemented with the on-board routines and the logic of it is rather clear.

FV-FD4 coupling (restriction)

The coupling from the Finite Volume scheme to FD4 in contrast is a real volumetric coupling: We overwrite the FD4's self._action_set_postprocess_solution with an instance of exahype2.solvers.rkfd.actionsets.PatchWisePostprocessSolution. This postprocessing is be executed after both the Finite Volume and the Differences scheme have updated their patch. This is ensured by the fact that postprocessing steps always plug into touchCellLastTime (cmp Peano's generic discussion of the order of events).

For the realisation, we employ the built-in routines. For example, we call toolbox::blockstructured::restrictCellIntoOverlappingCell_inject() for every cell which holds Finite Volume data, i.e. overlaps with the black hole. This is slightly different to ExaHypE's generic coupling sketch, where I recommend to have at least one whole cell overlap between any two solvers: I do not only inject data from Finite Volume patches back which are completely surrounded by further FV patches. It seemed to me to be a waste of compute power to do this. Rather make the black hole influence area smaller.

We have played around with different restriction variants:

  • toolbox::blockstructured::restrictCellIntoOverlappingCell_inject(): If we use a plain injection, we get a reasonably stable coupling, but the solution starts to show a kink after a while where the FD4 domain meets the limiter area. This effect can be mitigated by makind the FV domain larger, but this leads to stronger numerical dissipation and therefore harms the solution.
  • toolbox::blockstructured::restrictCellIntoOverlappingCell_inject_and_average(): This variant accepts that FD4 delivers a reasonable numerical answer (albeit likely unstable) and uses the FV solution to drag it into a different solution, i.e. to correct it, as we always take the average of both solvers and continue to work with this one.
  • Finally, we can use a hybrid of the two worlds, where the very centre of the domain is determined by FV only, but the area around it results from averaging.

Optimisation

Domain decomposition

I found that using more than 8 trees per rank slows down the calculations massively and introduces an enormous overhead, when we do the domain partitioning. Therefore, I constrain the maximum number of initial trees to 8 per rank, and then I also introduce a minimal tree size of 100. As the documentation of exahype2::LoadBalancingConfiguration::LoadBalancingConfiguration() clarifies, this minimal tree size is not used for the initial decomposition (this is where the 8 steps in), but it is used afterwards.

High level code tuning

The lionshare of the compute time is burnt within the Finite Volume patches. I followed a series of optimisation steps to get this part of the code faster:

  1. I switched to a heap-based data storage.
  2. I notified the solver that all the ncp, eigenvalue and source terms are side effect-free. That is, they do not alter any global state and solely the patch outcome.
  3. I alter the generated Makefile manually. That is, I replace the solver part with
    solver: $(CXX_OBJS) $(FORTRAN_OBJS)
    rm tasks/*.o
    $(CXX) $(CXXFLAGS) -qopt-report=3 -c tasks/CCZ4SBH_FD4EnclaveTask.cpp -o tasks/CCZ4SBH_FD4EnclaveTask.o
    $(CXX) $(CXXFLAGS) -qopt-report=3 -c tasks/CCZ4SBH_FVEnclaveTask.cpp -o tasks/CCZ4SBH_FVEnclaveTask.o
    $(CXX) $(FORTRAN_MODULE_OBJS) $(FORTRAN_OBJS) $(CXX_OBJS) $(LDFLAGS) $(CU_OBJS) $(LIBS) -o peano_sbh_2.0
    This follows Peano's recommendations how to obtain meaningful optimisation reports. However, it will result in many warnings, as the compiler will be unable to inline and consequently vectorise the required static functions.

Concurrency analysis with default task orchestration

When we visualise the CPU occupation, we get an image similar to the one below.

One time step starts at 115s. We have four threads here. Their partitions are not perfectly balanced. This leads to an oversubscribed thread 3. The primary sweep is relatively cheap, as most patch updates are deployed into tasks and hence not computed. Arround 118s, all four threads start the second grid sweep which again is very short besides on the fourth thread. It also remains a mystery why the heavy Finite Volume patch updates, which are all clearly visible (there are 7 of them), all are executed sequentially.

Things start to make sense once we study the result of tarch::multicore::orchestration::createDefaultStrategy() and run VTune once more with native threading. The default implementation always tries to hold back 16 tasks to be able to fuse them on the GPU. Unfortunately, our expensive Finite Volume patches are among these 16. Therefore, they are really only processed when a thread requests their outcome. The native task realisation maps all enclave tasks onto proper (OpenMP) tasks

and hence does a significantly better job. Finally, also some of the patch updates of the limited regions are computed in parallel. Still, we see those massive imbalances.

Tracing with Otter confirms that the Finite Volume and FD4 patches are spawned as enclave tasks. However, they drop in in totally random order and the expensive Finite Volume tasks run risk not to run in parallel but to be sequentialised. I hence introduce a tailored task orchestration:

Consult the documentation of the MulticoreOrchestration for information on its behaviour.

Optimise the actual compute kernels

We see that the majority of the runtime is now spent in the interpolation of the FD4 solver. The higher order FD scheme is too expensive and notably serialises things. Studying the call stack reveals that this interpolation does not occur at the AMR boundaries, but it is the interpolation within the limited region which slows down the calculation. Therefore, I started to tune the routine toolbox::blockstructured::interpolateCellDataAssociatedToVolumesIntoOverlappingCell_linear(). All the (OpenMP) parallelisation in there is a result of this benchmark. These optimisations are to be enabled explicitly by the scheduler.

I next try to do the same thing with the FV kernel. Here, we follow the recipes of the page describing how to make our compute kernels use all cores. In line with the description in there, we exploit the fact that we know that all patches can be updated in an embarrassingly parallel way. Furthermore, we know that literally all patches will be stateless.