Peano
|
This page describes how you introduce a new solver. New solver here means that you introduce a new particle species - likely with some bespoke numerics, runtime behaviour or physics. Therefore, introducing a new solver comes along in three flavours:
Higher-level performance optimisation is discussed on its dedicated page. This page focuses on the particle view.
Particles are characterized by their attributes (position, velocity, density, ...), their behaviour in the simulation and their initialisation. The latter two are formalised by the particle's algorithmic steps. You can think of them as a sequence of operations which have to performed per particle per time step or until we have a properly initialised particle. You can visualise a particle's algorithm via a simple graph:
It illustrates that we have some attributes on a particle (left top). Each particle runs through a number of steps (right): We start with an initialisation (red) in which we throw the particles into the domain. After that, each particle runs through two initialisation steps. The latter one migth repeat. Per time step, each particle runs through six compute steps, while, once again, one of them might have to be repeated. Some computational steps might only act on one particle at a time, other steps might require the particles around a particle to compute an outcome (darker boxes).
This input sequence is taken as by our graph compiler to map it onto traversals of the data and to generate the C++ application. Details are discussed in the section on the Swift Graph Compiler. The basic rationale behind this design is that a particle-based algorithm is besed phrased as flow chart over a single particle. It is then the job of our software to bring this information together with the mesh in which the particles are organised, to map the updates onto data traversals, and to decide which particle to update at which point. By abstracting the mesh traversal away from the particle lifecycle, we can also easily handle multiple particle species. The particle lifecycle however encodes the required comptuational steps. Therefore, it comprises both the physics and the time stepping scheme. Both are intertwined in classic SPH.
Any new particle type should be a subtype of swift2.particle.Particle.
You can instantiate your particle of interest and add further attributes to it in your main script. Alternatively, you might want to inherit from the particle type of interest and add additional attributes in the inherited type. Once you have called the superclass constructor, you can add further attributes to your species with instruction similar to
The second essential ingredient to build a particle life cycle is to specify the sequence of algorithmic steps. Each particle has to have a routine algorithm_steps()
which returns the sequence of algorithmic steps to be run per time step. The individual algorithm steps are instances of AlgorithmStep (python/swift2/particle/AlgorithmStep.py
) As an example, we discuss the SPH density loop:
dependencies
can distinguish between two options: NEIGHBOURS
and SELF
. These refer to the two type of operations that are done in the SPH context: either the computation depends on the particle itself or it requires contributions from neighbours.effect
has different options depending on the output type of the calculation, e.g. ALTER_LOCAL_STATE
means that only the current particle is updated. Other options notably include changes of the position or cut-off radius or also the possibility that the code asks Peano to rerun this step.cell_kernel
invokes a compute kernel which depends on the localParticles
and activeParticles
sets of the PIDT scheme.touch_vertex_first_time_kernel
invokes a compute kernel which depends exclusively on the localParticles
set of the PIDT scheme when the particle is first ‘touched’ during the mesh traversal.touch_vertex_last_time_kernel
analogous to the previous one.includes
allows us to ensure that the kernels that we invoke are well-known.The full life cycle is just a list of such an algorithmic steps. A very simple case of such a class can be found in swift2.particle.ExplicitEulerDynamicSearchRadius. Here, the two key routines simply create a new sequence of swift2.particle.AlgorithmStep in the two core routines. You might want to study swift2.particle.SPHLeapfrogFixedSearchRadius, which is more sophisticated. It first creates a dictionary of elementary steps and lets algorithm_steps() and initialisation_steps() then "assemble" the particle flow from this dictionary.
This is very similar to the algorithmic steps. Each particle offers a routine initialisation_steps() which returns a sequence of steps that have to be called before we start the actual simulation. The sequence can be empty. The graph compiler nevertheless will ensure that the initial sorting of the particles is correct.
Movements and changes to the paticle cut-offs should, in line with the discussion in Peano's toolbox, only be done in touchVertexLastTime(). That is, usually the movement delta is accumulated over the mesh traversal and then applied in this last access step. From hereon, it is Peano's responsibility to get the particle to the right position in space in time for the next mesh sweep.
There are multiple flavours of the computation of the particle interaction radius stemming from the density reconstruction:
The first variant is simple: We have to ensure in the particle initialisation that the cut-off radius is bigger or equal to the fixed smoothing length. As the cut-off determines the distance in which Peano searches for neighbouring particles, we will then get a set of active particles per local particle, iterate through them, and neglect all particles that are not within the smoothing length:
The second variant can change the local state. It's
Consult the discussion of multiscale interactions below to understand why it might happen that this algorithm reruns. The third, most flexible, variant is configured with
If particles interact with particles, we have to keep in mind that these are realised through computations between the local set and the active set. The recursive definition of the active set implies that the active set exclusively hosts particles of the same mesh resolution plus the ones stemming from coarser levels. We miss out on particles which are stored on finer levels.
This is not only a constraint for setups with largely varying smoothing lengths. It already becomes a serious issue if you employ adaptive grids. The action set peano4.toolbox.particles.api.AbstractUpdateParticleGridAssociation discusses this phenomenon in detail.
The solution to this "problem" is to accumulate all data (including the density) and to make (local) particles add contributions to active particles on coarser levels. Usually, i.e. on a single level, active particles act on local levels but are not updated themselves. So we
If the calculation of a particle state is itself an iterative scheme, we have to map the iterations onto mesh sweep reruns: In Peano, we never have the whole set of neighbouring particles available! Rerunning a step is triggered through the particle's species: The code has to invoke its setRerunPreviousGridSweep().
If you want to specify an algorithm step that is maybe rerun (as you alter the search radius, e.g., or implement an iterative scheme), you can trigger such a rerun through the species. The idea is as follows:
The clss swift2::ParticleSpecies provides further details. However, the user is usually only handling those to core operations. All global reductions, i.e. synchronisation of ranks, and the actual reruns are then championed by the Swift/Peano framework.
Swift phrases its physics in particle-to-particle interactions (see remarks below). You map these onto sets of particles via functions such as swift2::forAllParticles(). These functions are written down as templates. We refer to them as iterators.
Most particle kernels or particle-particle kernels are templates, too. The combination/nesting of two templates becomes tricky. While sophisticated template meta programming can resolve this, it makes sense to specialise the kernels explicitly. We do so through the format() call above.
Swift's core philosophy is that users should take care of only three things (most of the time):
In line with this philosophy, new physics are primarily introduced by adding new particle-particle interaction routines: They take a local particle (to be modified) and an active particle (acting on the local one) and upate the former. This is a binary compute kernel. We also have unary compute kernels which only take one single particle and update it. Our goal is to come up with an SPH formalism, where physical aspects are decoupled from data management, HPC, orchestration, ... challenges where possible.
To make this work, all particle-update kernels and particle-particle kernels have to commit to a few basic conventions:
Most kernels are templates parameterised over the particle type. However, there might be cases where you have bespoke kernels that apply only to one single species. The punchline to take away is that it is the kernel's responsibility to ensure that any particle update is rightfully performed.
To make both the tuning and the validation of the kernel easier, it is very important that the kernel's documentation explicitly enlists its constraints and also the checks used internally. Such information can later on be used within the algorithm steps (see remark above on invariants) and performance engineers (see remarks below).
Swift 2 builds up its interaction lists (who has to be checked against whom) on-the-fly while it runs through the mesh. Hereby, it employs the concept of a search radius: Only particles within the the search radius of a particle \( p_i \) are taken into account when we search for any other particle \( p_j \) interacting.
In Swift, the search radius \( r_{\text{search}} \) is a technical quantity (See also Terminology). The physics are not directly impacted by it. They are determined by the smoothing length \( r_{\text{smooth}} \) which changes (more) frequently. The smoothing length directly determines the interaction radius \(r_{\text{iact}} \):
\(r_{\text{iact}} = \gamma_{\text{kernel}} r_{\text{smooth}}\)
where \(\gamma_{\text{kernel}}\) is a constant which depends on the spatial dimensionality and the SPH kernel used (see Terminology and Dehnen & Aly 2012 ). Its numerical value is \(\sim 2\).
Naturally, we must ensure that the search radius is always equal or greater than the interaction radius.
Our code runs through the mesh and looks at all particles around a particle within the search radius. It is important to note that we have different types of interactions in our loops. For the density calculation, all particles within the interaction radius then enter the equations. We interact a particle i with a particle j if:
\( |x(p_i) - x(p_j)| \leq r_{\text{iact}}(p_i). \)
For the force calculation however, we have a symmetric correlation: Two particles interact if:
\( |x(p_i) - x(p_j)| \leq r_{\text{iact}}(p_i) \wedge |x(p_j) - x(p_i)| \leq r_{\text{iact}}(p_j). \)
Without the or, we could end up with non-symmetric forces.
The illustration above shows what can be wrong if the constraint
\( \gamma_{\text{kernel}} r_{\text{smooth}}(p_i) = r_{\text{iact}}(p_i) \leq r_{\text{search}}(p_i) \)
is violated: The red particle has to be updated within the middle cell. Let its search radius be depicted with the dotted circle, and the actual interaction radius with a solid one. As we store particles next to their closest vertex, we effectively can compare the red particle against all particles that are stored within the dashed square: We test all particles within the cell plus a h/2 halo. This is in line with Peano's search radius and sorting definition.
Consequently, we compare red with green and find out that (a) their search radius overlaps, and (b) their interaction radius doesn't. So we ignore the interaction. Red and blue are never even compared.
Now assume the smoothing length grows, increasing the interaction radius, such that it now becomes almost the search radius (dotted circle). Now, green is still within the interaction radius of red, so nothing has changed here. Blue however now ends up within the search radius of red, yet is not taken into account as it is stored within a vertex that is not adjacent to the current cell. We miss out on this iteration.
Swift 2 codes can be tuned in many ways. First, users can tune their code by picking an optimising graph compiler. They can also tweak the memory layout as discussed on Performance optimisation. On this page, we discuss how we can optimise the particle's compute kernels themselves. Before you continue to read, please consult Swift's page on runtime analysis, too, so all follow-up steps can be supported with appropriate measurements.
Users typically start from a generic usage of the iterators in ParticleSetIterators.h. That is, they simply type down swift2::kernels::forAllParticlePairs() and hand in their compute kernels of choice. Most kernels check internally if they are supposed to run or not (consult how to introduce a new solver). All these iterators do therefore is to invoke the actual kernel per particle or particle pair, respectively.
However, it might make sense to evaluate a "shall I run" predicate ahead of the actual kernel invocation: If a particle is outside of a cell in a cell kernel, it might be reasonable not even to start to run over all the (active) particles around it. So this additional check makes mainly sense for particle-particle interactions, i.e. forAllParticlePairs(), but Swift 2 realises the same guard mechanism for both forAllParticles() and forAllParticlePairs(). Both routines do accept additional guards which allow you to mask out certain calculations a priori.
If kernels consume a lot of compute time, it might be reasonable to check a priori if it makes sense to involve a kernel at all. This again can be done by passing another guard into the iterator. Replace
with
So far, this does actually decrease the performance, as you evaluate predicates now twice. To remove the redundancy, you have to provide a second variant of the compute kernel without these internal checks:
If you have a rather complicated logic, such as
then I typically write a bespoke predicate only for this particular case. swift2::kernels::legacy::densityKernelPairEvaluationPredicate() is a nice example of such a kernel-specific predicate.
The standard iterators swift2::kernels::forAllParticles() and swift2::kernels::forAllParticlePairs() do not exploit any information about the data layout. Once you have tweaked your data layout to ensure that you have coalesced (continuous) particle data, it is therefore a natural next step to switch to an iterator that exploits this knowledge.
The clean way to do this is to replace your swift2::kernels::forAllParticlePairs() calls with swift2::kernels::coalesced::forAllParticlePairs() or other variants. There are several ones available. Each one might be particular useful for one kernel-particle combination and make the performance break down for the next one.
If you don't want to run through your particle step by step, you can use the particle's routine swift2.particle.SPHParticle.switch_namespace_of_all_particle_iterators() to switch the namespace of all iterators. This is the brute-force variant which overwrites all particle-specific tuning (and might break down for particle types that do not stick exactly to all default naming conventions).
Vectorisation requires very aggressive inlining. To facilitate such inlining, all the setters and getters of particle attributes have to be available in the header. Study the DaStGen documentation for remarks how to move getters and settings into the header. Notably study the semantics of the function
which moves all attribute setters and getters into the header and hence facilitiates very aggressive inlining via copy n paste.
LLVM-based compilers provide ample feedback on your vectorisation once you add the compile flags
Every peano4.Project holds an attribute called makefile of the type peano4.output.Makefile. You can add additional compiler flags like these three guys there. Alternatively, you can reconfigure/rerun CMake on the whole project, but that means that you get vectorisation reports for every single file in Peano. As any Peano extension picks up the makefile settings from the core library you will get vectorisation reports for your particular SPH code, too.
For fully vectorised kernels, the reported width will always match the machine width (4 for AVX2 and 8 for AVX512 capable machines and double precision kernels). Sometimes, the compiler decides against vectorisation:
In this case, it might make sense to overwrite the compiler heuristics. Consult your compiler handbook. Again, it might be inconvenient to do this for the whole code. Instead, you might want to do it only for one object file. Follow the same pattern as introduced above for the optimisation report.
Unfortunately, Swift/Peano has currently no feature to pick compiler arguments manually per file. Worse, the Python API overwrite the Makefile. It might make sense to design a proper makefile manually and then to disable the overwriting in the Peano 4 project that is produced by Swift's Python API.
We've started to collect a list of todos and no-goes for the vectorisation below. You might want to weak this in one way or the other and it will be incomplete. But it is a reasonable starting point:
[[clang::always_inline]]but this led to multiple compiler crashes (again with Intel). To some degree this makes sense: The function has to be in the header! I played around with ipo, but that did not resolve the fundamental issues. So you have to put everything used in the for loops into headers.
It seemed originally that one has to replace the std::pow function with a specialised version that takes into account that the power is an integer. The C++ standard https://en.cppreference.com/w/cpp/numeric/math/pow is not clear if there is still a specialised version of pow for integers. However, all seems to vectorise nicely even if we use the power with the Intel 2023.2 compiler.
Along the same lines, we sometimes encouter issues with the max and min function. While replacing it with a manual case distinction
might help for some compilers, it is usually a bad idea and leads to error reports due to dependencies. We found it more convenient to work with the actual C++ max function. In this case, the vectorisation however works if and only if we translate with -Ofast or -ffast-math!
Complex statements such as
break the vectorisation. But once you replace them with statements similar to
vectorisation seems to work.