|
Peano
|
This page describes Peano 4's multithreading namespace. More...
Namespaces | |
| namespace | native |
| namespace | omp |
| namespace | orchestration |
| namespace | taskfusion |
| Task fusion means that a set of tasks are grabbed and mapped onto one large physical tasks instead of being processed one by one. | |
| namespace | tbb |
Data Structures | |
| class | BooleanSemaphore |
| class | Core |
| Core. More... | |
| class | EmptyTask |
| An empty task. More... | |
| class | Lock |
| Create a lock around a boolean semaphore region. More... | |
| class | MultiReadSingleWriteLock |
| Create a lock around a boolean semaphore region. More... | |
| class | MultiReadSingleWriteSemaphore |
| Read/write Semaphore. More... | |
| class | RecursiveLock |
| Create a lock around a boolean semaphore region. More... | |
| class | RecursiveSemaphore |
| Recursive Semaphore. More... | |
| class | Task |
| Abstract super class for a job. More... | |
| class | TaskComparison |
| Helper class if you wanna administer tasks with in a queue. More... | |
| class | TaskEnumerator |
| Rank-global enumerator for tasks. More... | |
| class | TaskWithCopyOfFunctor |
| Frequently used implementation for job with a functor. More... | |
| class | TaskWithoutCopyOfFunctor |
| Frequently used implementation for job with a functor. More... | |
Typedefs | |
| using | TaskNumber = int |
Enumerations | |
| enum class | TaskWaitType { Mixed , ExclusivelyFusedTasks , OnlyNativeTasks } |
| This is a hint type which can be used to accelerate any taskwaits. More... | |
Functions | |
| int | getNumberOfUnmaskedThreads () |
| This routine runs through the Unix thread mask and counts how many threads SLURM allows a code to use. | |
| std::string | printUnmaskedThreads () |
| Creates a string representation of those threads which are available to the processes. | |
| std::string | toString (const std::set< TaskNumber > &taskNumbers) |
| Construct string representation. | |
| void | initSmartMPI () |
| Switch on SmartMPI. | |
| void | shutdownSmartMPI () |
| void | setOrchestration (tarch::multicore::orchestration::Strategy *realisation) |
| tarch::multicore::orchestration::Strategy * | swapOrchestration (tarch::multicore::orchestration::Strategy *realisation) |
| Swap the active orchestration. | |
| tarch::multicore::orchestration::Strategy & | getOrchestration () |
| void | spawnTask (Task *task, const std::set< TaskNumber > &inDependencies=tarch::multicore::NoInDependencies, const TaskNumber &taskNumber=tarch::multicore::NoOutDependencies) |
| Spawns a single task in a non-blocking fashion. | |
| void | waitForTasks (const std::set< TaskNumber > &inDependencies, TaskWaitType taskWaitType=TaskWaitType::Mixed) |
| Wait for set of tasks. | |
| void | waitForTask (const int taskNumber, TaskWaitType taskWaitType=TaskWaitType::Mixed) |
| Wrapper around waitForTasks() with a single-element set. | |
| void | spawnAndWait (const std::vector< Task * > &tasks) |
| Fork-join task submission pattern. | |
| void | waitForAllTasks () |
| Wait for all tasks notably has to take fused tasks into account. | |
Variables | |
| constexpr TaskNumber | NoOutDependencies = -1 |
| const std::set< TaskNumber > | NoInDependencies = std::set<TaskNumber>() |
| const std::string | SpawnedTasksStatisticsIdentifier |
| Statistics identifier for spawned tasks. | |
| const std::string | SpawnedFusableTasksStatisticsIdentifier |
| Number of fusable tasks over time Please consult std::string SpawnedTasksStatisticsIdentifier for a description. | |
| const std::string | BSPConcurrencyLevelStatisticsIdentifier |
| Statistics identifier for active BSP sections. | |
This page describes Peano 4's multithreading namespace.
A high-level overview is provided through Multicore programming. This high-level page describes the multicore component from a user perspective, whereas the present namespace documentation focuses on the implementation of all multithreading within Peano.
The multithreading environment is realised through a small set of classes and functions within the namespace tarch::multicore. User code programmes against these classes and functions. Each type/function has a vanilla implementation within the directory src/multicore. This implementation is a dummy that ensures that all code works properly without any multithreading support.
Subdirectories hold alternative implementations (backends) which are enabled once the user selects a certain multithreading implementation variant, i.e. depending on the ifdefs set, one of the subdirectories is used. Some implementations introduce further headers, but user code is never supposed to work against functions or classes held within subdirectories.
As most backend-specific subdirectories solely contain specific implementation files, i.e. no headers, all back-end specific documentation resides directly within the actual routines. That is, many multicore routines have different sections within the documentation (in the header) which discusses how the routine semantics are mapped onto specific back-ends.
This implies that documentation about a specific backend (e.g. "how is the multithreading done with OpenMP") is not consolidated within a specific file. It is scattered among various routines.
Peano works with arbitrary dynamic task graphs where individual tasks are identified by a unique number (cmp discussion on Task graphs on Multicore programming, which provides details how to use this feature from a programmer's point of view). Within the task graph, the code makes no assumptions on
The second variant is particularly important: It implies that Peano makes no assumption at all about a certain well-structuredness (tree structure) of the task graph. Many task graphs will be constructed along a tree pattern, where tasks fork into more and more children tasks. However, it is absolutely not necessary that a task waits for all children that is has created before it terminates itself, i.e. we do not require the task graph to unfold and collapse along a tree pattern.
Notably, each mesh traversal creates one task per subdomain on a rank when we trigger a mesh traversal. These mesh traversals might span tasks with complex dependencies, while we wait for them to terminate in a subsequent mesh traversal. We allow for producer-consumer patterns without strong synchronisation.
In the example above, the mesh traversal is a task (bright red) which in turn creates four traversal tasks, as four local subdomains reside on this rank (dark green). These traversals assemble a rather complex task graph (blue) and then terminate. But they do not wait for the nodes of this task graph to complete, too. Later on, we issue another mesh traversal (dark red) spawning four children in a fork-join manner (bright green). These child tasks now might wait for tasks from the task graph (bright blue).
Peano supports various task execution patterns which are guided through the task orchestrations. Users can switch the orchestration for their code or even write bespoke strategies how to arrange the orchestration.
The orchestration determines, for example, which tasks end up on which GPU, or if a code wants to fuse multiple tasks into big meta tasks.
The information how tasks can be orchestrated but also how the orchestration is programmed is all consolidated within the namespace tarch::multicore::orchestration. You can swap the orchestration through calls of tarch::multicore::swapOrchestration().
Bespoke orchestrations have to inherit from tarch::multicore::orchestration::Strategy. The signature of this class showcases how much control the user has eventually over the task orchestration, as all orchestration is guided via callbacks to the strategy in control.
The canonical starting point to learn more about Peano's task fusion is the page Task fusion strategies.
An overview of the most important strategy decisions - in particular when it comes to fusion - is available through tarch::multicore:orchestration::FusionImplementation.
The backend of choice is set when we compile Peano, i.e. through the autotools or CMake. Details are provided on Multicore programming documentation. In principle, all programming should be completely backend-agnostic. However, there are some high-level issues with have to be considered. For some back-ends, we also have to provide workarounds and provide some additional utility features.

If you want to use the OpenMP backend, you have to embed your whole main loop within an
environment.
Furthermore, you will have to use
on some systems, as we rely heavily on nested parallelism.
At the moment, OpenMP does not allow you to set the number of used threads within the application, i.e. through tarch::multicore::Core. Instead, you have to set OMP_NUM_THREADS accordingly.
The realisation of task patterns as described above requires transparent task dependencies as they become available with OpenMP 6. As many production systems do not support OpenMP 6 yet, Peano has a workaround to mirror transparent task dependencies. Consequently, the documentation of individual routines and their implementation often distinguishes which OpenMP version is actually supported and maps onto the implementation. There is a routine tarch::multicore::omp::majorOpenMPVersion() which allows code to switch depending on the version. This routine is for the implementation internal only, so there is no Doxygen documentation for it.
The documentation of the OpenMP 5 workarounds, but also some details on OpenMP 6 can be found in the documentation of the namespace tarch::multicore::omp::internal.

Peano's usage pattern of OneTBB
| using tarch::multicore::TaskNumber = int |
Definition at line 209 of file multicore.h.
|
strong |
This is a hint type which can be used to accelerate any taskwaits.
| Enumerator | |
|---|---|
| Mixed | |
| ExclusivelyFusedTasks | |
| OnlyNativeTasks | |
Definition at line 434 of file multicore.h.
| int tarch::multicore::getNumberOfUnmaskedThreads | ( | ) |
This routine runs through the Unix thread mask and counts how many threads SLURM allows a code to use.
It returns this count. If you use multiple MPI ranks per node, each rank usually gets the permission to access the same number of cores exclusively.
| tarch::multicore::orchestration::Strategy & tarch::multicore::getOrchestration | ( | ) |
| void tarch::multicore::initSmartMPI | ( | ) |
Switch on SmartMPI.
If you use SmartMPI, then the bookkeeping registers the the local scheduling. If you don't use SmartMPI, this operation becomes nop, i.e. you can always call it and configure will decide whether it does something useful.
| std::string tarch::multicore::printUnmaskedThreads | ( | ) |
Creates a string representation of those threads which are available to the processes.
You get a string similar to
0000xxxx0000xxxx00000000000000
The example above means that cores 4-7 and 12-15 are available to the process, the other cores are not.
| void tarch::multicore::setOrchestration | ( | tarch::multicore::orchestration::Strategy * | realisation | ) |
| void tarch::multicore::shutdownSmartMPI | ( | ) |
| void tarch::multicore::spawnAndWait | ( | const std::vector< Task * > & | tasks | ) |
Fork-join task submission pattern.
The realisation is relatively straightforward:
The precise behaviour of the implementation is controlled through the orchestration. At the moment, we support three different variants:
I would appreciate if we could distinguish busy polling from task scheduling in the taskwait, but such a feature is not available within OpenMP, and we haven't studied TBB in this context yet.
In OpenMP, the taskwait pragma allows the scheduler to process other tasks as it is a scheduling point. This way, it should keep cores busy all the time as long as there are enough tasks in the system. If a fork-join task spawns a lot of additional subtasks, and if the orchestration does not tell Peano to hold them back, the OpenMP runtime might switch to the free tasks rather than continue with the actual fork-join tasks. Which is not what we want and introduces runtime flaws later down the line. This phenomenon is described in our 2021 IWOMP paper by H. Schulz et al.
A more severe problem arises the other way round: Several groups have reported that the taskwait does not continue with other tasks. See in particular
Jones, Christopher Duncan (Fermilab): Using OpenMP for HEP Framework Algorithm Scheduling. http://cds.cern.ch/record/2712271
Their presentation slides can be found at https://zenodo.org/record/3598796#.X6eVv8fgqV4.
This paper clarifies that some OpenMP runtimes do (busy) waits within the taskwait construct to be able to continue immediately. They do not process other tasks meanwhile. Our own ExaHyPE 2 POP review came to the same conclusion.
This can lead to a deadlock in applications such as ExaHyPE which spawn bursts of enclave tasks and then later on wait for their results to drop in. The consuming tasks will issue a taskyield() but this will not help, if the taskyield() now permutes through all the other traversal tasks.
If you suffer from that, you have to ensure that all enclave tasks have finished prior to the next traversal.
It is important to know how many BSP sections are active at a point. I therefore use the stats interface to maintain the BSP counters. However, I disable any statistics sampling, so I get a spot-on overview of the number of forked subtasks at any point.
| void tarch::multicore::spawnTask | ( | Task * | task, |
| const std::set< TaskNumber > & | inDependencies = tarch::multicore::NoInDependencies, | ||
| const TaskNumber & | taskNumber = tarch::multicore::NoOutDependencies ) |
Spawns a single task in a non-blocking fashion.
Ownership goes over to Peano's job namespace, i.e. you don't have to delete the pointer.
Spawn a task that depends on one other task. Alternatively, pass in NoDependency. In this case, the task can kick off immediately. You have to specify a task number. This number allows other, follow-up tasks to become dependent on this very task. Please note that the tasks have to be spawned in order, i.e. if B depends on A, then A has to be spawned before B. Otherwise, you introduce a so-called anti-dependency. This is OpenMP jargon which we adopted ruthlessly.
You may pass NoDependency as taskNumber. In this case, you have a fire-and-forget task which is just pushed out there without anybody ever waiting for it later on (at least not via task dependencies).
The very moment you hand over a task to spawnTask(), Peano's tasking backhand has ownership of the pointer. That is, you are not allowed to use the task object anymore. It might have been destroyed right after the spawnTask() call.
If you have a task object, you can directly call it. If you have only a functor, use tarch::multicore::TaskWithCopyOfFunctor.
The OpenMP variant has to jump through a number of hoops:
| task | Pointer to a task. The responsibility for this task is handed over to the tasking system, i.e. you are not allowed to delete it. |
| inDependencies | Set of incoming tasks that have to finish before the present task is allowed to run. You can pass the alias tarch::multicore::Tasks::NoInDependencies to make clear what's going on. |
| taskNumber | Allow the runtime to track out dependencies. Only numbers handed in here may be in inDependencies in an upcoming call. If you do not expect to construct any follow-up in-dependencies, you can pass in the default, i.e. NoOutDependencies. |
| tarch::multicore::orchestration::Strategy * tarch::multicore::swapOrchestration | ( | tarch::multicore::orchestration::Strategy * | realisation | ) |
Swap the active orchestration.
Different to setOrchestration(), this operation does not delete the current orchestration. It swaps them, so you can use setOrchestration() with the result afterwards and re-obtain the original strategy.
| std::string tarch::multicore::toString | ( | const std::set< TaskNumber > & | taskNumbers | ) |
Construct string representation.
Returns a string representation of the task set which either is
or it resembles
We use this routine in logging statements internally mainly. The routine has to be within the namespace, as we overload a general C++ std lib, and I want to keep this definition of toString() specific to task numbers. It is also fine to have it in the namespace, as it is only used internally anyway.
| void tarch::multicore::waitForAllTasks | ( | ) |
Wait for all tasks notably has to take fused tasks into account.
| void tarch::multicore::waitForTask | ( | const int | taskNumber, |
| TaskWaitType | taskWaitType = TaskWaitType::Mixed ) |
Wrapper around waitForTasks() with a single-element set.
| void tarch::multicore::waitForTasks | ( | const std::set< TaskNumber > & | inDependencies, |
| TaskWaitType | taskWaitType = TaskWaitType::Mixed ) |
Wait for set of tasks.
Entries in inDependencies can be NoDependency. This is a trivial implementation, as we basically run through each task in inDependencies and invoke waitForTask() for it. We don't have to rely on some backend-specific implementation.
You can obviously construct a task set explicitly. If you know the number of tasks, you can however directly use the bracket notation to invoke this function:
This routine degenerates to nop, as no task can be pending. spawnTask() always executed the task straightaway.
If you know that you have exclusively native tasks, then we can skip the check if there are any fused tasks still pending that we have to map onto native tasks first.
If you have exclusively fused tasks, then we can hand the baton on to the fusion subpackage, but in return ask it if there's a chance that some native tasks arise from the fusion. This is not the case for all fusion strategies. If so, we can skip the subsequent test for native tasks. See tarch::multicore::taskfusion::ensureHeldBackTasksAreMappedOntoNativeTasks().
| taskWaitType | This is an optimisation hint. See description above. |
|
extern |
Statistics identifier for active BSP sections.
If you translate in statistics mode, Peano's tarch will yield a statistics file, where it tracks the number of fork-join (BSP) partitions over time. If you invoke
over the output file (with the .csv extension), the script tells you in which column it dumps data of this flag.
If you visualise this through
with the correct column number, you get a plot similar to
In this example, we have a warm-up (e.g. grid construction) phase up to t=340. After that, the actual computation starts and the BSP level, i.e. number of forked tasks in a fork-join sense, is either 0 or 8. For most of the time, we seem to have 8 threads busy processing 8 rank-local subpartitions.
| const std::set<TaskNumber> tarch::multicore::NoInDependencies = std::set<TaskNumber>() |
Definition at line 213 of file multicore.h.
|
constexpr |
Definition at line 211 of file multicore.h.
|
extern |
Number of fusable tasks over time Please consult std::string SpawnedTasksStatisticsIdentifier for a description.
This identifier does not probe all tasks, but only those for which canFuse() holds.
|
extern |
Statistics identifier for spawned tasks.
If you translate in statistics mode, Peano's tarch will yield a statistics file, where it tracks the number of tasks that we spawn over time.
over the output file (with the .csv extension), the script tells you in which column it dumps data of this flag.
If you visualise this through
with the correct column number, you get a plot similar to
This scatter plot shows us how many tasks have been spawned over a certain time frame. The scatter plot is not that easy to ready, but the system probes the system regularly for the accumulated value of all spawned tasks and dumps it. After that, it resets the accumulated value to zero.
In the example above, it does not mean that we get only up to 1,000 tasks. To make statements on the total number of tasks at any point in the system, we would have to take the density (x-spacing) of the points into account. What it tells us however is that we have a relatively high task spawn frequency. In only very rare cases, we encounter only a few tasks.
Many applications do not plot only this quantity, but plot SpawnedFusableTasksStatisticsIdentifier at the same time, so users can compare the number of spawned tasks against the number of fusable tasks at any point.