Peano 4
No Matches
tarch::multicore Namespace Reference

This page describes Peano 4's multithreading layer. More...


namespace  internal
namespace  native
namespace  orchestration

Data Structures

class  BooleanSemaphore
class  Core
 Core. More...
class  Lock
 Create a lock around a boolean semaphore region. More...
class  MultiReadSingleWriteLock
 Create a lock around a boolean semaphore region. More...
class  MultiReadSingleWriteSemaphore
 Read/write Semaphore. More...
class  RecursiveLock
 Create a lock around a boolean semaphore region. More...
class  RecursiveSemaphore
 Recursive Semaphore. More...
class  Task
 Abstract super class for a job. More...
class  TaskComparison
 Helper class if you wanna administer tasks with in a queue. More...
class  TaskWithCopyOfFunctor
 Frequently used implementation for job with a functor. More...
class  TaskWithoutCopyOfFunctor
 Frequently used implementation for job with a functor. More...


using TaskNumber = int


int getNumberOfUnmaskedThreads ()
 This routine runs through the Unix thread mask and counts how many threads SLURM allows a code to use.
std::string printUnmaskedThreads ()
 Creates a string representation of those threads which are available to the processes.
void initSmartMPI ()
 Switch on SmartMPI.
void shutdownSmartMPI ()
void setOrchestration (tarch::multicore::orchestration::Strategy *realisation)
tarch::multicore::orchestration::StrategyswapOrchestration (tarch::multicore::orchestration::Strategy *realisation)
 Swap the active orchestration.
bool processPendingTasks (int maxTasks=std::numeric_limits< int >::max(), bool fifo=true)
 Process a few tasks from my backlog of tasks.
void spawnTask (Task *task, const std::set< TaskNumber > &inDependencies=std::set< TaskNumber >(), const std::set< TaskNumber > &conflicts=std::set< TaskNumber >(), const TaskNumber &taskNumber=NoDependencies)
 Spawns a single task in a non-blocking fashion.
void waitForAllTasks ()
 Wait for all tasks which have been spawned by spawnTask.
void waitForTasks (const std::set< TaskNumber > &inDependencies)
 Wait for set of tasks.
void waitForTask (const int taskNumber)
 Wrapper around waitForTasks() with a single-element set.
void spawnAndWait (const std::vector< Task * > &tasks)
 Fork-join task submission pattern.


constexpr TaskNumber NoDependencies = -1

Detailed Description

This page describes Peano 4's multithreading layer.

To compile with multicore support, you have to invoke the configure script with the option –with-multithreading=value where value is

  • cpp. This adds support through C++14 threads.
  • tbb. This adds support through Intel's Threading Building Blocks. If you use this option, you first have to ensure that your CXXFLAGS and LDFLAGS point to the right include or library, respectively, directories. LDFLAGS also has to compromise either -ltbb or -tbb_debug.
  • openmp. This adds OpenMP support. We currently develop against OpenMP 4.x though some of our routines use OpenMP target and thus are developed against OpenMP 5.
  • sycl. We have a SYCL support for the multithreading, through, within the Intel toolchain, it might be more appropriate to combine sycl on the GPU with the tbb backend for multithreading.

Writing your own code with multithreading features

If you wanna distinguish in your code between multicore and no-multicore variants, please use


#if defined(SharedMemoryParallelisation)

With the symbol SharedMemoryParallelisation, you make your code independent of OpenMP, TBB or C++ threading.

Our vision is that each code should be totally independent of the multithreading implementation chosen. Indeed, Peano 4 itself does not contain any direct multithreading library calls. It solely relies on the classes and functions from tarch::multicore.

Multicore architecture

The multithreading environment is realised through a small set of classes. User codes work with these classes. Each type/function has an implementation within src/multicore. This implementation is a dummy that ensures that all code works properly without any multithreading support. Subdirectories hold alternative implementations (backends) which are enabled once the user selects a certain multithreading implementation variant, i.e. depending on the ifdefs set, one of the subdirectories is used. Some implementations introduce further headers, but user code is never supposed to work against functions or classes held within subdirectories.

The central instance managing the threads on a system is tarch::multicore::Core. This is a singleton and the name thus is slightly wrong. It does not really represent one core but rather represents the landscape of cores. You can setup the multithreading environment through Core's configure() routine, but this is optional. Indeed, multithreading should work without calling configure() at all. Each multitheading backend offers its own realisation of the Core class.

For multithreaded code, it is important that the code can lock (protect) code regions and free them. For this, the multithreading layer offers different semaphores. Each multithreading backend maps these logical concepts onto its internal synchronisation mechanism. Usually, I use the semaphores through lock objects. As they rely on the semaphore implementations, they are generic and work for any backend.

Task model

Peano models all of its interna as tasks. Each Peano 4 task is a subclass of tarch::multicore::Task. However, these classes might not be mapped 1:1 onto native tasks. Indeed we distinguish different task types:

  • Tasks. The most generic type of tasks is submitted via spawnTask(). Each task can be assigned a unique number and incoming dependencies.
  • Fork-join tasks. These are created via tarch::multicore::spawnAndWait() and form a subset of the generic tasks. Here, we know the dependency structure and waits quite explicitly, so there's no need to work with task numbers.
  • Free floating tasks (task sinks). Tasks without any outgoing dependencies are free floating in the sense that we never wait for them. That's obviously almost never true - few tasks have no follow-up dependencies at all - but the user might decide to model the out dependencies without task dependencies: Typically, such tasks set some output flag or dump their result into a database, which then in turn unlocks follow-up tasks.
  • Pending tasks: Pending tasks are ready tasks which Peano's tasking API holds back on purpose. They are not (yet) submitted to the tasking system, as we might want to fuse them into larger task assemblies (and move them to other ranks/devices, e.g.).

Vanilla Peano task graph

The two routines allow us to model the typical Peano 4 task graph:

Peano's main core realises a classic fork-join parallelism (left) where the fork-join segments are realised via tasks. The routine tarch::multicore::spawnAndWait() is used to construct this task graph part on-the-fly. Formally, this part of the code is very simple to classic BSP (cmp OpenMP's parallel for), which is recursively nested into each other.

Each task within the core fork-join DAG might spawn additional tasks. When we wait for the BSP part of the graph (left) to terminate, these tasks still might linger around. They should backfill any empty task queue when it is appropriate or, in general, be executed with low priority. Late on throughout the execution, some further tasks for the core code section will need the outcome of tasks that have been spawned before. At this point, the tasks should already be completed (and therefore have dumped their outcomes into a database). Otherwise, we have to manually invoke the scheduler manually via tarch::multicore::processPendingTasks().

The tasks to the right are typically called enclave tasks, and they are typically added on top of the core fork-join task structure by Peano extensions. The prime example is ExaHyPE 2. Enclave tasking yields real bursts of low priority tasks, and it would be unacceptable complicated to model the dependencies on their outcomes via a task graph. Therefore, I went down the route that these tasks dump their outcomes into a table, and consumer tasks (or receivers in C++ terminology) then take the outcomes from there.

There is a further reason to work with the dedicated queues: As we found out and describe in the 2021 IWOMP paper by H. Schulz et al, tasking frameworks such as OpenMP might decide to switch to any ready task throughout the execution. This means that they might postpone the processing of the DAG to the left and instead do the free tasks on the right. This is not what we want: We want them to have lower priority. An important paper to read in the context of the present solution is also Note in particular the example in Figure 3.

GPUs and an additional queue (pending tasks)

On some platforms and code generations, it has proven of value to add an additional runtime (layer) on top of the native tasking backend. These layers do exist for TBB and OpenMP, e.g., and all follow the same general pattern. Basically, they introduce an additional, user-managed task queue on top of the native tasking systems. Enclave tasks are not spawned into the native runtime but instead held back within this additional queue. There are two advantages of this:

  1. Some runtimes see the task spawning as scheduling point (OpenMP e.g.) and might decide to run the enclave tasks immediately, as there's so many of them. However, we really want to have them low priority. By holding them back in our own queue, they are invisible to the tasking runtime and therefore cannot be scheduled.
  2. We can search the additional queue of tasks of the same type which can fit to the GPU. If we find multiple of these guys, we pack them together into one large meta task (or task assembly) and throw them onto an accelerator.

Performance analysis shows that using one global queue on top of the actual tasking backend quickly introduces performance issues, as too many tasks might try to spawn their tasks concurrently into this queue. We suffer from congestion. Therefore, the implementation in Tasks.cpp employs one global queue plus a queue per thread.

Tasks with dependencies

In Peano, task DAGs are built up along the task workflow. That is, each task that is not used within a fork-join region or is totally free is assigned a unique number when we spawn it.

Whenever we define a task, we can also define its dependencies. This is a sole completion dependency: you tell the task system which task has to be completed before the currently submitted one is allwed to start. A DAG thus can be built up layer by layer. We start with the first task. This task might be immediately executed - we do not care - and then we continue to work our way down through the graph adding node by node.

Different to OpenMP, outgoing dependencies do not have to be declared. We solely model in dependencies. This however imples that all predecessors of a task have to be submitted before we add this very task. This is not always possible. When you walk top-down through a tree and then bottom-up again, you might have thrown away finer levels when you get back to the original one and thus you are unable to introduce dependencies from fine to coarse unless you introduce very fancy bookkeeping.

We avoid such bookkeeping (or deploy it into the tasking API), as we allow codes to register a task. registerTask() tells the system "there will be a task eventually, but I haven't yet constructed it". You can also add dependencies from an existing task to such a registered task. Once you finally submit the registered task, you are however not allowed to add any dependencies anymore.

Orchestration and (auto-)tuning

The actual orchestration is controlled via an implementation of tarch::multicore::orchestration::Strategy that you set via tarch::multicore::setOrchestration(). The strategy basically controls:

  • How many enclave tasks should be hold back in the user-defined queue. If you exceed this treshold, the backend maps each enclave task 1:1 onto a native task.
  • How many tasks should the system try to fuse into one GPU call.
  • Should the code try to fuse these tasks immediately when they are spawned or search for fusion candidates every time a fork-join section has terminated.
  • What is an appropriate scheduling strategy for a fork-join section depending on the level of nested parallelisation calls.

The class tarch::multicore::orchestration::StrategyFactory allows you to pick various common strategies, and it also provides the routine which determines the default variant which is chosen if the user does not manually pick one. As we outsource the orchestration into strategy objects, users can implement online and offline autotuning within through an implementation of tarch::multicore::orchestration::Strategy.

Scheduling flavours for fork-join sections

Whenever we hit a fork-join section, i.e. encounter tarch::multicore::spawnAndWait(), there are different scheduling variants on the table:

  1. Execute the tasks straightaway. Do not exploit any concurrency.
  2. Run a straightforward, native task loop.
  3. Run through the forked tasks in parallel and check eventually if we should map some subtasks onto native tasks.

The orchestration of choice can control this behaviour via tarch::multicore::orchestration::Strategy::paralleliseForkJoinSection().

The documentation of tarch::multicore::spawnAndWait() provides rationale why I think that variant (2) minimises the algorithmic latency, whereas variant (3) maximises occupation, and



If you want to use the OpenMP backend, you have to embed your whole main loop within an

#pragma omp parallel
#pragma omp single

environment. Furthermore, you will have to use

export OMP_NESTED=true

on some systems, as we rely heavily on nested parallelism.


If the Peano statistics are enabled, the tasking backend will sample some quantities:

  • "tarch::multicore::bsp-concurrency-level" Typically corresponds to the number of fork-join traversal tasks.
  • "tarch::multicore::global-pending-tasks" Global pending tasks.
  • "tarch::multicore::thread-local-pending-tasks" Pending tasks per thread which are not yet committed to the global queue.
  • "tarch::multicore::fuse-tasks" Number of tasks which have been fused.

Depending on the chosen backend, you might get additional counters on top.

Typedef Documentation

◆ TaskNumber

Definition at line 19 of file Tasks.h.

Function Documentation

◆ getNumberOfUnmaskedThreads()

int tarch::multicore::getNumberOfUnmaskedThreads ( )

This routine runs through the Unix thread mask and counts how many threads SLURM allows a code to use.

It returns this count. If you use multiple MPI ranks per node, each rank usually gets the permission to access the same number of cores exclusively.

Definition at line 32 of file Core.cpp.

◆ initSmartMPI()

void tarch::multicore::initSmartMPI ( )

Switch on SmartMPI.

If you use SmartMPI, then the bookkeeping registers the the local scheduling. If you don't use SmartMPI, this operation becomes nop, i.e. you can always call it and configure will decide whether it does something useful.

Definition at line 11 of file multicore.cpp.

References tarch::mpi::Rank::getInstance(), and tarch::mpi::Rank::setCommunicator().

Referenced by main().

Here is the call graph for this function:
Here is the caller graph for this function:

◆ printUnmaskedThreads()

std::string tarch::multicore::printUnmaskedThreads ( )

Creates a string representation of those threads which are available to the processes.

You get a string similar to


The example above means that cores 4-7 and 12-15 are available to the process, the other cores are not.

Definition at line 11 of file Core.cpp.

◆ processPendingTasks()

bool tarch::multicore::processPendingTasks ( int maxTasks = std::numeric_limits<int>::max(),
bool fifo = true )

Process a few tasks from my backlog of tasks.

This routine tries to complete maxTasks. It is important that this routine makes progress, i.e. processes tasks, if there are any tasks left in the system. ExaHyPE's enclave tasking, for example, uses the process within its polling for enclave task results. If it does not make progress, this routine will starve. As such, it is absolutely fine if the implementation of processPendingTasks() suspends the actual thread, as long as this thread is not permanently suspended.

This routine invokes internal::copyInternalTaskQueuesOverIntoGlobalQueue() first of all to maximise the number of tasks in the local queue.

maxTasksSpecify how many tasks to process at most. By constraining this number, you can realise some polling where you check for a condition. If the condition is not met, you ask the task system to complete a few tasks, but you don't want the task system to complete all tasks, as you don't want to wait for ages before you check again.
fifoshall the system try to complete the tasks in FIFO order? This is a recommendation. Not all task processing strategies do support such a clue mechanism.
There have been tasks

Definition at line 240 of file Tasks.cpp.

References assertion, and tarch::multicore::internal::copyInternalTaskQueuesOverIntoGlobalQueue().

Referenced by examples::regulargridupscaling::MyObserver::endTraversal(), and exahype2::EnclaveBookkeeping::waitForTaskToTerminateAndReturnResult().

Here is the call graph for this function:
Here is the caller graph for this function:

◆ setOrchestration()

◆ shutdownSmartMPI()

void tarch::multicore::shutdownSmartMPI ( )

Definition at line 27 of file multicore.cpp.

Referenced by main().

Here is the caller graph for this function:

◆ spawnAndWait()

void tarch::multicore::spawnAndWait ( const std::vector< Task * > & tasks)

Fork-join task submission pattern.

The realisation is relatively straightforward:

  • Maintain nestedSpawnAndWaits which is incremented for every fork-join section that we enter.
  • Tell the orchestration that a BSP section starts.
  • Ask the orchestration which realisation to pick.
  • Either run through the task set sequentially or invoke the native parallel implementation.
  • If there are task pending and the orchestration instructs us to do so, map them onto native tasks.
  • Tell the orchestration that the BSP section has terminated
  • Tell the orchestration that a BSP section ends.
  • Maintain nestedSpawnAndWaits which is decremented whenever we leave a fork-join section.

Scheduling variants

The precise behaviour of the implementation is controlled through the orchestration. At the moment, we support three different variants:

  1. The serial variant tarch::multicore::orchestration::Strategy::ExecutionPolicy::RunSerially runs through all the tasks one by one. Our rationale is that a good orchestration picks this variant for very small task sets where the overhead of the join-fork makes a parallelisation counterproductive.
  2. The parallel variant tarch::multicore::orchestration::Strategy::ExecutionPolicy::RunParallel runs through all the tasks in parallel. Once all tasks are completed, the code commits all the further tasks that have been spawned into a global queue and then studies if to fuse them further or if to map them onto native tasks. This behaviour has to be studied in the context of tarch::multicore::spawnTask() which might already have mapped tasks onto native tasks or GPU tasks, i.e. at this point no free subtasks might be left over in the local queues even though there had been some. It is important to be careful with this "commit all tasks after the traversal" approach: In OpenMP, it can lead to deadlocks if the taskwait is realised via busy polling. See the bug description below.
  3. The parallel variant tarch::multicore::orchestration::Strategy::ExecutionPolicy::RunParallelAndIgnoreWithholdSubtasks runs through all the tasks in parallel. Different to tarch::multicore::orchestration::Strategy::ExecutionPolicy::RunParallel, it does not try to commit any further subtasks or to fuse them. This variant allows the scheduler to run task sets in parallel but to avoid the overhead introduced by the postprocessing.

I would appreciate if we could distinguish busy polling from task scheduling in the taskwait, but such a feature is not available within OpenMP, and we haven't studied TBB in this context yet.

Implementation flaws in OpenMP and bugs burried within the sketch

In OpenMP, the taskwait pragma allows the scheduler to process other tasks as it is a scheduling point. This way, it should keep cores busy all the time as long as there are enough tasks in the system. If a fork-join task spawns a lot of additional subtasks, and if the orchestration does not tell Peano to hold them back, the OpenMP runtime might switch to the free tasks rather than continue with the actual fork-join tasks. Which is not what we want and introduces runtime flaws later down the line. This phenomenon is described in our 2021 IWOMP paper by H. Schulz et al.

A more severe problem arises the other way round: Several groups have reported that the taskwait does not continue with other tasks. See in particular

Jones, Christopher Duncan (Fermilab): Using OpenMP for HEP Framework Algorithm Scheduling.

Their presentation slides can be found at

This paper clarifies that some OpenMP runtimes do (busy) waits within the taskwait construct to be able to continue immediately. They do not process other tasks meanwhile. Our own ExaHyPE 2 POP review came to the same conclusion.

This can lead to a deadlock in applications such as ExaHyPE which spawn bursts of enclave tasks and then later on wait for their results to drop in. The consuming tasks will issue a taskyield() but this will not help, if the taskyield() now permutes through all the other traversal tasks.

If you suffer from that, you have to ensure that all enclave tasks have finished prior to the next traversal.


It is important to know how many BSP sections are active at a point. I therefore use the stats interface to maintain the BSP counters. However, I disable any statistics sampling, so I get a spot-on overview of the number of forked subtasks at any point.

Speak to OpenMP. It would be totally great, if we could say that the task wait shall not(!) issue a new scheduling point. We would like to distinguish taskwaits which priorities throughput vs algorithmic latency.
Speak to OpenMP that we would like a taskyield() which does not (!) continue with a sibling. This is important for producer-consumer patterns.

Definition at line 302 of file Tasks.cpp.

References _log, tarch::multicore::internal::copyInternalTaskQueuesOverIntoGlobalQueue(), tarch::multicore::orchestration::Strategy::endBSPSection(), tarch::multicore::orchestration::Strategy::EndOfBSPSection, tarch::multicore::Lock::free(), tarch::multicore::internal::fusePendingTasks(), tarch::logging::Statistics::getInstance(), tarch::multicore::Core::getInstance(), tarch::multicore::orchestration::Strategy::getNumberOfTasksToFuseAndTargetDevice(), tarch::multicore::orchestration::Strategy::getNumberOfTasksToHoldBack(), tarch::multicore::internal::getNumberOfWithholdPendingTasks(), tarch::logging::Statistics::inc(), tarch::multicore::Lock::lock(), tarch::multicore::internal::mapPendingTasksOntoNativeTasks(), tarch::multicore::orchestration::Strategy::paralleliseForkJoinSection(), tarch::multicore::orchestration::Strategy::RunParallel, tarch::multicore::orchestration::Strategy::RunParallelAndIgnoreWithholdSubtasks, tarch::multicore::orchestration::Strategy::RunSerially, tarch::multicore::native::spawnAndWaitAsTaskLoop(), tarch::multicore::orchestration::Strategy::startBSPSection(), and tarch::multicore::Core::yield().

Referenced by peano4::parallel::tests::PingPongTest::testMultithreadedPingPongWithBlockingReceives(), peano4::parallel::tests::PingPongTest::testMultithreadedPingPongWithBlockingSends(), peano4::parallel::tests::PingPongTest::testMultithreadedPingPongWithBlockingSendsAndReceives(), peano4::parallel::tests::PingPongTest::testMultithreadedPingPongWithNonblockingReceives(), peano4::parallel::tests::PingPongTest::testMultithreadedPingPongWithNonblockingSends(), peano4::parallel::tests::PingPongTest::testMultithreadedPingPongWithNonblockingSendsAndReceives(), and peano4::parallel::SpacetreeSet::traverse().

Here is the call graph for this function:
Here is the caller graph for this function:

◆ spawnTask()

void tarch::multicore::spawnTask ( Task * task,
const std::set< TaskNumber > & inDependencies = std::set<TaskNumber>(),
const std::set< TaskNumber > & conflicts = std::set<TaskNumber>(),
const TaskNumber & taskNumber = NoDependencies )

Spawns a single task in a non-blocking fashion.

Ownership goes over to Peano's job namespace, i.e. you don't have to delete the pointer.

Handling tasks without outgoing dependencies

If taskNumber equals NoDependency, we know that noone is (directly) waiting for this task, i.e. we won't add dependencies to the task graph afterwards. In this case, the realisation is straightforward:

  1. If SmartMPI is enabled and the task should be sent away, do so.
  2. If the current orchestration strategy (an implementation of tarch::multicore::orchestration::Strategy says that we should hold back tasks, but the current number of tasks in the thread-local queue exceeds already this threshold, invoke the native tarch::multicore::native::spawnTask(task).
  3. If none of these ifs apply, enqueue the task in the thread-local queue.
  4. If we came through route (3), doublecheck if we should fuse tasks into GPUs.

spawnTask() will never commit a task to the global task queue and therefore is inherently thread-safe.

Tasks with a task number and incoming dependencies

Spawn a task that depends on one other task. Alternatively, pass in NoDependency. In this case, the task can kick off immediately. You have to specify a task number. This number allows other, follow-up tasks to become dependent on this very task. Please note that the tasks have to be spawned in order, i.e. if B depends on A, then A has to be spawned before B. Otherwise, you introduce a so-called anti-dependency. This is OpenMP jargon which we adopted ruthlessly.

You may pass NoDependency as taskNumber. In this case, you have a fire-and-forget task which is just pushed out there without anybody ever waiting for it later on (at least not via task dependencies).

See also
tarch::multicore and the section "Tasks with dependencies" therein for further documentation.
tarch::multicore::spawnAndWait() for details what happens with tasks that have no outgoing dependencies.
processPendingTasks(int) describing how we handle pending tasks.

Definition at line 262 of file Tasks.cpp.

References assertion, assertion1, assertion2, tarch::multicore::internal::copyInternalTaskQueueOverIntoGlobalQueue(), tarch::multicore::Lock::free(), tarch::multicore::internal::fusePendingTasks(), tarch::multicore::orchestration::Strategy::fuseTasksImmediatelyWhenSpawned(), tarch::logging::Statistics::getInstance(), tarch::multicore::Core::getInstance(), tarch::multicore::orchestration::Strategy::getNumberOfTasksToFuseAndTargetDevice(), tarch::multicore::orchestration::Strategy::getNumberOfTasksToHoldBack(), tarch::multicore::Task::getTaskType(), tarch::multicore::Core::getThreadNumber(), tarch::logging::Statistics::log(), logDebug, NoDependencies, and tarch::multicore::native::spawnTask().

Here is the call graph for this function:

◆ swapOrchestration()

◆ waitForAllTasks()

void tarch::multicore::waitForAllTasks ( )

Wait for all tasks which have been spawned by spawnTask.

This routine might return and still miss out for a few pending tasks. It basically runs only over those tasks with in/out dependencies and ensures that they are either done or are pending.

Definition at line 394 of file Tasks.cpp.

◆ waitForTask()

void tarch::multicore::waitForTask ( const int taskNumber)

Wrapper around waitForTasks() with a single-element set.

Definition at line 400 of file Tasks.cpp.

References waitForTasks().

Here is the call graph for this function:

◆ waitForTasks()

void tarch::multicore::waitForTasks ( const std::set< TaskNumber > & inDependencies)

Wait for set of tasks.

Entries in inDependencies can be NoDependency. This is a trivial implementation, as we basically run through each task in inDependencies and invoke waitForTask() for it. We don't have to rely on some backend-specific implementation.

Serial code

This routine degenerates to nop, as no task can be pending. spawnTask() always executed the task straightaway.

Definition at line 397 of file Tasks.cpp.

References tarch::multicore::native::waitForTasks().

Referenced by waitForTask().

Here is the call graph for this function:
Here is the caller graph for this function:

Variable Documentation

◆ NoDependencies