![]() |
Peano
|
This page describes Peano 4's multithreading namespace. More...
Namespaces | |
namespace | native |
namespace | omp |
namespace | orchestration |
namespace | taskfusion |
Task fusion means that a set of tasks are grabbed and mapped onto one large physical tasks instead of being processed one by one. | |
namespace | tbb |
Data Structures | |
class | BooleanSemaphore |
class | Core |
Core. More... | |
class | EmptyTask |
An empty task. More... | |
class | Lock |
Create a lock around a boolean semaphore region. More... | |
class | MultiReadSingleWriteLock |
Create a lock around a boolean semaphore region. More... | |
class | MultiReadSingleWriteSemaphore |
Read/write Semaphore. More... | |
class | RecursiveLock |
Create a lock around a boolean semaphore region. More... | |
class | RecursiveSemaphore |
Recursive Semaphore. More... | |
class | Task |
Abstract super class for a job. More... | |
class | TaskComparison |
Helper class if you wanna administer tasks with in a queue. More... | |
class | TaskWithCopyOfFunctor |
Frequently used implementation for job with a functor. More... | |
class | TaskWithoutCopyOfFunctor |
Frequently used implementation for job with a functor. More... | |
Typedefs | |
using | TaskNumber = int |
Functions | |
int | getNumberOfUnmaskedThreads () |
This routine runs through the Unix thread mask and counts how many threads SLURM allows a code to use. | |
std::string | printUnmaskedThreads () |
Creates a string representation of those threads which are available to the processes. | |
void | initSmartMPI () |
Switch on SmartMPI. | |
void | shutdownSmartMPI () |
void | setOrchestration (tarch::multicore::orchestration::Strategy *realisation) |
tarch::multicore::orchestration::Strategy * | swapOrchestration (tarch::multicore::orchestration::Strategy *realisation) |
Swap the active orchestration. | |
tarch::multicore::orchestration::Strategy & | getOrchestration () |
void | spawnTask (Task *task, const std::set< TaskNumber > &inDependencies=tarch::multicore::NoInDependencies, const TaskNumber &taskNumber=tarch::multicore::NoOutDependencies) |
Spawns a single task in a non-blocking fashion. | |
void | waitForTasks (const std::set< TaskNumber > &inDependencies) |
Wait for set of tasks. | |
void | waitForTask (const int taskNumber) |
Wrapper around waitForTasks() with a single-element set. | |
void | spawnAndWait (const std::vector< Task * > &tasks) |
Fork-join task submission pattern. | |
void | waitForAllTasks () |
Wait for all tasks notably has to take fused tasks into account. | |
void | processFusedTask (Task *myTask, const std::list< tarch::multicore::Task * > &tasksOfSameType, int device) |
Process a fused task. | |
Variables | |
constexpr TaskNumber | NoOutDependencies = -1 |
const std::set< TaskNumber > | NoInDependencies = std::set<TaskNumber>() |
This page describes Peano 4's multithreading namespace.
A more high-level overview is provided through Multicore programming.
If you wanna distinguish in your code between multicore and no-multicore variants, please use
and
With the symbol SharedMemoryParallelisation, you make your code independent of OpenMP, TBB or C++ threading.
The multithreading environment is realised through a small set of classes. User codes work with these classes. Each type/function has an implementation within src/multicore. This implementation is a dummy that ensures that all code works properly without any multithreading support. Subdirectories hold alternative implementations (backends) which are enabled once the user selects a certain multithreading implementation variant, i.e. depending on the ifdefs set, one of the subdirectories is used. Some implementations introduce further headers, but user code is never supposed to work against functions or classes held within subdirectories.
If you want to use the OpenMP backend, you have to embed your whole main loop within an
environment. Furthermore, you will have to use
on some systems, as we rely heavily on nested parallelism.
The core documentation of Peano's multicore layer is part of the generic Multicore programming documentation.
using tarch::multicore::TaskNumber = int |
Definition at line 99 of file multicore.h.
int tarch::multicore::getNumberOfUnmaskedThreads | ( | ) |
This routine runs through the Unix thread mask and counts how many threads SLURM allows a code to use.
It returns this count. If you use multiple MPI ranks per node, each rank usually gets the permission to access the same number of cores exclusively.
Definition at line 33 of file Core.cpp.
References u.
tarch::multicore::orchestration::Strategy & tarch::multicore::getOrchestration | ( | ) |
Definition at line 75 of file multicore.cpp.
Referenced by tarch::multicore::taskfusion::ProcessReadyTask::processTasks(), tarch::multicore::taskfusion::ProcessReadyTask::run(), and tarch::multicore::taskfusion::translateFusableTaskIntoTaskSequence().
void tarch::multicore::initSmartMPI | ( | ) |
Switch on SmartMPI.
If you use SmartMPI, then the bookkeeping registers the the local scheduling. If you don't use SmartMPI, this operation becomes nop, i.e. you can always call it and configure will decide whether it does something useful.
Definition at line 33 of file multicore.cpp.
References tarch::mpi::Rank::getInstance(), and tarch::mpi::Rank::setCommunicator().
Referenced by main().
std::string tarch::multicore::printUnmaskedThreads | ( | ) |
Creates a string representation of those threads which are available to the processes.
You get a string similar to
0000xxxx0000xxxx00000000000000
The example above means that cores 4-7 and 12-15 are available to the process, the other cores are not.
Definition at line 12 of file Core.cpp.
References u.
void tarch::multicore::processFusedTask | ( | Task * | myTask, |
const std::list< tarch::multicore::Task * > & | tasksOfSameType, | ||
int | device ) |
Process a fused task.
A fused task is a task (which can be fused) and a list of further tasks which are of the same type and, hence, can be fused, too. This routine has to process all tasks, delete all the instances, i.e. both myTask and all pointers stored in tasksOfSameType, and return. It is blocking in the sense that we are guaranteed that tasks have completed once this function returns. For the execution, we rely on myTask's fuse() operation, i.e. this routine is solely there for the memory management.
It is generally not clear if it is better to spawn all tasks again as a big parallel for, or to process them sequentially without any further tasking. After all, the task fusion has been introduced to reduce task management overhead. If we now take a set of tasks and again map them onto a set of physical tasks, it is really not clear what we gain in the end, as we again introduce overhead.
firstTask | First task. This is a task which can be fused. The pointer is valid. The ownership of firstTask is handed over to the called routine, i.e. processFusedTask() has to ensure that it is deleted. |
otherTasks | List of tasks of the same type. The list can be empty. processFusedTask() has to ensure that all tasks stored within the list are executed and subsequently destroyed. |
device | Target device on which the fused tasks should be executed. Can be host if the tasks should end up on the host. |
Definition at line 153 of file multicore.cpp.
References tarch::multicore::Task::fuse().
Referenced by tarch::multicore::taskfusion::ProcessReadyTask::processTasks().
void tarch::multicore::setOrchestration | ( | tarch::multicore::orchestration::Strategy * | realisation | ) |
Definition at line 56 of file multicore.cpp.
References assertion.
Referenced by peano4::parallel::tests::PingPongTest::testMultithreadedPingPongWithBlockingReceives(), peano4::parallel::tests::PingPongTest::testMultithreadedPingPongWithBlockingSends(), peano4::parallel::tests::PingPongTest::testMultithreadedPingPongWithBlockingSendsAndReceives(), peano4::parallel::tests::PingPongTest::testMultithreadedPingPongWithNonblockingReceives(), peano4::parallel::tests::PingPongTest::testMultithreadedPingPongWithNonblockingSends(), and peano4::parallel::tests::PingPongTest::testMultithreadedPingPongWithNonblockingSendsAndReceives().
void tarch::multicore::shutdownSmartMPI | ( | ) |
Definition at line 49 of file multicore.cpp.
Referenced by main().
Fork-join task submission pattern.
The realisation is relatively straightforward:
The precise behaviour of the implementation is controlled through the orchestration. At the moment, we support three different variants:
I would appreciate if we could distinguish busy polling from task scheduling in the taskwait, but such a feature is not available within OpenMP, and we haven't studied TBB in this context yet.
In OpenMP, the taskwait pragma allows the scheduler to process other tasks as it is a scheduling point. This way, it should keep cores busy all the time as long as there are enough tasks in the system. If a fork-join task spawns a lot of additional subtasks, and if the orchestration does not tell Peano to hold them back, the OpenMP runtime might switch to the free tasks rather than continue with the actual fork-join tasks. Which is not what we want and introduces runtime flaws later down the line. This phenomenon is described in our 2021 IWOMP paper by H. Schulz et al.
A more severe problem arises the other way round: Several groups have reported that the taskwait does not continue with other tasks. See in particular
Jones, Christopher Duncan (Fermilab): Using OpenMP for HEP Framework Algorithm Scheduling. http://cds.cern.ch/record/2712271
Their presentation slides can be found at https://zenodo.org/record/3598796#.X6eVv8fgqV4.
This paper clarifies that some OpenMP runtimes do (busy) waits within the taskwait construct to be able to continue immediately. They do not process other tasks meanwhile. Our own ExaHyPE 2 POP review came to the same conclusion.
This can lead to a deadlock in applications such as ExaHyPE which spawn bursts of enclave tasks and then later on wait for their results to drop in. The consuming tasks will issue a taskyield() but this will not help, if the taskyield() now permutes through all the other traversal tasks.
If you suffer from that, you have to ensure that all enclave tasks have finished prior to the next traversal.
It is important to know how many BSP sections are active at a point. I therefore use the stats interface to maintain the BSP counters. However, I disable any statistics sampling, so I get a spot-on overview of the number of forked subtasks at any point.
Definition at line 91 of file multicore.cpp.
References _log, tarch::multicore::Lock::free(), tarch::logging::Statistics::getInstance(), tarch::logging::Statistics::inc(), tarch::multicore::Lock::lock(), tarch::multicore::orchestration::Strategy::RunParallel, and tarch::multicore::orchestration::Strategy::RunSerially.
Referenced by peano4::parallel::tests::PingPongTest::testMultithreadedPingPongWithBlockingReceives(), peano4::parallel::tests::PingPongTest::testMultithreadedPingPongWithBlockingSends(), peano4::parallel::tests::PingPongTest::testMultithreadedPingPongWithBlockingSendsAndReceives(), peano4::parallel::tests::PingPongTest::testMultithreadedPingPongWithNonblockingReceives(), peano4::parallel::tests::PingPongTest::testMultithreadedPingPongWithNonblockingSends(), peano4::parallel::tests::PingPongTest::testMultithreadedPingPongWithNonblockingSendsAndReceives(), and peano4::parallel::SpacetreeSet::traverse().
void tarch::multicore::spawnTask | ( | Task * | task, |
const std::set< TaskNumber > & | inDependencies = tarch::multicore::NoInDependencies, | ||
const TaskNumber & | taskNumber = tarch::multicore::NoOutDependencies ) |
Spawns a single task in a non-blocking fashion.
Ownership goes over to Peano's job namespace, i.e. you don't have to delete the pointer.
Spawn a task that depends on one other task. Alternatively, pass in NoDependency. In this case, the task can kick off immediately. You have to specify a task number. This number allows other, follow-up tasks to become dependent on this very task. Please note that the tasks have to be spawned in order, i.e. if B depends on A, then A has to be spawned before B. Otherwise, you introduce a so-called anti-dependency. This is OpenMP jargon which we adopted ruthlessly.
You may pass NoDependency as taskNumber. In this case, you have a fire-and-forget task which is just pushed out there without anybody ever waiting for it later on (at least not via task dependencies).
The OpenMP variant has to jump through a number of hoops:
task | Pointer to a task. The responsibility for this task is handed over to the tasking system, i.e. you are not allowed to delete it. |
inDependencies | Set of incoming tasks that have to finish before the present task is allowed to run. You can pass the alias tarch::multicore::Tasks::NoInDependencies to make clear what's going on. |
taskNumber | Allow the runtime to track out dependencies. Only numbers handed in here may be in inDependencies in an upcoming call. If you do not expect to construct any follow-up in-dependencies, you can pass in the default, i.e. NoOutDependencies. |
Definition at line 135 of file multicore.cpp.
References assertion, tarch::multicore::Task::canFuse(), tarch::logging::Statistics::getInstance(), and tarch::logging::Statistics::inc().
tarch::multicore::orchestration::Strategy * tarch::multicore::swapOrchestration | ( | tarch::multicore::orchestration::Strategy * | realisation | ) |
Swap the active orchestration.
Different to setOrchestration(), this operation does not delete the current orchestration. It swaps them, so you can use setOrchestration() with the result afterwards and re-obtain the original strategy.
Definition at line 65 of file multicore.cpp.
References assertion.
Referenced by peano4::parallel::tests::PingPongTest::testMultithreadedPingPongWithBlockingReceives(), peano4::parallel::tests::PingPongTest::testMultithreadedPingPongWithBlockingSends(), peano4::parallel::tests::PingPongTest::testMultithreadedPingPongWithBlockingSendsAndReceives(), peano4::parallel::tests::PingPongTest::testMultithreadedPingPongWithNonblockingReceives(), peano4::parallel::tests::PingPongTest::testMultithreadedPingPongWithNonblockingSends(), and peano4::parallel::tests::PingPongTest::testMultithreadedPingPongWithNonblockingSendsAndReceives().
void tarch::multicore::waitForAllTasks | ( | ) |
Wait for all tasks notably has to take fused tasks into account.
Definition at line 129 of file multicore.cpp.
References tarch::multicore::taskfusion::processAllReadyTasks().
Wrapper around waitForTasks() with a single-element set.
Definition at line 85 of file multicore.cpp.
References waitForTasks().
Referenced by exahype2::EnclaveBookkeeping::waitForTaskToTerminateAndReturnResult().
void tarch::multicore::waitForTasks | ( | const std::set< TaskNumber > & | inDependencies | ) |
Wait for set of tasks.
Entries in inDependencies can be NoDependency. This is a trivial implementation, as we basically run through each task in inDependencies and invoke waitForTask() for it. We don't have to rely on some backend-specific implementation.
You can obviously construct a task set explicitly. If you know the number of tasks, you can however directly use the bracket notation to invoke this function:
This routine degenerates to nop, as no task can be pending. spawnTask() always executed the task straightaway.
Definition at line 80 of file multicore.cpp.
Referenced by waitForTask().
const std::set<TaskNumber> tarch::multicore::NoInDependencies = std::set<TaskNumber>() |
Definition at line 103 of file multicore.h.
|
constexpr |
Definition at line 101 of file multicore.h.
Referenced by swift2::TaskNumber::flatten(), tarch::multicore::taskfusion::ProcessReadyTask::run(), swift2::TaskNumber::TaskNumber(), swift2::TaskNumber::toString(), and tarch::multicore::taskfusion::translateFusableTaskIntoTaskSequence().