Peano 4
Vendor software stacks and system- and compiler-specific settings

Peano has three different approaches how to handle system- and compiler-specific settings and properties:

  • Most of the system-specific information should be specified through CXXFLAGS or LDFLAGS, i.e. when you configure the original setup. These flags as well as all the information about enabled toolboxes, extensions, and so forth are dumped into the config.h and the Makefiles. C++ codes will read config.h to take the settings into account, and the Python API will parse the generated Makefile to extract the relevant info and use it for all codes built on top of Peano, too.
  • At compile time, the system will furthermore read out the compiler version, and it will include the compiler-specific settings for this particular compiler. This allows Peano to use certain defines for certain compiler flavours.
  • Peano offers special toolchains for special vendors.

The page below discusses these particular toolchains and provides some information on settings for some vendors. There is a dedicated subpage for some machines that we use quite a lot for Peano. Before we dive into particular toolchains, some general remark on how the compiler-dependencies are managed:

Realisation of Peano's internal compiler-specific switches

Peano relies on a header tarch/compiler/CompilerSpecificSettings.h. This header reads out the compiler version and includes a particular flavour of the header for this compiler, i.e. the header reads out some compiler preprocessor directives and then includes the one it find most appropriate. You may always include your own file derived from one of the other headers in the directory.

Whenever we find incompatibilities between different compilers, we try to resolve them through defines within the compiler-specific settings. This way, we avoid that some "fixes" are spread over the whole code. The setting also are used to configure for particular machine specifica such as default alignment. Most expressions within the compiler-specific settings header can manually be overwritten via defines. Consult the file implementation for details.

The config.h as generated by the build system also feeds into the compiler-specific settings. First of all, it defines a few generic constants such as SharedTBB or Parallel. These are classic macro symbols. Whenever certrain macro combinations have knock-on effects on other features, they should be covered within CompilerSpecificSettings and in turn be mapped onto further macros.

There are exceptions to this rule: If you have a certain GPU backend, you might have to annotate functions in a certain way. Such information is not covered within CompilerSpecificSettings (it has nothing to do with a particlar compiler choice), but is found directly within the headers of the respective namespace in the technical architecture.

Vendor toolchains and compiler settings

Intel (oneAPI)

Intel toolchain

Peano provides support for Intel's (oneAPI) toolchain through two routes: The ITAC interface and the Instrumentation and Tracing Technology (ITT). To activate either of these toolchains, you have to reconfigure with one of the two options below:

./configure ... --with-toolchain=itac
./configure ... --with-toolchain=itt

Broadly speaking, the toolchains alters the code in two ways:

  • It enables ITAC or ITT.
  • It switches the line logger to the ITAC or ITT logger. That is, trace data are not pipes onto the terminal anymore, but are passed into the ITAC or ITT API.
  • It adds some Intel-specific compile flags.

The enumeration shows that the name Intel toolchain is misleading. We are actually not tailoring the build to the Intel toolchain, but we tailor the setup to the Intel analysis tools. Most production runs will not use the Intel toolchain, but rather set all Intel-specific flags for compiler and linker manually.

Intel-specific compiler flags

At the moment, we add the following flags to the compile when the Intel toolchain is activated:

  • We add
    CXXFLAGS="... -DTBB_USE_THREADING_TOOLS -parallel-source-info=2"
    to all compiles.
  • We add
    CXXFLAGS="... -DTBB_USE_ASSERT"
    to all debug and assert builds.

The loggers

Even though you have switched to the Intel loggers, you will not get any trace information if you build in release mode. You have to switch to the trace mode (or assertions or debug) - compare general remarks on Peano's build modi - to get traces or annotation info.

Once you try to trace your code with the Intel tool, the size of the traces quickly becomes unmanageable or the performance might go down. Therefore, we disable the trace by default. The first trace command will then enable it.

For further discussion on logging, please see the generic logging description for Peano.

Threading Building Blocks (TBB)

We found that the newer Intel compilers provide a flag

CXXFLAGS="... -tbb" LDFLAGS="... -tbb"

which means you don't have to manually link against TBB anymore. It also should provide all the includes. It is not clear what flags are set (might be more than only a few include paths), so we found this route more reliable rather than adding TBB paths and libraries manually.

SYCL

SYCL is directly supported via icpx. There are only a few things to do:

  • Translate your code with –with-multithreading=sycl and/or –with-gpu=sycl.
  • Add CXXFLAGS="... -fsycl" to your compiler flags. This way, all the headers of SYCL are known to icpx.
  • Add LDFLAGS="... -fsycl" to your linker flags. This way, the linker will automatically add all SYCL libraries. Furthermore, it will know when it builds the final applications that it should embed both the device kernels and the CPU kernels into the executable. If you use LIBS="... -lsycl", the linking will succeed, but the very first time you invoke a SYCL kernel, you'll get a message similar to
    terminate called after throwing an instance of 'sycl::_V1::runtime_error'
    what(): No kernel named _ZTSZZN7toolbox15blockstructured64interpolateCellDataAssociatedToVolumesIntoOverlappingCell_linearEiiiiiPKdPdN6peano45utils15LoopParallelismEENKUlRT_E_clIN4sycl3_V17handlerEEEDaS8_EUlNSC_2idILi3EEEE_ was found -46 (PI_ERROR_INVALID_KERNEL_NAME)
    Aborted (core dumped)
    kernel
    Definition: noh.py:410
  • Ensure CXX=icpx points to the icpx compiler.
  • Ensure CC=icpx points to the icpx compiler if you work with the autotools. This is not intuitive and actually quite a hack, but you need it to work: configure will test if you C compiler is compatible with your linker flags. However, the -fsycl command is unknown to the default C compiler (usually gcc) and your configure will fail miserably. By letting the C compiler variable point to the Intel C++ compiler, you ensure that the sanity check within the configuration phase passes.
  • Reset the MPI compiler (see below; if you want to use MPI).

MPI

Intel's MPI wrapper is mpiicpc even though they now want you to use icpx instead of icc/icpc. To tell the MPI wrapper that you want to use icpx, you have to set some environment variables:

export I_MPI_CXX=icpx
./configure --with-mpi=mpiicpc

NVIDIA NVC++

The NVIDIA toolchain requires us to make CXX and CC point to the C++ compiler, as both are passed the same arguments by configure, which are actually only understood by the C++ version.

Some compute kernels of some Peano extensions (such as ExaHyPE) are not available with the NVIDIA tools, as the compiler is pretty picky. It refuses, for example, to place temporary array variables on the call stack. So you might have to play around with the software configuration.

NVIDIA toolchain

Yet to be written. There is an NVIDIA logger offered.

OpenMP and OpenMP offloading

NVIDIA's compiler does not support all of OpenMP. Therefore, its OpenMP implementation is rather picky when it comes to copy constructors and other C++ features.  has internal workarounds for all of these items, i.e. the build should in principle succeed yet the code might be slightly slower than the Intel or native Clang counterpart.

The OpenMP offloading requires us to specify both the GPU OpenMP target and a corresponding CUDA version. It is important to recognise that --with-gpu=omp activates the offloading from 's point of view. However, as long as the NVIDIA compiler is not told which device to use, it will give you

We require -cuda to be passed to the compiler and linker flags to enable Unified Shared Memory (USM).

Found 0 devices

devices if you run the code.

Source code annotation

NVIDIA's NVPTX logging is supported by picking the

--with-toolchain=nvptx

toolchain.

AMD

For the AMD toolchain we need to load either ROCm or AOMP. Both modules should have AMD's version of Clang. With the AMD toolchain we use AMD's modified Clang. We again need to point CXX and CC to the Clang compiler.

OpenMP

The Clang compiler does not support omp_get_mapped_ptr() at the moment. We have implemented a workaround mk_omp_get_mapped_ptr() in src/tarch/multicore/omp/MkUtils.h which does the same trick.

For the configuration with OpenMP GPU offloading, we again specify --with-gpu=omp to make Peano GPU aware. Furthermore, we need to add AMD specific offloading instructions:

./configure  --enable-exahype  --enable-loadbalancing --enable-blockstructured --with-gpu=omp  --with-multithreading=omp CC=clang CXX=clang++  CXXFLAGS="-O3 -std=c++20 -fopenmp -O3 -std=c++20 --offload-arch=<gpu-arch>"  LDFLAGS="-fopenmp --offload-arch=<gpu-arch> -lstdc++fs"

where <gpu-arch> is either gfx906 (MI 50), gfx908 (MI 100), or gfx90a (MI 200 series).

Clang (native)

On Ubuntu and many other systems, clang is not shipped automatically with OpenMP. Instead, you have to install the package libomp-dev.

./configure CC=clang CXX=clang++  CXXFLAGS="-O3 -std=c++20-mtune=native -march=native -fopenmp"  --enable-exahype --enable-loadbalancing --enable-blockstructured --with-multithreading=omp

Should you require OpenMP offloading, i.e. GPU, support, you have to add the OpenMP target instructions:

./configure CC=clang CXX=clang++  CXXFLAGS="-O3 -std=c++20 -mtune=native -march=native -fopenmp" -fopenmp-targets=nvptx64 --enable-exahype --enable-loadbalancing --enable-blockstructured --with-multithreading=omp