Peano has three different approaches how to handle system- and compiler-specific settings and properties:
- Most of the system-specific information should be specified through CXXFLAGS or LDFLAGS, i.e. when you configure the original setup. These flags as well as all the information about enabled toolboxes, extensions, and so forth are dumped into the config.h and the Makefiles. C++ codes will read config.h to take the settings into account, and the Python API will parse the generated Makefile to extract the relevant info and use it for all codes built on top of Peano, too.
- At compile time, the system will furthermore read out the compiler version, and it will include the compiler-specific settings for this particular compiler. This allows Peano to use certain defines for certain compiler flavours.
- Peano offers special toolchains for special vendors.
The page below discusses these particular toolchains and provides some information on settings for some vendors. There is a dedicated subpage for some machines that we use quite a lot for Peano. Before we dive into particular toolchains, some general remark on how the compiler-dependencies are managed:
Realisation of Peano's internal compiler-specific switches
Peano relies on a header tarch/compiler/CompilerSpecificSettings.h. This header reads out the compiler version and includes a particular flavour of the header for this compiler, i.e. the header reads out some compiler preprocessor directives and then includes the one it find most appropriate. You may always include your own file derived from one of the other headers in the directory.
Whenever we find incompatibilities between different compilers, we try to resolve them through defines within the compiler-specific settings. This way, we avoid that some "fixes" are spread over the whole code. The setting also are used to configure for particular machine specifica such as default alignment. Most expressions within the compiler-specific settings header can manually be overwritten via defines. Consult the file implementation for details.
The config.h as generated by the build system also feeds into the compiler-specific settings. First of all, it defines a few generic constants such as SharedTBB or Parallel. These are classic macro symbols. Whenever certrain macro combinations have knock-on effects on other features, they should be covered within CompilerSpecificSettings and in turn be mapped onto further macros.
There are exceptions to this rule: If you have a certain GPU backend, you might have to annotate functions in a certain way. Such information is not covered within CompilerSpecificSettings (it has nothing to do with a particlar compiler choice), but is found directly within the headers of the respective namespace in the technical architecture.
Vendor toolchains and compiler settings
GNU
OpenMP and OpenMP offloading
CC=gcc CXX=g++ ./configure --with-multithreading=omp --with-gpu=omp CFLAGS="-O3 -W -Wall -Wno-attributes -fopenmp -foffload=nvptx-none -march=native" CXXFLAGS="-std=c++20 -O3 -W -Wall -Wno-attributes -fopenmp -foffload=nvptx-none -march=native" LDFLAGS="-fopenmp -foffload=nvptx-none -march=native -mtune=native"
Intel (oneAPI)
Intel toolchain

Peano provides support for Intel's (oneAPI) toolchain through two routes: The ITAC interface and the Instrumentation and Tracing Technology (ITT). To activate either of these toolchains, you have to reconfigure with one of the two options below:
./configure ... --with-toolchain=itac
./configure ... --with-toolchain=itt
Broadly speaking, the toolchains alters the code in two ways:
- It enables ITAC or ITT.
- It switches the line logger to the ITAC or ITT logger. That is, trace data are not pipes onto the terminal anymore, but are passed into the ITAC or ITT API.
- It adds some Intel-specific compile flags.
The enumeration shows that the name Intel toolchain is misleading. We are actually not tailoring the build to the Intel toolchain, but we tailor the setup to the Intel analysis tools. Most production runs will not use the Intel toolchain, but rather set all Intel-specific flags for compiler and linker manually.
Intel-specific compiler flags
At the moment, we add the following flags to the compile when the Intel toolchain is activated:
- We add
CXXFLAGS="... -DTBB_USE_THREADING_TOOLS -parallel-source-info=2"
to all compiles.
- We add
CXXFLAGS="... -DTBB_USE_ASSERT"
to all debug and assert builds.
The loggers
Even though you have switched to the Intel loggers, you will not get any trace information if you build in release mode. You have to switch to the trace mode (or assertions or debug) - compare general remarks on Peano's build modi - to get traces or annotation info.
Once you try to trace your code with the Intel tool, the size of the traces quickly becomes unmanageable or the performance might go down. Therefore, we disable the trace by default. The first trace command will then enable it.
For further discussion on logging, please see the generic logging description for Peano.
Threading Building Blocks (TBB)
We found that the newer Intel compilers provide a flag
CXXFLAGS="... -tbb" LDFLAGS="... -tbb"
which means you don't have to manually link against TBB anymore. It also should provide all the includes. It is not clear what flags are set (might be more than only a few include paths), so we found this route more reliable rather than adding TBB paths and libraries manually.
OpenMP and OpenMP offloading
CC=icx CXX=icpx ./configure --with-multithreading=omp --with-gpu=omp CFLAGS="-O3 -W -Wall -Wno-attributes -fiopenmp -fopenmp-targets=spir64" CXXFLAGS="-std=c++20 -O3 -W -Wall -Wno-attributes -fiopenmp -fopenmp-targets=spir64" LDFLAGS="-march=native -mtune=native -fiopenmp -fopenmp-targets=spir64"
Intel suggests to use -fiopenmp for performance and features: https://www.intel.com/content/www/us/en/developer/articles/guide/porting-guide-for-icc-users-to-dpcpp-or-icx.html
SYCL and SYCL offloading
SYCL is directly supported via icpx. There are only a few things to do:
- Translate your code with –with-multithreading=sycl and/or –with-gpu=sycl.
- Add CXXFLAGS="... -fsycl" to your compiler flags. This way, all the headers of SYCL are known to icpx.
- Add LDFLAGS="... -fsycl" to your linker flags. This way, the linker will automatically add all SYCL libraries. Furthermore, it will know when it builds the final applications that it should embed both the device kernels and the CPU kernels into the executable. If you use LIBS="... -lsycl", the linking will succeed, but the very first time you invoke a SYCL kernel, you'll get a message similar to
terminate called after throwing an instance of 'sycl::_V1::runtime_error'
what(): No kernel named _ZTSZZN7toolbox15blockstructured64interpolateCellDataAssociatedToVolumesIntoOverlappingCell_linearEiiiiiPKdPdN6peano45utils15LoopParallelismEENKUlRT_E_clIN4sycl3_V17handlerEEEDaS8_EUlNSC_2idILi3EEEE_ was found -46 (PI_ERROR_INVALID_KERNEL_NAME)
Aborted (core dumped)
- Ensure CXX=icpx points to the icpx compiler.
- Ensure CC=icx points to the icpx compiler if you work with the autotools. This is not intuitive and actually quite a hack, but you need it to work:
configure
will test if you C compiler is compatible with your linker flags. However, the -fsycl
command is unknown to the default C compiler (usually gcc) and your configure will fail miserably. By letting the C compiler variable point to the Intel oneAPI C compiler, you ensure that the sanity check within the configuration phase passes.
- Reset the MPI compiler (see below; if you want to use MPI).
For Intel GPUs:
CC=icx CXX=icpx ./configure --with-multithreading=sycl --with-gpu=sycl CFLAGS="-O3 -W -Wall -Wno-attributes -fsycl -fsycl-targets=spir64 -fsycl-device-code-split -fp-model=consistent" CXXFLAGS="-std=c++20 -O3 -W -Wall -Wno-attributes -fsycl -fsycl-targets=spir64 -fsycl-device-code-split -fp-model=consistent" LDFLAGS="-march=native -mtune=native -fsycl -fsycl-targets=spir64 -fsycl-device-code-split -fp-model=consistent"
For NVIDIA GPUs (Intel LLVM, Codeplay):
CC=icx CXX=icpx ./configure --with-multithreading=sycl --with-gpu=sycl CFLAGS="-O3 -W -Wall -Wno-attributes -fsycl -fsycl-targets=nvptx64-nvidia-cuda -fsycl-device-code-split -fp-model=consistent -Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_XX" CXXFLAGS="-std=c++20 -O3 -W -Wall -Wno-attributes -fsycl -fsycl-targets=nvptx64-nvidia-cuda -fsycl-device-code-split -fp-model=consistent -Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_XX" LDFLAGS="-march=native -mtune=native -fsycl -fsycl-targets=nvptx64-nvidia-cuda -fsycl-device-code-split -fp-model=consistent -Xsycl-target-backend=nvptx64-nvidia-cuda --cuda-gpu-arch=sm_XX"
where XX has to be replaced by the actual compute capability of the device.
For AMD GPUs (Intel LLVM, Codeplay):
CC=icx CXX=icpx ./configure --with-multithreading=sycl --with-gpu=sycl CFLAGS="-O3 -W -Wall -Wno-attributes -fsycl -fsycl-targets=amdgcn-amd-amdsha -fsycl-device-code-split -fp-model=consistent -Xsycl-target-backend=amdgcn-amd-amdsha --offload-arch=gfxXXX" CXXFLAGS="-std=c++20 -O3 -W -Wall -Wno-attributes -fsycl -fsycl-targets=amdgcn-amd-amdsha -fsycl-device-code-split -fp-model=consistent -Xsycl-target-backend=amdgcn-amd-amdsha --offload-arch=gfxXXX" LDFLAGS="-march=native -mtune=native -fsycl -fsycl-targets=amdgcn-amd-amdsha -fsycl-device-code-split -fp-model=consistent -Xsycl-target-backend=amdgcn-amd-amdsha --offload-arch=gfxXXX"
where XXX has to be replaced by the actual compute capability of the device.
MPI
Intel's MPI wrapper is mpiicpc even though they now want you to use icpx instead of icc/icpc. To tell the MPI wrapper that you want to use icpx, you have to set some environment variables:
export I_MPI_CXX=icpx
./configure --with-mpi=mpiicpc
LLVM
OpenMP
For NVIDIA GPUs:
export MARCH=sm_XX
CC=clang CXX=clang++ ./configure --with-multithreading=omp --with-gpu=omp CXXFLAGS="-std=c++17 -O3 -W -Wall -Wno-attributes -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target=nvptx64-nvidia-cuda -march=$MARCH" LDFLAGS="-march=native -mtune=native -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -Xopenmp-target=nvptx64-nvidia-cuda -march=$MARCH"
where XX has to be replaced by the actual compute capability of the device.
Note: We set the C++ compiler standard to 17. This is due to an issue with LLVM Clang itself. See https://github.com/llvm/llvm-project/issues/61327.
NVIDIA NVC++ (NVHPC)

The NVIDIA toolchain requires us to make CXX
and CC
point to the C++ compiler, as both are passed the same arguments by configure, which are actually only understood by the C++ version.
NVIDIA toolchain
Yet to be written. There is an NVIDIA logger offered.
OpenMP and OpenMP offloading
NVIDIA's compiler does not support all of OpenMP. Therefore, its OpenMP implementation is rather picky when it comes to taskloops and other C++ features. Peano has internal workarounds for all of these items, i.e. the build should in principle succeed yet the code might be slightly slower than the Intel or native Clang counterpart.
The OpenMP offloading requires us to specify both the GPU OpenMP target and a corresponding compute capability:
CC=nvc++ CXX=nvc++ ./configure --with-multithreading=omp --with-gpu=omp CFLAGS="-O3 -W -Wall --diag_suppress=unrecognized_attribute -fopenmp -Munroll=c:100 -mp=gpu -gpu=ccXX" CXX="nvc++" CXXFLAGS="-std=c++20 -O3 -W -Wall --diag_suppress=unrecognized_attribute -fopenmp -Munroll=c:100 -mp=gpu -gpu=ccXX" LDFLAGS="-march=native -mtune=native -fopenmp -Munroll=c:100 -mp=gpu -gpu=ccXX"
where XX has to be replaced by the actual compute capability of the device.
OpenMP and std::par offloading
CC=nvc++ CXX=nvc++ CFLAGS="-O3 -W -Wall --diag_suppress=unrecognized_attribute -fopenmp -Munroll=c:100 -mp=gpu -gpu=ccXX" CC="nvc" CXXFLAGS="-std=c++20 -O3 -W -Wall --diag_suppress=unrecognized_attribute -fopenmp -Munroll=c:100 -mp=gpu -stdpar=gpu -gpu=ccXX" LDFLAGS="-march=native -mtune=native -fopenmp -Munroll=c:100 -mp=gpu -gpu=ccXX"
where XX has to be replaced by the actual compute capability of the device.
Source code annotation
NVIDIA's NVPTX logging is supported by picking the
toolchain.
AMD
For the AMD toolchain we need to load either ROCm or AOMP. Both modules should have AMD's version of Clang. With the AMD toolchain we use AMD's modified Clang. We again need to point CXX
and CC
to the AMDClang compiler.
OpenMP and OpenMP offloading
For the configuration with OpenMP GPU offloading, we again specify --with-gpu=omp
to make Peano GPU aware. Furthermore, we need to add AMD specific offloading instructions:
CC=amdclang CXX=amdclang++ ./configure --with-multithreading=omp --with-gpu=omp CFLAGS="-O3 -W -Wall -Wno-attributes -fopenmp --offload-arch=gfxXXX -D__AMDGPU__" CXXFLAGS="-std=c++17 -O3 -W -Wall -Wno-attributes -fopenmp -lstdc++fs --offload-arch=gfxXXX -D__AMDGPU__" LDFLAGS="-march=native -mtune=native -lstdc++fs -fopenmp --offload-arch=gfxXXX"
where XXX has to be replaced by the actual compute capability of the device.
Note: We define the preprocessor flag AMDGPU to allow Peano for compiling patches that are needed only for the AMD toolchain.