Skip to content

Compilers, MPI, and GPU Offloading

Known Issues

Please see the Known Issues page regarding known software problems and incompatibilities on the Cori GPU nodes. If you encounter an issue which is not documented here, please file a ticket at the NERSC Help Desk, selecting 'Cori GPU' as the 'Resource' in the ticket.

Note about cross-compiling

Nearly all software provided by Cray (cray-petsc, cray-fftw, cray-hdf5, etc.) is not usable on the Cori GPU nodes. This is because the GPU nodes have different hardware and run a different OS. Only a select subset of modules available on Cori are designed to work on the GPU nodes.

This means you will likely need to compile your own software directly on the GPU nodes themselves, rather than cross-compiling for the GPU nodes on a login node.

The best way to access a GPU node using modules designed to work on the GPU nodes is to purge your default modules first, then load esslurm and the other GPU modules you need, and then request the nodes, e.g.,

user@cori02:~> module purge && module load esslurm cuda gcc mvapich2
user@cori02:~> salloc -C gpu -t 60 -N 1 -c 10 --gres=gpu:1 -A <account>
salloc: Granted job allocation 12345
salloc: Waiting for resource configuration
salloc: Nodes cgpu12 are ready for job
user@cgpu12:~>

Base compilers

There are several base compilers available on Cori GPU, with varying levels of support for GPU code generation:

  • GCC
  • NVIDIA HPC SDK (formerly PGI)
  • CCE (Cray compiler)
  • Intel
  • clang

These compilers and their capabilities are described in more detail below.

PGI compiler has been replaced by the NVIDIA HPC SDK

In May 2020, NVIDIA incorporated the PGI compiler into its new HPC SDK. The PGI brand will eventually be retired, and all future versions of the PGI compiler will be included in the HPC SDK. This new SDK is available on Cori GPU as the module hpcsdk. NVIDIA has given the C, C++, and Fortran compilers new names, but will retain the old names in the near future. The compiler names are now:

  • pgcc -> nvc
  • pgc++ -> nvc++
  • pgf77/pgf90/pgf95/pgfortran -> nvfortran

The final release of the PGI compiler suite was version 20.4, and the first release of the HPC SDK was 20.5.

Since this is a new product from NVIDIA, NERSC encourages users to use the HPC SDK and report any feedback by filing a ticket at the NERSC help desk, selecting "Cori GPU" as the "Resource."

MPI

Both OpenMPI and MVAPICH2 and support are provided on the GPU nodes. Details about each are provided below. NERSC generally recommends that users use OpenMPI, due to its wider support of compilers and compiler versions, as well as being generally more stable than MVAPICH2.

OpenMPI

OpenMPI is provided for the GCC, HPC SDK (formerly PGI), Intel, and CCE compilers, and is provided as the openmpi/4.0.3 module. Users must use this particular version of the openmpi module - the other versions are not configured for Cori GPU.

One must first load a compiler module and a CUDA module before loading the openmpi/4.0.3 module, e.g.,

module load hpcsdk
module load cuda
module load openmpi/4.0.3

After the openmpi/4.0.3 module is loaded, the MPI compiler wrappers will be available as mpicc, mpic++, and mpif90.

MVAPICH2

MVAPICH2 is available via the mvapich2 module. It supports three compilers:

  • GCC (via the gcc module)
  • PGI (via the pgi module)
  • Intel (via the intel module)

PGI support in MVAPICH2 is limited to PGI versions <= 19

MVAPICH2 is not compatible with PGI version 20, or with PGI's successor, the HPC SDK. Users who wish to use the PGI or HPC SDK compilers should use OpenMPI instead of MVAPICH2.

The mvapich2 module must be loaded after a compiler module and a cuda module. Thus, to load MVAPICH2 with GCC:

module load gcc
module load cuda
module load mvapich2

To load MVAPICH2 with PGI or Intel support, replace gcc in this example with pgi or intel.

Cross-compiling with mvapich2 module from Cori login nodes does not work

Attempting to cross-compile a code on the Cori login nodes using the mvapich2 compiler wrappers will result in an error like the following:

/global/common/cori/software/mvapich2/2.3/pgi/18.10/lib/libmpi.so: undefined reference to `ibv_reg_xrc_rcv_qp@IBVERBS_1.1'
/global/common/cori/software/mvapich2/2.3/pgi/18.10/lib/libmpi.so: undefined reference to `ibv_close_xrc_domain@IBVERBS_1.1'
/global/common/cori/software/mvapich2/2.3/pgi/18.10/lib/libmpi.so: undefined reference to `ibv_unreg_xrc_rcv_qp@IBVERBS_1.1'
/global/common/cori/software/mvapich2/2.3/pgi/18.10/lib/libmpi.so: undefined reference to `ibv_open_xrc_domain@IBVERBS_1.1'
/global/common/cori/software/mvapich2/2.3/pgi/18.10/lib/libmpi.so: undefined reference to `ibv_modify_xrc_rcv_qp@IBVERBS_1.1'
/global/common/cori/software/mvapich2/2.3/pgi/18.10/lib/libmpi.so: undefined reference to `ibv_create_xrc_rcv_qp@IBVERBS_1.1'
/global/common/cori/software/mvapich2/2.3/pgi/18.10/lib/libmpi.so: undefined reference to `ibv_create_xrc_srq@IBVERBS_1.1'

The error occurs because the Infiniband library files which MVAPICH2 relies on are installed only on GPU nodes, not on Cori login nodes or compute nodes.

To avoid this error, one must invoke the mvapich2 compiler wrappers directly on a Cori GPU node.

GPU Software Support

There are many different ways to offload code to GPUs. We provide software support for several of these methods on the GPU nodes.

CUDA

The CUDA SDK is available via the cuda modules. The SDK includes the nvcc CUDA C/C++ compiler, the Nsight and nvprof profiling tools, the cuda-gdb debugger, and others.

Additionally, the LLVM/clang compiler is also a valid CUDA compiler. One can replace the nvcc command from the CUDA SDK with clang --cuda-gpu-arch=<arch>, where <arch> on the Cori GPU nodes is sm_70. If using clang as a CUDA compiler, one usually will also need to add the -I/path/to/cuda/include and -L/path/to/cuda/lib64 flags manually, since nvcc includes them implicitly.

OpenMP

Several compilers have some support for OpenMP offloading to GPUs via the omp target directive.

LLVM/clang

The clang/clang++ LLVM compilers support GPU offloading with OpenMP. The 'raw' compilers are available via the following modules:

  • llvm/11.0.0-git_20200409
  • llvm/10.0.0-git_20190828
  • llvm/9.0.0-git_20190220

or you can load the corresponding PrgEnv-llvm modules:

  • PrgEnv-llvm/11.0.0-git_20200409
  • PrgEnv-llvm/10.0.0-git_20190828
  • PrgEnv-llvm/9.0.0-git_20190220

which loads the appropriate LLVM, CUDA, and MVAPICH2 modules.

Enabling GPU offloading with OpenMP in the clang compiler looks like:

clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda base.c -c

Using the clang++ compiler

The clang++ compiler will fail unless you add a compiler option to use an official C++ standard, e.g. -std=c++11. The issue seems to be related to GPU-offload support for GCC extensions, e.g. __float128 type.

Intrinsic math functions in GPU offloaded regions

The clang/clang++ compilers belonging to the llvm/9.0.0-git_20190220 module are unable to compile OpenMP target regions which call <math.h> functions, e.g. log() and exp(). The compilers also incorrectly handle OpenMP target regions inside static libraries -- your application will fail at runtime when encountering the static library OpenMP target region. If you need either of these capabilities please use the module PrgEnv-llvm/10.0.0-git_20190828.

CCE

The Cray compilers ('CCE') have the most mature OpenMP offloading capabilities of any compiler on the Cori GPU nodes currently, especially amongst Fortran compilers. Cray does not officially supported CCE on the Cori GPU nodes, but it can be made to work by careful loading/unloading of modules.

# load the appropriate modules
module load cdt/20.03
module swap PrgEnv-{intel,cray}
module swap craype-{${CRAY_CPU_TARGET},x86-skylake}
module load cuda
module load openmpi/4.0.3
export CRAY_ACCEL_TARGET=nvidia70

# compile the code
mpicc  -fopenmp -o my_openmp_code.ex my_openmp_code.c     # C code
mpic++ -fopenmp -o my_openmp_code.ex my_openmp_code.cpp   # C++ code
mpif90 -h omp   -o my_openmp_code.ex my_openmp_code.f90   # Fortran code

Do not module purge if using CCE

Unlike most other compilers and modules used on the Cori GPU nodes, which should be preceded with module purge, the CCE compilers depend on the default Cray module environment, and therefore one should not execute module purge if one desires to use the CCE compilers.

You can add the flag -fsave-loopmark to the Cray C/C++ compilers, or -h list=a to the Cray Fortran compiler to produce an optimization report (named <source_file>.lst) which indicates which regions of the code were successfully offloaded to the GPU. For example, in the OpenMP offload version of the SOLLVE OpenMP V&V suite suite, CCE outputs a diagnostic report for each source file which includes sections such as:

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
                          S o u r c e   L i s t i n g
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

   11.               
   12.               #define N 1024
   13.               
   14.               PROGRAM test_target_teams_distribute_device
   15.                 USE iso_fortran_env
   16.                 USE ompvv_lib
   17.                 USE omp_lib
   18.                 implicit none
   19.                 INTEGER :: errors
   20.                 errors = 0
   21.               
   22.  +              OMPVV_TEST_OFFLOADING
   23.               
   24.  +              OMPVV_TEST_VERBOSE(test_add() .ne. 0)
   25.               
   26.  +              OMPVV_REPORT_AND_RETURN()
   27.               CONTAINS
   28.                 INTEGER FUNCTION test_add()
   29.                   INTEGER,DIMENSION(N):: a, b
   30.                   INTEGER:: x, dev_sum, host_sum, errors
   31.                   errors = 0
   32.                   host_sum = 0
   33.                   dev_sum = 0
   34.               
   35.    fVr2-----<     DO x = 1, N
   36.    fVr2              a(x) = 1
   37.    fVr2              b(x) = x
   38.    fVr2----->     END DO
   39.               
   40.    f--------<     DO x = 1, N
   41.    f                 host_sum = host_sum + a(x) + b(x)
   42.    f-------->     END DO
   43.               
   44.  + MG-------<     !$omp target teams distribute defaultmap(tofrom:scalar) &
   45.    MG             !$omp& reduction(+:dev_sum)
   46.    MG gr6---<     DO x = 1, N
   47.    MG gr6            dev_sum = a(x) + b(x) + dev_sum
   48.    MG gr6-->>     END DO
   49.               
   50.  +                OMPVV_TEST_AND_SET_VERBOSE(errors, dev_sum .ne. host_sum)
   51.                   test_add = errors
   52.                 END FUNCTION test_add
   53.               END PROGRAM test_target_teams_distribute_device
ftn-5001 ftn: NOTE TEST_TARGET_TEAMS_DISTRIBUTE_DEVICE, File = test_target_teams_distribute_reduction_add.F90, Line = 53 
  Local variable "ERRORS" is assigned a value but never used.


ftn-3118 ftn: IPA TEST_TARGET_TEAMS_DISTRIBUTE_DEVICE, File = test_target_teams_distribute_reduction_add.F90, Line = 22, Column = 3 
  "test_offloading"(/global/u2/f/friesen/tests/OpenMP/openmp45/sollve_vv/tests/4.5/target_teams_distribute/test_target_teams_distrib
  te_reduction_add.F90:130) was not inlined because the call site will not flatten.  "omp_is_initial_device_" is missing.

ftn-3171 ftn: IPA TEST_TARGET_TEAMS_DISTRIBUTE_DEVICE, File = test_target_teams_distribute_reduction_add.F90, Line = 24, Column = 3 
  "test_add"(/global/u2/f/friesen/tests/OpenMP/openmp45/sollve_vv/tests/4.5/target_teams_distribute/test_target_teams_distribute_red
  ction_add.F90:28) was not inlined because it is not in the body of a loop.

ftn-3163 ftn: IPA TEST_TARGET_TEAMS_DISTRIBUTE_DEVICE, File = test_target_teams_distribute_reduction_add.F90, Line = 24, Column = 3 
  "test_error_verbose"(/global/u2/f/friesen/tests/OpenMP/openmp45/sollve_vv/tests/4.5/target_teams_distribute/test_target_teams_dist
  ibute_reduction_add.F90:176) was not inlined because the routine contains initialized data with the SAVE attribute.

ftn-3171 ftn: IPA TEST_TARGET_TEAMS_DISTRIBUTE_DEVICE, File = test_target_teams_distribute_reduction_add.F90, Line = 26, Column = 3 
  "report_and_set_errors"(/global/u2/f/friesen/tests/OpenMP/openmp45/sollve_vv/tests/4.5/target_teams_distribute/test_target_teams_d
  stribute_reduction_add.F90:261) was not inlined because it is not in the body of a loop.

ftn-6005 ftn: SCALAR TEST_ADD, File = test_target_teams_distribute_reduction_add.F90, Line = 35 
  A loop starting at line 35 was unrolled 2 times.

ftn-6204 ftn: VECTOR TEST_ADD, File = test_target_teams_distribute_reduction_add.F90, Line = 35 
  A loop starting at line 35 was vectorized.

ftn-6004 ftn: SCALAR TEST_ADD, File = test_target_teams_distribute_reduction_add.F90, Line = 40 
  A loop starting at line 40 was fused with the loop starting at line 35.

ftn-6405 ftn: ACCEL TEST_ADD, File = test_target_teams_distribute_reduction_add.F90, Line = 44 
  A region starting at line 44 and ending at line 48 was placed on the accelerator.

ftn-6823 ftn: THREAD TEST_ADD, File = test_target_teams_distribute_reduction_add.F90, Line = 44 
  A region starting at line 44 and ending at line 48 was multi-threaded.

ftn-6418 ftn: ACCEL TEST_ADD, File = test_target_teams_distribute_reduction_add.F90, Line = 44 
  If not already present: allocate memory and copy whole array "b" to accelerator, free at line 48 (acc_copyin).

ftn-6418 ftn: ACCEL TEST_ADD, File = test_target_teams_distribute_reduction_add.F90, Line = 44 
  If not already present: allocate memory and copy whole array "a" to accelerator, free at line 48 (acc_copyin).

ftn-6415 ftn: ACCEL TEST_ADD, File = test_target_teams_distribute_reduction_add.F90, Line = 44 
  Allocate memory and copy variable "dev_sum" to accelerator, copy back at line 48 (acc_copy).

ftn-6823 ftn: THREAD TEST_ADD, File = test_target_teams_distribute_reduction_add.F90, Line = 44 
  A region starting at line 44 and ending at line 48 was multi-threaded.

ftn-6005 ftn: SCALAR TEST_ADD, File = test_target_teams_distribute_reduction_add.F90, Line = 46 
  A loop starting at line 46 was unrolled 6 times.

ftn-6430 ftn: ACCEL TEST_ADD, File = test_target_teams_distribute_reduction_add.F90, Line = 46 
  A loop starting at line 46 was partitioned across the threadblocks and the 128 threads within a threadblock.

ftn-3171 ftn: IPA TEST_ADD, File = test_target_teams_distribute_reduction_add.F90, Line = 50, Column = 5 
  "test_and_set_verbose"(/global/u2/f/friesen/tests/OpenMP/openmp45/sollve_vv/tests/4.5/target_teams_distribute/test_target_teams_di
  tribute_reduction_add.F90:219) was not inlined because it is not in the body of a loop.

GCC

GCC 8.1.1 has some support for OpenMP offloading. This compiler is available via the gcc/8.1.1-openacc-gcc-8-branch-20190215 module, which depends on the cuda/9.2.148 module.

OpenMP offloading with gcc looks something like

gcc -fopenmp -foffload=nvptx-none="-Ofast -lm -misa=sm_35" base.c -c

OpenMP GPU offload support in GCC is limited

The GCC compiler's OpenMP offload capabilities for GPU code generation is very limited, in terms of both functionality and performance. Users are strongly advised to use LLVM/clang for C/C++ codes, or CCE, which also includes a Fortran compiler with OpenMP offload capability.

OpenACC

Several compilers on the GPU nodes also support GPU offloading with OpenACC directives.

GCC

The GCC module available via gcc/8.1.1-openacc-gcc-8-branch-20190215 also supports OpenACC offloading for GPUs. Invoking OpenACC looks like:

gcc -fopenacc -foffload=nvptx-none="-Ofast -lm -misa=sm_35" base.c -c

NVIDIA HPC SDK (formerly PGI)

The NVIDIA HPC SDK (formerly PGI) compilers support OpenACC offloading and are available via the hpcsdk (or pgi) modules.

Invoking OpenACC in the HPC SDK compilers looks like:

nvc++ -acc -ta=tesla:cc70 base.c -c

Documentation for the HPC SDK compiler is provided here.

CUDA Fortran

The NVIDIA HPC SDK (formerly PGI) Fortran compiler supports CUDA Fortran.

Compiler bugs

If you find bugs in the compilers (wrong answers, compiler crashing, etc.), PLEASE REPORT THEM TO NERSC! Any OpenMP target issues can be sent directly to Chris Daley: csdaley@lbl.gov. Many compilers are still in early phases of GPU enablement and depend on bug reports to fix these bugs quickly.

OpenCL

OpenCL is supported natively by NVIDIA's CUDA toolkit. In addition, there's a module for the Portable OpenCL (POCL) implementation which is based on LLVM and uses its NVPTX backend. It's recommended that you try the NVIDIA solution first, and then try the POCL implementation as it may provide better performance.

Module load order may affect which driver you get

If you need to load a CUDA module for your workflow, POCL must be loaded after the CUDA module to avoid using the NVIDIA driver.

NVIDIA OpenCL

A compilation using the NVIDIA driver requires specifying the path of the OpenCL CL/cl.h include file:

module load gcc cuda
g++ $CFLAGS -I$CUDA_ROOT/include <myapplication.c> -lOpenCL

Portable OpenCL (experimental)

In order to use the POCL implementation, you must first load the POCL module. Note that the POCL module includes the necessary paths for include files and libraries.

module use /global/cfs/cdirs/mpccc/dwdoerf/cori-gpu/modulefiles
module load opencl
g++ $CFLAGS <myapplication.c> -lOpenCL

You can check to make sure you're using POCL using the clinfo utility.

cgpu$ module load clinfo
cgpu$ srun clinfo -l
Platform #0: Portable Computing Language
 +-- Device #0: pthread-Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
 `-- Device #1: Tesla V100-SXM2-16GB

SYCL

There are a few options for SYCL compilers. Note that both of these are experimental to some degree.

ComputeCpp (experimental)

ComputeCpp is a production compiler developed by CodePlay, but the NERSC configuration relies on the open source Portable OpenCL (POCL) implmentation in addition to an open source SPIR-V to LLVM Translator in order to target the NVIDIA V100 GPU. The NERSC instantiation is not an officially suported configration, but the combination just happens to work.

A prerequisite is to load the following module path:

module use /global/cfs/cdirs/mpccc/dwdoerf/cori-gpu/modulefiles
module load computecpp

Example usage is best demonstrated with the follwing example:

cgpu$ cp -R /global/cfs/cdirs/mpccc/dwdoerf/cori-gpu/llvm-pocl/example .
cgpu$ cd example
cgpu$ make
compute++ -std=c++14 -O2 -sycl-driver -sycl-target spirv64 -no-serial-memop -o simple-vector-add.x simple-vector-add.cpp -lComputeCpp -lOpenCL
cgpu$ srun simple-vector-add.x
Using Platform Portable Computing Language: Device Tesla V100-SXM2-16GB
Using Platform Portable Computing Language: Device Tesla V100-SXM2-16GB
The results are correct!

Intel DPC++/SYCL (experimental)

The second option is to use the Intel Data Parallel C++ (DPC++) compiler. This is based on the LLVM/Clang compiler with SYCL extensions added by Intel (the basis of their oneAPI DPC++ solution), and an experimental NVPTX backend provided by CodePlay to target NVIDIA GPUs.

Intel SYCL requires a custom device selector

The Intel SYCL compiler targets NVPTX directly, i.e. bypassing the OpenCL driver, and hence if you're using a SYCL default GPU selector it may not find the NVIDIA GPU. You can inspect the example below which contains code demonstrating how to select the NVIDIA GPU as a device.

A prerequisite is to load the following module path:

module use /global/cfs/cdirs/mpccc/dwdoerf/cori-gpu/modulefiles
module load dpc++

Example usage of the Intel DPC++ compiler:

cgpu$ cp -R /global/cfs/cdirs/mpccc/dwdoerf/cori-gpu/llvm-sycl/example .
cgpu$ cd example
cgpu$ make
clang++ -std=c++14 -O2 -fsycl -fsycl-targets=nvptx64-nvidia-cuda-sycldevice -Xsycl-target-backend '--cuda-gpu-arch=sm_70' -o simple-vector-add.x simple-vector-add.cpp
cgpu$ srun simple-vector-add.x
Using Platform NVIDIA CUDA: Device Tesla V100-SXM2-16GB
Using Platform NVIDIA CUDA: Device Tesla V100-SXM2-16GB
The results are correct!

HIP

The HIP compiler and associated hipBLAS library are available.

A prerequisite is to load the following module path:

module use /global/cfs/cdirs/mpccc/dwdoerf/cori-gpu/modulefiles
module load hip

C++17 parallel algorithms

C++17 introduced parallel STL algorithms ("pSTL"), such that standard C++ code can express parallelism when using many of the STL algorithms. The NVIDIA HPC SDK supports GPU-accelerated pSTL algorithms, which can be activated by invoking nvc++ with the flag -stdpar=gpu. Documentation regarding pSTL for the HPC SDK can be found here.

HPC SDK v20.5 does not generate GPU code if a GPU is not visible at compile time

nvc++ version 20.5 attempts to detect a GPU when it is invoked with -stdpar=gpu. If a GPU is visible, it will generate GPU-accelerated pSTL code; however, if a GPU is not visible at compile time, even if -stdpar=gpu is specified, nvc++ will only generate CPU code. This means that cross-compilation of pSTL code from a non-GPU node (like a Cori login node) is not possible. It also means that GPU-accelerated pSTL code generation also does not work in interactive jobs on Cori GPU unless nvc++ is invoked from within an srun command. (See this page for details.)

cuTENSOR extensions for Fortran intrinsic math functions

The NVIDIA HPC SDK provides cuTENSOR extensions so that some Fortran intrinsic math functions can be accelerated on GPUs. Accelerated functions include MATMUL, TRANSPOSE, and several others. The nvfortran compile provides access to these GPU-accelerated functions via the module cutensorEx. Documentation about the cutensorEx module in nvfortran is provided here.