Skip to content

Software

Known Issues

Please see the Known Issues section at the bottom of this page regarding known software problems and incompatibilities on the Cori GPU nodes. If you encounter an issue which is not documented here, please file a ticket at the NERSC Help Desk, selecting 'Cori GPU' as the 'Resource' in the ticket.

Note about cross-compiling

Nearly all software provided by Cray (cray-petsc, cray-fftw, cray-hdf5, etc.) is not usable on the Cori GPU nodes. This is because the GPU nodes have different hardware and run a different OS. Only a select subset of modules available on Cori are designed to work on the GPU nodes.

This means you will likely need to compile your own software directly on the GPU nodes themselves, rather than cross-compiling for the GPU nodes on a login node.

The best way to access a GPU node using modules designed to work on the GPU nodes is to purge your default modules first, then load esslurm and the other GPU modules you need, and then request the nodes, e.g.,

user@cori02:~> module purge && module load esslurm cuda pgi mvapich2
user@cori02:~> salloc -C gpu -t 60 -N 1 -c 10 --gres=gpu:1 -A <account>
salloc: Granted job allocation 12345
salloc: Waiting for resource configuration
salloc: Nodes cgpu12 are ready for job
user@cgpu12:~>

MPI

Both MVAPICH2 and OpenMPI with UCX support are provided on the GPU nodes. Details about each are provided below.

MVAPICH2

MVAPICH2 is available via the mvapich2 module. It supports three compilers:

  • GCC (via the gcc module)
  • PGI (via the pgi module)
  • Intel (via the intel module)

The mvapich2 module must be loaded after a compiler module and a cuda module. Thus, to load MVAPICH2 with PGI:

module load pgi
module load cuda
module load mvapich2

or, more succinctly,

module load pgi cuda mvapich2

To load MVAPICH2 with GCC or Intel support, replace pgi in this example with gcc or intel.

After the mvapich2 module is loaded, the MPI compiler wrappers will be available as mpicc, mpic++, and mpif90.

Cross-compiling with mvapich2 module from Cori login nodes does not work

Attempting to cross-compile a code on the Cori login nodes using the mvapich2 compiler wrappers will result in an error like the following:

/global/common/cori/software/mvapich2/2.3/pgi/18.10/lib/libmpi.so: undefined reference to `ibv_reg_xrc_rcv_qp@IBVERBS_1.1'
/global/common/cori/software/mvapich2/2.3/pgi/18.10/lib/libmpi.so: undefined reference to `ibv_close_xrc_domain@IBVERBS_1.1'
/global/common/cori/software/mvapich2/2.3/pgi/18.10/lib/libmpi.so: undefined reference to `ibv_unreg_xrc_rcv_qp@IBVERBS_1.1'
/global/common/cori/software/mvapich2/2.3/pgi/18.10/lib/libmpi.so: undefined reference to `ibv_open_xrc_domain@IBVERBS_1.1'
/global/common/cori/software/mvapich2/2.3/pgi/18.10/lib/libmpi.so: undefined reference to `ibv_modify_xrc_rcv_qp@IBVERBS_1.1'
/global/common/cori/software/mvapich2/2.3/pgi/18.10/lib/libmpi.so: undefined reference to `ibv_create_xrc_rcv_qp@IBVERBS_1.1'
/global/common/cori/software/mvapich2/2.3/pgi/18.10/lib/libmpi.so: undefined reference to `ibv_create_xrc_srq@IBVERBS_1.1'

The error occurs because the Infiniband library files which MVAPICH2 relies on are installed only on GPU nodes, not on Cori login nodes or compute nodes.

To avoid this error, one must invoke the mvapich2 compiler wrappers directly on a Cori GPU node.

MPI_THREAD_MULTIPLE with MVAPICH2

By default, a code compiled with MVAPICH2 will support only MPI_THREAD_SINGLE, even if a higher threading model is requested in MPI_Init_thread(). This is by design; see this page for more information. If one requests a higher level of threading support, one will encounter the following runtime warning:

user@cgpu04:~/> mpicc -o main.ex main.c
user@cgpu04:~/> srun -n 2 -c 2 ./main.ex
Requested MPI_THREAD_MULTIPLE, got MPI_THREAD_SINGLE
Hello world from processor cgpu04, rank 0 out of 2 processors
Hello world from processor cgpu04, rank 1 out of 2 processors
Error in system call pthread_mutex_destroy: Device or resource busy
    src/mpi/init/initthread.c:241
Error in system call pthread_mutex_destroy: Device or resource busy
    src/mpi/init/initthread.c:241

To enable higher levels of threading support, e.g., MPI_THREAD_MULTIPLE, one must disable MVAPICH2's default task binding behavior by setting the environment variable MV2_ENABLE_AFFINITY=0 during execution:

user@cgpu04:~> MV2_ENABLE_AFFINITY=0 srun -n 2 -c 2 ./main.ex
Requested MPI_THREAD_MULTIPLE, got MPI_THREAD_MULTIPLE
Hello world from processor cgpu04, rank 0 out of 2 processors
Hello world from processor cgpu04, rank 1 out of 2 processors

OpenMPI

OpenMPI is provided for the GCC, PGI, and Intel compilers, and is provided as the openmpi/4.0.3 module. Similarly to the mvapich2 modules, one must first load a compiler module and a CUDA module before loading the openmpi/4.0.3 module, e.g.,

module load pgi
module load cuda
module load openmpi/4.0.3

GPU Software Support

There are many different ways to offload code to GPUs. We provide software support for several of these methods on the GPU nodes.

CUDA

The CUDA SDK is available via the cuda modules. The SDK includes the nvcc CUDA C/C++ compiler, the Nsight and nvprof profiling tools, the cuda-gdb debugger, and others.

Additionally, the LLVM/clang compiler is also a valid CUDA compiler. One can replace the nvcc command from the CUDA SDK with clang --cuda-gpu-arch=<arch>, where <arch> on the Cori GPU nodes is sm_70. If using clang as a CUDA compiler, one usually will also need to add the -I/path/to/cuda/include and -L/path/to/cuda/lib64 flags manually, since nvcc includes them implicitly.

OpenMP

Several compilers have some support for OpenMP offloading to GPUs via the omp target directive.

LLVM/clang

The clang/clang++ LLVM compilers support GPU offloading with OpenMP. The 'raw' compilers are available via the following modules:

  • llvm/11.0.0-git_20200409
  • llvm/10.0.0-git_20190828
  • llvm/9.0.0-git_20190220

or you can load the corresponding PrgEnv-llvm modules:

  • PrgEnv-llvm/11.0.0-git_20200409
  • PrgEnv-llvm/10.0.0-git_20190828
  • PrgEnv-llvm/9.0.0-git_20190220

which loads the appropriate LLVM, CUDA, and MVAPICH2 modules.

Enabling GPU offloading with OpenMP in the clang compiler looks like:

clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda base.c -c

Using the clang++ compiler

The clang++ compiler will fail unless you add a compiler option to use an official C++ standard, e.g. -std=c++11. The issue seems to be related to GPU-offload support for GCC extensions, e.g. __float128 type.

Intrinsic math functions in GPU offloaded regions

The clang/clang++ compilers belonging to the llvm/9.0.0-git_20190220 module are unable to compile OpenMP target regions which call <math.h> functions, e.g. log() and exp(). The compilers also incorrectly handle OpenMP target regions inside static libraries -- your application will fail at runtime when encountering the static library OpenMP target region. If you need either of these capabilities please use the module PrgEnv-llvm/10.0.0-git_20190828.

CCE

The Cray compilers ('CCE') have the most mature OpenMP offloading capabilities of any compiler on the Cori GPU nodes currently, especially amongst Fortran compilers. Cray does not officially supported CCE on the Cori GPU nodes, but it can be made to work by careful loading/unloading of modules.

# load the appropriate modules
module load cdt/20.03
module swap PrgEnv-{intel,cray}
module swap craype-{${CRAY_CPU_TARGET},x86-skylake}
module load cuda
module load openmpi/4.0.3
export CRAY_ACCEL_TARGET=nvidia70

# compile the code
mpicc  -fopenmp -o my_openmp_code.ex my_openmp_code.c     # C code
mpic++ -fopenmp -o my_openmp_code.ex my_openmp_code.cpp   # C++ code
mpif90 -h omp   -o my_openmp_code.ex my_openmp_code.f90   # Fortran code

Do not module purge if using CCE

Unlike most other compilers and modules used on the Cori GPU nodes, which should be preceded with module purge, the CCE compilers depend on the default Cray module environment, and therefore one should not execute module purge if one desires to use the CCE compilers.

You can add the flag -fsave-loopmark to the Cray C/C++ compilers, or -h list=a to the Cray Fortran compiler to produce an optimization report (named <source_file>.lst) which indicates which regions of the code were successfully offloaded to the GPU. For example, in the OpenMP offload version of the CloverLeaf benchmark code, the Cray compiler outputs a diagnostic report for each source file which includes sections such as:

     %%%    L o o p m a r k   L e g e n d    %%%

     Primary Loop Type        Modifiers
     ------- ---- ----        ---------
     A - Pattern matched      a - atomic memory operation
                              b - blocked
     C - Collapsed            c - conditional and/or computed
     D - Deleted
     E - Cloned
     F - Flat - No calls      f - fused
     G - Accelerated          g - partitioned
     I - Inlined              i - interchanged
     M - Multithreaded        m - partitioned
                              n - non-blocking remote transfer
                              p - partial
     R - Rerolling            r - unrolled
                              s - shortloop
     V - Vectorized           w - unwound

...

  105.             #pragma omp target teams distribute if(halo_offload)
  106.             //#pragma omp parallel for
  107.    gG-----<             for (int j = x_min-depth; j <= x_max+depth; j++)
  108.    gG                   {
  109.    gG       #pragma ivdep
  110.  + gG r2--<                 for (int k = 1; k <= depth; k++)
  111.    gG r2                    {
  112.    gG r2                        density0[FTNREF2D(j  ,y_max+k,x_max+4,x_min-2,y_min-2)] =
  113.    gG r2                            density0[FTNREF2D(j  ,y_max+1-k,x_max+4,x_min-2,y_min-2)];
  114.    gG r2-->                 }
  115.    gG----->             }

...

CC-6405 CC: ACCEL update_halo_kernel_c_, File = ext_update_halo.c, Line = 107
  A region starting at line 107 and ending at line 115 was placed on the accelerator.

CC-6430 CC: ACCEL update_halo_kernel_c_, File = ext_update_halo.c, Line = 107
  A loop was partitioned across the threadblocks and the 128 threads within a threadblock.

GCC

GCC 8.1.1 has some support for OpenMP offloading. This compiler is available via the gcc/8.1.1-openacc-gcc-8-branch-20190215 module, which depends on the cuda/9.2.148 module.

OpenMP offloading with gcc looks something like

gcc -fopenmp -foffload=nvptx-none="-Ofast -lm -misa=sm_35" base.c -c

OpenMP GPU offload support in GCC is limited

The GCC compiler's OpenMP offload capabilities for GPU code generation is very limited, in terms of both functionality and performance. Users are strongly advised to use LLVM/clang for C/C++ codes, or CCE, which also includes a Fortran compiler with OpenMP offload capability.

OpenACC

Several compilers on the GPU nodes also support GPU offloading with OpenACC directives.

GCC

The GCC module available via gcc/8.1.1-openacc-gcc-8-branch-20190215 also supports OpenACC offloading for GPUs. Invoking OpenACC looks like:

gcc -fopenacc -foffload=nvptx-none="-Ofast -lm -misa=sm_35" base.c -c

PGI

The PGI compilers support OpenACC offloading and are available via the pgi modules.

Invoking OpenACC in the PGI compilers looks like:

pgc++ -acc -ta=tesla:cc70 base.c -c

Documentation for the PGI compiler is provided here.

Python

There are many options for using Python on GPUs, each with their own set of pros/cons. We have tried to provide a brief overview of several frameworks here. The Python GPU landscape is changing quickly so please check back periodically for more information.

On corigpu we recommend that users build a custom conda environment for the Python GPU framework they would like to use. You can find instructions for building a custom conda environment here. Make sure that you are on corigpu when you build your environment and install the packages you need.

In all cases you'll need to:

  1. Make sure you have sourced your conda environment via source activate mypythonenv
  2. Run your code with the general format srun -n 1 python yourscript.py args...

CuPy

  • module load python cuda
  • Build a custom conda environment
  • pip install cupy into your environment following the directions here
  • As of May 2020, our default CUDA module is 10.2 Your CuPy and CUDA versions must match, so you'll need to pip install cupy-cuda102

Numba CUDA

  • module load python cuda
  • Build a custom conda environment
  • conda install numba and cudatoolkit into your environment following the directions here

PyOpenCL

  • module load python
  • Build a custom conda environment with conda install -n pyoencl-env -c conda-forge pyopencl cudatoolkit
  • And then create a link (only once) to the CUDA OpenCL vendor driver via
    ln -s /etc/OpenCL/vendors/nvidia.icd ~/.conda/envs/pyopencl-env/etc/OpenCL/vendors/nvidia.icd
    

PyCUDA

  • module load python cuda
  • build a custom conda environment
  • pip install pycuda

JAX

  • module load python cuda
  • Build a custom conda environment
  • JAX installation is somewhat complex. You can use this script on corigpu:
    #!/usr/bin/env bash
    
    # install jaxlib
    PYTHON_VERSION=cp37  # alternatives: cp36, cp37, cp38
    CUDA_VERSION=cuda102  # alternatives: cuda92, cuda100, cuda101, cuda102
    PLATFORM=linux_x86_64  # alternatives: linux_x86_64
    BASE_URL='https://storage.googleapis.com/jax-releases'
    pip install --upgrade $BASE_URL/$CUDA_VERSION/jaxlib-0.1.46-$PYTHON_VERSION-none-$PLATFORM.whl
    
    pip install --upgrade jax  # install jax
    

RAPIDS

We provide a RAPIDS kernel which you will find at jupyter.nersc.gov.

If you would like your own custom conda environment and/or Jupyter kernel that contains RAPIDS, you can follow the directions below.

1) Make sure you are on a Cori gpu node

2) module load python cuda

3) conda create -n rapids_env python=3.7

4) source activate rapids_env

5) Using the tool at https://rapids.ai/start.html, we generated the following command:

conda install -c rapidsai -c nvidia -c conda-forge \
    -c defaults rapids=0.13 python=3.7 cudatoolkit=10.2

to conda install RAPIDS into your rapids_env.

6) If you intend to use RAPIDS via scripts/command line, you're ready to go. If you would like to create your own RAPIDS kernel to use in Jupyter, you'll need to conda install ipykernel and python -m ipykernel install --user --name rapids_env --display-name rapids

7) You'll need to restart your Jupyter server. When you log in, you should now see your rapids kernel as an option.

8) If you need other libraries like matplotlib, you may want to install them during your original conda install command (see step 5) OR you may want to install later via pip with --user. These will help you avoid dependency problems.

For more information about how to use NVIDIA RAPIDS, please see our Examples page.

MPI4py

You can build mpi4py and install it into a conda environment on Cori to be used with one of the MPI implementations available for use with the GPU nodes. First, request an interactive session on a GPU node:

module purge; module load python esslurm
salloc -C gpu -A <account> -t 30 -G 1 -c 10

Then, on the GPU node, create or activate a conda environment, load your MPI implementation of choice (including relevant compiler), download mpi4py, and build/install the software using the mpicc wrapper:

user@cgpu12:~> conda create -n mpi4pygpu python=2.7
user@cgpu12:~> source activate mpi4pygpu
(mpi4pygpu) user@cgpu12:~> module load gcc/7.3.0 cuda mvapich2 (or pgi/intel instead of gcc)
(mpi4pygpu) user@cgpu12:~> wget https://bitbucket.org/mpi4py/mpi4py/downloads/mpi4py-3.0.0.tar.gz
(mpi4pygpu) user@cgpu12:~> tar zxvf mpi4py-3.0.0.tar.gz
(mpi4pygpu) user@cgpu12:~> cd mpi4py-3.0.0
(mpi4pygpu) user@cgpu12:~> python setup.py build --mpicc=mpicc
(mpi4pygpu) user@cgpu12:~> python setup.py install

Deep Learning Software

Tensorflow:

module load tensorflow/gpu-1.13.1-py36

PyTorch:

module load pytorch/v1.1.0-gpu

CUDA Fortran

The PGI Fortran compiler supports CUDA Fortran.

Compiler bugs

If you find bugs in the compilers (wrong answers, compiler crashing, etc.), PLEASE REPORT THEM TO NERSC! Any OpenMP target issues can be sent directly to Chris Daley: csdaley@lbl.gov. Many compilers are still in early phases of GPU enablement and depend on bug reports to fix these bugs quickly.

Intel MKL

To use routines provided by the Intel MKL, load one of the available intel modules before compiling and running your code:

module load intel

Module names

Be sure to load an intel compiler module and not the Intel programming environment module (PrgEnv-intel).

To determine the appropriate link lines for your code, use the Intel MKL Link Line Advisor.

OpenCL

OpenCL is supported natively by NVIDIA's CUDA toolkit. In addition, there's a module for the Portable OpenCL (POCL) implementation which is based on LLVM and uses its NVPTX backend. It's recommended that you try the NVIDIA solution first, and then try the POCL implementation as it may provide better performance.

Module load order may affect which driver you get

If you need to load a CUDA module for your workflow, POCL must be loaded after the CUDA module to avoid using the NVIDIA driver.

NVIDIA OpenCL

A compilation using the NVIDIA driver requires specifying the path of the OpenCL CL/cl.h include file:

module load gcc cuda
g++ $CFLAGS -I$CUDA_ROOT/include <myapplication.c> -lOpenCL

Portable OpenCL (experimental)

In order to use the POCL implementation, you must first load the POCL module. Note that the POCL module includes the necessary paths for include files and libraries.

module use /global/cfs/cdirs/mpccc/dwdoerf/cori-gpu/modulefiles
module load opencl
g++ $CFLAGS <myapplication.c> -lOpenCL

You can check to make sure you're using POCL using the clinfo utility.

cgpu$ module load clinfo
cgpu$ srun clinfo -l
Platform #0: Portable Computing Language
 +-- Device #0: pthread-Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz
 `-- Device #1: Tesla V100-SXM2-16GB

SYCL

There are a few options for SYCL compilers. Note that both of these are experimental to some degree.

ComputeCpp (experimental)

ComputeCpp is a production compiler developed by CodePlay, but the NERSC configuration relies on the open source Portable OpenCL (POCL) implmentation in addition to an open source SPIR-V to LLVM Translator in order to target the NVIDIA V100 GPU. The NERSC instantiation is not an officially suported configration, but the combination just happens to work.

A prerequisite is to load the following module path:

module use /global/cfs/cdirs/mpccc/dwdoerf/cori-gpu/modulefiles
module load computecpp

Example usage is best demonstrated with the follwing example:

cgpu$ cp -R /global/cfs/cdirs/mpccc/dwdoerf/cori-gpu/llvm-pocl/example .
cgpu$ cd example
cgpu$ make
compute++ -std=c++14 -O2 -sycl-driver -sycl-target spirv64 -no-serial-memop -o simple-vector-add.x simple-vector-add.cpp -lComputeCpp -lOpenCL
cgpu$ srun simple-vector-add.x
Using Platform Portable Computing Language: Device Tesla V100-SXM2-16GB
Using Platform Portable Computing Language: Device Tesla V100-SXM2-16GB
The results are correct!

Intel DPC++/SYCL (experimental)

The second option is to use the Intel Data Parallel C++ (DPC++) compiler. This is based on the LLVM/Clang compiler with SYCL extensions added by Intel (the basis of their oneAPI DPC++ solution), and an experimental NVPTX backend provided by CodePlay to target NVIDIA GPUs.

Intel SYCL requires a custom device selector

The Intel SYCL compiler targets NVPTX directly, i.e. bypassing the OpenCL driver, and hence if you're using a SYCL default GPU selector it may not find the NVIDIA GPU. You can inspect the example below which contains code demonstrating how to select the NVIDIA GPU as a device.

A prerequisite is to load the following module path:

module use /global/cfs/cdirs/mpccc/dwdoerf/cori-gpu/modulefiles
module load dpc++

Example usage of the Intel DPC++ compiler:

cgpu$ cp -R /global/cfs/cdirs/mpccc/dwdoerf/cori-gpu/llvm-sycl/example .
cgpu$ cd example
cgpu$ make
clang++ -std=c++14 -O2 -fsycl -fsycl-targets=nvptx64-nvidia-cuda-sycldevice -Xsycl-target-backend '--cuda-gpu-arch=sm_70' -o simple-vector-add.x simple-vector-add.cpp
cgpu$ srun simple-vector-add.x
Using Platform NVIDIA CUDA: Device Tesla V100-SXM2-16GB
Using Platform NVIDIA CUDA: Device Tesla V100-SXM2-16GB
The results are correct!

HIP

The HIP compiler and associated hipBLAS library are available.

A prerequisite is to load the following module path:

module use /global/cfs/cdirs/mpccc/dwdoerf/cori-gpu/modulefiles
module load hip

Profiling

The Cori GPU nodes provide a few tools for profiling GPU code.

nvprof

nvprof has been CUDA's standard profiling tool for several years. It is easy to use - one simply inserts the word nvprof in front of their application in the srun command, and it will profile the code and generate a report:

user@cgpu17:~/samples/bin/x86_64/linux/release> srun -n 1 ./nvgraph_SpectralClustering
GPU Device 0: "Tesla V100-SXM2-16GB" with compute capability 7.0

Modularity_score: 0.371466
Hit rate : 100.000000% (34 hits)
Done!
user@cgpu17:~/samples/bin/x86_64/linux/release> srun -n 1 nvprof ./nvgraph_SpectralClustering
==152717== NVPROF is profiling process 152717, command: ./nvgraph_SpectralClustering
GPU Device 0: "Tesla V100-SXM2-16GB" with compute capability 7.0

Modularity_score: 0.371466
Hit rate : 100.000000% (34 hits)
Done!
==152717== Profiling application: ./nvgraph_SpectralClustering
==152717== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   26.12%  22.309ms      6162  3.6200us  3.3270us  5.4400us  void nrm2_kernel<float, float, float, int=1, int=0, int=128, int=0>(cublasNrm2Params<float, float>)
                   16.07%  13.722ms      8850  1.5500us  1.5040us  2.1120us  [CUDA memcpy DtoH]
                   13.07%  11.161ms      8620  1.2940us  1.2160us  1.9520us  void axpy_kernel_val<float, float, int=0>(cublasAxpyParamsVal<float, float, float>)
                   11.51%  9.8280ms      5752  1.7080us  1.6310us  13.600us  void dot_kernel<float, float, float, int=128, int=0, int=0>(cublasDotParams<float, float>)
                   10.79%  9.2173ms      2877  3.2030us  3.1040us  13.344us  void csrMv_kernel<float, float, float, int=128, int=1, int=2>(cusparseCsrMvParams<float, float, float>)
                   10.28%  8.7775ms      5752  1.5250us  1.4710us  2.4960us  void reduce_1Block_kernel<float, float, float, int=128, int=7>(float*, int, float*)
                    5.54%  4.7290ms      3295  1.4350us  1.3760us  1.8240us  [CUDA memcpy HtoD]
                    4.23%  3.6102ms      3079  1.1720us  1.1200us  1.4400us  void scal_kernel_val<float, float, int=0>(cublasScalParamsVal<float, float>)
                    1.06%  908.79us       206  4.4110us  3.5190us  8.1920us  volta_sgemm_128x32_nn
                    0.57%  487.16us       413  1.1790us  1.1520us  1.6640us  [CUDA memcpy DtoD]
...

nvprof is part of the CUDA SDK (available via the cuda modules on the Cori GPU nodes) but after CUDA 10 will be replaced by Nsight Compute, NVIDIA's new profiling tool.

Documentation about nvprof is here.

Using SLURM_TASK_ID with nvprof using multiple MPI ranks

nvprof profiles only one task at a time; if one profiles a GPU code which has multiple tasks (e.g., multiple MPI ranks), nvprof will save one profile per task if used with the -o flag. However, each file must have a unique filename, or else all nvprof tasks will attempt to write to the same file, typically resulting in an unusuable profiling result.

However, one can use the SLURM_TASK_PID environment variable to save each nvprof profiling result to a unique file, which can then be combined to generate a single result in the nvvp GUI. To do this, one must invoke a bash shell inside the srun statement in order to return a unique value for the SLURM_TASK_PID for each task:

user@cgpu02:~> srun -n 2 -c 2 bash -c 'echo $SLURM_TASK_PID'
36032
36031

Note that, without the bash -c statement, each task will return the same value of SLURM_TASK_ID, because the wrong bash shell is interpreting the value of the variable:

user@cgpu02:~> srun -n 2 -c 2 echo $SLURM_TASK_PID
35765
35765

The easiest way to use SLURM_TASK_PID to produce a unique profiling database per MPI rank is to use it inside a shell script which is then invoked by srun, e.g.,:

user@cgpu02:~> cat profile.sh
#!/bin/bash

file_prefix=$(bash -c 'echo $SLURM_TASK_PID')
nvprof -o result_${file_prefix}.nvvp ./main.ex
user@cgpu02:~> srun -n 2 -c 2 --cpu-bind=cores ./profile.sh
==44490== NVPROF is profiling process 44490, command: ./main.exe
==44492== NVPROF is profiling process 44492, command: ./main.exe
...
(code runs)
...
==44490== Generated result file: /path/to/result/result_44468.nvvp
==44492== Generated result file: /path/to/result/result_44467.nvvp

One can then merge the individual .nvvp files in the nvvp GUI in order to produce a unified profiling result of the entire application.

nvvp

nvvp is the profiling GPU which accompanies nvprof. It is used for displaying profiling information collected by nvprof in a GUI. Since X11 window forwarding via SSH is typically slow, one will enjoy a much better nvvp experience by running it on a Cori login node using NoMachine. Example output of nvvp is shown below:

nvvp screenshot

Nsight

Nsight is NVIDIA's new profiling suite which will replace nvprof after CUDA 10. It measures much of the same information as nvprof, but organizes information in different ways. Nsight is divided into two separate tools: Nsight Compute, and Nsight Systems.

Nsight Systems

The way to achieve approximately the same output as nvprof ./my_code.exe is to use nsys profile --stats=true ./my_code.exe, which is part of Nsight Systems. The output resembles the following:

Generating CUDA API Statistics...
CUDA API Statistics (nanoseconds)

Time(%)      Total Time       Calls         Average         Minimum         Maximum  Name
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   89.0       520526780        4002        130066.7            2119         1275014  cuStreamSynchronize
    4.9        28856541           1      28856541.0        28856541        28856541  cuMemHostAlloc
    2.8        16433738        3000          5477.9            4403           35531  cuLaunchKernel
    0.9         5499511        1010          5445.1            1410         1306269  cuEventSynchronize
    0.8         4816205        1005          4792.2            4170           22094  cuMemcpyDtoHAsync_v2
    0.6         3420392        1000          3420.4            2884           20080  cuMemsetD32Async
    0.4         2409796           5        481959.2            4670         1297188  cuMemAlloc_v2
    0.3         1763291        1012          1742.4            1510           14416  cuEventRecord
    0.2         1076739           1       1076739.0         1076739         1076739  cuMemAllocHost_v2
    0.0          272120           1        272120.0          272120          272120  cuModuleLoadDataEx
    0.0           75190           4         18797.5           15333           27138  cuMemcpyHtoDAsync_v2
    0.0           12881           1         12881.0           12881           12881  cuStreamCreate
    0.0            6902           4          1725.5             475            2795  cuEventCreate




Generating CUDA Kernel Statistics...
CUDA Kernel Statistics (nanoseconds)

Time(%)      Total Time   Instances         Average         Minimum         Maximum  Name
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   56.3       284400677        1000        284400.7          274750          296989  laplace_85_gpu
   43.2       218012722        1000        218012.7          215646          223102  laplace_102_gpu
    0.5         2382122        1000          2382.1            2272            9280  laplace_85_gpu__red



Generating CUDA Memory Operation Statistics...
CUDA Memory Operation Statistics (nanoseconds)

Time(%)      Total Time  Operations         Average         Minimum         Maximum  Name
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   49.2         6552324        1005          6519.7            1376         1295316  [CUDA memcpy DtoH]
   40.9         5445551           4       1361387.7         1354995         1365236  [CUDA memcpy HtoD]
    9.8         1308501        1000          1308.5            1279            1952  [CUDA memset]


CUDA Memory Operation Statistics (KiB)

              Total      Operations              Average            Minimum              Maximum  Name
-------------------  --------------  -------------------  -----------------  -------------------  --------------------------------------------------------------------------------
          65539.906            1005               65.214              0.004            16383.937  [CUDA memcpy DtoH]
          65536.000               4            16384.000          16384.000            16384.000  [CUDA memcpy HtoD]
              3.906            1000                0.004              0.004                0.004  [CUDA memset]

Nsight Systems will save the profiling output to a .qdrep file in the present working directory. One can then view the .qdrep profiling database via the Nsight Systems GUI. Adding the --stats=true flag to nsys profile causes Nsight Systems to automatically convert the .qdrep file into a .sqlite file. As with nvvp, the Nsight Systems GUI is best viewed from a Cori login node using NoMachine. Inside a NoMachine session, one can launch the Nsight Systems GUI with the nsight-sys. From the GUI, one can import the .qdrep file to view the profiling output. An example of this output is shown below:

nsight-sys screenshot

NVIDIA documentation about Nsight Systems is here.

Nsight Compute

The Nsight Compute tool enables deep dives into GPU code performance. The command line interface to Nsight Compute is nv-nsight-cu-cli. More documentation about this tool is forthcoming.

NVIDIA documentation about Nsight Compute is here.

Debugging

Debugging applications which launch GPU kernels is more challenging than CPU applications. Typical debuggers like gdb cannot debug GPU kernels or CUDA C application code. Nevertheless, several tools are available on Cori GPU which can aid in debugging GPU code. Most interactive command-line debuggers like cuda-gdb require the additional srun flag --pty in order to function properly (please see the Cori GPU usage page for more details.

cuda-gdb

cuda-gdb is NVIDIA's official debugger for GPU code. It has similar behavior to GNU gdb, but has many enhancements for debugging GPU kernels. It is designed to debug non-MPI applications. For MPI-parallelized applications, NVIDIA recommends using debuggers provided by different vendors, which are summarized below. On Cori GPU nodes, cuda-gdb is provided by any of the cuda modules.

Arm DDT

Arm DDT (formerly "Allinea DDT") is a debugger for parallel codes which run on both CPUs and GPUs. Documentation for using Arm DDT to debug parallel CPU codes at NERSC is provided here. The same debugger can also be used to debug parallel GPU codes on Cori GPU nodes. Documentation regarding how to use Arm DDT to debug GPU codes is provided here.

TotalView

TotalView is another debugger for parallel GPU applications which is available at NERSC. Documentation regarding how to debug parallel CPU-only applications at NERSC is provided here. Documentation regarding how to debug parallel GPU codes with TotalView is provided here.

How do I know if my code ran on the GPU?

While it is usually clear that a code has run at all, it is sometimes less clear whether the code ran on the CPU or the GPU. One way this ambiguity can arise is if one includes GPU offloading directives in the code, but does not use the appropriate compiler flag to enable those directives.

There are several ways to determine if your code actually ran on the GPU. It may be more useful to know when a code does not run on the GPU (especially when one expects that it should):

  • Run the code through an NVIDIA profiler such as Nsight Compute or nvprof. If a code runs on the GPU, both profilers will print a summary following code execution:

    ==39359== Profiling application: ./laplace2d_acc
    ==39359== Profiling result:
                Type  Time(%)      Time     Calls       Avg       Min       Max  Name
     GPU activities:   53.75%  264.48ms      1000  264.48us  254.33us  292.35us  main_96_gpu
                       43.03%  211.73ms      1000  211.73us  210.27us  214.01us  main_109_gpu
                        1.37%  6.7473ms      1004  6.7200us  1.2790us  1.3656ms  [CUDA memcpy HtoD]
                        1.33%  6.5500ms      1005  6.5170us  1.4070us  1.2849ms  [CUDA memcpy DtoH]
    

    An Nsight Compute profile would look something like:

    An Nsight Compute profile of a code which does not run on the GPU at all will print the following message:

    user@cgpu12:~/tests> srun -n 1 nv-nsight-cu-cli ./a.out
    ==PROF== ERROR: Target application terminated before first instrumented API call.
    srun: error: cgpu12: task 0: Exited with exit code 255
    srun: Terminating job step 123456.5
    user@cgpu12:~/tests>
    

    and an nvprof profile would print:

    user@cgpu12:~/tests> srun -n 1 nvprof ./a.out
    ======== Warning: No profile data collected.
    user@cgpu12:~/tests>
    
  • Check compiler reports. If one writes a code with OpenACC directives and compiles it with the PGI compiler but does not include the flags needed to inform the compiler to use the directives, there will be no output:

    user@cori02:~> pgcc -I../common -Minfo=accel -o laplace2d_acc laplace2d.c
    user@cori02:~>
    

    Including the appropriate OpenACC flags to the compiler (in this case -acc -ta=nvidia) results in more output, including a note that the compiler generated Tesla code.

    user@cori02:~> pgcc -I../common -acc -ta=nvidia -Minfo=accel -o laplace2d_acc laplace2d.c
    main:
         86, Generating copy(A[:][:])
             Generating create(Anew[:][:])
         93, Loop is parallelizable
         96, Loop is parallelizable
             Generating Tesla code
             93, #pragma acc loop gang(32), vector(16) /* blockIdx.y threadIdx.y */
             96, #pragma acc loop gang(16), vector(32) /* blockIdx.x threadIdx.x */
            100, Generating implicit reduction(max:error)
        106, Loop is parallelizable
        109, Loop is parallelizable
             Generating Tesla code
            106, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
            109, #pragma acc loop gang(16), vector(32) /* blockIdx.x threadIdx.x */
    user@cori02:~>
    

Shifter with CUDA

Shifter works in a slightly different way on the Cori GPU nodes than it does on the Haswell or KNL nodes. First, one should add the following to ENTRYPOINT in the Shifter container:

export PATH=/opt/shifter/bin:${PATH}
export LD_LIBRARY_PATH=/opt/shifter/lib:${LD_LIBRARY_PATH}

Next, one should load the cuda/shifter module; this will define the SHIFTER_CUDA_ROOT environment variable and point it to the version of the CUDA SDK installation which works in Shifter images.

cuda modules and Shifter images

Note that the normal cuda modules will not work inside Shifter containers.

Then one should invoke shifter from the job script as follows:

srun -n <num_task> -c <num_cpu> \
    shifter \
    --image=<your_image> \
    --entrypoint \
    --volume=${SHIFTER_CUDA_ROOT}:/opt/shifter:ro \
    ./your_gpu_code.ex args

A complete is example is show below; this example is also provided in $SHIFTER_CUDA_ROOT/example.

#!/bin/bash -e

#SBATCH -A nstaff
#SBATCH -N 1
#SBATCH -C gpu
#SBATCH --gres=gpu:1
#SBATCH -t 00:10:00
#SBATCH --job-name=nvidia-shifter

# provides SHIFTER_CUDA_ROOT
module load cuda/shifter

#
# jrmadsen/tomopy:shifter container has the following in entrypoint:
#
#    export PATH=/opt/shifter/bin:${PATH}
#    export LD_LIBRARY_PATH=/opt/shifter/lib:${LD_LIBRARY_PATH}
#

export TOMOPY_NUM_THREADS=32
export NUMEXPR_MAX_THREADS=80
export OMP_NUM_THREADS=1

srun -n 1 -c 1 \
    shifter \
    --image=jrmadsen/tomopy:shifter \
    --entrypoint \
    --volume=${SHIFTER_CUDA_ROOT}:/opt/shifter:ro \
    python ${SHIFTER_CUDA_ROOT}/example/phantom.py -a mlem -i 100 -n 1 -p shepp2d -f png -s 256

Known Issues

MPS disabled indefinitely

NVIDIA's Multi-Process Service (MPS) enables multiple processes (typically MPI ranks) to execute kernels on a single GPU simultaneously. MPS can enable an application to achieve higher performance when a single process is unable to saturate the GPU's resources.

Unfortunately, a security vulnerability in the V100 GPUs was disclosed in February 2019 (see here, here, and here for more information) which renders data in GPU memory exposed to a side-channel attack if the GPU is accessed by multiple processes simultaneously. The vulnerability is not present if a GPU is allocated exclusively to a single process.

As a result of this security risk, NERSC has disabled MPS on Cori GPU until a mitigation for this vulnerability is implemented.

PGI 19.5 requires CUDA <= 10.1.105

The PGI v19.5 compiler using OpenACC directives is compatible with CUDA modules only up to cuda/10.1.105. It is not compatible with cuda/10.1.168. If one attempts to compile OpenACC code with the cuda/10.1.168 module loaded, one encounters an error at runtime:

user@cgpu01:~/tests/OpenACC/vector_add> module list -l
- Package -----------------------------+- Versions -+- Last mod. ------
Currently Loaded Modulefiles:
esslurm                                              2019/02/08 22:01:04
pgi/19.5                                             2019/07/19 22:04:32
modules/3.2.10.6                                     2017/04/27 21:50:33
cuda/10.1.168                                        2019/07/19 22:01:27
user@cgpu01:~/tests/OpenACC/vector_add> pgf90 -acc -ta=tesla -o vector_add.ex vector_add.f90
user@cgpu01:~/tests/OpenACC/vector_add> srun -n 1 ./a.out
Failing in Thread:0
call to cuInit returned error -1: Other

srun: error: cgpu01: task 0: Exited with exit code 1
srun: Terminating job step 182246.6
user@cgpu01:~/tests/OpenACC/vector_add>

MVAPICH2 ptmalloc warnings with Python MPI codes

When running some MPI-enabled Python codes compiled with MVAPICH2, one may encounter the following warning:

WARNING: Error in initializing MVAPICH2 ptmalloc library.Continuing without
InfiniBand registration cache support.

This is due to a bad interaction between MVAPICH's ptmalloc library and the memory allocator used in Python. Details about this warning are provided here. As described in that page, one workaround is to set the LD_PRELOAD environment variable:

export LD_PRELOAD=$MVAPICH2_DIR/lib/libmpi.so