Skip to content

Known Issues

MPS disabled indefinitely

NVIDIA's Multi-Process Service (MPS) enables multiple processes (typically MPI ranks) to execute kernels on a single GPU simultaneously. MPS can enable an application to achieve higher performance when a single process is unable to saturate the GPU's resources.

Unfortunately, a security vulnerability in the V100 GPUs was disclosed in February 2019 (see here, here, and here for more information) which renders data in GPU memory exposed to a side-channel attack if the GPU is accessed by multiple processes simultaneously. The vulnerability is not present if a GPU is allocated exclusively to a single process.

As a result of this security risk, NERSC has disabled MPS on Cori GPU until a mitigation for this vulnerability is implemented.

nvcc from HPC SDK v20.5 fails if a GPU is not detected

The NVIDIA CUDA C compiler nvcc is able to cross-compile CUDA code, i.e., it does not require the presence of a GPU in order to compile GPU code. However, a bug in the version of nvcc which is bundled with the HPC SDK breaks this cross-compilation behavior, because it attempts to detect the presence of a GPU each time it is invoked, and fails if a GPU is not found. The result is that this version of nvcc fails with the following error:

user@cori07:~> nvcc my_code.cu
nvcc-Error-Version /global/common/cori_cle7/software/hpcsdk/20.5/Linux_x86_64/cuda//bin is not available in this installation
user@cori07:~>

This will be fixed in a future release of the HPC SDK. To work around this problem, users should compile CUDA code using the nvcc included with the cuda modules.

nvc++ from HPC SDK v20.5 requires detection of a GPU to generate GPU-accelerated pSTL code

The HPC SDK includes a new capability for generating GPU-accelerated code from C++17 parallel algorithms (see official documentation of this feature here). However, it requires a GPU to be visible when the compile is invoked in order to generate GPU code; if a GPU is not detected at compile time, even if the -stdpar=gpu compiler switch is provided, nvc++ will silently fall back to generating CPU code. This behavior will be addressed in a future release of the SDK. In the meantime, a workaround is to simply invoke nvc++ with srun in an interactive Cori GPU job, such that the compiler will be able to see the GPU and generate GPU code appropriately.

PGI 19.5 requires CUDA <= 10.1.105

The PGI v19.5 compiler using OpenACC directives is compatible with CUDA modules only up to cuda/10.1.105. It is not compatible with cuda/10.1.168. If one attempts to compile OpenACC code with the cuda/10.1.168 module loaded, one encounters an error at runtime:

user@cgpu01:~/tests/OpenACC/vector_add> module list -l
- Package -----------------------------+- Versions -+- Last mod. ------
Currently Loaded Modulefiles:
esslurm                                              2019/02/08 22:01:04
pgi/19.5                                             2019/07/19 22:04:32
modules/3.2.10.6                                     2017/04/27 21:50:33
cuda/10.1.168                                        2019/07/19 22:01:27
user@cgpu01:~/tests/OpenACC/vector_add> pgf90 -acc -ta=tesla -o vector_add.ex vector_add.f90
user@cgpu01:~/tests/OpenACC/vector_add> srun -n 1 ./a.out
Failing in Thread:0
call to cuInit returned error -1: Other

srun: error: cgpu01: task 0: Exited with exit code 1
srun: Terminating job step 182246.6
user@cgpu01:~/tests/OpenACC/vector_add>

MVAPICH2 ptmalloc warnings with Python MPI codes

When running some MPI-enabled Python codes compiled with MVAPICH2, one may encounter the following warning:

WARNING: Error in initializing MVAPICH2 ptmalloc library.Continuing without
InfiniBand registration cache support.

This is due to a bad interaction between MVAPICH's ptmalloc library and the memory allocator used in Python. Details about this warning are provided here. As described in that page, one workaround is to set the LD_PRELOAD environment variable:

export LD_PRELOAD=$MVAPICH2_DIR/lib/libmpi.so

MPI_THREAD_MULTIPLE with MVAPICH2

By default, a code compiled with MVAPICH2 will support only MPI_THREAD_SINGLE, even if a higher threading model is requested in MPI_Init_thread(). This is by design; see this page for more information. If one requests a higher level of threading support, one will encounter the following runtime warning:

user@cgpu04:~/> mpicc -o main.ex main.c
user@cgpu04:~/> srun -n 2 -c 2 ./main.ex
Requested MPI_THREAD_MULTIPLE, got MPI_THREAD_SINGLE
Hello world from processor cgpu04, rank 0 out of 2 processors
Hello world from processor cgpu04, rank 1 out of 2 processors
Error in system call pthread_mutex_destroy: Device or resource busy
    src/mpi/init/initthread.c:241
Error in system call pthread_mutex_destroy: Device or resource busy
    src/mpi/init/initthread.c:241

To enable higher levels of threading support, e.g., MPI_THREAD_MULTIPLE, one must disable MVAPICH2's default task binding behavior by setting the environment variable MV2_ENABLE_AFFINITY=0 during execution:

user@cgpu04:~> MV2_ENABLE_AFFINITY=0 srun -n 2 -c 2 ./main.ex
Requested MPI_THREAD_MULTIPLE, got MPI_THREAD_MULTIPLE
Hello world from processor cgpu04, rank 0 out of 2 processors
Hello world from processor cgpu04, rank 1 out of 2 processors