Known Issues¶
MPS disabled indefinitely¶
NVIDIA's Multi-Process Service (MPS) enables multiple processes (typically MPI ranks) to execute kernels on a single GPU simultaneously. MPS can enable an application to achieve higher performance when a single process is unable to saturate the GPU's resources.
Unfortunately, a security vulnerability in the V100 GPUs was disclosed in February 2019 (see here, here, and here for more information) which renders data in GPU memory exposed to a side-channel attack if the GPU is accessed by multiple processes simultaneously. The vulnerability is not present if a GPU is allocated exclusively to a single process.
As a result of this security risk, NERSC has disabled MPS on Cori GPU until a mitigation for this vulnerability is implemented.
OpenACC codes using CUDA awareness need UCX_MEMTYPE_CACHE=n
¶
Programs compiled with OpenMPI and OpenACC and which rely on CUDA-awareness in OpenMPI will crash with an error like the following:
[cgpu02:725 :0:725] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x2aab07afa000)
==== backtrace (tid: 725) ====
0 /usr/common/software/sles15_cgpu/ucx/1.8.1/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x2aaac3233ac4]
1 /usr/common/software/sles15_cgpu/ucx/1.8.1/lib/libucs.so.0(+0x21cc4) [0x2aaac3233cc4]
2 /usr/common/software/sles15_cgpu/ucx/1.8.1/lib/libucs.so.0(+0x21e7a) [0x2aaac3233e7a]
3 /lib64/libc.so.6(+0x15a8c4) [0x2aaaae18c8c4]
4 /usr/common/software/sles15_cgpu/ucx/1.8.1/lib/libucp.so.0(+0x5760b) [0x2aaac2da760b]
5 /usr/common/software/sles15_cgpu/ucx/1.8.1/lib/libuct.so.0(uct_mm_ep_am_bcopy+0x13a) [0x2aaac2ff07ba]
6 /usr/common/software/sles15_cgpu/ucx/1.8.1/lib/libucp.so.0(+0x57697) [0x2aaac2da7697]
7 /usr/common/software/sles15_cgpu/ucx/1.8.1/lib/libucp.so.0(+0x394eb) [0x2aaac2d894eb]
8 /usr/common/software/sles15_cgpu/ucx/1.8.1/lib/libucp.so.0(ucp_tag_send_nb+0x5c8) [0x2aaac2d9b0f8]
9 /usr/common/software/sles15_cgpu/openmpi/4.0.3/hpcsdk/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_isend+0x211) [0x2aaac29460d1]
=================================
[cgpu02:00725] *** Process received signal ***
[cgpu02:00725] Signal: Segmentation fault (11)
[cgpu02:00725] Signal code: (-6)
[cgpu02:00725] Failing at address: 0xb523000002d5
[cgpu02:00725] *** End of error message ***
This error can be avoided by setting the environment variable UCX_MEMTYPE_CACHE=n
before the program is executed.
nvcc
from HPC SDK v20.5 fails if a GPU is not detected¶
The NVIDIA CUDA C compiler nvcc
is able to cross-compile CUDA code, i.e., it does not require the presence of a GPU in order to compile GPU code. However, a bug in the version of nvcc
which is bundled with the HPC SDK breaks this cross-compilation behavior, because it attempts to detect the presence of a GPU each time it is invoked, and fails if a GPU is not found. The result is that this version of nvcc
fails with the following error:
user@cori07:~> nvcc my_code.cu
nvcc-Error-Version /global/common/cori_cle7/software/hpcsdk/20.5/Linux_x86_64/cuda//bin is not available in this installation
user@cori07:~>
This will be fixed in a future release of the HPC SDK. To work around this problem, users should compile CUDA code using the nvcc
included with the cuda
modules.
nvc++
from HPC SDK v20.5 requires detection of a GPU to generate GPU-accelerated pSTL code¶
The HPC SDK includes a new capability for generating GPU-accelerated code from C++17 parallel algorithms (see official documentation of this feature here). However, it requires a GPU to be visible when the compile is invoked in order to generate GPU code; if a GPU is not detected at compile time, even if the -stdpar=gpu
compiler switch is provided, nvc++ will silently fall back to generating CPU code. This behavior will be addressed in a future release of the SDK. In the meantime, a workaround is to simply invoke nvc++ with srun
in an interactive Cori GPU job, such that the compiler will be able to see the GPU and generate GPU code appropriately.
PGI 19.5 requires CUDA <= 10.1.105¶
The PGI v19.5 compiler using OpenACC directives is compatible with CUDA modules only up to cuda/10.1.105
. It is not compatible with cuda/10.1.168
. If one attempts to compile OpenACC code with the cuda/10.1.168
module loaded, one encounters an error at runtime:
user@cgpu01:~/tests/OpenACC/vector_add> module list -l
- Package -----------------------------+- Versions -+- Last mod. ------
Currently Loaded Modulefiles:
cgpu 2019/02/08 22:01:04
pgi/19.5 2019/07/19 22:04:32
modules/3.2.10.6 2017/04/27 21:50:33
cuda/10.1.168 2019/07/19 22:01:27
user@cgpu01:~/tests/OpenACC/vector_add> pgf90 -acc -ta=tesla -o vector_add.ex vector_add.f90
user@cgpu01:~/tests/OpenACC/vector_add> srun -n 1 ./a.out
Failing in Thread:0
call to cuInit returned error -1: Other
srun: error: cgpu01: task 0: Exited with exit code 1
srun: Terminating job step 182246.6
user@cgpu01:~/tests/OpenACC/vector_add>
MVAPICH2 ptmalloc
warnings with Python MPI codes¶
When running some MPI-enabled Python codes compiled with MVAPICH2, one may encounter the following warning:
WARNING: Error in initializing MVAPICH2 ptmalloc library.Continuing without
InfiniBand registration cache support.
This is due to a bad interaction between MVAPICH's ptmalloc
library and the memory allocator used in Python. Details about this warning are provided here. As described in that page, one workaround is to set the LD_PRELOAD
environment variable:
export LD_PRELOAD=$MVAPICH2_DIR/lib/libmpi.so
MPI_THREAD_MULTIPLE
with MVAPICH2¶
By default, a code compiled with MVAPICH2 will support only MPI_THREAD_SINGLE
, even if a higher threading model is requested in MPI_Init_thread()
. This is by design; see this page for more information. If one requests a higher level of threading support, one will encounter the following runtime warning:
user@cgpu04:~/> mpicc -o main.ex main.c
user@cgpu04:~/> srun -n 2 -c 2 ./main.ex
Requested MPI_THREAD_MULTIPLE, got MPI_THREAD_SINGLE
Hello world from processor cgpu04, rank 0 out of 2 processors
Hello world from processor cgpu04, rank 1 out of 2 processors
Error in system call pthread_mutex_destroy: Device or resource busy
src/mpi/init/initthread.c:241
Error in system call pthread_mutex_destroy: Device or resource busy
src/mpi/init/initthread.c:241
To enable higher levels of threading support, e.g., MPI_THREAD_MULTIPLE
, one must disable MVAPICH2's default task binding behavior by setting the environment variable MV2_ENABLE_AFFINITY=0
during execution:
user@cgpu04:~> MV2_ENABLE_AFFINITY=0 srun -n 2 -c 2 ./main.ex
Requested MPI_THREAD_MULTIPLE, got MPI_THREAD_MULTIPLE
Hello world from processor cgpu04, rank 0 out of 2 processors
Hello world from processor cgpu04, rank 1 out of 2 processors