MPS disabled indefinitely¶
NVIDIA's Multi-Process Service (MPS) enables multiple processes (typically MPI ranks) to execute kernels on a single GPU simultaneously. MPS can enable an application to achieve higher performance when a single process is unable to saturate the GPU's resources.
Unfortunately, a security vulnerability in the V100 GPUs was disclosed in February 2019 (see here, here, and here for more information) which renders data in GPU memory exposed to a side-channel attack if the GPU is accessed by multiple processes simultaneously. The vulnerability is not present if a GPU is allocated exclusively to a single process.
As a result of this security risk, NERSC has disabled MPS on Cori GPU until a mitigation for this vulnerability is implemented.
OpenACC codes using CUDA awareness need
Programs compiled with OpenMPI and OpenACC and which rely on CUDA-awareness in OpenMPI will crash with an error like the following:
[cgpu02:725 :0:725] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x2aab07afa000) ==== backtrace (tid: 725) ==== 0 /usr/common/software/sles15_cgpu/ucx/1.8.1/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x2aaac3233ac4] 1 /usr/common/software/sles15_cgpu/ucx/1.8.1/lib/libucs.so.0(+0x21cc4) [0x2aaac3233cc4] 2 /usr/common/software/sles15_cgpu/ucx/1.8.1/lib/libucs.so.0(+0x21e7a) [0x2aaac3233e7a] 3 /lib64/libc.so.6(+0x15a8c4) [0x2aaaae18c8c4] 4 /usr/common/software/sles15_cgpu/ucx/1.8.1/lib/libucp.so.0(+0x5760b) [0x2aaac2da760b] 5 /usr/common/software/sles15_cgpu/ucx/1.8.1/lib/libuct.so.0(uct_mm_ep_am_bcopy+0x13a) [0x2aaac2ff07ba] 6 /usr/common/software/sles15_cgpu/ucx/1.8.1/lib/libucp.so.0(+0x57697) [0x2aaac2da7697] 7 /usr/common/software/sles15_cgpu/ucx/1.8.1/lib/libucp.so.0(+0x394eb) [0x2aaac2d894eb] 8 /usr/common/software/sles15_cgpu/ucx/1.8.1/lib/libucp.so.0(ucp_tag_send_nb+0x5c8) [0x2aaac2d9b0f8] 9 /usr/common/software/sles15_cgpu/openmpi/4.0.3/hpcsdk/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_isend+0x211) [0x2aaac29460d1] ================================= [cgpu02:00725] *** Process received signal *** [cgpu02:00725] Signal: Segmentation fault (11) [cgpu02:00725] Signal code: (-6) [cgpu02:00725] Failing at address: 0xb523000002d5 [cgpu02:00725] *** End of error message ***
This error can be avoided by setting the environment variable
UCX_MEMTYPE_CACHE=n before the program is executed.
nvcc from HPC SDK v20.5 fails if a GPU is not detected¶
The NVIDIA CUDA C compiler
nvcc is able to cross-compile CUDA code, i.e., it does not require the presence of a GPU in order to compile GPU code. However, a bug in the version of
nvcc which is bundled with the HPC SDK breaks this cross-compilation behavior, because it attempts to detect the presence of a GPU each time it is invoked, and fails if a GPU is not found. The result is that this version of
nvcc fails with the following error:
user@cori07:~> nvcc my_code.cu nvcc-Error-Version /global/common/cori_cle7/software/hpcsdk/20.5/Linux_x86_64/cuda//bin is not available in this installation user@cori07:~>
This will be fixed in a future release of the HPC SDK. To work around this problem, users should compile CUDA code using the
nvcc included with the
nvc++ from HPC SDK v20.5 requires detection of a GPU to generate GPU-accelerated pSTL code¶
The HPC SDK includes a new capability for generating GPU-accelerated code from C++17 parallel algorithms (see official documentation of this feature here). However, it requires a GPU to be visible when the compile is invoked in order to generate GPU code; if a GPU is not detected at compile time, even if the
-stdpar=gpu compiler switch is provided, nvc++ will silently fall back to generating CPU code. This behavior will be addressed in a future release of the SDK. In the meantime, a workaround is to simply invoke nvc++ with
srun in an interactive Cori GPU job, such that the compiler will be able to see the GPU and generate GPU code appropriately.
PGI 19.5 requires CUDA <= 10.1.105¶
The PGI v19.5 compiler using OpenACC directives is compatible with CUDA modules only up to
cuda/10.1.105. It is not compatible with
cuda/10.1.168. If one attempts to compile OpenACC code with the
cuda/10.1.168 module loaded, one encounters an error at runtime:
user@cgpu01:~/tests/OpenACC/vector_add> module list -l - Package -----------------------------+- Versions -+- Last mod. ------ Currently Loaded Modulefiles: cgpu 2019/02/08 22:01:04 pgi/19.5 2019/07/19 22:04:32 modules/126.96.36.199 2017/04/27 21:50:33 cuda/10.1.168 2019/07/19 22:01:27 user@cgpu01:~/tests/OpenACC/vector_add> pgf90 -acc -ta=tesla -o vector_add.ex vector_add.f90 user@cgpu01:~/tests/OpenACC/vector_add> srun -n 1 ./a.out Failing in Thread:0 call to cuInit returned error -1: Other srun: error: cgpu01: task 0: Exited with exit code 1 srun: Terminating job step 182246.6 user@cgpu01:~/tests/OpenACC/vector_add>
ptmalloc warnings with Python MPI codes¶
When running some MPI-enabled Python codes compiled with MVAPICH2, one may encounter the following warning:
WARNING: Error in initializing MVAPICH2 ptmalloc library.Continuing without InfiniBand registration cache support.
This is due to a bad interaction between MVAPICH's
ptmalloc library and the memory allocator used in Python. Details about this warning are provided here. As described in that page, one workaround is to set the
LD_PRELOAD environment variable:
MPI_THREAD_MULTIPLE with MVAPICH2¶
By default, a code compiled with MVAPICH2 will support only
MPI_THREAD_SINGLE, even if a higher threading model is requested in
MPI_Init_thread(). This is by design; see this page for more information. If one requests a higher level of threading support, one will encounter the following runtime warning:
user@cgpu04:~/> mpicc -o main.ex main.c user@cgpu04:~/> srun -n 2 -c 2 ./main.ex Requested MPI_THREAD_MULTIPLE, got MPI_THREAD_SINGLE Hello world from processor cgpu04, rank 0 out of 2 processors Hello world from processor cgpu04, rank 1 out of 2 processors Error in system call pthread_mutex_destroy: Device or resource busy src/mpi/init/initthread.c:241 Error in system call pthread_mutex_destroy: Device or resource busy src/mpi/init/initthread.c:241
To enable higher levels of threading support, e.g.,
MPI_THREAD_MULTIPLE, one must disable MVAPICH2's default task binding behavior by setting the environment variable
MV2_ENABLE_AFFINITY=0 during execution:
user@cgpu04:~> MV2_ENABLE_AFFINITY=0 srun -n 2 -c 2 ./main.ex Requested MPI_THREAD_MULTIPLE, got MPI_THREAD_MULTIPLE Hello world from processor cgpu04, rank 0 out of 2 processors Hello world from processor cgpu04, rank 1 out of 2 processors