Skip to content

Usage

CPUs

Using the CPUs on the GPU nodes is similar to using 'normal' compute nodes on Cori. CPU bindings via -c and --cpu-bind work the same way.

GPUs

In a batch job submitted with sbatch, GPUs can be accessed with or without srun. However, in an interactive salloc job, the GPUs are accessible only via srun. They are not visible through normal shell commands. For example:

user@cori02:~> module load esslurm
user@cori02:~> salloc -C gpu -N 1 -t 30 -G 2 -A <account>
salloc: Granted job allocation 12345
salloc: Waiting for resource configuration
salloc: Nodes cgpu02 are ready for job
user@cgpu02:~> nvidia-smi
No devices were found
user@cgpu02:~>

Even though the job allocates 2 GPUs via the -G 2 flag in the job allocation, the GPUs are still not visible unless one invokes srun:

user@cgpu02:~> srun nvidia-smi
Thu Mar 14 18:14:00 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:1A:00.0 Off |                    0 |
| N/A   30C    P0    52W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:1B:00.0 Off |                    0 |
| N/A   34C    P0    53W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
user@cgpu02:~>

If one requires interactivity with the GPUs within a given srun command (e.g., if debugging a GPU code with cuda-gdb), one can accomplish this by adding the --pty flag to the srun command:

user@cgpu12:~> srun --pty cuda-gdb
NVIDIA (R) CUDA Debugger
10.0 release
Portions Copyright (C) 2007-2018 NVIDIA Corporation
GNU gdb (GDB) 7.12
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
(cuda-gdb)

If the --pty flag is omitted, the srun command will hang upon reaching the first interactive prompt, and will never return.

SSD

Each Cori GPU node has a ~ 1 TB SSD mounted over NVMe. This SSD is exposed to the user under the /tmp directory. Although the SSD is shared amongst other users accessing the same GPU node, the /tmp directory is presented to each user as an isolated file system:

user@cgpu02:~> ls -ld /tmp
drwx------ 2 user root 6 Dec 13 10:08 /tmp
user@cgpu02:~> ls -l /tmp
total 0
user@cgpu02:~>

Note that the /tmp file system deletes all of its contents after the job allocation ends; so the SSDs are useful for high performance I/O during a job, but cannot be used as permanent storage across multiple jobs.