Skip to content

Usage

CPUs

Using the CPUs on the GPU nodes is similar to using Haswell or KNL compute nodes on Cori. Task binding to CPUs via -c and --cpu-bind work the same way on Cori GPU as on the Haswell and KNL nodes, and is documented in official Slurm documentation.

GPUs

In a batch job submitted with sbatch, GPUs can be accessed with or without srun. However, in an interactive salloc job, the GPUs are accessible only via srun. They are not visible through normal shell commands. For example:

user@cori02:~> module load cgpu
user@cori02:~> salloc -C gpu -q interactive -t 30 -c 20 -G 2 -A <account>
salloc: Granted job allocation 12345
salloc: Waiting for resource configuration
salloc: Nodes cgpu02 are ready for job
user@cgpu02:~> nvidia-smi
No devices were found
user@cgpu02:~>

Even though the job allocates 2 GPUs via the -G 2 flag in the job allocation, the GPUs are still not visible unless one invokes srun:

user@cgpu02:~> srun nvidia-smi
Thu Mar 14 18:14:00 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.79       Driver Version: 410.79       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:1A:00.0 Off |                    0 |
| N/A   30C    P0    52W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:1B:00.0 Off |                    0 |
| N/A   34C    P0    53W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
user@cgpu02:~>

If one requires interactivity with the GPUs within a given srun command (e.g., if debugging a GPU code with cuda-gdb), one can accomplish this by adding the --pty flag to the srun command:

user@cgpu12:~> srun --pty cuda-gdb
NVIDIA (R) CUDA Debugger
10.0 release
Portions Copyright (C) 2007-2018 NVIDIA Corporation
GNU gdb (GDB) 7.12
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word".
(cuda-gdb)

If the --pty flag is omitted, the srun command will hang upon reaching the first interactive prompt, and will never return.

Controlling task and GPU binding

When allocating CPUs and GPUs to a job in Slurm, the default behavior is that all GPUs on a particular node allocated to the job can be accessed by all tasks on that same node:

cori04:~> srun -C gpu -n 2 -c 10 --cpu-bind=cores --gpus-per-task=1 ./src/mpi_cuda_hello_world 
srun: job 1234567 queued and waiting for resources
srun: job 1234567 has been allocated resources
Hello world from processor cgpu01, rank 1 out of 2 processors. I see 2 GPUs! Their PCI IDs are:
0: 0000:07:00.0
1: 0000:0F:00.0
Hello world from processor cgpu01, rank 0 out of 2 processors. I see 2 GPUs! Their PCI IDs are:
0: 0000:07:00.0
1: 0000:0F:00.0
cori04:~> 

For some applications, it is desirable that only certain GPUs can be accessed by certain tasks. For example, a common programming model for MPI + GPU applications is such that each GPU on a node is accessed by only a single task on that node.

Such behavior can be controlled in different ways. One way is to manipulate the environment variable CUDA_VISIBLE_DEVICES, as documented here. This approach works on any system with NVIDIA GPUs. The variable must be configured per process, and may have different values on different processes, depending on the user's desired GPU affinity settings.

Using CUDA_VISIBLE_DEVICES in interactive jobs

To set CUDA_VISIBLE_DEVICES and have the subsequent command use your specified configuration while in an interactive job, you'll need to wrap the environment variable setting as well as the executable run in one shell command that is then sent to the srun command.

For example, if you are using two GPUs and wish to reverse the device order, the command to use would look like: srun bash -c 'CUDA_VISIBLE_DEVICES=1,0 && ./code.exe

Another way to achieve a similar result is to use Slurm's GPU affinity flags. In particular, the --gpu-bind flag may be supplied to either salloc, sbatch, or srun in order to control which tasks can access which GPUs. A description of the --gpu-bind flag is documented here and via man srun. For example, adding --gpu-bind:map_gpu:0,1 to the previous example results in:

cori04:~> srun -C gpu -n 2 -c 10 --cpu-bind=cores --gpus-per-task=1 --gpu-bind=map_gpu:0,1 ./src/mpi_cuda_hello_world 
srun: job 1234567 queued and waiting for resources
srun: job 1234567 has been allocated resources
Hello world from processor cgpu20, rank 0 out of 2 processors. I see 1 GPUs! Their PCI IDs are:
0: 0000:07:00.0
Hello world from processor cgpu20, rank 1 out of 2 processors. I see 1 GPUs! Their PCI IDs are:
0: 0000:0F:00.0

such that each task on the node may access a single, unique GPU.

To run a job across all 18 Cori GPU nodes using 144 tasks total, with each task bound to one of the 144 total GPUs on the cluster, one could run the following:

srun -C gpu -n 144 -c 10 --cpu-bind=cores --gpus-per-task=1 --gpu-bind=map_gpu:0,1,2,3,4,5,6,7 ./my-code.ex

Running Single-GPU Tasks in Parallel

Users who have many independent single-GPU tasks may wish to pack these into one job which runs the tasks in parallel on different GPUs. There are multiple ways to accomplish this; here we present examples of two different approaches. The first example uses the srun command, while the second example uses GNU parallel.

srun

The Slurm srun command can be used to launch individual tasks, each allocated some amount of resources requested by the job script. An example of this is:

#!/bin/bash
#SBATCH -A <account>
#SBATCH -C gpu
#SBATCH -G 2
#SBATCH -c 20
#SBATCH -t 5

srun --gres=craynetwork:0 -n 1 -G 1 ./a.out &
srun --gres=craynetwork:0 -n 1 -G 1 ./b.out &
wait

Each srun invocation requests one task and one GPU, and requesting zero craynetwork resources per task is required to allow the tasks to run in parallel. The & at the end of each line puts the tasks in the background, and the final wait command is needed to allow all of the tasks to run to completion.

Do not use srun for large numbers of tasks

This approach is feasible for relatively small numbers (i.e., tens) of tasks but should not be used for hundreds or thousands of tasks. To run larger numbers of tasks, GNU parallel is preferred.

GNU parallel

GNU parallel is an alternative option which allows users to run many tasks at once without without invoking srun for each task. An example of this is:

#!/bin/bash
#SBATCH -A <account>
#SBATCH -C gpu
#SBATCH -G 4
#SBATCH -c 40
#SBATCH -t 5

cat inputs.txt | parallel -j4 'CUDA_VISIBLE_DEVICES=$(("{%}" - 1)) && ./a.out {}'

which requests 4 GPUs and uses parallel to place one task on each requested GPU.

Here, inputs.txt is an input file with one entry per line which contains a list of arguments to be sent to the executable calls. For something like a scaling study, an example inputs.txt might contain:

1000000
5000000
10000000
50000000

It is possible to have many more input values than requested GPUs and parallel jobs; the values from inputs.txt will be processed as tasks complete. However, it is important to note that the number of GPUs requested and the number of parallel jobs specified (the -j argument) should be equal.

Each task is assigned to an individual GPU by specifying a particular device number and then assigning the environment variable CUDA_VISIBLE_DEVICES to be equal to only that device number for that task. The variable {%} is the current "job slot" in GNU parallel. Job slots are numbered from 1 to the number of concurrent jobs specified (so, in the above example, 1 - 4), and CUDA_VISIBLE_DEVICES ranges from 0 to 1 fewer than the number of GPUs requested (so, 0 - 3 above), so we decrement the job slot value to match the available device numbers.

SSD

Each Cori GPU node has a ~ 1 TB SSD mounted over NVMe. This SSD is exposed to the user under the /tmp directory. Although the SSD is shared amongst other users accessing the same GPU node, the /tmp directory is presented to each user as an isolated file system:

user@cgpu02:~> ls -ld /tmp
drwx------ 2 user root 6 Dec 13 10:08 /tmp
user@cgpu02:~> ls -l /tmp
total 0
user@cgpu02:~>

Note that the /tmp file system deletes all of its contents after the job allocation ends; so the SSDs are useful for high performance I/O during a job, but cannot be used as permanent storage across multiple jobs.