Skip to content

Slurm Access to the Cori GPU nodes

The GPU nodes are accessible via Slurm on the Cori login nodes. They are exposed as a hardware 'constraint', in the same way that the Haswell and KNL compute nodes are. They require that you use the esslurm module before you run your Slurm scripts, or else your jobs will fail.

Each node has 8 GPUs, 40 CPU cores spread across 2 sockets with 2 hyper-threads per core, and 384 GB DRAM. To access approximately 1/8 of a single node's resources (generally sufficient for single-GPU code development), one can execute

user@cori02> module load esslurm
user@cori02> salloc -C gpu -N 1 -t 60 -c 10 --gres=gpu:1 -A <account>
salloc: Granted job allocation 12345
salloc: Waiting for resource configuration
salloc: Nodes cgpu02 are ready for job

which will provide the user with 1 GPU, 5 physical cores (10 hyper-threads), and approximately 30 GB of DRAM. Note that Slurm already allocates memory to your job proportial to the number of CPUs you request for your job. E.g., if you request -c 40 (half of the available CPUs), you will be allocated roughly half of the memory on the node - approximately 192 GB.

The new flag in the above example which is not used elsewhere on Cori is --gres, which is used to reserve a particular number of GPUs on the node.

You must make sure to specify the Slurm account with is associated with the GPU QOS for your user account. To see which Slurm accounts your user account is associated with and which QOSes are available for each account, use the command:

user@cori02> sacctmgr show assoc user=$USER -p

which will print to screen any accounts you can submit jobs with and the allowed QOSes for the jobs. If your user account is associated with several job accounts, you'll probably want to use something like sacctmgr show assoc user=$USER -p | grep gpu to search the output.

GPU nodes are 'shared' by default

Slurm's default behavior on the 'normal' compute nodes on Cori and Edison is to reserve each compute node entirely for yourself; every node in your job allocation is exclusively yours. However, on the GPU nodes, the default behavior is the opposite - the default behavior is to share the nodes in your job allocation with other users. If you need to reserve all CPU resources on a node for yourself, you can specify the --exclusive option in your Slurm script invocation.

Although sharing nodes reduces the likelihood that you will need to wait for a node to become available, users of shared nodes may encounter signficant performance variability due to other concurrent activity on the node, particularly if PCI traffic (CPU <-> GPU memory bandwidth and network bandwidth) comprises a significant portion of an application's performance. This is because the GPU nodes do not have enough PCI bandwidth to service all PCI connections at full speed.

Use only what you need

There are only 18 GPU nodes to satisfy the development needs of many NERSC users. If you need all CPUs and GPUs on a given number of GPU nodes for your work, you should use them. But if you only need a single GPU and a single physical core, please be mindful of others and do not reserve the entire node for yourself.

Job constraints

Job constraints are as follows:

  1. Jobs requesting <= 2 nodes must request <= 4 hours.
  2. Jobs requesting > 2 nodes must request > 4 and <= 8 hours.
  3. Batch jobs (but not interactive jobs) may violate the above constraints by submitting directly to the gpu_preempt QoS. (See details below.)

Jobs in the second category, requesting > 2 nodes and > 4 hours, will be placed in a "preemptable" queue. Preemptable jobs are a special type of job in Slurm, which can be stopped in order to allow a higher priority job in the queue to start, even while the preemptable job is currently executing. In the case of Cori GPU, jobs requesting <= 2 nodes and <= 4 hours of run time have higher priority than jobs requesting > 2 nodes and > 4 hours. A job which is preempted will print a message to STDERR similar to the message printed when a job is canceled due to exceding a time limit, except that the reason for the cancellation will be PREEMPTED.

One can mitigate the disruption due to job preemption by using two strategies:

  1. Ensure that the long-running code checkpoints frequently.
  2. Add the --requeue flag to the job's submission script, so that it is automatically resubmitted to the queue if it is preempted.

If the user wishes to run a batch job which requires > 4 hours, but requires fewer than 2 nodes (perhaps even just a single GPU on one shared node), this can be achieved by submitting the job directly to the gpu_preempt QoS, e.g.,

#SBATCH -C gpu
#SBATCH -q gpu_preempt
#SBATCH -t 300
#SBATCH -c 10
#SBATCH --gres=gpu:1
#SBATCH -A <account>
#SBATCH --requeue

srun -n 1 ./my.exe

Note that this works only for batch jobs - interactive jobs via salloc are not able to submit directly to the gpu_preempt QoS.

Slurm commands with esslurm

While the esslurm module is loaded, commands such as sinfo, squeue, sbatch, etc. will not show information or submit jobs to 'normal' Cori compute nodes. To query the 'normal' compute nodes, unload the esslurm module with module unload esslurm and then enter your desired Slurm commands.