Skip to content

Slurm Access to the Cori GPU nodes

The GPU nodes are accessible via Slurm on the Cori login nodes. Slurm sees the Cori GPU nodes as a separate cluster from the KNL and Haswell nodes. You can set Slurm commands to apply to the GPU nodes by loading the cgpu module:

module load cgpu

Afterwards, you can return to using the KNL and Haswell nodes with:

module unload cgpu

Each Cori GPU node has 8 GPUs, 40 physical cores spread across 2 sockets with 2 hardware threads per core, and 384 GB DRAM.

GPU constraint

To submit a job to the GPU nodes you will need to specify a "constraint" of gpu (in the same way you would specify a constraint of knl for KNL nodes):

#SBATCH -C gpu

Number of GPUs

You will also need to specify the number of GPUs you wish to use, for example to request 2 GPUs:

#SBATCH -G 2

Note that the argument to -G is the total number of GPUs allocated to the job, not the number of GPUs per node allocated to the job. Slurm provides other GPU allocation flags which can ensure a fixed ratio of GPUs to other allocatable resources, e.g., --gpus-per-task=<N>, --gpus-per-node=<N>, etc. The behavior of these flags are described in the salloc manual page, which can be accessed via man salloc, or by reading the SchedMD documentation.

Allocate GPUs with -G <N> or --gpus=<N> instead of --gres=gpu:<N>

In older versions of Slurm, allocating GPUs required using the flag --gres=gpu:<N>, which allocated <N> GPUs per node to the job. However, a Slurm now enables the use of the GPU allocation flag --gpus=<N> (or -G <N> for short). These flags have similar basic functionality to --gres=gpu:<N>, but are easier to type, and also offer more flexible resource allocation. Is is recommended that scripts replace --gres=gpu:<M> with -G <M> or --gpus=<M> Note that <M> and <N> are likely different, as --gres=gpu:<N> allocates <N> GPUs per node to the job, while -G <M> or --gpus=<M> allocates <M> total GPUs to the job.

Exclusive or shared GPU node use

GPU nodes are 'shared' by default

Slurm's default behavior on the Haswell and KNL compute nodes on Cori is to reserve each compute node in the user's job allocation entirely to the user's job; the node cannot be accessed by any other job until the existing user's job reserving that node has ended. However, on the GPU nodes, the default behavior is the opposite - nodes allocated to a user's job are shared with other users, unless all resources on the node are explicitly allocated to that job. If needed, the user can reserve all CPU and memory resources on a GPU node by adding the --exclusive option to one's Slurm job allocation. Note that this flag does not allocate GPUs to the job - GPUs must still be allocated with -G or --gpus.

Although sharing nodes reduces the likelihood that you will need to wait for a node to become available, users of shared nodes may encounter signficant performance variability due to other concurrent activity on the node, particularly if PCI traffic (CPU <-> GPU memory bandwidth and network bandwidth) comprises a significant portion of an application's performance. This is because the GPU nodes do not have enough PCI bandwidth to service all PCI connections at full speed.

Your GPU-access-enabled project

You must make sure to specify the Slurm account with is associated with the GPU QOS for your user account. To see which Slurm accounts your user account is associated with and which QOSes are available for each account, use the command:

sacctmgr show assoc user=$USER qos=gpu_regular format=account|uniq

which will print to screen any accounts from which you can submit jobs to the Cori GPU nodes.

How to chose what to request

Use only what you need

There are only 18 GPU nodes, and 144 GPUs total, to satisfy the development needs of many NERSC users. If you need all CPUs and GPUs on a given number of GPU nodes for your work, you should use them. But if you only need a single GPU and a single physical core, please be mindful of others and do not reserve the entire node for yourself.

Cori GPU does not charge NERSC-hours

Because Cori GPU is a development resource targeting R&D efforts for Perlmutter readiness, and is not a production resource, NERSC does not charge against project allocations for usage of these nodes.

Submitting batch jobs

You can submit a GPU batch job from the Cori login node in the same way you would on Haswell or KNL nodes, remembering to first load the cgpu module.

Working interactively

To access approximately 1/8 of a single node's resources (generally sufficient for single-GPU code development), one can execute

user@cori02> module load cgpu
user@cori02> salloc -C gpu -t 60 -c 10 -G 1 -q interactive -A <account>
salloc: Granted job allocation 12345
salloc: Waiting for resource configuration
salloc: Nodes cgpu02 are ready for job
user@cgpu02:~>

which will provide the user with 1 GPU, 5 physical cores (10 hardware threads), and approximately 48 GB of DRAM. Note that Slurm already allocates memory to your job proportial to the number of CPUs you request for your job. E.g., if you request -c 40 (half of the available CPUs), you will be allocated roughly half of the memory on the node - approximately 192 GB.

Jupyter access

Cori GPU nodes may also be accessed using NERSC's JupyterHub service, which is documented here. After logging into JuypterHub using Iris credentials, one may start a Jupyter notebook which will have access to a GPU by clicking the "Start" button under the column labeled "Shared GPU Node."

JupyterHub sessions may time out at startup if no GPUs are available

When all of the GPUs on the Cori GPU system are allocated to other jobs, Jupyter sessions will be unable to start, as they always request a GPU when the notebook is allocted to the cluster. The "Spawning server" window will not progress, and after a few minutes will time out. When this occurs, one must simply wait until GPUs are available on the cluster before starting a Jupyter notebook.

Job constraints

Cori GPU prioritizes interactive code development during business hours in Pacific Time (UTC-7), and allows large and/or long running jobs to run on nights and weekends. Cori GPU also prioritizes jobs submitted by NESAP application teams over non-NESAP teams.

Cori GPU imposes a few job constraints, described below, to prioritize the most important workloads for this cluster. Users should be aware that jobs of nearly any size and length can be submitted to Cori GPU at any time, and that the constraints described below affect only the time jobs are eligibile to start, not the time that jobs can be submitted to the queue.

Job constraints are as follows:

  1. Jobs running between 5:00 AM Pacific Time (1:00 PM UTC) and 8:00 PM Pacific Time (3:00 AM UTC) from Monday through Friday are limited to 4-hours of run time.
  2. Jobs running before 5:00 AM Pacific Time (1:00 PM UTC) or after 8:00 PM Pacific Time (3:00 AM UTC), or on weekends, can run until 5:00 AM Pacific Time on the next weekday.
  3. Members of the NESAP ERCAP project (m1759) may add an additional flag -q special to their batch and interactive jobs to be placed in a higher-priority queue. Note that the project m1759 must be requested when allocating a job in the special QOS, or else job allocation will fail; it is possible that m1759 is not the default project for many users.
  4. Interactive jobs, allocated via salloc -C gpu -q interactive are now limited to 2 GPUs and a 2-hour walltime limit. Jobs requiring more than 2 GPUs and/or more than 2 hours can be submitted via sbatch instead of salloc.

Slurm commands with cgpu

While the cgpu module is loaded, commands such as sinfo, squeue, sbatch, etc. will not show information or submit jobs to 'normal' Cori compute nodes. To query the 'normal' compute nodes, unload the cgpu module with module unload cgpu and then enter your desired Slurm commands.

The (ReqNodeNotAvail, May be reserved for other job) reason code

Cori GPU uses a handful of specially configured Slurm reservations to enforce the job constraints described above. In some cases, when a job is being prevented from starting due to one of those constraints (e.g., a user submits a job requesting 12 hours of runtime at 9:00 AM Pacific Time), the Slurm reason code provided in the output of squeue will be (ReqNodeNotAvail, May be reserved for other job). This reason code does not mean that anything is wrong with the job, only that it must wait to be eligible to start until at least one of the above constraints no longer applies (e.g., the current time has reached 8:00 PM Pacific Time and the 4-hour job runtime restriction is lifted).