Examples¶
Slurm job configurations¶
All examples in this section assume the user desires 10 CPUs (5 cores) per task. This can be changed by adjusting the argument to -c
in the script header. We choose 10 CPUs (5 cores) for these examples because each Cori GPU node has 8 GPUs and 80 CPUs (40 cores), so dividing resources evenly across jobs on shared nodes yields 10 CPUs (5 cores) per GPU. The user should recall that -c
means "CPUs per task", not "total CPUs," although the two are equivalent when running only 1 task in a job allocation.
--gpus-per-task
does not enforce GPU affinity or binding
Despite what its name suggests, --gpus-per-task
in the examples below only counts the number of GPUs to allocate to the job; it does not enforce any binding or affinity of GPUs to CPUs or tasks.
1 node, 1 task, 1 GPU¶
#!/bin/bash
#SBATCH -A <account>
#SBATCH -C gpu
#SBATCH -q regular
#SBATCH -t 1:00:00
#SBATCH -n 1
#SBATCH --ntasks-per-node=1
#SBATCH -c 10
#SBATCH --gpus-per-task=1
export SLURM_CPU_BIND="cores"
srun ./my_code.ex <args>
1 node, 4 tasks, 4 GPUs, all GPUs visible to all tasks¶
#!/bin/bash
#SBATCH -A <account>
#SBATCH -C gpu
#SBATCH -q regular
#SBATCH -t 1:00:00
#SBATCH -n 4
#SBATCH --ntasks-per-node=4
#SBATCH -c 10
#SBATCH --gpus-per-task=1
export SLURM_CPU_BIND="cores"
srun ./my_code.ex <args>
1 node, 4 tasks, 4 GPUs, 1 GPU visible to each task¶
#!/bin/bash
#SBATCH -A <account>
#SBATCH -C gpu
#SBATCH -q regular
#SBATCH -t 1:00:00
#SBATCH -n 4
#SBATCH --ntasks-per-node=4
#SBATCH -c 10
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=map_gpu:0,1,2,3
export SLURM_CPU_BIND="cores"
srun ./my_code.ex <args>
4 nodes, 32 tasks, 32 GPUs, all GPUs visible to all tasks¶
#!/bin/bash
#SBATCH -A <account>
#SBATCH -C gpu
#SBATCH -q regular
#SBATCH -t 1:00:00
#SBATCH -n 32
#SBATCH --ntasks-per-node=8
#SBATCH -c 10
#SBATCH --gpus-per-task=1
export SLURM_CPU_BIND="cores"
srun ./my_code.ex <args>
4 nodes, 32 tasks, 32 GPUs, 1 GPU visible to each task¶
#!/bin/bash
#SBATCH -A <account>
#SBATCH -C gpu
#SBATCH -q regular
#SBATCH -t 1:00:00
#SBATCH -n 32
#SBATCH --ntasks-per-node=8
#SBATCH -c 10
#SBATCH --gpus-per-task=1
#SBATCH --gpu-bind=map_gpu:0,1,2,3,4,5,6,7
export SLURM_CPU_BIND="cores"
srun ./my_code.ex <args>
Programming models¶
OpenMP (C)¶
The following code demonstrates basic usage of OpenMP 4.5 offloading in C:
#include <stdio.h>
#define MAX 32768
int main()
{
int x[MAX], i;
for (i = 0; i < MAX; ++i) {
x[i] = 42;
}
#pragma omp target teams distribute parallel for simd map(tofrom: x[0:MAX])
for (i = 0; i < MAX; ++i) {
x[i] = x[i] + 1;
}
printf("After the target region is executed, x[0] = %d\n", x[0]);
return 0;
}
It can be compiled and run as:
user@cgpu01:~> module purge
user@cgpu01:~> module load cgpu PrgEnv-llvm
user@cgpu01:~> clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -o main.ex main.c
user@cgpu01:~> srun -n 1 nvprof ./main.ex
==105106== NVPROF is profiling process 105106, command: ./main.ex
After the target region is executed, x[0] = 43
==105106== Profiling application: ./main.ex
==105106== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 74.35% 79.871us 1 79.871us 79.871us 79.871us __omp_offloading_33_1f79c13_main_l12
13.94% 14.976us 2 7.4880us 1.4080us 13.568us [CUDA memcpy HtoD]
11.71% 12.576us 2 6.2880us 1.7280us 10.848us [CUDA memcpy DtoH]
API calls: 69.47% 261.44ms 1 261.44ms 261.44ms 261.44ms cuCtxCreate
27.21% 102.39ms 1 102.39ms 102.39ms 102.39ms cuCtxDestroy
1.76% 6.6232ms 1 6.6232ms 6.6232ms 6.6232ms cuModuleLoadDataEx
1.25% 4.7129ms 1 4.7129ms 4.7129ms 4.7129ms cuModuleUnload
0.11% 423.38us 1 423.38us 423.38us 423.38us cuMemAlloc
0.11% 418.26us 1 418.26us 418.26us 418.26us cuMemFree
0.04% 161.25us 1 161.25us 161.25us 161.25us cuCtxSynchronize
0.02% 84.679us 2 42.339us 34.314us 50.365us cuMemcpyDtoH
0.01% 44.885us 2 22.442us 5.3950us 39.490us cuMemcpyHtoD
0.00% 14.476us 1 14.476us 14.476us 14.476us cuLaunchKernel
0.00% 3.9210us 3 1.3070us 367ns 2.4200us cuDeviceGetCount
0.00% 2.8000us 2 1.4000us 1.2200us 1.5800us cuDeviceGet
0.00% 2.5640us 6 427ns 107ns 780ns cuDeviceGetAttribute
0.00% 2.1960us 1 2.1960us 2.1960us 2.1960us cuDeviceGetPCIBusId
0.00% 1.5640us 2 782ns 544ns 1.0200us cuModuleGetGlobal
0.00% 1.1280us 6 188ns 129ns 284ns cuCtxSetCurrent
0.00% 924ns 1 924ns 924ns 924ns cuFuncGetAttribute
0.00% 924ns 1 924ns 924ns 924ns cuModuleGetFunction
user@cgpu01:~/tests/OpenMP/openmp45>
where the name of the offloaded loop generated by the compiler is derived from the line number of the loop in the original source code (the last part of the kernel name, main_l12
indicates that the kernel starts on line 12 of main.c
).
The ECP SOLLVE project also provides a verification and validation suite for OpenMP 4.5 compilers. Many OpenMP offloading examples in both C and Fortran can be found in their public Bitbucket repository.
OpenMP (Fortran)¶
The following Fortran program executes the same loop as the C example above.
program main
implicit none
integer, parameter :: sz = 32768
integer, dimension(sz) :: arr
integer :: i
do i = 1, sz
arr(i) = 42
end do
!$omp target teams distribute parallel do map(tofrom: arr(1:sz))
do i = 1, sz
arr(i) = arr(i) + 1
end do
print *, "After the target region is executed, arr(1) = ", arr(1)
end program main
It can be compiled with the GCC Fortran compiler gfortran
from the gcc/8.1.1-openacc-gcc-8-branch-20190215
module:
user@cgpu01:~> gfortran -fopenmp -foffload=nvptx-none="-Ofast -lm -misa=sm_35" -o main.ex main.f90
user@cgpu01:~> srun -n 1 nvprof ./main.ex
==108662== NVPROF is profiling process 108662, command: ./main.ex
libgomp: ignoring acc_prof_register request for TODOinvalid acc_event_t 26
libgomp: ignoring acc_prof_register request for TODOinvalid acc_event_t 27
After the target region is executed, arr(1) = 43
==108662== Profiling application: ./main.ex
==108662== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 95.07% 590.33us 1 590.33us 590.33us 590.33us MAIN__$_omp_fn$0
2.72% 16.863us 3 5.6210us 1.3760us 14.047us [CUDA memcpy HtoD]
2.21% 13.727us 1 13.727us 13.727us 13.727us [CUDA memcpy DtoH]
API calls: 67.58% 263.93ms 1 263.93ms 263.93ms 263.93ms cuCtxCreate
27.13% 105.95ms 1 105.95ms 105.95ms 105.95ms cuCtxDestroy
2.14% 8.3529ms 1 8.3529ms 8.3529ms 8.3529ms cuModuleLoadData
1.78% 6.9468ms 24 289.45us 140.04us 1.5775ms cuLinkAddData
0.35% 1.3679ms 2 683.95us 671.05us 696.85us cuMemAlloc
0.34% 1.3225ms 1 1.3225ms 1.3225ms 1.3225ms cuLinkComplete
0.24% 926.79us 2 463.39us 399.72us 527.06us cuMemFree
0.16% 607.99us 19 31.999us 140ns 603.26us cuDeviceGetAttribute
0.15% 596.67us 1 596.67us 596.67us 596.67us cuCtxSynchronize
0.09% 349.22us 1 349.22us 349.22us 349.22us cuLaunchKernel
0.03% 109.33us 3 36.444us 9.0220us 68.314us cuMemcpyHtoD
0.01% 47.820us 1 47.820us 47.820us 47.820us cuMemcpyDtoH
0.01% 22.020us 1 22.020us 22.020us 22.020us cuLinkCreate
0.00% 5.5850us 1 5.5850us 5.5850us 5.5850us cuModuleGetGlobal
0.00% 3.2570us 1 3.2570us 3.2570us 3.2570us cuLinkDestroy
0.00% 2.8270us 1 2.8270us 2.8270us 2.8270us cuDeviceGetPCIBusId
0.00% 2.1540us 9 239ns 166ns 341ns cuCtxGetDevice
0.00% 1.8790us 1 1.8790us 1.8790us 1.8790us cuModuleGetFunction
0.00% 1.7310us 4 432ns 276ns 633ns cuMemGetAddressRange
0.00% 1.6530us 3 551ns 177ns 841ns cuFuncGetAttribute
0.00% 1.4540us 4 363ns 133ns 804ns cuDeviceGetCount
0.00% 804ns 2 402ns 209ns 595ns cuDeviceGet
0.00% 390ns 1 390ns 390ns 390ns cuCtxGetCurrent
0.00% 156ns 1 156ns 156ns 156ns cuInit
OpenACC (C)¶
A 2-D Jacobi solver written in C and accelerated with OpenACC is provided here and is discussed in detail on the NVIDIA Developer Blog.
It can be compiled as follows (also requires the timer.h
file provided here), using the NVIDIA HPC SDK compiler available via the nvhpc
(formerly hpcsdk
) module:
user@cgpu01:~> module load cgpu nvhpc
user@cgpu01:~> nvc -acc -Minfo -o laplace_2d.ex laplace_2d.c
GetTimer:
72, FMA (fused multiply-add) instruction(s) generated
main:
86, Generating copy(A[:][:])
Generating create(Anew[:][:])
93, Loop is parallelizable
96, Loop is parallelizable
Generating Tesla code
93, #pragma acc loop gang(32), vector(16) /* blockIdx.y threadIdx.y */
96, #pragma acc loop gang(16), vector(32) /* blockIdx.x threadIdx.x */
100, Generating implicit reduction(max:error)
106, Loop is parallelizable
109, Loop is parallelizable
Generating Tesla code
106, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
109, #pragma acc loop gang(16), vector(32) /* blockIdx.x threadIdx.x */
OpenACC (Fortran)¶
The same solver described above is also provided in a Fortran version here and can be compiled as follows:
user@cgpu01:~> module load cgpu nvhpc
user@cgpu01:~> nvfortran -acc -Minfo -o laplace_2d.ex laplace_2d.F90
laplace:
43, Memory zero idiom, array assignment replaced by call to pgf90_mzero4
46, Memory copy idiom, loop replaced by call to __c_mcopy4
50, Memory copy idiom, loop replaced by call to __c_mcopy4
77, Generating create(anew(:,:))
Generating copy(a(:,:))
83, Loop is parallelizable
85, Loop is parallelizable
Generating Tesla code
83, !$acc loop gang(32), vector(16) ! blockidx%y threadidx%y
85, !$acc loop gang(16), vector(32) ! blockidx%x threadidx%x
88, Generating implicit reduction(max:error)
100, Loop is parallelizable
102, Loop is parallelizable
Generating Tesla code
100, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
102, !$acc loop gang(16), vector(32) ! blockidx%x threadidx%x
CUDA C¶
A vector addition example written in CUDA C is provided in this NVIDIA blog and can be compiled with the nvcc
compiler provided by any cuda
module on Cori GPU:
module load cuda
nvcc -arch compute_70 -code compute_70 saxpy.cu
CUDA Fortran¶
A vector addition example written in CUDA Fortran is provided in this NVIDIA blog and can be compiled with the nvfortran
compiler provided by any nvhpc
module on Cori GPU:
module load cgpu nvhpc
nvfortran -o saxpy.ex saxpy.cuf
More CUDA examples¶
The CUDA SDK, provided by the cuda
modules, also provides a large number of example codes. They are located in $CUDA_HOME/samples
and are sorted by category into various subdirectories, e.g., 0_Simple
, 1_Utilities
, etc. One can copy these sample codes into a write-accessible location and compile the examples:
cd $HOME
mkdir CUDA_samples
cp -r $CUDA_HOME/samples/{common,7_CUDALibraries} CUDA_samples
cd CUDA_samples/7_CUDALibraries
make
which will create the executable in the directory $HOME/CUDA_samples/bin/x86_64/linux/release
.
GPU-aware MPI¶
Both OpenMPI and MVAPICH2 are GPU-aware MPI implementations. This means that the programmer can use pointers to GPU device memory in MPI buffers, and the MPI implementation will correctly copy the data in GPU device memory to/from the network interface card's (NIC's) memory, either by implicitly copying the data first to host memory and then copying the data from host memory to the NIC; or, in hardware which supports GPUDirect RDMA, the data will be copied directly from the GPU to the NIC, bypassing host memory altogether.
GPU-aware MPI does not necessarily mean higher performance
A GPU-aware MPI implementation does not necessarily correlate to a higher-performing MPI library; it means only that the library provides the programmer with certain conveniences which require less code to accomplish certain tasks (chiefly, copying data between GPU memory and a NIC).
An example of this is shown below:
/* A demonstration of a GPU-aware MPI program. The value of val_host is set on
* the CPU, copied to val_device which is allocated only on the GPU, and then
* val_device is passed directly to MPI_Send()/MPI_Recv() buffers, which
* automatically copy the data from the GPU to the NIC, without requiring that
* the programmer to explicitly copy the data to/from CPU memory.
*
* Note that this program does *not* require GPUDirect RDMA hardware support in
* the NIC. It relies on software capabilities in GPU-aware MPI implementations
* like MVAPICH2 and OpenMPI. The MPI library is able to either execute an
* implicit cudaMemcpy() before each MPI_* function in order to copy the data
* to/from GPU memory; or if executing on a NIC which does support GPUDirect
* RDMA, it may copy the data directly between GPU memory and the NIC,
* bypassing CPU memory altogether.
*
* Note also that not all MPI functions are implemented with this kind of GPU
* memory awareness. Please check the documentation of your GPU-aware MPI
* library for details regarding which MPI functions can take advantage of this
* capability.
*/
#include <iostream>
#include <memory>
#include <mpi.h>
#include <cuda_runtime.h>
int main(int argc, char *argv[])
{
int myrank, tag=99;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
float *val_device, *val_host = new float;
cudaMalloc((void **)&val_device, sizeof(float));
if (myrank == 0) {
*val_host = 42.0;
cudaMemcpy(val_device, val_host, sizeof(float), cudaMemcpyHostToDevice);
MPI_Send(val_device, 1, MPI_FLOAT, 1, tag, MPI_COMM_WORLD);
std::cout << "rank 0 sent " << *val_host << std::endl;
} else {
MPI_Recv(val_device, 1, MPI_FLOAT, 0, tag, MPI_COMM_WORLD, &status);
cudaMemcpy(val_host, val_device, sizeof(float), cudaMemcpyDeviceToHost);
std::cout << "rank 1 received " << *val_host << std::endl;
}
cudaFree(val_device);
MPI_Finalize();
return 0;
}
The above C++ code can be compiled with OpenMPI on the Cori GPU nodes using:
module purge
module load cgpu
module load gcc
module load cuda
module load openmpi
mpic++ -o gpu_mpi.ex -I${CUDA_ROOT}/include gpu_mpi.cpp -L${CUDA_ROOT}/lib64 -lcudart
It can also be compiled with MVAPICH2:
module purge
module load cgpu
module load gcc
module load cuda
module load mvapich2
mpic++ -o gpu_mpi.ex -I${CUDA_ROOT}/include gpu_mpi.cpp -L${CUDA_ROOT}/lib64 -lcudart
The code must be executed with two MPI processes:
srun -n 2 -c 2 ./gpu_mpi.ex
Please note that not all MPI functions are GPU-aware. The list of GPU-aware MPI functions in OpenMPI is provided here.
More information about OpenMPI's GPU capabilities is provided here. More information about MVAPICH2's GPU support is provided here.
Python examples¶
These examples will demonstrate 1) how to use CuPy as a drop-in replacement for NumPy and 2) how to use Numba CUDA to write kernels which have not been implemented in CuPy. Please keep in mind that this is just one paradigm for running Python on GPUs.
CuPy¶
Chainer is developing CuPy, which is intended to be a drop-in replacement for NumPy on GPUs. Many commonly used NumPy and SciPy functions have already been implemented in CuPy and the list is growing quickly.
Here is an example of a simple use case on the CPU (with NumPy) and on the GPU (with CuPy):
CPU version
import numpy as np
cpu_array = np.array([1, 2, 3])
cpu_mean = np.mean(cpu_array)
GPU version
import cupy as cp
#cupy automatically creates this array on the GPU for you
gpu_array = cp.array([1, 2, 3])
#and then finds the mean on the GPU
gpu_mean = cp.mean(gpu_array)
#if you would like the data back on the cpu, you can move it by
cpu_mean1 = cp.asnumpy(gpu_mean)
#or also
cpu_mean2 = gpu_mean.get()
#these two methods are equivalent
And that's it! CuPy is relatively easy to use.
But what if I need a function that isn't implemented in CuPy?
It is easy to interface CuPy with Numba, so we'll demonstrate how to use Numba to create custom GPU kernels.
Numba CUDA¶
If you have used Numba for accelerating Python on the CPU, you'll know that it provides a nice solution for speeding up Python code without having to rewrite kernels in another language. Numba CUDA provides this same capability, although it is not nearly as friendly as its CPU-based cousin.
Numba for GPUs is far more limited. Here is a short summary of Numba GPU rules:
- No operations that create objects or otherwise allocate memory are allowed
- GPU kernels cannot return any objects
- Most NumPy functionality that was permitted on the CPU is not permitted on the GPU
- Here is the short list of functionality that is permitted
Here is an example of a CPU Numba kernel and a GPU Numba kernel. These examples are based on this notebook.
CPU version
import numba
@numba.jit()
def smooth_cpu(x, out):
n, m = x.shape
for i in range(1, n - 1): #just normal loops for i
for j in range(1, m - 1): #and j instead of cuda.grid
out[i,j] = (x[i-1, j-1] + x[i-1, j+0] + x[i-1, j+1] +
x[i+0, j-1] + x[i+0, j+0] + x[i+0, j+1] +
x[i+1, j-1] + x[i+1, j+0] + x[i+1, j+1]) // 9
return out
GPU version
import numba
#you must explicitly import numba.cuda
from numba import cuda
@cuda.jit()
def smooth_gpu(x, out):
i, j = cuda.grid(2) #this is the key step! makes a 2d grid of threads
n, m = x.shape
#no loop, just a boundary check
#since the code is already marching over all threads
if (1 <= i < n - 1) and (1 <= j < m - 1):
out[i, j] = (x[i-1, j-1] + x[i-1, j+0] + x[i-1, j+1] +
x[i+0, j-1] + x[i+0, j+0] + x[i+0, j+1] +
x[i+1, j-1] + x[i+1, j+0] + x[i+1, j+1]) // 9
#no returns allowed in gpu land
What is happening in this example?
We have the same stencil calculation for both CPU and GPU, but there are a few key differences in the GPU kernel. First, the array out
must be preallocated since no memory creation is allowed on the GPU. Second, the key step is the cuda.grid(2)
function, which creates a 2D grid of GPU threads. For more information about the concept of CUDA threadblocks, see this page. Finally, there is no loop over i
and j
like there is in the CPU version; this is because the Numba kernel is executing on many threads at once, similar to what happens in an MPI code. Therefore only a boundary check is necessary.
The last key part of the GPU kernel is in how it is invoked. The invocation will look something like:
import numba
import cupy as cp
import math
x = cp.ones((10000, 10000))
out = cp.zeros((10000, 10000))
threadsperblock = (16, 16) #this is a 2d kernel
blockspergrid_x = math.ceil(x.shape[0] / threadsperblock[0])
blockspergrid_y = math.ceil(x.shape[1] / threadsperblock[1])
blockspergrid = (blockspergrid_x, blockspergrid_y) #this is a 2d kernel
smooth_gpu[blocks_per_grid, threads_per_block](x, out)
where the values blocks_per_grid
and threads_per_block
may depend on both the GPU hardware you are using and the algorithm you are running. The best advice we have received is to try several values for each to determine what is best for your application.
This has been short summary of a complex subject. For further reading on Numba for GPUs we recommend these resources:
NVIDIA RAPIDS¶
What if you are a pandas and/or scikit-learn user?
NVIDIA RAPIDS has implemented pandas as cuDF and scikit-learn as cuML.
For self-guiding examples from our NERSC/NVIDIA RAPIDS workshop on 4/14/2020, please see these notebooks.
Here you will find official RAPIDS notebooks, organized by library (cuDF, cuML, cuGraph, XGBoost).
Here you will find a much larger suite of RAPIDS notebooks. The notebooks are created by both NVIDIA and members of the community.