Skip to content


This page contains examples of codes using different GPU programming models.

OpenMP (C)

The following code demonstrates basic usage of OpenMP 4.5 offloading in C:

#include <stdio.h>

#define MAX 32768

int main()
  int x[MAX], i;
  for (i = 0; i < MAX; ++i) {
    x[i] = 42;

  #pragma omp target teams distribute parallel for simd map(tofrom: x[0:MAX])
  for (i = 0; i < MAX; ++i) {
    x[i] = x[i] + 1;
  printf("After the target region is executed, x[0] = %d\n", x[0]);
  return 0;

It can be compiled and run as:

user@cgpu01:~/tests/OpenMP/openmp45> module load PrgEnv-llvm/9.0.0-git-patched-upstream_20190305
user@cgpu01:~/tests/OpenMP/openmp45> clang -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda -o main.ex main.c
user@cgpu01:~/tests/OpenMP/openmp45> srun -n 1 nvprof ./main.ex
==105106== NVPROF is profiling process 105106, command: ./main.ex
After the target region is executed, x[0] = 43
==105106== Profiling application: ./main.ex
==105106== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   74.35%  79.871us         1  79.871us  79.871us  79.871us  __omp_offloading_33_1f79c13_main_l12
                   13.94%  14.976us         2  7.4880us  1.4080us  13.568us  [CUDA memcpy HtoD]
                   11.71%  12.576us         2  6.2880us  1.7280us  10.848us  [CUDA memcpy DtoH]
      API calls:   69.47%  261.44ms         1  261.44ms  261.44ms  261.44ms  cuCtxCreate
                   27.21%  102.39ms         1  102.39ms  102.39ms  102.39ms  cuCtxDestroy
                    1.76%  6.6232ms         1  6.6232ms  6.6232ms  6.6232ms  cuModuleLoadDataEx
                    1.25%  4.7129ms         1  4.7129ms  4.7129ms  4.7129ms  cuModuleUnload
                    0.11%  423.38us         1  423.38us  423.38us  423.38us  cuMemAlloc
                    0.11%  418.26us         1  418.26us  418.26us  418.26us  cuMemFree
                    0.04%  161.25us         1  161.25us  161.25us  161.25us  cuCtxSynchronize
                    0.02%  84.679us         2  42.339us  34.314us  50.365us  cuMemcpyDtoH
                    0.01%  44.885us         2  22.442us  5.3950us  39.490us  cuMemcpyHtoD
                    0.00%  14.476us         1  14.476us  14.476us  14.476us  cuLaunchKernel
                    0.00%  3.9210us         3  1.3070us     367ns  2.4200us  cuDeviceGetCount
                    0.00%  2.8000us         2  1.4000us  1.2200us  1.5800us  cuDeviceGet
                    0.00%  2.5640us         6     427ns     107ns     780ns  cuDeviceGetAttribute
                    0.00%  2.1960us         1  2.1960us  2.1960us  2.1960us  cuDeviceGetPCIBusId
                    0.00%  1.5640us         2     782ns     544ns  1.0200us  cuModuleGetGlobal
                    0.00%  1.1280us         6     188ns     129ns     284ns  cuCtxSetCurrent
                    0.00%     924ns         1     924ns     924ns     924ns  cuFuncGetAttribute
                    0.00%     924ns         1     924ns     924ns     924ns  cuModuleGetFunction

where the name of the offloaded loop generated by the compiler is derived from the line number of the loop in the original source code (the last part of the kernel name, main_l12 indicates that the kernel starts on line 12 of main.c).

The ECP SOLLVE project also provides a verification and validation suite for OpenMP 4.5 compilers. Many OpenMP offloading examples in both C and Fortran can be found in their public Bitbucket repository.

OpenMP (Fortran)

The following Fortran program executes the same loop as the C example above.

program main
  implicit none

  integer, parameter :: sz = 32768
  integer, dimension(sz) :: arr
  integer :: i

  do i = 1, sz
    arr(i) = 42
  end do

  !$omp target teams distribute parallel do map(tofrom: arr(1:sz))
  do i = 1, sz
    arr(i) = arr(i) + 1
  end do

  print *, "After the target region is executed, arr(1) = ", arr(1)

end program main

It can be compiled with the GCC Fortran compiler gfortran from the gcc/8.1.1-openacc-gcc-8-branch-20190215 module:

user@cgpu01:~/tests/OpenMP/openmp45> gfortran -fopenmp -foffload=nvptx-none="-Ofast -lm -misa=sm_35" -o main.ex main.f90
user@cgpu01:~/tests/OpenMP/openmp45> srun -n 1 nvprof ./main.ex
==108662== NVPROF is profiling process 108662, command: ./main.ex

libgomp: ignoring acc_prof_register request for TODOinvalid acc_event_t 26

libgomp: ignoring acc_prof_register request for TODOinvalid acc_event_t 27
 After the target region is executed, arr(1) =           43
==108662== Profiling application: ./main.ex
==108662== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   95.07%  590.33us         1  590.33us  590.33us  590.33us  MAIN__$_omp_fn$0
                    2.72%  16.863us         3  5.6210us  1.3760us  14.047us  [CUDA memcpy HtoD]
                    2.21%  13.727us         1  13.727us  13.727us  13.727us  [CUDA memcpy DtoH]
      API calls:   67.58%  263.93ms         1  263.93ms  263.93ms  263.93ms  cuCtxCreate
                   27.13%  105.95ms         1  105.95ms  105.95ms  105.95ms  cuCtxDestroy
                    2.14%  8.3529ms         1  8.3529ms  8.3529ms  8.3529ms  cuModuleLoadData
                    1.78%  6.9468ms        24  289.45us  140.04us  1.5775ms  cuLinkAddData
                    0.35%  1.3679ms         2  683.95us  671.05us  696.85us  cuMemAlloc
                    0.34%  1.3225ms         1  1.3225ms  1.3225ms  1.3225ms  cuLinkComplete
                    0.24%  926.79us         2  463.39us  399.72us  527.06us  cuMemFree
                    0.16%  607.99us        19  31.999us     140ns  603.26us  cuDeviceGetAttribute
                    0.15%  596.67us         1  596.67us  596.67us  596.67us  cuCtxSynchronize
                    0.09%  349.22us         1  349.22us  349.22us  349.22us  cuLaunchKernel
                    0.03%  109.33us         3  36.444us  9.0220us  68.314us  cuMemcpyHtoD
                    0.01%  47.820us         1  47.820us  47.820us  47.820us  cuMemcpyDtoH
                    0.01%  22.020us         1  22.020us  22.020us  22.020us  cuLinkCreate
                    0.00%  5.5850us         1  5.5850us  5.5850us  5.5850us  cuModuleGetGlobal
                    0.00%  3.2570us         1  3.2570us  3.2570us  3.2570us  cuLinkDestroy
                    0.00%  2.8270us         1  2.8270us  2.8270us  2.8270us  cuDeviceGetPCIBusId
                    0.00%  2.1540us         9     239ns     166ns     341ns  cuCtxGetDevice
                    0.00%  1.8790us         1  1.8790us  1.8790us  1.8790us  cuModuleGetFunction
                    0.00%  1.7310us         4     432ns     276ns     633ns  cuMemGetAddressRange
                    0.00%  1.6530us         3     551ns     177ns     841ns  cuFuncGetAttribute
                    0.00%  1.4540us         4     363ns     133ns     804ns  cuDeviceGetCount
                    0.00%     804ns         2     402ns     209ns     595ns  cuDeviceGet
                    0.00%     390ns         1     390ns     390ns     390ns  cuCtxGetCurrent
                    0.00%     156ns         1     156ns     156ns     156ns  cuInit

OpenACC (C)

A 2-D Jacobi solver written in C and accelerated with OpenACC is provided here and is discussed in detail on the NVIDIA Developer Blog.

It can be compiled as follows (also requires the timer.h file provided here), using the PGI v19.5 compiler available via the pgi/19.5 module:

user@cgpu01:~> module load pgi/19.5
user@cgpu01:~> pgcc -acc -Minfo -ta=tesla:cc70 -o laplace_2d.ex laplace_2d.c
     72, FMA (fused multiply-add) instruction(s) generated
     86, Generating copy(A[:][:])
         Generating create(Anew[:][:])
     93, Loop is parallelizable
     96, Loop is parallelizable
         Generating Tesla code
         93, #pragma acc loop gang(32), vector(16) /* blockIdx.y threadIdx.y */
         96, #pragma acc loop gang(16), vector(32) /* blockIdx.x threadIdx.x */
        100, Generating implicit reduction(max:error)
    106, Loop is parallelizable
    109, Loop is parallelizable
         Generating Tesla code
        106, #pragma acc loop gang, vector(4) /* blockIdx.y threadIdx.y */
        109, #pragma acc loop gang(16), vector(32) /* blockIdx.x threadIdx.x */

OpenACC (Fortran)

The same solver described above is also provided in a Fortran version here and can be compiled as follows:

user@cgpu01:~> module load pgi/19.5
user@cgpu01:~> pgf90 -acc -Minfo -ta=tesla:cc70 -o laplace_2d.ex laplace_2d.F90
     43, Memory zero idiom, array assignment replaced by call to pgf90_mzero4
     46, Memory copy idiom, loop replaced by call to __c_mcopy4
     50, Memory copy idiom, loop replaced by call to __c_mcopy4
     77, Generating create(anew(:,:))
         Generating copy(a(:,:))
     83, Loop is parallelizable
     85, Loop is parallelizable
         Generating Tesla code
         83, !$acc loop gang(32), vector(16) ! blockidx%y threadidx%y
         85, !$acc loop gang(16), vector(32) ! blockidx%x threadidx%x
         88, Generating implicit reduction(max:error)
    100, Loop is parallelizable
    102, Loop is parallelizable
         Generating Tesla code
        100, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
        102, !$acc loop gang(16), vector(32) ! blockidx%x threadidx%x


A vector addition example written in CUDA C is provided in this NVIDIA blog and can be compiled with the nvcc compiler provided by any cuda module on Cori GPU:

user@cgpu01:~> module load cuda/10.1.168
user@cgpu01:~> nvcc -arch compute_75 -code compute_75

CUDA Fortran

A vector addition example written in CUDA Fortran is provided in this NVIDIA blog and can be compiled with the pgf90 compiler provided by any pgi module on Cori GPU:

user@cgpu01:~> module load pgi/19.5
user@cgpu01:~> pgf90 -o saxpy.ex saxpy.cuf

More CUDA examples

The CUDA SDK, provided by the cuda modules, also provides a large number of example codes. They are located in $CUDA_HOME/samples and are sorted by category into various subdirectories, e.g., 0_Simple, 1_Utilities, etc. One can copy these sample codes into a write-accessible location and compile the examples:

user@cgpu01:~> cd $HOME
user@cgpu01:~> mkdir CUDA_samples
user@cgpu01:~> cp -r $CUDA_HOME/samples/{common,7_CUDALibraries} CUDA_samples
user@cgpu01:~> cd CUDA_samples/7_CUDALibraries
user@cgpu01:~> make
which will create the executable in the directory $HOME/CUDA_samples/bin/x86_64/linux/release.

GPU-aware MPI

Both OpenMPI and MVAPICH2 are GPU-aware MPI implementations. This means that the programmer can use pointers to GPU device memory in MPI buffers, and the MPI implementation will correctly copy the data in GPU device memory to/from the network interface card's (NIC's) memory, either by implicitly copying the data first to host memory and then copying the data from host memory to the NIC; or, in hardware which supports GPUDirect RDMA, the data will be copied directly from the GPU to the NIC, bypassing host memory altogether.

GPU-aware MPI does not necessarily mean higher performance

A GPU-aware MPI implementation does not necessarily correlate to a higher-performing MPI library; it means only that the library provides the programmer with certain conveniences which require less code to accomplish certain tasks (chiefly, copying data between GPU memory and a NIC).

An example of this is shown below:

/* A demonstration of a GPU-aware MPI program. The value of val_host is set on
 * the CPU, copied to val_device which is allocated only on the GPU, and then
 * val_device is passed directly to MPI_Send()/MPI_Recv() buffers, which
 * automatically copy the data from the GPU to the NIC, without requiring that
 * the programmer to explicitly copy the data to/from CPU memory.
 * Note that this program does *not* require GPUDirect RDMA hardware support in
 * the NIC. It relies on software capabilities in GPU-aware MPI implementations
 * like MVAPICH2 and OpenMPI. The MPI library is able to either execute an
 * implicit cudaMemcpy() before each MPI_* function in order to copy the data
 * to/from GPU memory; or if executing on a NIC which does support GPUDirect
 * RDMA, it may copy the data directly between GPU memory and the NIC,
 * bypassing CPU memory altogether.
 * Note also that not all MPI functions are implemented with this kind of GPU
 * memory awareness. Please check the documentation of your GPU-aware MPI
 * library for details regarding which MPI functions can take advantage of this
 * capability.

#include <iostream>
#include <memory>
#include <mpi.h>
#include <cuda_runtime.h>

int main(int argc, char *argv[])
    int myrank, tag=99;
    MPI_Status status;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

    float *val_device, *val_host = new float;

    cudaMalloc((void **)&val_device, sizeof(float));

    if (myrank == 0) {
        *val_host = 42.0;
        cudaMemcpy(val_device, val_host, sizeof(float), cudaMemcpyHostToDevice);
        MPI_Send(val_device, 1, MPI_FLOAT, 1, tag, MPI_COMM_WORLD);
        std::cout << "rank 0 sent " << *val_host << std::endl;
    } else {
        MPI_Recv(val_device, 1, MPI_FLOAT, 0, tag, MPI_COMM_WORLD, &status);
        cudaMemcpy(val_host, val_device, sizeof(float), cudaMemcpyDeviceToHost);
        std::cout << "rank 1 received " << *val_host << std::endl;


    return 0;

The above C++ code can be compiled with OpenMPI on the Cori GPU nodes using:

module load gcc/7.3.0
module load cuda/10.1.168
module load openmpi/4.0.1-ucx-1.6
mpic++ -o gpu_mpi.ex -I${CUDA_ROOT}/include gpu_mpi.cpp -L${CUDA_ROOT}/lib64 -lcudart

It can also be compiled with MVAPICH2:

module load gcc/7.3.0
module load cuda/10.1.168
module load mvapich2/2.3.1
mpic++ -o gpu_mpi.ex -I${CUDA_ROOT}/include gpu_mpi.cpp -L${CUDA_ROOT}/lib64 -lcudart

The code must be executed with two MPI processes:

srun -n 2 -c 2 ./gpu_mpi.ex

Please note that not all MPI functions are GPU-aware. The list of GPU-aware MPI functions in OpenMPI is provided here.

More information about OpenMPI's GPU capabilities is provided here. More information about MVAPICH2's GPU support is provided here.

Python examples

These examples assume that you have already created a custom conda environment in which both CuPy and Numba have been installed. For directions please see here.


Chainer is developing CuPy, which is intended to be a drop-in replacement for NumPy on GPUs. Many commonly used NumPy and SciPy functions have already been implemented in CuPy and the list is growing quickly.

Here is an example of a simple use case on the CPU (with NumPy) and on the GPU (with CuPy):

CPU version

import numpy as np

cpu_array = np.array([1, 2, 3])
cpu_mean = np.mean(cpu_array)
GPU version
import cupy as cp

#cupy automatically creates this array on the GPU for you
gpu_array = cp.array([1, 2, 3])
#and then finds the mean on the GPU
gpu_mean = cp.mean(gpu_array)
#if you would like the data back on the cpu, you can move it by
cpu_mean1 = cp.asnumpy(gpu_mean)
#or also
cpu_mean2 = gpu_mean.get()
#these two methods are equivalent
And that's it! CuPy is relatively easy to use.

But what if I need a function that isn't implemented in CuPy? Or what if several CuPy functions strung together are slow and I want them to be faster?

In this situation we recommend Numba, which allows us to write custom Python kernels that call CUDA under the hood.

Numba (using CUDA)

If you have used Numba for accelerating Python on the CPU, you'll know that it provides a nice solution for speeding up Python code without having to rewrite kernels in another language. Numba for GPUs provides this capability, although it is not nearly as friendly as its CPU-based cousin.

Numba for GPUs is far more limited. Here is a short summary of Numba GPU rules:

  • No operations that create objects or otherwise allocate memory are allowed
  • GPU kernels cannot return any objects
  • Most NumPy functionality that was permitted on the CPU is not permitted on the GPU
  • Here is the short list of functionality that is permitted

Here is an example of a CPU Numba kernel and a GPU Numba kernel. These examples are based on this notebook.

CPU version

import numba

def smooth_cpu(x, out):    

    n, m = x.shape 

    for i in range(1, n - 1):     #just normal loops for i
        for j in range(1, m - 1): #and j instead of cuda.grid

            out[i,j] = (x[i-1, j-1] + x[i-1, j+0] + x[i-1, j+1] +
                        x[i+0, j-1] + x[i+0, j+0] + x[i+0, j+1] +
                        x[i+1, j-1] + x[i+1, j+0] + x[i+1, j+1]) // 9    
    return out

GPU version

import numba
#you must explictly import numba.cuda
from numba import cuda

def smooth_gpu(x, out):

    i, j = cuda.grid(2) #this is the key step! makes a 2d grid of threads

    n, m = x.shape

    #no loop, just a boundary check
    #since the code is already marching over all threads
    if (1 <= i < n - 1) and (1 <= j < m - 1):

        out[i, j] = (x[i-1, j-1] + x[i-1, j+0] + x[i-1, j+1] +
                     x[i+0, j-1] + x[i+0, j+0] + x[i+0, j+1] +
                     x[i+1, j-1] + x[i+1, j+0] + x[i+1, j+1]) // 9            

    #no returns allowed in gpu land 
So what is happening in this example? We have the same stencil calculation for both CPU and GPU, but there are a few key differences in the GPU kernel. First, the array out must be preallocated since no memory creation is allowed on the GPU. Second, the key step is the cuda.grid(2) function, which creates a 2D grid of GPU threads. This is a convenience function; for more information about what it is actually doing, see this notebook. For more information about the concept of CUDA threadblocks, see this page. Finally, there is no loop over i and j like there is in the CPU version; this is because the Numba kernel is executing on many threads at once, similar to what happens in an MPI code. Therefore only a boundary check is necessary.

The last key part of the GPU kernel is in how it is invoked. The invocation will look something like:

import numba
import cupy as cp

x = cp.ones((10000, 10000))
out = cp.zeros((10000, 10000))

threads_per_block = 128
blocks_per_grid = 30

smooth_gpu[blocks_per_grid, threads_per_block](x, out)
where the values blocks_per_grid and threads_per_block may depend on both the GPU hardware you are using and the algorithm you are running. The best advice we have received is to try several values for each to determine what is best for your application.

And yes, you read correctly: Numba can accept CuPy objects!

This has been short summary of a complex subject. For further reading on Numba for GPUs we recommend these resources:

Notebooks from GTC 2019

The Numba docs


What if you are a Pandas and/or scikit-learn user?

NVIDIA RAPIDS has implemented Pandas as cuDF and scikit-learn as cuML.

You can read their getting started guide here.

Here you will find official RAPIDS notebooks, organized by library (cuDF, cuML, cuGraph, XGBoost).

Here you will find a much larger suite of RAPIDS notebooks. The notebooks are created by both NVIDIA and members of the community.

More information about using RAPIDS at NERSC coming soon!