Skip to content

Math Libraries

Intel MKL

To use routines provided by the Intel MKL, load one of the available intel modules before compiling and running your code:

module load intel

Module names

Be sure to load an intel compiler module and not the Intel programming environment module (PrgEnv-intel).

To determine the appropriate link lines for your code, use the Intel MKL Link Line Advisor.

Thrust

Thrust is an open-source C++ library which implements much functionality of the C++ STL on GPUs. The CUDA SDK (already installed on Cori GPU) already includes a recent release of Thrust; therefore simply loading a cuda module on Cori GPU is sufficient to compile and run Thrust code. An example code shown here can be compiled on Cori GPU as follows:

user@cori11:~> module load cuda
user@cori11:~> nvcc -o main.ex main.cu
user@cori11:~> srun -C gpu -c 2 -G 1 -t 1 nvprof ./a.out
user@cori11:~> srun -C gpu -c 2 -G 1 -t 1 nvprof ./main.ex
H has size 4
H[0] = 14
H[1] = 20
H[2] = 38
H[3] = 46
H now has size 2
==42627== NVPROF is profiling process 42627, command: ./main.ex
D[0] = 99
D[1] = 88
==42627== Profiling application: ./main.ex
==42627== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   58.02%  4.5120us         3  1.5040us  1.3760us  1.7600us  [CUDA memcpy HtoD]
                   41.98%  3.2640us         2  1.6320us  1.5040us  1.7600us  [CUDA memcpy DtoH]
      API calls:   99.44%  288.27ms         1  288.27ms  288.27ms  288.27ms  cudaMalloc
                    0.29%  843.06us        97  8.6910us     140ns  338.33us  cuDeviceGetAttribute
                    0.12%  338.00us         1  338.00us  338.00us  338.00us  cuDeviceTotalMem
                    0.10%  294.85us         1  294.85us  294.85us  294.85us  cudaFree
                    0.03%  73.565us         1  73.565us  73.565us  73.565us  cuDeviceGetName
                    0.02%  54.643us         5  10.928us  3.3030us  19.945us  cudaMemcpyAsync
                    0.01%  25.451us         5  5.0900us     964ns  7.8780us  cudaStreamSynchronize
                    0.00%  2.2820us         1  2.2820us  2.2820us  2.2820us  cuDeviceGetPCIBusId
                    0.00%  1.3490us         3     449ns     179ns     747ns  cuDeviceGetCount
                    0.00%     941ns         2     470ns     183ns     758ns  cuDeviceGet
                    0.00%     255ns         1     255ns     255ns     255ns  cuDeviceGetUuid

Thurst source code files must use the .cu extension

Although Thrust code is standard C++, it must use the .cu extension in order for the nvcc compiler to recognize it as Thrust code and compile it in the required way. If one uses a standard C++ file extension for Thrust code, like .cpp or .cc, nvcc compilation will fail.

cuFFT

cuFFT is the NVIDIA® CUDA™ Fast Fourier Transform (FFT) product; it is provided with CUDA installations. It contains two libraries, cuFFT and cuFFTW.

cuFFTW

The cuFFTW library is provided as a porting tool to help users of FFTW to start using NVIDIA GPUs. This is done via the FFTW3 API provided by the cuFFT library.

Consider the following FFTW code example, fftw_example.c, adapted from this Github repository:

#include <fftw3.h>
#include <stdio.h>
#include <math.h>

#define NUM_POINTS 64
#define REAL 0
#define IMAG 1

void acquire_from_somewhere(fftw_complex* signal) {
    /* Generate two sine waves of different frequencies and
     * amplitudes.
     */

    int i;
    for (i = 0; i < NUM_POINTS; ++i) {
        double theta = (double)i / (double)NUM_POINTS * M_PI;

        signal[i][REAL] = 1.0 * cos(10.0 * theta) +
                          0.5 * cos(25.0 * theta);

        signal[i][IMAG] = 1.0 * sin(10.0 * theta) +
                          0.5 * sin(25.0 * theta);
    }
}

void do_something_with(fftw_complex* result) {
    int i;
    for (i = 0; i < NUM_POINTS; ++i) {
        double mag = sqrt(result[i][REAL] * result[i][REAL] +
                          result[i][IMAG] * result[i][IMAG]);

        printf("%g\n", mag);
    }
}

int main() {
    fftw_complex signal[NUM_POINTS];
    fftw_complex result[NUM_POINTS];

    fftw_plan plan = fftw_plan_dft_1d(NUM_POINTS,
                                      signal,
                                      result,
                                      FFTW_FORWARD,
                                      FFTW_ESTIMATE);

    acquire_from_somewhere(signal);
    fftw_execute(plan);
    do_something_with(result);
    fftw_destroy_plan(plan);

    return 0;
}

For use on the Cori Haswell or KNL nodes, this would be compiled with:

module load cray-fftw
CC -o fftw_example.o fftw_example.c

To use the cuFFTW library with this example for use on the Cori GPU nodes, simply replace the include statement #include <fftw3.h> with #include <cufftw.h>. Then, compile the code with:

module load cuda
nvcc -lcufftw -o cufftw_example.o cufftw_example.cu

where here we have changed the source file name to cufft_example.cu.

Not all FFTW3 capability supported

cuFFT does not support all of the components and functions of FFTW3. For a description of what is and is not supported, please see this section of the cuFFT documentation.

cuFFT

The above example can also be replicated using the cuFFT library. The equivalent source code, cufft_example.cu, is:

#include <cuda_runtime.h>
#include <cufft.h>
#include <stdio.h>
#include <math.h>

#define NUM_POINTS 64

void acquire_from_somewhere(cufftComplex* signal) {
    /* Generate two sine waves of different frequencies and
     * amplitudes.
     */

    int i;
    for (i = 0; i < NUM_POINTS; ++i) {
        double theta = (double)i / (double)NUM_POINTS * M_PI;

        signal[i].x = 1.0 * cos(10.0 * theta) +
                      0.5 * cos(25.0 * theta);

        signal[i].y = 1.0 * sin(10.0 * theta) +
                      0.5 * sin(25.0 * theta);
    }   
}

void do_something_with(cufftComplex* result) {
    int i;
    for (i = 0; i < NUM_POINTS; ++i) {
        double mag = sqrt(result[i].x * result[i].x +
                          result[i].y * result[i].y);

        printf("%g\n", mag);
    }   
}

int main() {
    cufftComplex* signal = (cufftComplex*)malloc(sizeof(cufftComplex)*NUM_POINTS);
    cufftComplex* result = (cufftComplex*)malloc(sizeof(cufftComplex)*NUM_POINTS);

    acquire_from_somewhere(signal);

    cufftComplex *d_signal;
    int mem_size = sizeof(cufftComplex) * NUM_POINTS;
    cudaMalloc((void**)&d_signal, mem_size);
    cudaMemcpy(d_signal, signal, mem_size, cudaMemcpyHostToDevice);

    cufftHandle plan;
    cufftPlan1d(&plan,
                NUM_POINTS,
                CUFFT_C2C,
                1); 
    cufftExecC2C(plan, d_signal, d_signal, CUFFT_FORWARD);

    cudaMemcpy(result, d_signal, mem_size, cudaMemcpyDeviceToHost);
    cufftDestroy(plan);

    do_something_with(result);

    free(signal);
    free(result);
    cudaFree(d_signal);

    return 0;
}

and would be compiled on a Cori GPU node with:

module load cuda
nvcc -o cufft_example.o cufft_example.cu -lcufft