Math Libraries¶
Intel MKL¶
To use routines provided by the Intel MKL, load one of the available intel
modules before compiling and running your code:
module load intel
Module names
Be sure to load an intel
compiler module and not the Intel programming environment module (PrgEnv-intel
).
To determine the appropriate link lines for your code, use the Intel MKL Link Line Advisor.
Thrust¶
Thrust is an open-source C++ library which implements much functionality of the C++ STL on GPUs. The CUDA SDK (already installed on Cori GPU) already includes a recent release of Thrust; therefore simply loading a cuda
module on Cori GPU is sufficient to compile and run Thrust code. An example code shown here can be compiled on Cori GPU as follows:
user@cori11:~> module load cuda
user@cori11:~> nvcc -o main.ex main.cu
user@cori11:~> srun -C gpu -c 2 -G 1 -t 1 nvprof ./a.out
user@cori11:~> srun -C gpu -c 2 -G 1 -t 1 nvprof ./main.ex
H has size 4
H[0] = 14
H[1] = 20
H[2] = 38
H[3] = 46
H now has size 2
==42627== NVPROF is profiling process 42627, command: ./main.ex
D[0] = 99
D[1] = 88
==42627== Profiling application: ./main.ex
==42627== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 58.02% 4.5120us 3 1.5040us 1.3760us 1.7600us [CUDA memcpy HtoD]
41.98% 3.2640us 2 1.6320us 1.5040us 1.7600us [CUDA memcpy DtoH]
API calls: 99.44% 288.27ms 1 288.27ms 288.27ms 288.27ms cudaMalloc
0.29% 843.06us 97 8.6910us 140ns 338.33us cuDeviceGetAttribute
0.12% 338.00us 1 338.00us 338.00us 338.00us cuDeviceTotalMem
0.10% 294.85us 1 294.85us 294.85us 294.85us cudaFree
0.03% 73.565us 1 73.565us 73.565us 73.565us cuDeviceGetName
0.02% 54.643us 5 10.928us 3.3030us 19.945us cudaMemcpyAsync
0.01% 25.451us 5 5.0900us 964ns 7.8780us cudaStreamSynchronize
0.00% 2.2820us 1 2.2820us 2.2820us 2.2820us cuDeviceGetPCIBusId
0.00% 1.3490us 3 449ns 179ns 747ns cuDeviceGetCount
0.00% 941ns 2 470ns 183ns 758ns cuDeviceGet
0.00% 255ns 1 255ns 255ns 255ns cuDeviceGetUuid
Thurst source code files must use the .cu
extension
Although Thrust code is standard C++, it must use the .cu
extension in order for the nvcc
compiler to recognize it as Thrust code and compile it in the required way. If one uses a standard C++ file extension for Thrust code, like .cpp
or .cc
, nvcc
compilation will fail.
cuFFT¶
cuFFT is the NVIDIA® CUDA™ Fast Fourier Transform (FFT) product; it is provided with CUDA installations. It contains two libraries, cuFFT and cuFFTW.
cuFFTW¶
The cuFFTW library is provided as a porting tool to help users of FFTW to start using NVIDIA GPUs. This is done via the FFTW3 API provided by the cuFFT library.
Consider the following FFTW code example, fftw_example.c
, adapted from this Github repository:
#include <fftw3.h>
#include <stdio.h>
#include <math.h>
#define NUM_POINTS 64
#define REAL 0
#define IMAG 1
void acquire_from_somewhere(fftw_complex* signal) {
/* Generate two sine waves of different frequencies and
* amplitudes.
*/
int i;
for (i = 0; i < NUM_POINTS; ++i) {
double theta = (double)i / (double)NUM_POINTS * M_PI;
signal[i][REAL] = 1.0 * cos(10.0 * theta) +
0.5 * cos(25.0 * theta);
signal[i][IMAG] = 1.0 * sin(10.0 * theta) +
0.5 * sin(25.0 * theta);
}
}
void do_something_with(fftw_complex* result) {
int i;
for (i = 0; i < NUM_POINTS; ++i) {
double mag = sqrt(result[i][REAL] * result[i][REAL] +
result[i][IMAG] * result[i][IMAG]);
printf("%g\n", mag);
}
}
int main() {
fftw_complex signal[NUM_POINTS];
fftw_complex result[NUM_POINTS];
fftw_plan plan = fftw_plan_dft_1d(NUM_POINTS,
signal,
result,
FFTW_FORWARD,
FFTW_ESTIMATE);
acquire_from_somewhere(signal);
fftw_execute(plan);
do_something_with(result);
fftw_destroy_plan(plan);
return 0;
}
For use on the Cori Haswell or KNL nodes, this would be compiled with:
module load cray-fftw
CC -o fftw_example.o fftw_example.c
To use the cuFFTW library with this example for use on the Cori GPU nodes, simply replace the include statement #include <fftw3.h>
with #include <cufftw.h>
. Then, compile the code with:
module load cuda
nvcc -lcufftw -o cufftw_example.o cufftw_example.cu
where here we have changed the source file name to cufft_example.cu
.
Not all FFTW3 capability supported
cuFFT does not support all of the components and functions of FFTW3. For a description of what is and is not supported, please see this section of the cuFFT documentation.
cuFFT¶
The above example can also be replicated using the cuFFT library. The equivalent source code, cufft_example.cu
, is:
#include <cuda_runtime.h>
#include <cufft.h>
#include <stdio.h>
#include <math.h>
#define NUM_POINTS 64
void acquire_from_somewhere(cufftComplex* signal) {
/* Generate two sine waves of different frequencies and
* amplitudes.
*/
int i;
for (i = 0; i < NUM_POINTS; ++i) {
double theta = (double)i / (double)NUM_POINTS * M_PI;
signal[i].x = 1.0 * cos(10.0 * theta) +
0.5 * cos(25.0 * theta);
signal[i].y = 1.0 * sin(10.0 * theta) +
0.5 * sin(25.0 * theta);
}
}
void do_something_with(cufftComplex* result) {
int i;
for (i = 0; i < NUM_POINTS; ++i) {
double mag = sqrt(result[i].x * result[i].x +
result[i].y * result[i].y);
printf("%g\n", mag);
}
}
int main() {
cufftComplex* signal = (cufftComplex*)malloc(sizeof(cufftComplex)*NUM_POINTS);
cufftComplex* result = (cufftComplex*)malloc(sizeof(cufftComplex)*NUM_POINTS);
acquire_from_somewhere(signal);
cufftComplex *d_signal;
int mem_size = sizeof(cufftComplex) * NUM_POINTS;
cudaMalloc((void**)&d_signal, mem_size);
cudaMemcpy(d_signal, signal, mem_size, cudaMemcpyHostToDevice);
cufftHandle plan;
cufftPlan1d(&plan,
NUM_POINTS,
CUFFT_C2C,
1);
cufftExecC2C(plan, d_signal, d_signal, CUFFT_FORWARD);
cudaMemcpy(result, d_signal, mem_size, cudaMemcpyDeviceToHost);
cufftDestroy(plan);
do_something_with(result);
free(signal);
free(result);
cudaFree(d_signal);
return 0;
}
and would be compiled on a Cori GPU node with:
module load cuda
nvcc -o cufft_example.o cufft_example.cu -lcufft