How did it work?
The kernel is the heart of our CUDA code. When a kernel is launched the number of threads per
block(blockDim) and number of blocks per
grid(gridDim) are specified. The total number of threads is
(blockDim) * (gridDim). Each thread evaluates one copy of the kernel.
// CUDA kernel. Each thread takes care of one element of c
__global__ void vecAdd(double *a, double *b, double *c, int n)
// Get our global thread ID
int id = blockIdx.x*blockDim.x+threadIdx.x;
// Make sure we do not go out of bounds
if (id < n)
c[id] = a[id] + b[id];
__global__ decorator specifies this is a CUDA kernel, otherwise normal C function syntax is used. The kernel must have return type void. We calculate the threads global id using CUDA supplied structs. blockIdx contains the blocks position in the grid, ranging from
threadIdx is the threads index inside of it’ s associated block, ranging from
blockDim-1. For convenience blocks and grids can be multi dimensional, the associated structs contain
z members. Unless we have an array length divisible by our
blockDim we will not have the same number of threads launched as valid array components. To avoid overstepping our array we simply test to make sure our global thread ID is less than the length of our array.