Course


CUDA vector addition

Learn how to do CUDA vector addition.

Vector addition

We want to sum two one-dimensional vectors, A and B, containing 512 cells (numbers) each, with the result contained in a third vector, named C.

CUDA (4.0)
  • Show Input  

How did it work?

The kernel is the heart of our CUDA code. When a kernel is launched the number of threads per block(blockDim) and number of blocks per grid(gridDim) are specified. The total number of threads is (blockDim) * (gridDim). Each thread evaluates one copy of the kernel.

// CUDA kernel. Each thread takes care of one element of c
__global__ void vecAdd(double *a, double *b, double *c, int n)
{
    // Get our global thread ID
    int id = blockIdx.x*blockDim.x+threadIdx.x;

    // Make sure we do not go out of bounds
    if (id  <  n)
        c[id] = a[id] + b[id];
}

The __global__ decorator specifies this is a CUDA kernel, otherwise normal C function syntax is used. The kernel must have return type void. We calculate the threads global id using CUDA supplied structs. blockIdx contains the blocks position in the grid, ranging from 0 to gridDim-1. threadIdx is the threads index inside of it’ s associated block, ranging from 0 to blockDim-1. For convenience blocks and grids can be multi dimensional, the associated structs contain x, y, and z members. Unless we have an array length divisible by our blockDim we will not have the same number of threads launched as valid array components. To avoid overstepping our array we simply test to make sure our global thread ID is less than the length of our array.