CUDA matrix multiplication

How to perform CUDA matrix multiplication?

CUDA Matrix multiplication

In this tutorial, we will tackle a well-suited problem for Parallel Programming and quite a useful one. We will do Matrix Multiplication. Problem: Given two N*N matrix X and Y of floating point numbers, perform matrix multiplication and store the result on a N*N matrix answer.

Below is a code for matrix multiplication using C++. It is the standard O(N³) procedure. Here x, y, and ans are three size matrices( sized 1D array. We are using 1D arrays as of it were 2D).

void cpuMatMul(int N,double *x, double *y, double *ans)
{for(int i=0;i < N;i++)
        for(int j=0;j < N;j++)
            for(int k=0;k < N;k++)

For a matrix of size 1024 X 1024, it takes about 10 seconds to complete the task. For a matrix of size 4096 X 4096, it takes about 2500 seconds(larger matrices add extra overhead. So it does not scale up exactly in O(N³) after 1024) to complete the task. Above (4096 X 4096), it just gets infeasible to run the experiment.

Parallel Matrix Multiplication

Following is the parallel version of the same code we have seen above. Now let's execute the whole code:

CUDA (3.2.16)
  • Input  

Source code credit Zhengchun Liu. The code has been simplified.