# CUDA Matrix multiplication

In this tutorial, we will tackle a well-suited problem for Parallel Programming and quite a useful one. We will do Matrix Multiplication. Problem: Given two `N*N` matrix `X` and `Y` of floating point numbers, perform matrix multiplication and store the result on a `N*N` matrix answer.

Below is a code for matrix multiplication using C++. It is the standard `O(N³)` procedure. Here `x`, `y`, and ans are three `N²` size matrices(`N²` sized `1D` array. We are using `1D` arrays as of it were `2D`).

``````void cpuMatMul(int N,double *x, double *y, double *ans)
{for(int i=0;i < N;i++)
{
for(int j=0;j < N;j++)
{
for(int k=0;k < N;k++)
{
ans[i*N+j]+=(x[i*N+k]*y[k*N+j]);
}
}
}}``````

For a matrix of size `1024 X 1024`, it takes about `10` seconds to complete the task. For a matrix of size `4096 X 4096`, it takes about `2500` seconds(larger matrices add extra overhead. So it does not scale up exactly in `O(N³)` after 1024) to complete the task. Above (`4096 X 4096`), it just gets infeasible to run the experiment.

## Parallel Matrix Multiplication

Following is the parallel version of the same code we have seen above. Now let's execute the whole code:

CUDA (3.2.16)
• Input

Source code credit Zhengchun Liu. The code has been simplified.