Sunday, 12 July 2015

Addition of two numbers in CUDA: A Simple Approach

Addition is the very Basic & one of the arithmetic operation. To perform it in C language is also a very easy and simple task. In this post, we will convert the C language code into a CUDA code. The steps to remember for writing a CUDA code for any program are as follows:

  1. Declare Device variables,
  2. Allocate Memory for Device variables,
  3. Copy Host Memory to Device Memory.
  4. Launch Kernel with appropriate Number of Blocks and Threads Per Block.
  5. Copy Device Memory contents to Host Memory.
  6. Display the result.
  7. Free the allocated memory for Device variables as well as Host variables.

Please keep in mind that Device is the GPU Card having CUDA capability & Host is the Laptop/Desktop PC machine. Kernel Launch is the function call to the function/procedure which you want to execute onto Device (GPU Card). It accepts two parameters which are very crucial to run your code parallel and efficiently. The number of Blocks in your code & The number of Threads per Block. The overall performance is dependent on this configuration. Changing this configuration may vary the performance. For this example as we are just adding two numbers; So I have set no. of Blocks & no. of threads as equal to 1 each. You may change it.

Refer the following code to get a basic idea of the above discussion. You may refer the GitHub repository for the same.


__global__ void AddIntsCUDA(int *a, int *b) //Kernel Definition
 *a = *a + *b;

int main()
 int a = 5, b = 9;
 int *d_a, *d_b; //Device variable Declaration

        //Allocation of Device Variables 
 cudaMalloc((void **)&d_a, sizeof(int));
 cudaMalloc((void **)&d_b, sizeof(int));

        //Copy Host Memory to Device Memory
 cudaMemcpy(d_a, &a, sizeof(int), cudaMemcpyHostToDevice);
 cudaMemcpy(d_b, &b, sizeof(int), cudaMemcpyHostToDevice);

        //Launch Kernel
 AddIntsCUDA << <1, 1 >> >(d_a, d_b);

        //Copy Device Memory to Host Memory
 cudaMemcpy(&a, d_a, sizeof(int), cudaMemcpyDeviceToHost);

 printf("The answer is ",a);

        //Free Device Memory

 return 0;

==>Posted By Yogesh B. Desai

Previous Post: Vedic Arithmetic Operations in CUDA

Next Post: TILED Matrix Multiplication using Shared Memory in CUDA

You are Visitor Number: