Addition is the very Basic & one of the arithmetic operation. To perform it in C language is also a very easy and simple task. In this post, we will convert the C language code into a CUDA code. The steps to remember for writing a CUDA code for any program are as follows:
- Declare Device variables,
- Allocate Memory for Device variables,
- Copy Host Memory to Device Memory.
- Launch Kernel with appropriate Number of Blocks and Threads Per Block.
- Copy Device Memory contents to Host Memory.
- Display the result.
- Free the allocated memory for Device variables as well as Host variables.
Please keep in mind that Device is the GPU Card having CUDA capability & Host is the Laptop/Desktop PC machine. Kernel Launch is the function call to the function/procedure which you want to execute onto Device (GPU Card). It accepts two parameters which are very crucial to run your code parallel and efficiently. The number of Blocks in your code & The number of Threads per Block. The overall performance is dependent on this configuration. Changing this configuration may vary the performance. For this example as we are just adding two numbers; So I have set no. of Blocks & no. of threads as equal to 1 each. You may change it.
Refer the following code to get a basic idea of the above discussion. You may refer the GitHub repository for the same.
#include<stdio.h>
#include<cuda.h>
#include<cuda_runtime_api.h>
__global__ void AddIntsCUDA(int *a, int *b) //Kernel Definition
{
*a = *a + *b;
}
int main()
{
int a = 5, b = 9;
int *d_a, *d_b; //Device variable Declaration
//Allocation of Device Variables
cudaMalloc((void **)&d_a, sizeof(int));
cudaMalloc((void **)&d_b, sizeof(int));
//Copy Host Memory to Device Memory
cudaMemcpy(d_a, &a, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, &b, sizeof(int), cudaMemcpyHostToDevice);
//Launch Kernel
AddIntsCUDA << <1, 1 >> >(d_a, d_b);
//Copy Device Memory to Host Memory
cudaMemcpy(&a, d_a, sizeof(int), cudaMemcpyDeviceToHost);
printf("The answer is ",a);
//Free Device Memory
cudaFree(d_a);
cudaFree(d_b);
return 0;
}
==>Posted By Yogesh B. Desai
Previous Post: Vedic Arithmetic Operations in CUDA
Next Post: TILED Matrix Multiplication using Shared Memory in CUDA
Lovely example
ReplyDeleteLovely example
ReplyDelete