Friday 17 July 2015

Vector Arithmetic Operations in CUDA

After learning how to perform addition of two numbers, the next step is to learn how to perform addition and subtraction of two Vectors. The only difference between these two programs is the memory required to store the two Vectors. Also the kernel functions to perform the task.

The steps are mentioned below please refer the steps:
  1. Declare Device variables for Vectors,
  2. Allocate Memory for Device variables,
  3. Copy Host Memory to Device Memory,
  4. Launch Kernel with appropriate Number of Blocks and Threads Per Block.
  5. Copy Device Memory contents to Host Memory.
  6. Display the result.
  7. Free the allocated memory for Device variables as well as Host variables.


For the sake of this example I have taken no. of blocks as 2 and no. of threads as 3. You can change as per your need. One can modify the same code to achieve maximum performance by using Shared Memory and Constant Memory in the case of very big Vectors. I have used a simple One-Dimensional Vector to demonstrate the example.

You can refer sample code given below or can download the same from my GitHub repository.


/*Title: Vector addition and subtraction in CUDA.
A simple way to understand how CUDA can be used to perform arithmetic operations.
*/
#include<iostream>
#include<stdio.h>
#include<cuda.h>
#include<cuda_runtime_api.h>
using namespace std;
# define size 5

//Global functions
__global__ void AddIntsCUDA(int *a, int *b)
{
 int tid=blockIdx.x*blockDim.x+threadIdx.x;
 a[tid] = a[tid] + b[tid];
}

__global__ void SubIntsCUDA(int *a, int *b)
{
 int tid=blockIdx.x*blockDim.x+threadIdx.x;
 b[tid] = a[tid] - b[tid];
}
//********************************************************
int main()
{
 int a[size]={1,2,3,4,5}, b[size]={1,2,3,4,5}; //Vector Declaration and Definition
 int *d_a, *d_b;

 //Allocation of Device variables
 cudaMalloc((void **)&d_a, sizeof(int)*size);
 cudaMalloc((void **)&d_b, sizeof(int)*size);

 //Copy Host Memory to Device Memory
 cudaMemcpy(d_a, &a, sizeof(int)*size, cudaMemcpyHostToDevice);
 cudaMemcpy(d_b, &b, sizeof(int)*size, cudaMemcpyHostToDevice);
 
 //Launch Kernel
 AddIntsCUDA << <2,3 >> >(d_a, d_b);
 
 //Copy Device Memory to Host Memory
 cudaMemcpy(&a, d_a, sizeof(int)*size, cudaMemcpyDeviceToHost);

 cout << "The answer is "<<endl;
 for(int i=0;i<size;i++)
 {
  printf("a[%d]=%d\n",i,a[i]);
 }

 //Deallocate the Device Memory and Host Memory
 cudaFree(d_a);
 cudaFree(d_b);
 free(a);
 free(b);

 return 0;
}




==>Posted By Yogesh B. Desai
Previous Post: Addition of two numbers in CUDA: A Simple Approach
Next Post: TILED Matrix Multiplication Using Shared Memory in CUDA

You are Visitor Number:
web counter free

No comments:

Post a Comment