Wednesday 27 May 2015

One Dimensional (1D) Image Convolution in CUDA by using TILES

          Tiled algorithms are a special case in CUDA as we can Optimize the algorithm implementation, by using this strategy. It is very useful when we want to achieve maximum usage of available GPU hardware, present in the system. It has several advantages over naive CUDA implementations such as improved Memory bandwidth, reduced memory read/write operations,etc. Tiled implementation uses Shared memory available in GPU hardware which is much faster as compared to Global Memory in GPU. In any naive CUDA implementation only Global memory is used for all read and write operations. So, if these memory (read/write) operations are huge in number then the more time is wasted only in transferring the data which results in low/poor performance.

          The steps to convert any naive CUDA implementation into its TILED CUDA implementation are as follows,
  • Identify the way of Global Memory is being used in CUDA implementation through GRID, Blocks and Threads combination.
  • Make a suitable TILE to store the data. This data will be stored in Shared memory which is copied from Global Memory.
  • Modify the naive CUDA implementation accordingly to work out for the TILED implementation.
          You may check the naive CUDA implementation of 1D Image convolution here. Now we will discuss about the implementation of 1D Image Convolution by using TILES. Lets assume that Mask is 1D and its size is 3. Image is also a 1D matrix having size 5. Each thread will access the complete Mask and related Image contents to calculate one pixel value. So here in this case each thread will access Image contents equal to Mask Size i.e. each thread will access 3 Image pixels to calculate one output pixel. So each thread will access the same Image pixels from the Global memory. The number of memory read operations will be product of Number of Threads created & Memory locations accessed by one thread. Total memory writes will be equal to number of threads created in the program. For example if we have 5 threads in execution then total number of memory reads will be 5 X 3 = 15. Also total number of memory writes will be 5. 
          But this is the case of naive convolution as we are not bothering about the memory reads/writes. If we use TILES to improve the same algorithm then we will store the Image pixels into TILE and then perform the calculation required to generate output pixel. Here, number of read operations will be reduced as it will be equal to division of Image size with TILE size. Lets compare with the same example considered as above with TILE size 3. Number of read operations will be ceil(5 / 3) = 2. As you can see we can reduce memory access significantly.
          Please refer the following code to understand implementation of TILES in 1D Image Convolution.  You can also get it from my GitHub repository.




#include<stdio.h>
#include<cuda.h>
#include<cuda_runtime_api.h>
#include<stdlib.h>


#define O_Tile_Width 3
#define Mask_width 3
#define width 5
#define Block_width (O_Tile_Width+(Mask_width-1))
#define Mask_radius (Mask_width/2)


__global__ void convolution_1D_tiled(float *N,float *M,float *P)
{
int index_out_x=blockIdx.x*O_Tile_Width+threadIdx.x;
int index_in_x=index_out_x-Mask_radius;
__shared__ float N_shared[Block_width];
float Pvalue=0.0;

//Load Data into shared Memory (into TILE)
if((index_in_x>=0)&&(index_in_x<width))
{
 N_shared[threadIdx.x]=N[index_in_x];
}
else
{
 N_shared[threadIdx.x]=0.0f;
}
__syncthreads();

//Calculate Convolution (Multiply TILE and Mask Arrays)
if(threadIdx.x<O_Tile_Width)
{
 //Pvalue=0.0f;
 for(int j=0;j<Mask_width;j++)
 {
  Pvalue+=M[j]*N_shared[j+threadIdx.x];
 }
 P[index_out_x]=Pvalue;
}


}

int main()
{
 float * input;
 float * Mask;
 float * output;

 float * device_input;
 float * device_Mask;
 float * device_output;

 input=(float *)malloc(sizeof(float)*width);
 Mask=(float *)malloc(sizeof(float)*Mask_width);
 output=(float *)malloc(sizeof(float)*width);

 for(int i=0;i<width;i++)
 {
  input[i]=1.0;
 }

 for(int i=0;i<Mask_width;i++)
 {
  Mask[i]=1.0;
 }
  printf("\nInput:\n");
  for(int i=0;i<width;i++)
  {
   printf(" %0.2f\t",*(input+i));
  }
  printf("\nMask:\n");
   for(int i=0;i<Mask_width;i++)
   {
    printf(" %0.2f\t",*(Mask+i));
   }

 cudaMalloc((void **)&device_input,sizeof(float)*width);
 cudaMalloc((void **)&device_Mask,sizeof(float)*Mask_width);
 cudaMalloc((void **)&device_output,sizeof(float)*width);

 cudaMemcpy(device_input,input,sizeof(float)*width,cudaMemcpyHostToDevice);
 cudaMemcpy(device_Mask,Mask,sizeof(float)*Mask_width,cudaMemcpyHostToDevice);

 dim3 dimBlock(Block_width,1,1);
 dim3 dimGrid((((width-1)/O_Tile_Width)+1),1,1);
 convolution_1D_tiled<<<dimGrid,dimBlock>>>(device_input,device_Mask,device_output);

 cudaMemcpy(output,device_output,sizeof(float)*width,cudaMemcpyDeviceToHost);

 printf("\nOutput:\n");
 for(int i=0;i<width;i++)
 {
  printf(" %0.2f\t",*(output+i));
 }

 cudaFree(device_input);
 cudaFree(device_Mask);
 cudaFree(device_output);
 free(input);
 free(Mask);
 free(output);

printf("\n\nNumber of Blocks: %d ",dimGrid.x);
printf("\n\nNumber of Threads Per Block: %d ",dimBlock.x);

return 0;
}




The expected output of the above code will be as following:
          You may change the values of Input Image and Mask. I have taken all values as 1 for the simplicity of the example.
 
SAMPLE OUTPUT: 


O_Tile_Width 3
Mask_width 3
width 5

Input:
 1.00  1.00  1.00  1.00  1.00 
Mask:
 1.00  1.00  1.00 
Output:
 2.00  3.00  3.00  3.00  2.00 

Number of Blocks: 2 

Number of Threads Per Block: 5
 



Posted By ==> Yogesh Desai

Previous Post: One Dimensional Image Convolution in CUDA

Next Post: Two Dimensional Image Convolution in CUDA: A Basic Approach

You are Visitor Number:
free web counter



No comments:

Post a Comment