Image Processing using CUDA
Anders Eklund, PhD
Virginia Tech Carilion Research Institute
Outline
• Storing an image in memory
• 2D Convolution
• Interpolation
• Calculating a similarity measure between two images
• Image registration
Storing an image
• How is an image stored in memory?
• There are at least two possibilities
• Row major order, column major order
Storing an image
• Row major order (C programming)
• A = [1 2 3] [4 5 6]
• Values are stored in memory as 1, 2, 3, 4, 5, 6
• Pixel at location (x,y) is accessed as x + y * WIDTH where WIDTH is the number of columns
Storing an image
• Column major order (Matlab)
• A = [1 2 3] [4 5 6]
• Values are stored in memory as 1, 4, 2, 5, 3, 6
• Pixel at location (x,y) is accessed as y + x * HEIGHT where HEIGHT is the number of rows
Storing an image
• Why is this important?
• When reading/writing data from/to global memory it is important to use coalesced reads and writes, for optimal performance
• Coalesced operation = the threads read/write consecutive memory locations
• Use the Nvidia profiler to check for uncoalesced memory operations
Storing an image
• Assume the image is stored in row major order • We use 2D thread blocks, 64 along x, 2 along y • int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
• int idx = y + x * HEIGHT (wrong) • Image[idx] = 3.0f; Uncoalesced writes
• int idx = x + y * WIDTH (correct) • Image[idx] = 3.0f; Coalesced writes
Storing an image
• int idx = y + x * HEIGHT (wrong)
• Image[idx] = 3.0f; Uncoalesced writes
• Indices are not consecutive
• threadIdx.y = 0
• idx = 0, HEIGHT, 2*HEIGHT, 3*HEIGHT, 4*HEIGHT, …
• threadIdx.y = 1
• idx = 1, 1+HEIGHT, 1+2*HEIGHT, 1+3*HEIGHT, 1+4*HEIGHT, …
Storing an image
• int idx = x + y * WIDTH (correct)
• Image[idx] = 3.0f; Coalesced writes
• Indices are consecutive
• threadIdx.y = 0
• idx = 0, 1, 2, 3, 4, …
• threadIdx.y = 1
• idx = WIDTH, 1+WIDTH, 2+WIDTH, 3+WIDTH, …
Multiplying two images
• __global__ void Multiply(float* Result, const float* Image1, const float* Image2 , int DATA_W, int DATA_H) { int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; if ( (x >= DATA_W) || (y >= DATA_H)) return; int idx = x + y * DATA_W; Result[idx] = Image1[idx] * Image2[idx]; }
Multiplying two images
• The kernel is completely bound by the memory bandwidth, two read operations, one write operation
• Uncoalesced memory operations make a big difference
• (In this specific kernel we could have used 1D thread blocks)
Image processing with C/C++
• We will use the CImg library to read and write images using C++ objects
• The CImg library is open source and consists of a single header file (Cimg.h)
• Works on Windows, Linux, Mac
• cimg.sourceforge.net
Convolution • Convolution = scalar product between filter
values and pixel values in each neighbourhood
• Slide the filter over all pixels, save each result in the center pixel
• Note the minus signs (means that the filter is rotated 180 degrees)!
Task 1
• Open imageprocessing_convolution.cu and imageprocessing_kernel.cu
• Complete the code for the kernel Convolution_2D_Texture
• The code reads an image from file, sends it to the GPU, copies back the filter response, writes the filter response to a new image
• Compares your result to convolution with CImg
Constant memory
• Constant memory is normally 64 KB
• Each multiprocessor on the GPU has a constant memory cache (8 KB)
• Put the filter kernel in constant memory
• __device__ __constant__ float c_Filter_2D[11][11];
• The filter will be in the cache during the whole execution, saves reads from global memory
Texture memory
• Texture memory is cached for spatially local reads
• Hardware support for reading outside the image
• Read the value at position (x,y), value = tex2D(tex_Image, x + 0.5f, y + 0.5f);
• Note the addition of 0.5f !
Compiling the code
• See the top of each file for how to compile the code
Checking results
• Compare the images filteredImageCUDA.bmp and filteredImageCImg.bmp, difference is given in difference.bmp
• Copy images from your account to your own computer, to be able to see the images
• scp [email protected]:/home/aeklund/GPULab/*.bmp .
• display filteredImageCUDA.bmp &
Checking results
• For the convolution, the maximum error compared to CImg should be something like 0.000015
• The total error should be something like 0.09
Convolution – First part
• __global__ void Convolution_2D_Texture (float* Result, int DATA_W, int DATA_H, int FILTER_W, int FILTER_H) { int x = blockIdx.x * blockDim.x + threadIdx.x; int y = blockIdx.y * blockDim.y + threadIdx.y; if ( (x >= DATA_W) || (y >= DATA_H)) return; float sum = 0.0f;
Convolution – Second part
• float yoffset = -((float)FILTER_H-1)/2.0f + 0.5f; for (int fy = FILTER_H-1; fy >= 0; fy--) { int xoffset = -((float)FILTER_W-1.0f)/2.0f + 0.5f; for (int fx = FILTER_W-1; fx >= 0; fx--) { sum += tex2D(tex_Image, x + xoffset,y + yoffset) * c_Filter[fy][fx]; xoffset += 1.0f; } yoffset += 1.0f; }
int idx = x + y * DATA_W; Result[idx] = sum;
Texture memory
• The texture memory has hardware support for linear interpolation
• So far we have only used the texture memory for fast reading from global memory (using the texture cache)
• Lets use the texture memory for interpolation
Rotating an image
• Use the rotation matrix
( cos(angle) -sin(angle) ) ( sin(angle) cos(angle) )
to transform each (x,y) coordinate, read from the new coordinate using texture memory
Rotating an image
(xnew) = ( cos(angle) -sin(angle) ) (xold) (ynew) ( sin(angle) cos(angle) ) (yold)
Rotating an image
• cos and sin in CUDA use double precision
• cosf and sinf use single precision
• All functions use radians and not degrees
Task 2
• Open imageprocessing_transformation.cu and imageprocessing_kernel.cu
• Complete the code for the kernel RotateImage, which rotates an image using texture memory for interpolation
• Extra task, rotate the image around the center of the image, instead of around the corner
Checking results
• Compare the images transformedImageCUDA.bmp and transformedImageCImg.bmp, difference is given in difference.bmp
Checking results
• For the rotation, the maximum error compared to CImg should be something like 5.23
• The total error should be something like 12331.2
• Interpolation in CImg is probably performed slightly differently
Rotating an image
• First part of the transformation kernel is the same as for the convolution kernel
• float xf = (float)x; float yf = (float)y; • float angler = angled/180.0f*pi; • float xnew = cosf(angler) * xf – sinf(angler)*yf; • float ynew = sinf(angler) * xf + cosf(angler)*yf; • value = tex2D(tex_Image, xnew + 0.5f,ynew + 0.5f); • TransformedImage[idx] = value;
Rotating an image around its center
• float xf = (float)x – (float)IMAGE_WIDTH/2.0f;
float yf = (float)y - (float)IMAGE_HEIGHT/2.0f; • float angler = angled/180.0f*pi; • float xnew = cosf(angler) * xf – sinf(angler)*yf; • float ynew = sinf(angler) * xf + cosf(angler)*yf; • xnew += (float)IMAGE_WIDTH/2.0f; • ynew += (float)IMAGE_HEIGHT/2.0f; • value = tex2D(tex_Image, xnew + 0.5f,ynew + 0.5f); • TransformedImage[idx] = value;
Image registration
• Image registration relies on the concept of optimizing a similarity measure
• Find the translations and rotations that maximize the similarity between two images
• Normalized cross correlation (NCC) is one of the most common similarity measures
Normalized cross correlation (NCC)
• Correlation between variables x and y
Normalized cross correlation (NCC)
• NCC can be calculated using vectors x and y as
• Only three scalar products are needed, between x and y, between x and x and between y and y (remove the mean of x and y first)
Representing an image as a long vector
CUBLAS
• CUBLAS has many functions for matrix algebra
• Matrix-matrix multiplications,
matrix-vector multiplications, vector-vector multiplications
• The function cublasSdot can be used to calculate the scalar product between two vectors with float values
• Look in the CUDA documentation to see how it works
Task 3 • Open imageprocessing_similarity.cu
• Complete the code for how to calculate the correlation
between two images, using the CUBLAS library
• Each image is treated as a vector of length IMAGE_WIDTH * IMAGE_HEIGHT
• The mean values have already been removed
• You have to allocate memory on the GPU and copy data to the GPU
• Your correlation value is compared to a correlation calculated using regular C code
Calculating correlation using CUBLAS
• #include <cublas_v2.h>
• First create a handle to CUBLAS, already provided in the code
• cublasStatus_t status;
• cublasHandle_t handle;
• status = cublasCreate(&handle);
• cublasStatus_t cublasSdot (cublasHandle_t handle, int n, const float *x, int incx, const float *y, int incy, float *result)
• n is the length of the vectors, in our case IMAGE_WIDTH * IMAGE_HEIGHT
• x is the pointer to the first image, d_Image1
• y is the pointer to the second image, d_Image2
• incx and incy is simply the distance between each element in the vectors, in our case 1
• result is the pointer to the calculated scalar product
Calculating correlation using CUBLAS
Calculating correlation using CUBLAS
• float productAB, productAA, productBB;
• status = cublasSdot(handle, IMAGE_WIDTH * IMAGE_HEIGHT, d_Image1, 1, d_Image2, 1, &productAB);
• status = cublasSdot(handle, IMAGE_WIDTH * IMAGE_HEIGHT, d_Image1, 1, d_Image1, 1, &productAA);
• status = cublasSdot(handle, IMAGE_WIDTH * IMAGE_HEIGHT, d_Image2, 1, d_Image2, 1, &productBB);
• float correlationGPU = productAB / (sqrt(productAA * productBB));
Image registration
• We now have the two most important building blocks for image registration
• Calculation of a similarity measure
• Applying a transformation to an image
Image registration
• Lets combine these two functions to perform a very simple form of image registration
• Register two images by finding the optimal rotation to apply to one of the images
Task 4
• Open imageprocessing_registration.cu and complete the code to perform a registration between the two images
• Make a for loop that in each iteration applies a rotation and calculates the correlation between the fixed image and the rotated image
• Print the rotation that gives the highest correlation
• Do not copy data to/from the GPU in each iteration!
Checking results
• The best rotation should be -30 degrees, giving a correlation of 0.861115