MSE490F11WenjiaWang

8/3/2019 MSE490F11WenjiaWang

1/22

Comparison of computation of Cahn-Hilliard equation using

single CPU core and single GPU

Wenjia Wang


2/22

2

I. ABSTRACT

The goal is to compare the performance and accuracy of the simulation of the Cahn-

Hilliard equation using a single CPU core and a single GPU. Cahn-Hilliard equation de-

scribes the phase separation of a binary fluid system where diffusion is the major transport

phenomenon. It is also possible to use this model to describe the phase separation of a

system of solid state binary alloy. The linearized Cahn-Hilliard equation is given as follows:

c

t= M2

W

2c(1 c)(1 2c) 22c

, (1)

where c is the concentration profile, M is the diffusion coefficient, W is the scalar constant

and is the length of the transition region between the domains of the two phases.

Finite difference method was used to evaluate the Laplacians. In particular, the central

difference scheme was used to approximate the second-order derivatives. Eq. (2) shows the

1D case. The multi-variable cases are analogous to Eq. (2):2c

x2

i

=ci+1 2ci + ci1

(x)2(2)

Explicit method was used to discretize the differential equation with respect to time:

c

t=

c(t + t) c(t)

t(3)

Additionally, periodic boundary conditions are applied to the computation domains.

The following parameters in Eq. (1) and Eq. (2) are constant for all the three cases: the

size of the unit cell, x=0.1, total evolving time is 50s and the initial profiles are randomly

generated values centered about 0.5 with a fluctuation of 0.1.

NVidia GeForce 8600M GT and NVidia Tesla C1060 are the GPUs used to run the CUDA

code. They dont support double precision, so all computations on the CPUs and the GPUs

are done with single precision floating point numbers. The 2D and 3D cases run significantly

faster on the GPUs than on the CPUs. However, since the aforementioned two classes of

GPUs dont have full-precision FMAD, they always round down when multiplying floatingpoint numbers, so considerable errors arise when the time steps are fine.

II. 1D CAHN-HILLIARD EQUATION

For the 1D case a 200 unit stencil was used. The parameters in Eq. (1) and Eq. (3)

are defined as follows: W = 1.0, M = 1.0, = 0.1 and t = 1103. The CUDA kernel


3/22

3

code utilizes registers for each single thread to fetch the value of a unit cell and its two

neighbors from the global memory for reuse, thus reducing the latency of accessing the global

memory. The memory access pattern can be further optimized by using the shared memory

for each thread block, but since shared memories are only accessible to threads within the

thread block that it is assigned to, synchronization problem would cause cumbersome manual

adjustment of the code with respect to different classes of GPUs.

The C code running on one core of a 2.5GHz Intel Core 2 Duo CPU used less than 1s. The

resolution of the difftime() function in time.h is on the order of a second, which isnt enough

to return a more precise time used to evolve the stencil. The CUDA code case running on

an NVidia GeForce 8600M GT used 481 ms. There are error between the results obtained

using the CPU and the GPU, which is tabulated as follows:

Table.1 Distribution of errors between 1D results obtained from CPU & GPU

range of the error percentage

error 106 38.5%

107 error 106 39.5%

108 error 107 16.5%

III. 2D CAHN-HILLIARD EQUATION

FIG. 1. A (8+2)(8 + 2) array in shared

memory. The light green part maps to the

values that the threads are processing and

the dark green parts (halos) maps to the

values of the neighbors outside the block

The 2D case is a 200200-unit-cell concentra-

tion profile, with W = 4.0, M = 1.0, = 0.4 and

t = 1105.

The kernel of the CUDA code for the 2D case is

written to perform better on a GeForce 8600M GT

GPU, which has 32 streaming processors (cores).

The warp size of the GPU is 32, so the number

of threads within a block should be a multiple of32. Taking the above into consideration, the block

size was chosen to be 816 and the grid size was

chosen to be 257. Thus the 200200 profile can

be covered by invoking the grid twice (825=200

and 167=112, 1122=224, the unused 24 blocks are padding wastes).


4/22

4

FIG. 2. The initial 2D profile

In the kernel, each block is assigned a (8 + 2)(16 + 2) of array of floating point numbers

in the shared memory, in which the center 816 units each stores the value in the concen-

tration profile the corresponding thread maps to, so that the latency caused by access to

the global memory can be greatly reduced. The outer padding of the array is used to store

the neighboring values of the center 816 units necessary to perform a differentiation using

central difference scheme. The results are shown in Fig. 3.

a. C code after 5 104 steps . CUDA code after 5 104 steps


5/22

5

b. C code after 2 105 steps . CUDA code after 2 105 steps

c. C code after 7 105 steps . CUDA code after 7 105 steps

d. C code after 5 106 steps . CUDA code after 5 106 steps

FIG. 3. Result of CPU computation (a, b, c, d) and GPU computation (,, ,)


6/22

6

In FIG. 3., results a, b and c were obtained with one core of a 2.5GHz Intel Core 2 Duo

and result d was obtained with one core of a AMD 1.4GHz Opteron 240 cluster; results

,, were obtained with a GeForce 8600M GT and result was obtained with a Tesla

C1060. Time and errors are shown as follows:

Table.2 Time used to evolve the 2D profile using CPU & GPU

steps time for CPU (seconds) time for GPU (seconds)

5 104 147 (2.5GHz Intel Core 2 Duo) 29(GeForce 8600M GT)



5 106 17980 (AMD 1.4GHz Opteron 240) 435(Tesla C1060)

Table.3 Distribution of errors between 2D results obtained from CPU & GPU

range of error 5 104 steps 2 105 steps 7 105 steps 5 106 steps(Opteron & Tesla)

error 5 104 0% 0% 2.31% 7.10%

error 1 104 0% 1.12% 45.66% 29.14%

error 5 105 0% 15.69% 64.84% 44.46%

error 1 105 0% 78.09% 91.68% 70.62%

It can be seen that Tesla C1060 has higher computation speed (because it has a total of

128 CUDA cores whereas GeForce 8600M GT has only 32) and a better precision.

IV. 3D CAHN-HILLIARD EQUATION

The 3D case is an extension to the 2D case. The concentration profile is defined on a 128

by 128 by 128 domain. The parameters are defined as follows: W = 4.0, M = 1.0, = 0.04

and t = 5106. The kernel was written to perform better on a Tesla C1060 GPU. the

block size was chosen to be 32161 and the grid size was chosen to be 481. A single

time step would require 128 invoking of the grid to evolve through the complete profile. The

C code was run on a single core of an Intel Xeon E3-1230 quad-core processor, which has a

clock frequency of 3.2GHz.

In the kernel code, shared memory was extensively used to reduce the access traffic to

the global memory. The sample code is attached in the Appendix. Figures 4. and 5. show


7/22

7

FIG. 4. The initial 3D profile

the concentration profile of the evolving structures obtained using C and CUDA codes. The

CPU took 125337s (34.8 hours) to run 1 million steps, while the GPU took 5257s (1.46

hours) to run 1 million steps and 52575s (14.6) to run 10 million steps.

a. C code after 5 103 steps . CUDA code after 5 103 steps


8/22

8

b. C code after 2 104 steps . CUDA code after 2 104 steps

c. C code after 2 105 steps . CUDA code after 2 105 steps

d. C code after 1 106 steps . CUDA code after 1 106 steps


9/22

9

e. CUDA code after 1 107 steps

FIG. 5. The 3D result

Results were obtained from 300 steps to 1 million steps for the C code and 300 steps to

10 million steps for the CUDA code. The time used is linear to the step used for both the

CPU and the GPU, so the time used to run a certain steps is predictable for a known type

of device. As shown in FIG.6., the GPU was 23.9 times faster than the CPU in evolving the

3D profile.

FIG. 7. shows the error between the results obtained using a CPU and a GPU. At 1

million steps, the errors that are greater than 0.0005 compose about 5% of the entire error

profile, the errors that are greater than 0.0001 compose about 15% of the entire error profile.


10/22

10

0.0E+00

2.0E+04

4.0E+04

6.0E+04

8.0E+04

1.0E+05

1.2E+05

1.4E+05

1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07

Time(seconds)

LogarithmicSteps

TeslaC1060

"XeonE3-1230"

FIG. 6. Semi-log plot of time used to evolve the profile using CPU and GPU vs. steps

About 50% of the errors are greater than 1 105. As is shown in FIG. 7., the percentage

of error with respect to time is also about linear to step used, so one can expect the error

to grow linearly with step used. Hence a device with higher precision is called for when the

time step is very fine.


11/22

11

0

5

10

15

20

25

30

35

40

45

50

0.0E+00 2.0E+05 4.0E+05 6.0E+05 8.0E+05 1.0E+06 1.2E+06

percentage

steps

errorgreaterthan5E-4




FIG. 7. Percentage of error between the result obtained using CPU and GPU vs. steps


12/22

APPENDIX 1: 2D CUDA KERNEL CODE

#define W 4.0#define DX 0.1#define M 1.0

#define TF 50#define EPS 0.4#define N 200#define TI 0#define DT 0.00001

#define DBX 8 // 8 * 25 = 200 rows#define DBY 16 // 16 * 13 = 200 + 8(padding) columns#define DGX 25#define DGY 7 /* 25*7 = 175 blocks per kernel invoke */

__global__ void stage1 (float* c0D, float* dfdc,int cell_x, int cell_y) {

__shared__ float temp[DBX+2][DBY+2];

const int tx = threadIdx.x;const int ty = threadIdx.y;const int ixt = blockIdx.x * blockDim.x + tx;const int iyt = blockIdx.y * blockDim.y + ty;

int ixm, iym, idm;

/* upper half of the global memory */

/* generic thread index:x = bx*DBX+tx; y = by*DBY+ty;

mapped into the global memory:x'= x + 1; y'= y + 1; */

ixm = ixt + 1;iym = iyt + 1;idm = iym + ixm*cell_y;

temp[tx+1][ty+1] = c0D[idm];

if (tx == 0)temp[tx][ty+1] = c0D[idm - cell_y];

if (tx == DBX - 1)temp[tx+2][ty+1] = c0D[idm + cell_y];

if (ty == 0)temp[tx+1][ty] = c0D[idm - 1];

if (ty == DBY - 1)temp[tx+1][ty+2] = c0D[idm + 1];

__syncthreads();

dfdc[idm] = W/2 * temp[tx+1][ty+1] * (1-temp[tx+1][ty+1]) * (1-2*temp[tx+1][ty+1])

- EPS * EPS * (temp[tx][ty+1] + temp[tx+2][ty+1]


13/22

+ temp[tx+1][ty] + temp[tx+1][ty+2] -4*temp[tx+1][ty+1])/DX/DX;

__syncthreads();

/* upper half of the halos */if (ixt == 0)

dfdc[idm + N * cell_y] = dfdc[idm];if (ixt == N - 1)

dfdc[idm - N * cell_y] = dfdc[idm];if (iyt == 0)

dfdc[idm + N] = dfdc[idm];

__syncthreads();

/* lower half of the global memory */


mapped into the global memory:x'= x + 1; y'= y + 1 + DGY*DBY */

ixm = ixt + 1;iym = iyt + 1 + DGY * DBY;idm = iym + ixm*cell_y;


if (iym < N + 1) {/* extra threads are idle */if (tx == 0)

temp[tx][ty+1] = c0D[idm - cell_y];if (tx == DBX - 1)

temp[tx+2][ty+1] = c0D[idm + cell_y];if (ty == 0)temp[tx+1][ty] = c0D[idm - 1];

if (ty == DBY - 1)temp[tx+1][ty+2] = c0D[idm + 1];

__syncthreads();


- EPS * EPS * (temp[tx][ty+1] + temp[tx+2][ty+1]+ temp[tx+1][ty] + temp[tx+1][ty+2] -

4*temp[tx+1][ty+1])/DX/DX;

__syncthreads();

/* lower half of the halos */if (ixt == 0)

dfdc[idm + N * cell_y] = dfdc[idm];if (ixt == N - 1)

dfdc[idm - N * cell_y] = dfdc[idm];if (iym == N)

dfdc[idm - N] = dfdc[idm];


14/22

__syncthreads();

}

}

__global__ void stage2 (float* c0D, float* dfdc,

int cell_x, int cell_y) {

__shared__ float temp[DBX+2][DBY+2];

const int tx = threadIdx.x;const int ty = threadIdx.y;const int ixt = blockIdx.x * blockDim.x + tx;const int iyt = blockIdx.y * blockDim.y + ty;

int ixm, iym, idm;float base;

/* upper half of the global memory */


mapped into the global memory:x'= x + 1; y'= y + 1; */

ixm = ixt + 1;iym = iyt + 1;idm = iym + ixm*cell_y;

temp[tx+1][ty+1] = dfdc[idm];base = c0D[idm];

if (tx == 0)temp[tx][ty+1] = dfdc[idm - cell_y];if (tx == DBX - 1)

temp[tx+2][ty+1] = dfdc[idm + cell_y];if (ty == 0)

temp[tx+1][ty] = dfdc[idm - 1];if (ty == DBY - 1)

temp[tx+1][ty+2] = dfdc[idm + 1];

__syncthreads();

c0D[idm] = DT * M * (temp[tx][ty+1] + temp[tx+2][ty+1]+ temp[tx+1][ty] + temp[tx+1][ty+2] -

4*temp[tx+1][ty+1])/DX/DX

+ base;

__syncthreads();

/* upper half of the halos */if (ixt == 0)

c0D[idm + N * cell_y] = c0D[idm];if (ixt == N - 1)

c0D[idm - N * cell_y] = c0D[idm];if (iyt == 0)


15/22

c0D[idm + N] = c0D[idm];

__syncthreads();

/* lower half of the global memory */

/* generic thread index:

x = bx*DBX+tx; y = by*DBY+ty;

mapped into the global memory:x'= x + 1; y'= y + 1 + DGY*DBY */

ixm = ixt + 1;iym = iyt + 1 + DGY * DBY;idm = iym + ixm*cell_y;

temp[tx+1][ty+1] = dfdc[idm];base = c0D[idm];

if (iym < N + 1) {/* extra threads are idle */if (tx == 0)

temp[tx][ty+1] = dfdc[idm - cell_y];if (tx == DBX - 1)

temp[tx+2][ty+1] = dfdc[idm + cell_y];if (ty == 0)

temp[tx+1][ty] = dfdc[idm - 1];if (ty == DBY - 1)

temp[tx+1][ty+2] = dfdc[idm + 1];

__syncthreads();

c0D[idm] = DT * M * (temp[tx][ty+1] + temp[tx+2][ty+1]+ temp[tx+1][ty] + temp[tx+1][ty+2] -

4*temp[tx+1][ty+1])/DX/DX+ base;

__syncthreads();

/* lower half of the halos */if (ixt == 0)

c0D[idm + N * cell_y] = c0D[idm];if (ixt == N - 1)

c0D[idm - N * cell_y] = c0D[idm];if (iym == N)

c0D[idm - N] = c0D[idm];

__syncthreads();

}

}


16/22

APPENDIX 2: 3D CUDA CODE

Thornton_3.cu

#include #include #include

using namespace std;

#define TF 50#define N 128#define TI 0#define DT 0.000005

#define DBX 32#define DBY 16#define DBZ 1#define DGX 4#define DGY 8#define DGZ 1

/* 128 grids */

#include "Thornton_3kernel.cu"#include "T3period.h"

int main() {

int cell_x = N + 2;int cell_y = N + 2;

int cell_z = N + 2;int sizeG = cell_x * cell_y * cell_z * sizeof(float);int step = (int)((TF - TI)/DT);char R[] = "/home/wenwan/CH3D/ch_3rand128.txt";

float* c0 = (float*) calloc(cell_x * cell_y * cell_z, sizeof(float));

float* c1 = (float*) malloc(sizeG);

float *c0D = 0, *dfdc = 0;

cudaMalloc((void**) &c0D, sizeG);cudaMalloc((void**) &dfdc, sizeG);

dim3 dB(DBX, DBY, DBZ);dim3 dG(DGX, DGY, DGZ);

FILE* read;FILE* ch_3cu;read = fopen(R, "r");

float input, elapsedTime;float t_total = 0;


17/22

int i, j, k, n, iox, ioy, ioz;

for (k = 1; k < cell_z - 1; k++) {for (j = 1; j < cell_y - 1; j++) {

for (i = 1; i < cell_x - 1; i++) {

fscanf(read, "%f", &input);

c0[k*cell_y*cell_x + j*cell_x + i] = input;}c0[k*cell_y*cell_x + j*cell_x] = c0[k*cell_y*cell_x + (j + 1)*

cell_x - 2];c0[k*cell_y*cell_x + (j + 1)*cell_x -

1] = c0[k*cell_y*cell_x + j*cell_x + 1];}for (i = 1; i < cell_x - 1; i++) {

c0[k*cell_y*cell_x + i] = c0[k*cell_y*cell_x + N*cell_x + i];c0[k*cell_y*cell_x + (cell_y -

1)*cell_x + i] = c0[k*cell_y*cell_x + cell_x + i];}

}for (j = 1; j < cell_y - 1; j++) {

for (i = 1; i < cell_x - 1; i++) {c0[j*cell_x + i] = c0[N*cell_y*cell_x + j*cell_x + i];c0[(cell_z -

1)*cell_y*cell_x + j*cell_x + i] = c0[cell_y*cell_x + j*cell_x + i];}

}

fclose(read);

cudaMemcpy(c0D, c0, sizeG, cudaMemcpyHostToDevice);

cudaEvent_t start, stop;cudaEventCreate(&start);

cudaEventCreate(&stop);cudaEventRecord(start, 0);

/* growth */for (n = 0; n < step; n++) {

if ( (n == 300) || (n == 600) || (n == 1000) || (n == 2000)|| (n == 3000) || (n == 4000) || (n == 5000) || (n == 6000)|| (n == 10000) || (n == 20000) || (n == 30000) || (n == 60000)|| (n == 100000) || (n == 200000) || (n == 300000) || (n == 4000

00)|| (n == 500000) || (n == 600000) || (n == 700000) || (n == 8000

00)|| (n == 900000) || (n == 1000000) || (n == 2000000) || (n == 30

00000) || (n == 4000000) || (n == 5000000) || (n == 6000000)|| (n == 7000000) || (n == 8000000) || (n == 9000000) ) {

cudaEventRecord(stop, 0);cudaEventSynchronize(stop);cudaEventElapsedTime(&elapsedTime, start, stop);t_total += elapsedTime;cudaMemcpy(c1, c0D, sizeG, cudaMemcpyDeviceToHost);

char strnum[64];


18/22

sprintf(strnum, "%d", n);strcat(strnum, ".txt");period(n, strnum, t_total, c1);cudaEventRecord(start, 0);

}stage1(c0D, dfdc, cell_x, cell_y);stage2(c0D, dfdc, cell_x, cell_y);

}

cudaEventRecord(stop, 0);cudaEventSynchronize(stop);cudaEventElapsedTime(&elapsedTime, start, stop);cudaEventDestroy(start);cudaEventDestroy(stop);t_total += elapsedTime;printf("\n----- Elapsed time after 10,000,000 steps: %12.10f ms ----

-\n\n", t_total);

cudaMemcpy(c1, c0D, sizeG, cudaMemcpyDeviceToHost);ch_3cu = fopen("/home/wenwan/CH3D/ch3cu128004complete.txt", "wb");

for (ioz = 1; ioz < cell_z - 1; ioz++)for (ioy = 1; ioy < cell_y - 1; ioy++)

for (iox = 1; iox < cell_x - 1; iox++)fprintf(ch_3cu, "%12.10f\n", c1[ioz*cell_y*cell_x + ioy*cel

l_x + iox]);

fclose(ch_3cu);

cudaFree(c0D); cudaFree(dfdc);free(c0); free(c1);

return 0;

}

Thornton_3kernel.cu

#ifndef _THORNTON_3KERNEL_H_#define _THORNTON_3KERNEL_H_

#define W 4.0#define DX 0.1#define M 1.0#define TF 50#define EPS 0.04#define N 128#define TI 0#define DT 0.000005

#define DBX 32#define DBY 16#define DBZ 1#define DGX 4#define DGY 8#define DGZ 1


19/22


__shared__ float temp[DBX+2][DBY+2];__shared__ float top[DBX][DBY];

__shared__ float bottom[DBX][DBY];

const int tx = threadIdx.x;const int ty = threadIdx.y;const int tz = threadIdx.z;const int ixt = blockIdx.x * blockDim.x + tx;const int iyt = blockIdx.y * blockDim.y + ty;const int izt = blockIdx.z * blockDim.z + tz;

int ixm, iym, izm, idm, count;

for (count = 0; count < 128; count++) {

ixm = ixt + 1;iym = iyt + 1;izm = izt + 1 + count*DGZ*DBZ;

idm = izm*cell_y*cell_x + iym*cell_x + ixm;


top[tx][ty] = c0D[idm - cell_y*cell_x];bottom[tx][ty] = c0D[idm + cell_y*cell_x];

if (tx == 0) {temp[tx][ty+1] = c0D[idm - 1];

}

if (tx == DBX - 1) {temp[tx+2][ty+1] = c0D[idm + 1];}if (ty == 0) {

temp[tx+1][ty] = c0D[idm - cell_x];}if (ty == DBY - 1) {

temp[tx+1][ty+2] = c0D[idm + cell_x];}

__syncthreads();


- EPS * EPS * (temp[tx][ty+1] + temp[tx+2][ty+1]+ temp[tx+1][ty] + temp[tx+1][ty+2] + top[tx][ty]+ bottom[tx][ty] - 6*temp[tx+1][ty+1])/DX/DX;

__syncthreads();

/* halos */if (ixt == 0)

dfdc[idm + N] = dfdc[idm];if (ixt == N - 1)


20/22

dfdc[idm - N] = dfdc[idm];if (iyt == 0)

dfdc[idm + N * cell_x] = dfdc[idm];if (iyt == N - 1)

dfdc[idm - N * cell_x] = dfdc[idm];

if (izm == 1)

dfdc[idm + N * cell_y * cell_x] = dfdc[idm];if (izm == N)

dfdc[idm - N * cell_y * cell_x] = dfdc[idm];

__syncthreads();}

}


__shared__ float temp[DBX+2][DBY+2];__shared__ float top[DBX][DBY];__shared__ float bottom[DBX][DBY];

const int tx = threadIdx.x;const int ty = threadIdx.y;const int tz = threadIdx.z;const int ixt = blockIdx.x * blockDim.x + tx;const int iyt = blockIdx.y * blockDim.y + ty;const int izt = blockIdx.z * blockDim.z + tz;

int ixm, iym, izm, idm, count;float base;

for (count = 0; count < 128; count++) {

ixm = ixt + 1;iym = iyt + 1;izm = izt + 1 + count*DGZ*DBZ;

idm = izm*cell_y*cell_x + iym*cell_x + ixm;

temp[tx+1][ty+1] = dfdc[idm];

top[tx][ty] = dfdc[idm - cell_y*cell_x];bottom[tx][ty] = dfdc[idm + cell_y*cell_x];

base = c0D[idm];

if (tx == 0) {temp[tx][ty+1] = dfdc[idm - 1];}if (tx == DBX - 1) {

temp[tx+2][ty+1] = dfdc[idm + 1];}if (ty == 0) {

temp[tx+1][ty] = dfdc[idm - cell_x];}if (ty == DBY - 1) {


21/22

temp[tx+1][ty+2] = dfdc[idm + cell_x];}

__syncthreads();

c0D[idm] = base + DT * M * (temp[tx][ty+1]+ temp[tx+2][ty+1] + temp[tx+1][ty] + temp[tx+1][ty+2]

+ top[tx][ty] + bottom[tx][ty] -6*temp[tx+1][ty+1])/DX/DX;

__syncthreads();

/* halos */if (ixt == 0)

c0D[idm + N] = c0D[idm];if (ixt == N - 1)

c0D[idm - N] = c0D[idm];if (iyt == 0)

c0D[idm + N * cell_x] = c0D[idm];if (iyt == N - 1)

c0D[idm - N * cell_x] = c0D[idm];

if (izm == 1)c0D[idm + N * cell_y * cell_x] = c0D[idm];

if (izm == N)c0D[idm - N * cell_y * cell_x] = c0D[idm];

__syncthreads();}

}

#endif

T3period.h

#ifndef _T3PERIOD_H_#define _T3PERIOD_H_

#include #include #include

#define N 128

void period (int step, char* strstep, float elapsedTime, float* c) {

int x, y, z;FILE* periodicOut;char filename[1024];strcpy(filename, "/home/wenwan/CH3D/ch3cu128004_");

strcat(filename, strstep);

periodicOut = fopen(filename, "wb");

for (z = 1; z


22/22

for (y = 1; y

Date post:	06-Apr-2018
Category:	Documents
Upload:	wenjia-wang
View:	214 times
Download:	0 times

MSE490F11WenjiaWang

Documents