Date post: | 06-Apr-2018 |
Category: |
Documents |
Upload: | wenjia-wang |
View: | 214 times |
Download: | 0 times |
of 22
8/3/2019 MSE490F11WenjiaWang
1/22
Comparison of computation of Cahn-Hilliard equation using
single CPU core and single GPU
Wenjia Wang
8/3/2019 MSE490F11WenjiaWang
2/22
2
I. ABSTRACT
The goal is to compare the performance and accuracy of the simulation of the Cahn-
Hilliard equation using a single CPU core and a single GPU. Cahn-Hilliard equation de-
scribes the phase separation of a binary fluid system where diffusion is the major transport
phenomenon. It is also possible to use this model to describe the phase separation of a
system of solid state binary alloy. The linearized Cahn-Hilliard equation is given as follows:
c
t= M2
W
2c(1 c)(1 2c) 22c
, (1)
where c is the concentration profile, M is the diffusion coefficient, W is the scalar constant
and is the length of the transition region between the domains of the two phases.
Finite difference method was used to evaluate the Laplacians. In particular, the central
difference scheme was used to approximate the second-order derivatives. Eq. (2) shows the
1D case. The multi-variable cases are analogous to Eq. (2):2c
x2
i
=ci+1 2ci + ci1
(x)2(2)
Explicit method was used to discretize the differential equation with respect to time:
c
t=
c(t + t) c(t)
t(3)
Additionally, periodic boundary conditions are applied to the computation domains.
The following parameters in Eq. (1) and Eq. (2) are constant for all the three cases: the
size of the unit cell, x=0.1, total evolving time is 50s and the initial profiles are randomly
generated values centered about 0.5 with a fluctuation of 0.1.
NVidia GeForce 8600M GT and NVidia Tesla C1060 are the GPUs used to run the CUDA
code. They dont support double precision, so all computations on the CPUs and the GPUs
are done with single precision floating point numbers. The 2D and 3D cases run significantly
faster on the GPUs than on the CPUs. However, since the aforementioned two classes of
GPUs dont have full-precision FMAD, they always round down when multiplying floatingpoint numbers, so considerable errors arise when the time steps are fine.
II. 1D CAHN-HILLIARD EQUATION
For the 1D case a 200 unit stencil was used. The parameters in Eq. (1) and Eq. (3)
are defined as follows: W = 1.0, M = 1.0, = 0.1 and t = 1103. The CUDA kernel
8/3/2019 MSE490F11WenjiaWang
3/22
3
code utilizes registers for each single thread to fetch the value of a unit cell and its two
neighbors from the global memory for reuse, thus reducing the latency of accessing the global
memory. The memory access pattern can be further optimized by using the shared memory
for each thread block, but since shared memories are only accessible to threads within the
thread block that it is assigned to, synchronization problem would cause cumbersome manual
adjustment of the code with respect to different classes of GPUs.
The C code running on one core of a 2.5GHz Intel Core 2 Duo CPU used less than 1s. The
resolution of the difftime() function in time.h is on the order of a second, which isnt enough
to return a more precise time used to evolve the stencil. The CUDA code case running on
an NVidia GeForce 8600M GT used 481 ms. There are error between the results obtained
using the CPU and the GPU, which is tabulated as follows:
Table.1 Distribution of errors between 1D results obtained from CPU & GPU
range of the error percentage
error 106 38.5%
107 error 106 39.5%
108 error 107 16.5%
III. 2D CAHN-HILLIARD EQUATION
FIG. 1. A (8+2)(8 + 2) array in shared
memory. The light green part maps to the
values that the threads are processing and
the dark green parts (halos) maps to the
values of the neighbors outside the block
The 2D case is a 200200-unit-cell concentra-
tion profile, with W = 4.0, M = 1.0, = 0.4 and
t = 1105.
The kernel of the CUDA code for the 2D case is
written to perform better on a GeForce 8600M GT
GPU, which has 32 streaming processors (cores).
The warp size of the GPU is 32, so the number
of threads within a block should be a multiple of32. Taking the above into consideration, the block
size was chosen to be 816 and the grid size was
chosen to be 257. Thus the 200200 profile can
be covered by invoking the grid twice (825=200
and 167=112, 1122=224, the unused 24 blocks are padding wastes).
8/3/2019 MSE490F11WenjiaWang
4/22
4
FIG. 2. The initial 2D profile
In the kernel, each block is assigned a (8 + 2)(16 + 2) of array of floating point numbers
in the shared memory, in which the center 816 units each stores the value in the concen-
tration profile the corresponding thread maps to, so that the latency caused by access to
the global memory can be greatly reduced. The outer padding of the array is used to store
the neighboring values of the center 816 units necessary to perform a differentiation using
central difference scheme. The results are shown in Fig. 3.
a. C code after 5 104 steps . CUDA code after 5 104 steps
8/3/2019 MSE490F11WenjiaWang
5/22
5
b. C code after 2 105 steps . CUDA code after 2 105 steps
c. C code after 7 105 steps . CUDA code after 7 105 steps
d. C code after 5 106 steps . CUDA code after 5 106 steps
FIG. 3. Result of CPU computation (a, b, c, d) and GPU computation (,, ,)
8/3/2019 MSE490F11WenjiaWang
6/22
6
In FIG. 3., results a, b and c were obtained with one core of a 2.5GHz Intel Core 2 Duo
and result d was obtained with one core of a AMD 1.4GHz Opteron 240 cluster; results
,, were obtained with a GeForce 8600M GT and result was obtained with a Tesla
C1060. Time and errors are shown as follows:
Table.2 Time used to evolve the 2D profile using CPU & GPU
steps time for CPU (seconds) time for GPU (seconds)
5 104 147 (2.5GHz Intel Core 2 Duo) 29(GeForce 8600M GT)
2 105 594 (2.5GHz Intel Core 2 Duo) 120(GeForce 8600M GT)
7 105 2063 (2.5GHz Intel Core 2 Duo) 398(GeForce 8600M GT)
5 106 17980 (AMD 1.4GHz Opteron 240) 435(Tesla C1060)
Table.3 Distribution of errors between 2D results obtained from CPU & GPU
range of error 5 104 steps 2 105 steps 7 105 steps 5 106 steps(Opteron & Tesla)
error 5 104 0% 0% 2.31% 7.10%
error 1 104 0% 1.12% 45.66% 29.14%
error 5 105 0% 15.69% 64.84% 44.46%
error 1 105 0% 78.09% 91.68% 70.62%
It can be seen that Tesla C1060 has higher computation speed (because it has a total of
128 CUDA cores whereas GeForce 8600M GT has only 32) and a better precision.
IV. 3D CAHN-HILLIARD EQUATION
The 3D case is an extension to the 2D case. The concentration profile is defined on a 128
by 128 by 128 domain. The parameters are defined as follows: W = 4.0, M = 1.0, = 0.04
and t = 5106. The kernel was written to perform better on a Tesla C1060 GPU. the
block size was chosen to be 32161 and the grid size was chosen to be 481. A single
time step would require 128 invoking of the grid to evolve through the complete profile. The
C code was run on a single core of an Intel Xeon E3-1230 quad-core processor, which has a
clock frequency of 3.2GHz.
In the kernel code, shared memory was extensively used to reduce the access traffic to
the global memory. The sample code is attached in the Appendix. Figures 4. and 5. show
8/3/2019 MSE490F11WenjiaWang
7/22
7
FIG. 4. The initial 3D profile
the concentration profile of the evolving structures obtained using C and CUDA codes. The
CPU took 125337s (34.8 hours) to run 1 million steps, while the GPU took 5257s (1.46
hours) to run 1 million steps and 52575s (14.6) to run 10 million steps.
a. C code after 5 103 steps . CUDA code after 5 103 steps
8/3/2019 MSE490F11WenjiaWang
8/22
8
b. C code after 2 104 steps . CUDA code after 2 104 steps
c. C code after 2 105 steps . CUDA code after 2 105 steps
d. C code after 1 106 steps . CUDA code after 1 106 steps
8/3/2019 MSE490F11WenjiaWang
9/22
9
e. CUDA code after 1 107 steps
FIG. 5. The 3D result
Results were obtained from 300 steps to 1 million steps for the C code and 300 steps to
10 million steps for the CUDA code. The time used is linear to the step used for both the
CPU and the GPU, so the time used to run a certain steps is predictable for a known type
of device. As shown in FIG.6., the GPU was 23.9 times faster than the CPU in evolving the
3D profile.
FIG. 7. shows the error between the results obtained using a CPU and a GPU. At 1
million steps, the errors that are greater than 0.0005 compose about 5% of the entire error
profile, the errors that are greater than 0.0001 compose about 15% of the entire error profile.
8/3/2019 MSE490F11WenjiaWang
10/22
10
0.0E+00
2.0E+04
4.0E+04
6.0E+04
8.0E+04
1.0E+05
1.2E+05
1.4E+05
1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07
Time(seconds)
LogarithmicSteps
TeslaC1060
"XeonE3-1230"
FIG. 6. Semi-log plot of time used to evolve the profile using CPU and GPU vs. steps
About 50% of the errors are greater than 1 105. As is shown in FIG. 7., the percentage
of error with respect to time is also about linear to step used, so one can expect the error
to grow linearly with step used. Hence a device with higher precision is called for when the
time step is very fine.
8/3/2019 MSE490F11WenjiaWang
11/22
11
0
5
10
15
20
25
30
35
40
45
50
0.0E+00 2.0E+05 4.0E+05 6.0E+05 8.0E+05 1.0E+06 1.2E+06
percentage
steps
errorgreaterthan5E-4
errorgreaterthan1E-4
errorgreaterthan5E-5
errorgreaterthan1E-5
FIG. 7. Percentage of error between the result obtained using CPU and GPU vs. steps
8/3/2019 MSE490F11WenjiaWang
12/22
APPENDIX 1: 2D CUDA KERNEL CODE
#define W 4.0#define DX 0.1#define M 1.0
#define TF 50#define EPS 0.4#define N 200#define TI 0#define DT 0.00001
#define DBX 8 // 8 * 25 = 200 rows#define DBY 16 // 16 * 13 = 200 + 8(padding) columns#define DGX 25#define DGY 7 /* 25*7 = 175 blocks per kernel invoke */
__global__ void stage1 (float* c0D, float* dfdc,int cell_x, int cell_y) {
__shared__ float temp[DBX+2][DBY+2];
const int tx = threadIdx.x;const int ty = threadIdx.y;const int ixt = blockIdx.x * blockDim.x + tx;const int iyt = blockIdx.y * blockDim.y + ty;
int ixm, iym, idm;
/* upper half of the global memory */
/* generic thread index:x = bx*DBX+tx; y = by*DBY+ty;
mapped into the global memory:x'= x + 1; y'= y + 1; */
ixm = ixt + 1;iym = iyt + 1;idm = iym + ixm*cell_y;
temp[tx+1][ty+1] = c0D[idm];
if (tx == 0)temp[tx][ty+1] = c0D[idm - cell_y];
if (tx == DBX - 1)temp[tx+2][ty+1] = c0D[idm + cell_y];
if (ty == 0)temp[tx+1][ty] = c0D[idm - 1];
if (ty == DBY - 1)temp[tx+1][ty+2] = c0D[idm + 1];
__syncthreads();
dfdc[idm] = W/2 * temp[tx+1][ty+1] * (1-temp[tx+1][ty+1]) * (1-2*temp[tx+1][ty+1])
- EPS * EPS * (temp[tx][ty+1] + temp[tx+2][ty+1]
8/3/2019 MSE490F11WenjiaWang
13/22
+ temp[tx+1][ty] + temp[tx+1][ty+2] -4*temp[tx+1][ty+1])/DX/DX;
__syncthreads();
/* upper half of the halos */if (ixt == 0)
dfdc[idm + N * cell_y] = dfdc[idm];if (ixt == N - 1)
dfdc[idm - N * cell_y] = dfdc[idm];if (iyt == 0)
dfdc[idm + N] = dfdc[idm];
__syncthreads();
/* lower half of the global memory */
/* generic thread index:x = bx*DBX+tx; y = by*DBY+ty;
mapped into the global memory:x'= x + 1; y'= y + 1 + DGY*DBY */
ixm = ixt + 1;iym = iyt + 1 + DGY * DBY;idm = iym + ixm*cell_y;
temp[tx+1][ty+1] = c0D[idm];
if (iym < N + 1) {/* extra threads are idle */if (tx == 0)
temp[tx][ty+1] = c0D[idm - cell_y];if (tx == DBX - 1)
temp[tx+2][ty+1] = c0D[idm + cell_y];if (ty == 0)temp[tx+1][ty] = c0D[idm - 1];
if (ty == DBY - 1)temp[tx+1][ty+2] = c0D[idm + 1];
__syncthreads();
dfdc[idm] = W/2 * temp[tx+1][ty+1] * (1-temp[tx+1][ty+1]) * (1-2*temp[tx+1][ty+1])
- EPS * EPS * (temp[tx][ty+1] + temp[tx+2][ty+1]+ temp[tx+1][ty] + temp[tx+1][ty+2] -
4*temp[tx+1][ty+1])/DX/DX;
__syncthreads();
/* lower half of the halos */if (ixt == 0)
dfdc[idm + N * cell_y] = dfdc[idm];if (ixt == N - 1)
dfdc[idm - N * cell_y] = dfdc[idm];if (iym == N)
dfdc[idm - N] = dfdc[idm];
8/3/2019 MSE490F11WenjiaWang
14/22
__syncthreads();
}
}
__global__ void stage2 (float* c0D, float* dfdc,
int cell_x, int cell_y) {
__shared__ float temp[DBX+2][DBY+2];
const int tx = threadIdx.x;const int ty = threadIdx.y;const int ixt = blockIdx.x * blockDim.x + tx;const int iyt = blockIdx.y * blockDim.y + ty;
int ixm, iym, idm;float base;
/* upper half of the global memory */
/* generic thread index:x = bx*DBX+tx; y = by*DBY+ty;
mapped into the global memory:x'= x + 1; y'= y + 1; */
ixm = ixt + 1;iym = iyt + 1;idm = iym + ixm*cell_y;
temp[tx+1][ty+1] = dfdc[idm];base = c0D[idm];
if (tx == 0)temp[tx][ty+1] = dfdc[idm - cell_y];if (tx == DBX - 1)
temp[tx+2][ty+1] = dfdc[idm + cell_y];if (ty == 0)
temp[tx+1][ty] = dfdc[idm - 1];if (ty == DBY - 1)
temp[tx+1][ty+2] = dfdc[idm + 1];
__syncthreads();
c0D[idm] = DT * M * (temp[tx][ty+1] + temp[tx+2][ty+1]+ temp[tx+1][ty] + temp[tx+1][ty+2] -
4*temp[tx+1][ty+1])/DX/DX
+ base;
__syncthreads();
/* upper half of the halos */if (ixt == 0)
c0D[idm + N * cell_y] = c0D[idm];if (ixt == N - 1)
c0D[idm - N * cell_y] = c0D[idm];if (iyt == 0)
8/3/2019 MSE490F11WenjiaWang
15/22
c0D[idm + N] = c0D[idm];
__syncthreads();
/* lower half of the global memory */
/* generic thread index:
x = bx*DBX+tx; y = by*DBY+ty;
mapped into the global memory:x'= x + 1; y'= y + 1 + DGY*DBY */
ixm = ixt + 1;iym = iyt + 1 + DGY * DBY;idm = iym + ixm*cell_y;
temp[tx+1][ty+1] = dfdc[idm];base = c0D[idm];
if (iym < N + 1) {/* extra threads are idle */if (tx == 0)
temp[tx][ty+1] = dfdc[idm - cell_y];if (tx == DBX - 1)
temp[tx+2][ty+1] = dfdc[idm + cell_y];if (ty == 0)
temp[tx+1][ty] = dfdc[idm - 1];if (ty == DBY - 1)
temp[tx+1][ty+2] = dfdc[idm + 1];
__syncthreads();
c0D[idm] = DT * M * (temp[tx][ty+1] + temp[tx+2][ty+1]+ temp[tx+1][ty] + temp[tx+1][ty+2] -
4*temp[tx+1][ty+1])/DX/DX+ base;
__syncthreads();
/* lower half of the halos */if (ixt == 0)
c0D[idm + N * cell_y] = c0D[idm];if (ixt == N - 1)
c0D[idm - N * cell_y] = c0D[idm];if (iym == N)
c0D[idm - N] = c0D[idm];
__syncthreads();
}
}
8/3/2019 MSE490F11WenjiaWang
16/22
APPENDIX 2: 3D CUDA CODE
Thornton_3.cu
#include #include #include
using namespace std;
#define TF 50#define N 128#define TI 0#define DT 0.000005
#define DBX 32#define DBY 16#define DBZ 1#define DGX 4#define DGY 8#define DGZ 1
/* 128 grids */
#include "Thornton_3kernel.cu"#include "T3period.h"
int main() {
int cell_x = N + 2;int cell_y = N + 2;
int cell_z = N + 2;int sizeG = cell_x * cell_y * cell_z * sizeof(float);int step = (int)((TF - TI)/DT);char R[] = "/home/wenwan/CH3D/ch_3rand128.txt";
float* c0 = (float*) calloc(cell_x * cell_y * cell_z, sizeof(float));
float* c1 = (float*) malloc(sizeG);
float *c0D = 0, *dfdc = 0;
cudaMalloc((void**) &c0D, sizeG);cudaMalloc((void**) &dfdc, sizeG);
dim3 dB(DBX, DBY, DBZ);dim3 dG(DGX, DGY, DGZ);
FILE* read;FILE* ch_3cu;read = fopen(R, "r");
float input, elapsedTime;float t_total = 0;
8/3/2019 MSE490F11WenjiaWang
17/22
int i, j, k, n, iox, ioy, ioz;
for (k = 1; k < cell_z - 1; k++) {for (j = 1; j < cell_y - 1; j++) {
for (i = 1; i < cell_x - 1; i++) {
fscanf(read, "%f", &input);
c0[k*cell_y*cell_x + j*cell_x + i] = input;}c0[k*cell_y*cell_x + j*cell_x] = c0[k*cell_y*cell_x + (j + 1)*
cell_x - 2];c0[k*cell_y*cell_x + (j + 1)*cell_x -
1] = c0[k*cell_y*cell_x + j*cell_x + 1];}for (i = 1; i < cell_x - 1; i++) {
c0[k*cell_y*cell_x + i] = c0[k*cell_y*cell_x + N*cell_x + i];c0[k*cell_y*cell_x + (cell_y -
1)*cell_x + i] = c0[k*cell_y*cell_x + cell_x + i];}
}for (j = 1; j < cell_y - 1; j++) {
for (i = 1; i < cell_x - 1; i++) {c0[j*cell_x + i] = c0[N*cell_y*cell_x + j*cell_x + i];c0[(cell_z -
1)*cell_y*cell_x + j*cell_x + i] = c0[cell_y*cell_x + j*cell_x + i];}
}
fclose(read);
cudaMemcpy(c0D, c0, sizeG, cudaMemcpyHostToDevice);
cudaEvent_t start, stop;cudaEventCreate(&start);
cudaEventCreate(&stop);cudaEventRecord(start, 0);
/* growth */for (n = 0; n < step; n++) {
if ( (n == 300) || (n == 600) || (n == 1000) || (n == 2000)|| (n == 3000) || (n == 4000) || (n == 5000) || (n == 6000)|| (n == 10000) || (n == 20000) || (n == 30000) || (n == 60000)|| (n == 100000) || (n == 200000) || (n == 300000) || (n == 4000
00)|| (n == 500000) || (n == 600000) || (n == 700000) || (n == 8000
00)|| (n == 900000) || (n == 1000000) || (n == 2000000) || (n == 30
00000) || (n == 4000000) || (n == 5000000) || (n == 6000000)|| (n == 7000000) || (n == 8000000) || (n == 9000000) ) {
cudaEventRecord(stop, 0);cudaEventSynchronize(stop);cudaEventElapsedTime(&elapsedTime, start, stop);t_total += elapsedTime;cudaMemcpy(c1, c0D, sizeG, cudaMemcpyDeviceToHost);
char strnum[64];
8/3/2019 MSE490F11WenjiaWang
18/22
sprintf(strnum, "%d", n);strcat(strnum, ".txt");period(n, strnum, t_total, c1);cudaEventRecord(start, 0);
}stage1(c0D, dfdc, cell_x, cell_y);stage2(c0D, dfdc, cell_x, cell_y);
}
cudaEventRecord(stop, 0);cudaEventSynchronize(stop);cudaEventElapsedTime(&elapsedTime, start, stop);cudaEventDestroy(start);cudaEventDestroy(stop);t_total += elapsedTime;printf("\n----- Elapsed time after 10,000,000 steps: %12.10f ms ----
-\n\n", t_total);
cudaMemcpy(c1, c0D, sizeG, cudaMemcpyDeviceToHost);ch_3cu = fopen("/home/wenwan/CH3D/ch3cu128004complete.txt", "wb");
for (ioz = 1; ioz < cell_z - 1; ioz++)for (ioy = 1; ioy < cell_y - 1; ioy++)
for (iox = 1; iox < cell_x - 1; iox++)fprintf(ch_3cu, "%12.10f\n", c1[ioz*cell_y*cell_x + ioy*cel
l_x + iox]);
fclose(ch_3cu);
cudaFree(c0D); cudaFree(dfdc);free(c0); free(c1);
return 0;
}
Thornton_3kernel.cu
#ifndef _THORNTON_3KERNEL_H_#define _THORNTON_3KERNEL_H_
#define W 4.0#define DX 0.1#define M 1.0#define TF 50#define EPS 0.04#define N 128#define TI 0#define DT 0.000005
#define DBX 32#define DBY 16#define DBZ 1#define DGX 4#define DGY 8#define DGZ 1
8/3/2019 MSE490F11WenjiaWang
19/22
__global__ void stage1 (float* c0D, float* dfdc,int cell_x, int cell_y) {
__shared__ float temp[DBX+2][DBY+2];__shared__ float top[DBX][DBY];
__shared__ float bottom[DBX][DBY];
const int tx = threadIdx.x;const int ty = threadIdx.y;const int tz = threadIdx.z;const int ixt = blockIdx.x * blockDim.x + tx;const int iyt = blockIdx.y * blockDim.y + ty;const int izt = blockIdx.z * blockDim.z + tz;
int ixm, iym, izm, idm, count;
for (count = 0; count < 128; count++) {
ixm = ixt + 1;iym = iyt + 1;izm = izt + 1 + count*DGZ*DBZ;
idm = izm*cell_y*cell_x + iym*cell_x + ixm;
temp[tx+1][ty+1] = c0D[idm];
top[tx][ty] = c0D[idm - cell_y*cell_x];bottom[tx][ty] = c0D[idm + cell_y*cell_x];
if (tx == 0) {temp[tx][ty+1] = c0D[idm - 1];
}
if (tx == DBX - 1) {temp[tx+2][ty+1] = c0D[idm + 1];}if (ty == 0) {
temp[tx+1][ty] = c0D[idm - cell_x];}if (ty == DBY - 1) {
temp[tx+1][ty+2] = c0D[idm + cell_x];}
__syncthreads();
dfdc[idm] = W/2 * temp[tx+1][ty+1] * (1-temp[tx+1][ty+1]) * (1-2*temp[tx+1][ty+1])
- EPS * EPS * (temp[tx][ty+1] + temp[tx+2][ty+1]+ temp[tx+1][ty] + temp[tx+1][ty+2] + top[tx][ty]+ bottom[tx][ty] - 6*temp[tx+1][ty+1])/DX/DX;
__syncthreads();
/* halos */if (ixt == 0)
dfdc[idm + N] = dfdc[idm];if (ixt == N - 1)
8/3/2019 MSE490F11WenjiaWang
20/22
dfdc[idm - N] = dfdc[idm];if (iyt == 0)
dfdc[idm + N * cell_x] = dfdc[idm];if (iyt == N - 1)
dfdc[idm - N * cell_x] = dfdc[idm];
if (izm == 1)
dfdc[idm + N * cell_y * cell_x] = dfdc[idm];if (izm == N)
dfdc[idm - N * cell_y * cell_x] = dfdc[idm];
__syncthreads();}
}
__global__ void stage2 (float* c0D, float* dfdc,int cell_x, int cell_y) {
__shared__ float temp[DBX+2][DBY+2];__shared__ float top[DBX][DBY];__shared__ float bottom[DBX][DBY];
const int tx = threadIdx.x;const int ty = threadIdx.y;const int tz = threadIdx.z;const int ixt = blockIdx.x * blockDim.x + tx;const int iyt = blockIdx.y * blockDim.y + ty;const int izt = blockIdx.z * blockDim.z + tz;
int ixm, iym, izm, idm, count;float base;
for (count = 0; count < 128; count++) {
ixm = ixt + 1;iym = iyt + 1;izm = izt + 1 + count*DGZ*DBZ;
idm = izm*cell_y*cell_x + iym*cell_x + ixm;
temp[tx+1][ty+1] = dfdc[idm];
top[tx][ty] = dfdc[idm - cell_y*cell_x];bottom[tx][ty] = dfdc[idm + cell_y*cell_x];
base = c0D[idm];
if (tx == 0) {temp[tx][ty+1] = dfdc[idm - 1];}if (tx == DBX - 1) {
temp[tx+2][ty+1] = dfdc[idm + 1];}if (ty == 0) {
temp[tx+1][ty] = dfdc[idm - cell_x];}if (ty == DBY - 1) {
8/3/2019 MSE490F11WenjiaWang
21/22
temp[tx+1][ty+2] = dfdc[idm + cell_x];}
__syncthreads();
c0D[idm] = base + DT * M * (temp[tx][ty+1]+ temp[tx+2][ty+1] + temp[tx+1][ty] + temp[tx+1][ty+2]
+ top[tx][ty] + bottom[tx][ty] -6*temp[tx+1][ty+1])/DX/DX;
__syncthreads();
/* halos */if (ixt == 0)
c0D[idm + N] = c0D[idm];if (ixt == N - 1)
c0D[idm - N] = c0D[idm];if (iyt == 0)
c0D[idm + N * cell_x] = c0D[idm];if (iyt == N - 1)
c0D[idm - N * cell_x] = c0D[idm];
if (izm == 1)c0D[idm + N * cell_y * cell_x] = c0D[idm];
if (izm == N)c0D[idm - N * cell_y * cell_x] = c0D[idm];
__syncthreads();}
}
#endif
T3period.h
#ifndef _T3PERIOD_H_#define _T3PERIOD_H_
#include #include #include
#define N 128
void period (int step, char* strstep, float elapsedTime, float* c) {
int x, y, z;FILE* periodicOut;char filename[1024];strcpy(filename, "/home/wenwan/CH3D/ch3cu128004_");
strcat(filename, strstep);
periodicOut = fopen(filename, "wb");
for (z = 1; z
8/3/2019 MSE490F11WenjiaWang
22/22
for (y = 1; y