+ All Categories
Home > Documents > MSE490F11WenjiaWang

MSE490F11WenjiaWang

Date post: 06-Apr-2018
Category:
Upload: wenjia-wang
View: 214 times
Download: 0 times
Share this document with a friend

of 22

Transcript
  • 8/3/2019 MSE490F11WenjiaWang

    1/22

    Comparison of computation of Cahn-Hilliard equation using

    single CPU core and single GPU

    Wenjia Wang

  • 8/3/2019 MSE490F11WenjiaWang

    2/22

    2

    I. ABSTRACT

    The goal is to compare the performance and accuracy of the simulation of the Cahn-

    Hilliard equation using a single CPU core and a single GPU. Cahn-Hilliard equation de-

    scribes the phase separation of a binary fluid system where diffusion is the major transport

    phenomenon. It is also possible to use this model to describe the phase separation of a

    system of solid state binary alloy. The linearized Cahn-Hilliard equation is given as follows:

    c

    t= M2

    W

    2c(1 c)(1 2c) 22c

    , (1)

    where c is the concentration profile, M is the diffusion coefficient, W is the scalar constant

    and is the length of the transition region between the domains of the two phases.

    Finite difference method was used to evaluate the Laplacians. In particular, the central

    difference scheme was used to approximate the second-order derivatives. Eq. (2) shows the

    1D case. The multi-variable cases are analogous to Eq. (2):2c

    x2

    i

    =ci+1 2ci + ci1

    (x)2(2)

    Explicit method was used to discretize the differential equation with respect to time:

    c

    t=

    c(t + t) c(t)

    t(3)

    Additionally, periodic boundary conditions are applied to the computation domains.

    The following parameters in Eq. (1) and Eq. (2) are constant for all the three cases: the

    size of the unit cell, x=0.1, total evolving time is 50s and the initial profiles are randomly

    generated values centered about 0.5 with a fluctuation of 0.1.

    NVidia GeForce 8600M GT and NVidia Tesla C1060 are the GPUs used to run the CUDA

    code. They dont support double precision, so all computations on the CPUs and the GPUs

    are done with single precision floating point numbers. The 2D and 3D cases run significantly

    faster on the GPUs than on the CPUs. However, since the aforementioned two classes of

    GPUs dont have full-precision FMAD, they always round down when multiplying floatingpoint numbers, so considerable errors arise when the time steps are fine.

    II. 1D CAHN-HILLIARD EQUATION

    For the 1D case a 200 unit stencil was used. The parameters in Eq. (1) and Eq. (3)

    are defined as follows: W = 1.0, M = 1.0, = 0.1 and t = 1103. The CUDA kernel

  • 8/3/2019 MSE490F11WenjiaWang

    3/22

    3

    code utilizes registers for each single thread to fetch the value of a unit cell and its two

    neighbors from the global memory for reuse, thus reducing the latency of accessing the global

    memory. The memory access pattern can be further optimized by using the shared memory

    for each thread block, but since shared memories are only accessible to threads within the

    thread block that it is assigned to, synchronization problem would cause cumbersome manual

    adjustment of the code with respect to different classes of GPUs.

    The C code running on one core of a 2.5GHz Intel Core 2 Duo CPU used less than 1s. The

    resolution of the difftime() function in time.h is on the order of a second, which isnt enough

    to return a more precise time used to evolve the stencil. The CUDA code case running on

    an NVidia GeForce 8600M GT used 481 ms. There are error between the results obtained

    using the CPU and the GPU, which is tabulated as follows:

    Table.1 Distribution of errors between 1D results obtained from CPU & GPU

    range of the error percentage

    error 106 38.5%

    107 error 106 39.5%

    108 error 107 16.5%

    III. 2D CAHN-HILLIARD EQUATION

    FIG. 1. A (8+2)(8 + 2) array in shared

    memory. The light green part maps to the

    values that the threads are processing and

    the dark green parts (halos) maps to the

    values of the neighbors outside the block

    The 2D case is a 200200-unit-cell concentra-

    tion profile, with W = 4.0, M = 1.0, = 0.4 and

    t = 1105.

    The kernel of the CUDA code for the 2D case is

    written to perform better on a GeForce 8600M GT

    GPU, which has 32 streaming processors (cores).

    The warp size of the GPU is 32, so the number

    of threads within a block should be a multiple of32. Taking the above into consideration, the block

    size was chosen to be 816 and the grid size was

    chosen to be 257. Thus the 200200 profile can

    be covered by invoking the grid twice (825=200

    and 167=112, 1122=224, the unused 24 blocks are padding wastes).

  • 8/3/2019 MSE490F11WenjiaWang

    4/22

    4

    FIG. 2. The initial 2D profile

    In the kernel, each block is assigned a (8 + 2)(16 + 2) of array of floating point numbers

    in the shared memory, in which the center 816 units each stores the value in the concen-

    tration profile the corresponding thread maps to, so that the latency caused by access to

    the global memory can be greatly reduced. The outer padding of the array is used to store

    the neighboring values of the center 816 units necessary to perform a differentiation using

    central difference scheme. The results are shown in Fig. 3.

    a. C code after 5 104 steps . CUDA code after 5 104 steps

  • 8/3/2019 MSE490F11WenjiaWang

    5/22

    5

    b. C code after 2 105 steps . CUDA code after 2 105 steps

    c. C code after 7 105 steps . CUDA code after 7 105 steps

    d. C code after 5 106 steps . CUDA code after 5 106 steps

    FIG. 3. Result of CPU computation (a, b, c, d) and GPU computation (,, ,)

  • 8/3/2019 MSE490F11WenjiaWang

    6/22

    6

    In FIG. 3., results a, b and c were obtained with one core of a 2.5GHz Intel Core 2 Duo

    and result d was obtained with one core of a AMD 1.4GHz Opteron 240 cluster; results

    ,, were obtained with a GeForce 8600M GT and result was obtained with a Tesla

    C1060. Time and errors are shown as follows:

    Table.2 Time used to evolve the 2D profile using CPU & GPU

    steps time for CPU (seconds) time for GPU (seconds)

    5 104 147 (2.5GHz Intel Core 2 Duo) 29(GeForce 8600M GT)

    2 105 594 (2.5GHz Intel Core 2 Duo) 120(GeForce 8600M GT)

    7 105 2063 (2.5GHz Intel Core 2 Duo) 398(GeForce 8600M GT)

    5 106 17980 (AMD 1.4GHz Opteron 240) 435(Tesla C1060)

    Table.3 Distribution of errors between 2D results obtained from CPU & GPU

    range of error 5 104 steps 2 105 steps 7 105 steps 5 106 steps(Opteron & Tesla)

    error 5 104 0% 0% 2.31% 7.10%

    error 1 104 0% 1.12% 45.66% 29.14%

    error 5 105 0% 15.69% 64.84% 44.46%

    error 1 105 0% 78.09% 91.68% 70.62%

    It can be seen that Tesla C1060 has higher computation speed (because it has a total of

    128 CUDA cores whereas GeForce 8600M GT has only 32) and a better precision.

    IV. 3D CAHN-HILLIARD EQUATION

    The 3D case is an extension to the 2D case. The concentration profile is defined on a 128

    by 128 by 128 domain. The parameters are defined as follows: W = 4.0, M = 1.0, = 0.04

    and t = 5106. The kernel was written to perform better on a Tesla C1060 GPU. the

    block size was chosen to be 32161 and the grid size was chosen to be 481. A single

    time step would require 128 invoking of the grid to evolve through the complete profile. The

    C code was run on a single core of an Intel Xeon E3-1230 quad-core processor, which has a

    clock frequency of 3.2GHz.

    In the kernel code, shared memory was extensively used to reduce the access traffic to

    the global memory. The sample code is attached in the Appendix. Figures 4. and 5. show

  • 8/3/2019 MSE490F11WenjiaWang

    7/22

    7

    FIG. 4. The initial 3D profile

    the concentration profile of the evolving structures obtained using C and CUDA codes. The

    CPU took 125337s (34.8 hours) to run 1 million steps, while the GPU took 5257s (1.46

    hours) to run 1 million steps and 52575s (14.6) to run 10 million steps.

    a. C code after 5 103 steps . CUDA code after 5 103 steps

  • 8/3/2019 MSE490F11WenjiaWang

    8/22

    8

    b. C code after 2 104 steps . CUDA code after 2 104 steps

    c. C code after 2 105 steps . CUDA code after 2 105 steps

    d. C code after 1 106 steps . CUDA code after 1 106 steps

  • 8/3/2019 MSE490F11WenjiaWang

    9/22

    9

    e. CUDA code after 1 107 steps

    FIG. 5. The 3D result

    Results were obtained from 300 steps to 1 million steps for the C code and 300 steps to

    10 million steps for the CUDA code. The time used is linear to the step used for both the

    CPU and the GPU, so the time used to run a certain steps is predictable for a known type

    of device. As shown in FIG.6., the GPU was 23.9 times faster than the CPU in evolving the

    3D profile.

    FIG. 7. shows the error between the results obtained using a CPU and a GPU. At 1

    million steps, the errors that are greater than 0.0005 compose about 5% of the entire error

    profile, the errors that are greater than 0.0001 compose about 15% of the entire error profile.

  • 8/3/2019 MSE490F11WenjiaWang

    10/22

    10

    0.0E+00

    2.0E+04

    4.0E+04

    6.0E+04

    8.0E+04

    1.0E+05

    1.2E+05

    1.4E+05

    1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07

    Time(seconds)

    LogarithmicSteps

    TeslaC1060

    "XeonE3-1230"

    FIG. 6. Semi-log plot of time used to evolve the profile using CPU and GPU vs. steps

    About 50% of the errors are greater than 1 105. As is shown in FIG. 7., the percentage

    of error with respect to time is also about linear to step used, so one can expect the error

    to grow linearly with step used. Hence a device with higher precision is called for when the

    time step is very fine.

  • 8/3/2019 MSE490F11WenjiaWang

    11/22

    11

    0

    5

    10

    15

    20

    25

    30

    35

    40

    45

    50

    0.0E+00 2.0E+05 4.0E+05 6.0E+05 8.0E+05 1.0E+06 1.2E+06

    percentage

    steps

    errorgreaterthan5E-4

    errorgreaterthan1E-4

    errorgreaterthan5E-5

    errorgreaterthan1E-5

    FIG. 7. Percentage of error between the result obtained using CPU and GPU vs. steps

  • 8/3/2019 MSE490F11WenjiaWang

    12/22

    APPENDIX 1: 2D CUDA KERNEL CODE

    #define W 4.0#define DX 0.1#define M 1.0

    #define TF 50#define EPS 0.4#define N 200#define TI 0#define DT 0.00001

    #define DBX 8 // 8 * 25 = 200 rows#define DBY 16 // 16 * 13 = 200 + 8(padding) columns#define DGX 25#define DGY 7 /* 25*7 = 175 blocks per kernel invoke */

    __global__ void stage1 (float* c0D, float* dfdc,int cell_x, int cell_y) {

    __shared__ float temp[DBX+2][DBY+2];

    const int tx = threadIdx.x;const int ty = threadIdx.y;const int ixt = blockIdx.x * blockDim.x + tx;const int iyt = blockIdx.y * blockDim.y + ty;

    int ixm, iym, idm;

    /* upper half of the global memory */

    /* generic thread index:x = bx*DBX+tx; y = by*DBY+ty;

    mapped into the global memory:x'= x + 1; y'= y + 1; */

    ixm = ixt + 1;iym = iyt + 1;idm = iym + ixm*cell_y;

    temp[tx+1][ty+1] = c0D[idm];

    if (tx == 0)temp[tx][ty+1] = c0D[idm - cell_y];

    if (tx == DBX - 1)temp[tx+2][ty+1] = c0D[idm + cell_y];

    if (ty == 0)temp[tx+1][ty] = c0D[idm - 1];

    if (ty == DBY - 1)temp[tx+1][ty+2] = c0D[idm + 1];

    __syncthreads();

    dfdc[idm] = W/2 * temp[tx+1][ty+1] * (1-temp[tx+1][ty+1]) * (1-2*temp[tx+1][ty+1])

    - EPS * EPS * (temp[tx][ty+1] + temp[tx+2][ty+1]

  • 8/3/2019 MSE490F11WenjiaWang

    13/22

    + temp[tx+1][ty] + temp[tx+1][ty+2] -4*temp[tx+1][ty+1])/DX/DX;

    __syncthreads();

    /* upper half of the halos */if (ixt == 0)

    dfdc[idm + N * cell_y] = dfdc[idm];if (ixt == N - 1)

    dfdc[idm - N * cell_y] = dfdc[idm];if (iyt == 0)

    dfdc[idm + N] = dfdc[idm];

    __syncthreads();

    /* lower half of the global memory */

    /* generic thread index:x = bx*DBX+tx; y = by*DBY+ty;

    mapped into the global memory:x'= x + 1; y'= y + 1 + DGY*DBY */

    ixm = ixt + 1;iym = iyt + 1 + DGY * DBY;idm = iym + ixm*cell_y;

    temp[tx+1][ty+1] = c0D[idm];

    if (iym < N + 1) {/* extra threads are idle */if (tx == 0)

    temp[tx][ty+1] = c0D[idm - cell_y];if (tx == DBX - 1)

    temp[tx+2][ty+1] = c0D[idm + cell_y];if (ty == 0)temp[tx+1][ty] = c0D[idm - 1];

    if (ty == DBY - 1)temp[tx+1][ty+2] = c0D[idm + 1];

    __syncthreads();

    dfdc[idm] = W/2 * temp[tx+1][ty+1] * (1-temp[tx+1][ty+1]) * (1-2*temp[tx+1][ty+1])

    - EPS * EPS * (temp[tx][ty+1] + temp[tx+2][ty+1]+ temp[tx+1][ty] + temp[tx+1][ty+2] -

    4*temp[tx+1][ty+1])/DX/DX;

    __syncthreads();

    /* lower half of the halos */if (ixt == 0)

    dfdc[idm + N * cell_y] = dfdc[idm];if (ixt == N - 1)

    dfdc[idm - N * cell_y] = dfdc[idm];if (iym == N)

    dfdc[idm - N] = dfdc[idm];

  • 8/3/2019 MSE490F11WenjiaWang

    14/22

    __syncthreads();

    }

    }

    __global__ void stage2 (float* c0D, float* dfdc,

    int cell_x, int cell_y) {

    __shared__ float temp[DBX+2][DBY+2];

    const int tx = threadIdx.x;const int ty = threadIdx.y;const int ixt = blockIdx.x * blockDim.x + tx;const int iyt = blockIdx.y * blockDim.y + ty;

    int ixm, iym, idm;float base;

    /* upper half of the global memory */

    /* generic thread index:x = bx*DBX+tx; y = by*DBY+ty;

    mapped into the global memory:x'= x + 1; y'= y + 1; */

    ixm = ixt + 1;iym = iyt + 1;idm = iym + ixm*cell_y;

    temp[tx+1][ty+1] = dfdc[idm];base = c0D[idm];

    if (tx == 0)temp[tx][ty+1] = dfdc[idm - cell_y];if (tx == DBX - 1)

    temp[tx+2][ty+1] = dfdc[idm + cell_y];if (ty == 0)

    temp[tx+1][ty] = dfdc[idm - 1];if (ty == DBY - 1)

    temp[tx+1][ty+2] = dfdc[idm + 1];

    __syncthreads();

    c0D[idm] = DT * M * (temp[tx][ty+1] + temp[tx+2][ty+1]+ temp[tx+1][ty] + temp[tx+1][ty+2] -

    4*temp[tx+1][ty+1])/DX/DX

    + base;

    __syncthreads();

    /* upper half of the halos */if (ixt == 0)

    c0D[idm + N * cell_y] = c0D[idm];if (ixt == N - 1)

    c0D[idm - N * cell_y] = c0D[idm];if (iyt == 0)

  • 8/3/2019 MSE490F11WenjiaWang

    15/22

    c0D[idm + N] = c0D[idm];

    __syncthreads();

    /* lower half of the global memory */

    /* generic thread index:

    x = bx*DBX+tx; y = by*DBY+ty;

    mapped into the global memory:x'= x + 1; y'= y + 1 + DGY*DBY */

    ixm = ixt + 1;iym = iyt + 1 + DGY * DBY;idm = iym + ixm*cell_y;

    temp[tx+1][ty+1] = dfdc[idm];base = c0D[idm];

    if (iym < N + 1) {/* extra threads are idle */if (tx == 0)

    temp[tx][ty+1] = dfdc[idm - cell_y];if (tx == DBX - 1)

    temp[tx+2][ty+1] = dfdc[idm + cell_y];if (ty == 0)

    temp[tx+1][ty] = dfdc[idm - 1];if (ty == DBY - 1)

    temp[tx+1][ty+2] = dfdc[idm + 1];

    __syncthreads();

    c0D[idm] = DT * M * (temp[tx][ty+1] + temp[tx+2][ty+1]+ temp[tx+1][ty] + temp[tx+1][ty+2] -

    4*temp[tx+1][ty+1])/DX/DX+ base;

    __syncthreads();

    /* lower half of the halos */if (ixt == 0)

    c0D[idm + N * cell_y] = c0D[idm];if (ixt == N - 1)

    c0D[idm - N * cell_y] = c0D[idm];if (iym == N)

    c0D[idm - N] = c0D[idm];

    __syncthreads();

    }

    }

  • 8/3/2019 MSE490F11WenjiaWang

    16/22

    APPENDIX 2: 3D CUDA CODE

    Thornton_3.cu

    #include #include #include

    using namespace std;

    #define TF 50#define N 128#define TI 0#define DT 0.000005

    #define DBX 32#define DBY 16#define DBZ 1#define DGX 4#define DGY 8#define DGZ 1

    /* 128 grids */

    #include "Thornton_3kernel.cu"#include "T3period.h"

    int main() {

    int cell_x = N + 2;int cell_y = N + 2;

    int cell_z = N + 2;int sizeG = cell_x * cell_y * cell_z * sizeof(float);int step = (int)((TF - TI)/DT);char R[] = "/home/wenwan/CH3D/ch_3rand128.txt";

    float* c0 = (float*) calloc(cell_x * cell_y * cell_z, sizeof(float));

    float* c1 = (float*) malloc(sizeG);

    float *c0D = 0, *dfdc = 0;

    cudaMalloc((void**) &c0D, sizeG);cudaMalloc((void**) &dfdc, sizeG);

    dim3 dB(DBX, DBY, DBZ);dim3 dG(DGX, DGY, DGZ);

    FILE* read;FILE* ch_3cu;read = fopen(R, "r");

    float input, elapsedTime;float t_total = 0;

  • 8/3/2019 MSE490F11WenjiaWang

    17/22

    int i, j, k, n, iox, ioy, ioz;

    for (k = 1; k < cell_z - 1; k++) {for (j = 1; j < cell_y - 1; j++) {

    for (i = 1; i < cell_x - 1; i++) {

    fscanf(read, "%f", &input);

    c0[k*cell_y*cell_x + j*cell_x + i] = input;}c0[k*cell_y*cell_x + j*cell_x] = c0[k*cell_y*cell_x + (j + 1)*

    cell_x - 2];c0[k*cell_y*cell_x + (j + 1)*cell_x -

    1] = c0[k*cell_y*cell_x + j*cell_x + 1];}for (i = 1; i < cell_x - 1; i++) {

    c0[k*cell_y*cell_x + i] = c0[k*cell_y*cell_x + N*cell_x + i];c0[k*cell_y*cell_x + (cell_y -

    1)*cell_x + i] = c0[k*cell_y*cell_x + cell_x + i];}

    }for (j = 1; j < cell_y - 1; j++) {

    for (i = 1; i < cell_x - 1; i++) {c0[j*cell_x + i] = c0[N*cell_y*cell_x + j*cell_x + i];c0[(cell_z -

    1)*cell_y*cell_x + j*cell_x + i] = c0[cell_y*cell_x + j*cell_x + i];}

    }

    fclose(read);

    cudaMemcpy(c0D, c0, sizeG, cudaMemcpyHostToDevice);

    cudaEvent_t start, stop;cudaEventCreate(&start);

    cudaEventCreate(&stop);cudaEventRecord(start, 0);

    /* growth */for (n = 0; n < step; n++) {

    if ( (n == 300) || (n == 600) || (n == 1000) || (n == 2000)|| (n == 3000) || (n == 4000) || (n == 5000) || (n == 6000)|| (n == 10000) || (n == 20000) || (n == 30000) || (n == 60000)|| (n == 100000) || (n == 200000) || (n == 300000) || (n == 4000

    00)|| (n == 500000) || (n == 600000) || (n == 700000) || (n == 8000

    00)|| (n == 900000) || (n == 1000000) || (n == 2000000) || (n == 30

    00000) || (n == 4000000) || (n == 5000000) || (n == 6000000)|| (n == 7000000) || (n == 8000000) || (n == 9000000) ) {

    cudaEventRecord(stop, 0);cudaEventSynchronize(stop);cudaEventElapsedTime(&elapsedTime, start, stop);t_total += elapsedTime;cudaMemcpy(c1, c0D, sizeG, cudaMemcpyDeviceToHost);

    char strnum[64];

  • 8/3/2019 MSE490F11WenjiaWang

    18/22

    sprintf(strnum, "%d", n);strcat(strnum, ".txt");period(n, strnum, t_total, c1);cudaEventRecord(start, 0);

    }stage1(c0D, dfdc, cell_x, cell_y);stage2(c0D, dfdc, cell_x, cell_y);

    }

    cudaEventRecord(stop, 0);cudaEventSynchronize(stop);cudaEventElapsedTime(&elapsedTime, start, stop);cudaEventDestroy(start);cudaEventDestroy(stop);t_total += elapsedTime;printf("\n----- Elapsed time after 10,000,000 steps: %12.10f ms ----

    -\n\n", t_total);

    cudaMemcpy(c1, c0D, sizeG, cudaMemcpyDeviceToHost);ch_3cu = fopen("/home/wenwan/CH3D/ch3cu128004complete.txt", "wb");

    for (ioz = 1; ioz < cell_z - 1; ioz++)for (ioy = 1; ioy < cell_y - 1; ioy++)

    for (iox = 1; iox < cell_x - 1; iox++)fprintf(ch_3cu, "%12.10f\n", c1[ioz*cell_y*cell_x + ioy*cel

    l_x + iox]);

    fclose(ch_3cu);

    cudaFree(c0D); cudaFree(dfdc);free(c0); free(c1);

    return 0;

    }

    Thornton_3kernel.cu

    #ifndef _THORNTON_3KERNEL_H_#define _THORNTON_3KERNEL_H_

    #define W 4.0#define DX 0.1#define M 1.0#define TF 50#define EPS 0.04#define N 128#define TI 0#define DT 0.000005

    #define DBX 32#define DBY 16#define DBZ 1#define DGX 4#define DGY 8#define DGZ 1

  • 8/3/2019 MSE490F11WenjiaWang

    19/22

    __global__ void stage1 (float* c0D, float* dfdc,int cell_x, int cell_y) {

    __shared__ float temp[DBX+2][DBY+2];__shared__ float top[DBX][DBY];

    __shared__ float bottom[DBX][DBY];

    const int tx = threadIdx.x;const int ty = threadIdx.y;const int tz = threadIdx.z;const int ixt = blockIdx.x * blockDim.x + tx;const int iyt = blockIdx.y * blockDim.y + ty;const int izt = blockIdx.z * blockDim.z + tz;

    int ixm, iym, izm, idm, count;

    for (count = 0; count < 128; count++) {

    ixm = ixt + 1;iym = iyt + 1;izm = izt + 1 + count*DGZ*DBZ;

    idm = izm*cell_y*cell_x + iym*cell_x + ixm;

    temp[tx+1][ty+1] = c0D[idm];

    top[tx][ty] = c0D[idm - cell_y*cell_x];bottom[tx][ty] = c0D[idm + cell_y*cell_x];

    if (tx == 0) {temp[tx][ty+1] = c0D[idm - 1];

    }

    if (tx == DBX - 1) {temp[tx+2][ty+1] = c0D[idm + 1];}if (ty == 0) {

    temp[tx+1][ty] = c0D[idm - cell_x];}if (ty == DBY - 1) {

    temp[tx+1][ty+2] = c0D[idm + cell_x];}

    __syncthreads();

    dfdc[idm] = W/2 * temp[tx+1][ty+1] * (1-temp[tx+1][ty+1]) * (1-2*temp[tx+1][ty+1])

    - EPS * EPS * (temp[tx][ty+1] + temp[tx+2][ty+1]+ temp[tx+1][ty] + temp[tx+1][ty+2] + top[tx][ty]+ bottom[tx][ty] - 6*temp[tx+1][ty+1])/DX/DX;

    __syncthreads();

    /* halos */if (ixt == 0)

    dfdc[idm + N] = dfdc[idm];if (ixt == N - 1)

  • 8/3/2019 MSE490F11WenjiaWang

    20/22

    dfdc[idm - N] = dfdc[idm];if (iyt == 0)

    dfdc[idm + N * cell_x] = dfdc[idm];if (iyt == N - 1)

    dfdc[idm - N * cell_x] = dfdc[idm];

    if (izm == 1)

    dfdc[idm + N * cell_y * cell_x] = dfdc[idm];if (izm == N)

    dfdc[idm - N * cell_y * cell_x] = dfdc[idm];

    __syncthreads();}

    }

    __global__ void stage2 (float* c0D, float* dfdc,int cell_x, int cell_y) {

    __shared__ float temp[DBX+2][DBY+2];__shared__ float top[DBX][DBY];__shared__ float bottom[DBX][DBY];

    const int tx = threadIdx.x;const int ty = threadIdx.y;const int tz = threadIdx.z;const int ixt = blockIdx.x * blockDim.x + tx;const int iyt = blockIdx.y * blockDim.y + ty;const int izt = blockIdx.z * blockDim.z + tz;

    int ixm, iym, izm, idm, count;float base;

    for (count = 0; count < 128; count++) {

    ixm = ixt + 1;iym = iyt + 1;izm = izt + 1 + count*DGZ*DBZ;

    idm = izm*cell_y*cell_x + iym*cell_x + ixm;

    temp[tx+1][ty+1] = dfdc[idm];

    top[tx][ty] = dfdc[idm - cell_y*cell_x];bottom[tx][ty] = dfdc[idm + cell_y*cell_x];

    base = c0D[idm];

    if (tx == 0) {temp[tx][ty+1] = dfdc[idm - 1];}if (tx == DBX - 1) {

    temp[tx+2][ty+1] = dfdc[idm + 1];}if (ty == 0) {

    temp[tx+1][ty] = dfdc[idm - cell_x];}if (ty == DBY - 1) {

  • 8/3/2019 MSE490F11WenjiaWang

    21/22

    temp[tx+1][ty+2] = dfdc[idm + cell_x];}

    __syncthreads();

    c0D[idm] = base + DT * M * (temp[tx][ty+1]+ temp[tx+2][ty+1] + temp[tx+1][ty] + temp[tx+1][ty+2]

    + top[tx][ty] + bottom[tx][ty] -6*temp[tx+1][ty+1])/DX/DX;

    __syncthreads();

    /* halos */if (ixt == 0)

    c0D[idm + N] = c0D[idm];if (ixt == N - 1)

    c0D[idm - N] = c0D[idm];if (iyt == 0)

    c0D[idm + N * cell_x] = c0D[idm];if (iyt == N - 1)

    c0D[idm - N * cell_x] = c0D[idm];

    if (izm == 1)c0D[idm + N * cell_y * cell_x] = c0D[idm];

    if (izm == N)c0D[idm - N * cell_y * cell_x] = c0D[idm];

    __syncthreads();}

    }

    #endif

    T3period.h

    #ifndef _T3PERIOD_H_#define _T3PERIOD_H_

    #include #include #include

    #define N 128

    void period (int step, char* strstep, float elapsedTime, float* c) {

    int x, y, z;FILE* periodicOut;char filename[1024];strcpy(filename, "/home/wenwan/CH3D/ch3cu128004_");

    strcat(filename, strstep);

    periodicOut = fopen(filename, "wb");

    for (z = 1; z

  • 8/3/2019 MSE490F11WenjiaWang

    22/22

    for (y = 1; y