[IEEE 2012 International Conference on Microwave and Millimeter Wave Technology (ICMMT) - Shenzhen,...

UPML-FDTD Parallel Computing On GPUMAweiwei1, SUN dong 2,WU xianliang1

1Key of Intelligent Computing and Signal Processing of Ministry of Education , Anhui University, Hefei 230039, China2School of Electrical Engineering and Automation ,Anhui University, heifei 230039,China

Abstract- Finite-Difference Time-Domain method(FDTD) which is known with its simple and flexible iswidely used for the calculating of electromagneticfields.However,it costs a lot of time in simulating theelectrical-large object.To sovle the promble, parallelmethod for UPML-FDTD algorithm was put forwardusing GPU based-on Compute Unified Device Architecture(CUDA) in this paper. The algorithm is further optimizedby using texture memory .Comparing with the calculationwith traditional CPU The result of simulation shows thatthis algorithm has enough precision and the remarkableincreased efficiency was acquired in different sizes of Yeecells.

I. INTRODUCTION

Finite-Difference Time-Domain method (FDTD)iswidely used in the calculation of electromagnetic fields whichwas proposed by K.S.Yee in 1966[1]. As the continuousdevelopment in recent several decades, FDTD gets more andmore application and attention for its unique properties andadvantages .However, it can produce a huge number of gridsand take longer time for long-term simulation and electricallarge object because of numerical dispersive and time domainiteration.

For above reasons,the paper studies the FDTD inparallel.Program parallelization is a inevitable trend ofnumerical calculation [2].Over the years, we have seen a lot ofgrowth in parallel technology of FDTD algorithm [3], butmost of them are use MPI(Message Passing Interface) paralleltechnology.By contrast,GPU-Based parallel technology hasthe lower cost of hardware and fewer restrictions after theCompute Unified Device Architecture (CUDA) has beenreleased which becomes a hotspot in the field of parallelcomputing research.This paper analyzes the basic principles ofFDTD and the CUDA architecture,then implements the designof parallel FDTD program based on the latest Ferminarchitecture,adopts the uniaxial perfectly matchedlayer(UPML) boundary condition and optimizes the algorithmby using the texture cache successfully.

II. FUNDAMENTAL THEORY OF UPML-FDTD

A. Two-Dimensional Cartesian Coordinate System FDTDPrinciple [4]

The finite difference time domain method (FDTD)takes the central difference approximations for both the

temporal and spatial derivatives in the Yee grid space.Yeeproposed this sampling method is called Yee cell.This paperdeals with two-dimensional TM wave, so gives the two-dimensional TM wave Maxwell's equations. For the sameorder of magnitudeon in electric field and magnetic field , theequation is normalized.So the normalized TM wave equationof in passive in the free space:

0 0

1 ( )y xz H HEt x y� �

��

� � �

0 0

1x zH Et y� ��

� ��

(1)

0 0

1y zH Et x� �

��

We adopt Yee cell that space step �x is same as �y .According to the Courant stability condition ,including

2st v

�� (2)

Central difference scheme is used for discretization,weget the equations:

)],()1,([5.0

)21,()

21,( 2

121

jiEjiE

jiHjiH

nz

nz

n

x

n

x

��

��

(3)

)],(),1([5.0

),21(),

21( 2121

jiEjiE

jiHjiH

nz

nz

ny

ny

��

��

(4)

Figure 1. TM mode, interleaving of the E and H fileds for the two-dimensional TM formulation.

978-1-4673-2185-3/12/$31.00 ©2012 IEEE

The Figure 1 indicate that the calculations are interleaved inboth space and time.In iterative formula(3),for example ,thenew value of Hx is calculated from the previous value of Hxand the most recent values of Ez.So FDTD algorithm, whichhave the natural 2-D parallelism, is especially suitable forparallel computation in space.

B. The Uniaxial Perfectly Matched Layer (UPML) BoundaryCondition

The size of the area that can be simulated using FDTDis limited by computer resources.One of the most flexible andefficient absorbing boundary conditions(ABCs) is the UPMLdeveloped by Berenger[5].The basic idea is this:if a wave ispropagating in medium A and it impinges upon medium B,the amount of reflection is dicatated by the intrinsicimpedances of the two media

BA

BA

��

�(5)

Which are determined by the dielectric constants � andpermeabilities � of the two media

��

� (6)

If � changed with � so remained aconstant , 0� � and no reflection would occurHowever,thepulse will continue propagating in a new medium.We wantthat the pulse will die out before it hits the boundary.This isaccomplished by making both � and � of Eq(8)complex,because the imaginary part represents the part thatcauses decay.

Under this idea,we obtained the formulas for UPML-FDTD after strict mathematic analysis and deducing. Take Hxin the X direction for example,we get

1 12 2

1/2 1/2

1 1/2

_ ( , ) ( , 1)1 1( , ) ( , ) _2 21 1 1( , ) ( , ) 0.5 _ 1( ) ( , )2 2 2

n nz z

n nH H

n n nx x H

x x

x

curl e E i j E i j

I i j I i j curl e

H i j H i j curl e fi i I i j

� �

� �

� �

� � �

� � � �

� � � � � � � �

(7)with

0

( )1( )2i tfi i ��

� . (8)

III. PARALLELIZATION ALGORITHMS

A. CUDA Programming ModelCompute Unified Device Architecture(CUDA) is

NVIDIA's parallel computing architecture which start workingin a massively-parallel (i.e., highly-threaded) environment.Itenables dramatic increases in computing performance byharnessing the power of the GPU. It utilizes class C languageto develop.In this architecture,a program is divided intoHOST(CPU) port and device(GPU) port during execution.The execution of code in the Device called the kernel --

simultaneously by many threads in parallel. The host issues asuccession of kernel invocations to the device. Each kernel isexecuted as a batch of threads organized as a grid of threadblocks[6]. The program is "co-processing" on the CPU andGPU. Sequential programs is executed on the CPU.We cansee from the figure 2 that only one thread runs on theCPU ,then there will be many threads process dataconcurrently on the GPU.

Figure 2 . CUDA programming model

According to the characteristics of CUDA, thecalculation area is divided into multiple Block, then eachBlock is divided into multiple Thread in the realization ofparallel FDTD algorithm. Each Thread calculate a Yeeyuan cell.It can offer data access quickly because that eachthread need exchange data with the neighborhood Yee cell.The specific implementation of the major programmingideas is shown in figure 3.

B. CUDA Hardware ArchitectureCUDA released in 2007 which is the world's first the C

language development environment for GPU.In just a fewyears, there are many important changes about the CUDA-enabled hardware.From the original G80 architecture,to theGT200 core, to the current latest CUDA hardware architectureFermi[7].In general, Fermi architecture compares with theprevious ones have many advantages,Such as:On G80/GT200series, each thread block can contain 512/1024 threads, andFermi architecture can supports 1536 thread .Fermi allows thatmultiple cores occupy an SM computing resourcessimultaneously(only one kernel function is running at a certaintime for previous series ).In addition, Fermi has secondarycache about 768KB, in order to speed up to read the globalmemory and texture memory.

Figure 3.A process flow diagram using GPU.

Host PortInitialization of the distribution of memory

Host PortDistribute and copy data to graphics memory

Host PortCall the Device kernel and do the parallelcomputation of components of the fields

Device PortAdd a source

Calculate Ez by Hx and HyImpose Absorbing Boundry Conditions

Calculate Hx by EzCalculate Hy by Ez

Host portCopy the results to memory and draw them

using the engine of Matlab

Host portRelease the memory both on the host port

and the device port

IV. TWO-DIMENSIONAL UPML-FDTD SIMULATIONOF CUDA

The development language is C + +, developmentenvironment is VS2008. The model number of the computer'sprocessor: inter (R) Pentium (R) 4 CPU and NVIDIA GeForceGT 440 GPU. GT440 is the Fermi architecture.The number ofstream processors is 96 .Video Memory Size is 512MB.

A. Algorithm to AchieveTaking the two-dimensional TM wave as an example, we

simulate a plane wave impinging on a dielectric cylinder20cm in diameter. .Incident wave is a Gaussian pulse which isexp(-0.5*((40-n)/12)^²), where n is the time steps.UPMLboundary thickness of 8.Theparameters : 4r� � , 0.5 � , 1r� � .Added the Matlab engine to the main program for

showing the values of the electric field dynamically.Figure 5shows the dynamic graphs using Matlab engine to make.

Figure 4. Diagram of the simulation of a plane wave striking a dielectriccylinder.[4]

To simulate a plane wave interacting with a dielectriccylinder using parallel algorithm,we use each thread tocalculate an electric field in Yee cell according to the parallelproperty of FDTD. CUDA implement programs with the unitof warp.A half-warp is either the first or second half of awarp,which is an important concept for memory accessesbecause in order to hide the latency effectively, half of a warpshould be performed at least at a time. A warp contained 32threads.So,we make each Block has 16 * 16 Threads.Whenrunning, each thread in the block will have 256 threads toexecute at the same kernel.B. Texture memory Optimization

Optimizing the CUDA algorithm most often involvesoptimizing data accesses ,which includes the use of thevarious CUDA memory spaces. Texture memory providesgreat capabilities including the ability to cache globalmemory .In order to speed up the parallel algorithm, a textureoptimization algorithm was presented[8]. Texture cache is in-chip memory ,which have good processing speed.Texturememory is designed originally for dealing with graphicswhich have a large number of spatial locality when we accessmemory. According to the characteristics of texture memorythat is read-only memory,we put the parameters of UPMLboundary conditions and dielectric cylinder into the texturememory and realize optimization .

020

4060

80

020

4060

80-1

-0.5

0

0.5

1

1) .The change chart of the value of Ez CPU-based

010

2030

4050

0

20

40

60-1

-0.5

0

0.5

1

2) .The change chart of the value of Ez CPU-based

Figure 5. Changes in the electric field value

It can be seen from the figure 5 that the results in CPUwere virtually the same as those in GPU,which proved thecorrectness and effectiveness of this parallel algorithmdesign.Table 1 shows the computational results in differentsizes of Yee cells and acceleration ratio.

B. Analysis of ResultsTable 1 and figure 6 shows that the GPU acceleration is

not obvious when fewer grid, because the GPU does not makefull use of its SM. The more the number of grids, the fasterthe parallel computation speed.The number of grids may farmore than one million in practice and GPU's parallelalgorithm will save a lot of time.

V. CONCLUSIONIn this paper a parallel 2D FDTD algorithm based on

graphics processing unit is implemented where we simulate aplane wave impinging on a dielectric cylinder and UPMLboundary is adopted. Simulation and experimental resultsshow that GPU-based FDTD parallel method can save a lot oftime and enhance computational efficiency. These techniqueswill contribute to the application of the long-term simulationin the electromagnetic fields problems and have the practicalvalue very much.

VI. ACKNOWLEDGMENT

The authors gratefully acknowledge the support of theNational Natural Science Foundation of China (No. 60931002,61101064), Distinguished Natural Science Foundation (No.1108085J01), and Universities Natural Science Foundation ofAnhui Province (No.KJ2011A002, KJ2011A242), andFinanced by the 211 Project of Anhui University.

0 2 4 6 8 10 12

x 105

0

1

2

3

4

5

6

7

8

9x 10

5

the number of cells

the

calc

ulat

ion

time(

ms)

the CPU runtimethe GPU runtime

Figure 6. The contrast trend chart of the CPUand GPU calculation time in different sizes of Yee cells.

TABLE I

Comparison of time spent with GPU and CPU(t=1000 t� )

REFERENCES[1] Ge Debiao Yan Yubo. Finite-difference time-domainelectromagnetic ,2nd ed., Xi'an: Xidian University Press, 2005

[2] LvYingHua, Numerical Methods of Computational electromagnetics .Tsinghua University Press, 2004,pp.6-14.

[3] A. Grama, "Introduction to parallel computing": Addison-Wesley NewYork, 2003.

[4] Sullivan D M. Electromagnetic Simulation Using the FDTD Method.New York:IEEE Press,2000,pp49-62.

[5] J.P.Berenger,A perfectly matched layer for the absorption ofelectromagnetic waves,JComput.Phys.,vol.114,1994,pp.185-200.

[6] NVIDIA CUDA Programming Guide unified computing devicearchitecture [Z]. NVIDA.2007

[7]Lab of Evaluation of modern computer ."Fermi's general-purposecomputing architecture optimization strategy",

[8] Jason Sanders,Edward Kandrot.CUDA By Example an Introduction toGeneral-Purpose GPU Programming.2010.pp.84-101

Thenumber ofgrids

Thenumberof

blocks

Theruntimeof CPU(ms)

Theruntimeof GPU(ms)

Acce-lerationratio

64*64 4*4 297 296 1.0033

128*128 8*8 1313 421 3.1187

256*256 16*16 7156 906 7.898

384*384 24*24 20063 1547 12.97

512* 512 32*32 89993 2859 31.47

640*640 40*40 213953 3828 56.41

768*768 48*48 467094 5266 88.70

1024*1024 64*64 875438 11688 74.90

Date post:	10-Dec-2016
Category:	Documents
Upload:	wu
View:	216 times
Download:	2 times

[IEEE 2012 International Conference on Microwave and Millimeter Wave Technology (ICMMT) - Shenzhen,...

Documents