Linear Solvers for Stable Fluids: GPU vs CPUzxu2/acms60212-40212-S12/final_project/Linear... ·...

Linear Solvers for Stable Fluids GPU vs CPU

Goncalo Amador Abel GomesInstituto de Telecomunicacoes

Departamento de Informatica Universidade da Beira InteriorCovilha Portugal

a14722ubipt agomesdiubipt

AbstractFluid simulation has been an active research field in computer graphics for the last 30 years Stamrsquos stablefluids method among others is used for solving equations that govern fluids This method solves a sparse linearsystem during the diffusion and move steps using either relaxation methods (Jacobi Gauss-Seidel etc) ConjugateGradient (and its variants) or others (not subject of study in this paper) A comparative performance analysisbetween a parallel GPU-based (using CUDA) algorithm and a serial CPU-based algorithm in both 2D and 3Dis given with the corresponding implementation of Jacobi (J) Gauss-Seidel (GS) and Conjugate Gradient (CG)solvers

KeywordsStable Fluids CUDA GPU Sparse Linear Systems

1 INTRODUCTION

The stable fluids method was introduced by Stam[Stam 99] to the field of computer graphics It allowsfor unconditionally stable physics-based fluid simulationsDuring the study of this method it became clear that dur-ing its diffusion and move steps a sparse linear system hadto be solved The performance and scalability (maximumgrid size that can be used in real-time) of this method isdirectly related to the efficiency of the solver Solvers thatconverge to better values give better visual results How-ever the solver must converge quickly but not at the costof more computation to allow real-time simulations Inter-estingly and in spite of existing more than one alternativeto solve sparse linear systems (J GS CG etc) an imple-mentation (to the specific problem of stable fluids) and ancomparative analysis of the various solvers on different ar-chitectures (GPU using CUDA and CPU) is hard to findnot to say that they do not exist

Solvers have to iterate repeatedly and update elements ofa grid For each solver the elements of the grid can beaccessed and updated in an asynchronous way (with someminor changes in the GS solver) Therefore this is a kindof problem where clearly we get performance gains in us-ing parallel computing resources such as GPUs Since therelease of CUDA ie an API for GPU processing the sub-ject of parallel computation on GPU has become more andmore attractive

This paper addresses to how to code these solvers onGPU evaluation of the gains and drawbacks of GPU im-plementations and is it possible to improve the scalability

of the stable fluids method using CUDA

So the main contributions of this paper are

bull CUDA-based implementation of stable fluids solversin 3D There is already a CPU-based implementationfor stable fluids in 3D using a Gauss-Seidel solverwhich is due to Ash [Ash 05] There is also a Cgshading-based implementation using a Jacobi solverfor 3D stable fluids [Keenan 07] However at our bestknowledge there is no CUDA-based implementationof 3D stable fluids

bull A comparative study of CUDA-based implementa-tions of 3D stable fluids using different solversnamely Gauss-Seidel Jacobi and Conjugate Gradi-ent

This paper is organised as follows Section 2 reviews theprevious work Section 3 briefly describes the NVIDIACompute Unified Device Architecture (CUDA) Section 4describes the method of stable fluids including the Navier-Stokes equations Section 5 deals with the implementationof the three sparse linear solvers mentioned above Sec-tion 6 carries out a performance analysis of both CPU- andGPU-based implementations of the solvers Finally Sec-tion 7 draws relevant conclusions and points out new di-rections for future work

2 PREVIOUS WORK

Since the appearance of the stable fluids method due toStam [Stam 99] much research work has been done with

this method To solve the sparse linear system for an2D simulation in the diffusion and move steps both theFast Fourier Transform (FFT) in Stamrsquos [Stam 01] andthe Gauss-Seidel relaxation in Stamrsquos [Stam 03] were used(both implementations run on the CPU) The Gauss-Seidelversion was the only that could support internal and mov-ing boundaries (tips in how to implement internal and mov-ing boundaries are given in Stamrsquos [Stam 03]) Later on in2005 Stamrsquos Gauss-Seidel version was extended to 3Dalso for the CPU by Ash [Ash 05] In 2007 Ashrsquos 3D ver-sion of stable fluids was implemented for CrsquoNedra (opensource virtual reality framework) by Bongart [Bongart 07]this version also runs on the CPU Recently in 2008 Kimpresented in [Kim 08] a full complexity and bandwidthanalysis of Stamrsquos stable fluids version [Stam 03]

In 2005 Stamrsquos stable fluids version was implementedon the GPU by Harris [Harris 05] using the Cg languageThis version supported internal and moving boundariesbut used Jacobi relaxation instead of the Gauss-Seidel re-laxation In 2007 an extension to 3D of Harrisrsquoworkwas carried out by Keenan et al [Keenan 07] Still in2007 when CUDA (see [Nickolls 08] for a quick in-troduction to CUDA programming) was released Good-nightrsquos OpenGL-CUFFT version of Stamrsquos [Stam 01] be-came available [Goodnight 07] The code of this imple-mentation is still part of the CUDA SDK examples

The stable fluids method addressed the stability problemof previous solvers namely Kass and Miller method in[Kass 90] The Kass and Miller method could not be usedfor large time steps consequently it was not possible to useit in real-time computer graphics applications The reasonbehind this was related to the usage of explicit integrationdone with the Jacobi solver instead of stable fluids implicitintegration In spite of the limitations of Kass and Millermethod in 2004 there was a GPU Cg-based version of thismethod implemented by Noe [Noe 04] This GPU versionis used in real-time thanks to the gains obtained from us-ing the GPU

As earlier mentioned it is hard to find not to say thatit does not exist an comparative analysis of the varioussolvers on different architectures (GPU and CPU) How-ever one might find performance analysis comparing theCPU and GPU versions of individual methods In 2007the CG method performance for the CPU and GPU wasanalysed by Wiggers et al in [Wiggers 07] In 2008Amorim et al [Amorim 08]) implemented the Jacobisolver on both CPU and GPU having then carried out aperformance analysis and comparison of both implemen-tations

Implementations of solvers for the GPU using shad-ing language APIs have already been addressednamely Gauss-Seidel and Conjugate Gradient(in [Kruger 03]) and Conjugate Gradient and multi-grid solvers (in [Bolz 03])

To understand the mathematics and algorithm ofCG method (and variants) the reader is referred toShewchuck [Shewchuk 94] Carlson addressed the

problem of applying the CG method to fluids simula-tions [Carlson 04] The Preconditioned CG (PCG) wasalso overviewed during SIGGRAPH 2007 fluid simulationcourse [Bridson 07] where the source code for a C++implementation of a sparse linear system solver was madeavailable SIGGRAPH version builds up and stores thenon-null elements of the sparse matrix When the sparsematrix is stored in memory access efficiency is crucial fora good GPU implementation which is a problem that wasaddressed by Bell and Garland [Bell 08]

An overview on GPU programming including the GPUarchitecture shading languages and recent APIs such asCUDA intended to allow the usage of the capabilities ofGPUs parallel processing was addressed in [Owens 08]

In spite of the previous work described above a CUDAimplementation of Gauss-Seidel and CG solvers for thespecific case of the stable fluids method seems to be non-existent This paper just proposes such CUDA-based im-plementations Besides this paper also carries out a com-parison between Jacobi Gauss-Seidel and CG solvers intheir CPU and CUDA implementations

3 NVIDIA COMPUTE UNIFIED DEVICE ARCHI-TECTURE (CUDA)

With the advent of the first graphics systems it becameclear that the parallel work required by graphics compu-tations should be delegated to another component otherthan the CPU Thus the first graphics cards arrived to al-leviate graphics processing load of the CPU However thegraphics programming was basically done using a kind ofassembly language With the arrival of the graphics pro-gramming APIs (such as OpenGL or DirectX) and latterthe high-level shading languages (such as Cg or GLSL)programming graphics became easier

When in 2007 NVIDIA CUDA (Program-ming Guide [NVIDIA 08a] and Reference Man-ual [NVIDIA 08b]) was released it was made possible tospecify how and what work should be done in the NVIDIAgraphics cards It became possible to program directly theGPU using the CC++ programming language CUDA isavailable for Linux and Windows operative systems andincludes the BLAS and the FFT libraries

The CUDA programming logic separates the work (seeFig 1) that should be performed by the CPU (host) fromthe work that should be performed by the GPU (device)The host runs its own code and launches kernels to be ex-ecuted by the GPU After a kernel call the control returnsimmediately to the host unless a lock is activated withcudaThreadSynchronize The GPU and the CPUwork simultaneously after a kernel call each running itsown code (unless a lock is activated in the host) The GPUruns the kernels by order of arrival (kernel 1 then kernel 2as shown in Fig 1) unless they are running on differentstreams ie kernel execution is asynchronous If a lockis activated the next kernel will be only called when thepreviously called kernels finish their jobs

A kernel has a set of parameters aside from the pointers

Figure 1 CUDA work flow model

to device memory variables or copies of CPU data Theparameters of a kernel specify the number of blocks in agrid (in 2D only) the number of threads (in x y z direc-tions) of each grid block the size of shared memory perblock (0 by default) and the stream to use (0 by default)The maximum number of threads allowed by block is 512(xtimes y times z threads per block) All blocks of the same gridhave the same number of threads Blocks work in paralleleither asynchronously or synchronously With this infor-mation the kernel specifies how the work will be dividedover a grid

When talking about CUDA four kinds of memory are con-sidered (see Fig 2) Host memory refers to the CPU-associated RAM and can only be accessed by the hostThe device has three kinds of memory constant memoryglobal memory and shared memory Constant and globalmemory are accessible by all threads in a grid Globalmemory has readwrite permissions from each thread of agrid Constant memory only allows read permission fromeach thread of a grid The host may transfer data fromRAM to the device global or constant memory or vice-versa Shared memory is the memory shared by all threadsof a block All threads within the same block have read-write permissions to use the block shared memory

4 STABLE FLUIDS

The motion of a viscous fluid can be described by theNavier-Stokes (NS) equations They are differential equa-tions that establish a relation between pressure velocityand forces during a given time step Most physically basedfluid simulations use NS equations to describe the mo-tion of fluids These simulations are based on three NSequations One equation just ensures mass conservationand states that variation of the velocity field equals zero(5v = 0) The other two equations describe the evolu-tion of velocity (Eq 1) and density (Eq 2) over time as

Figure 2 CUDA memory model

follows

partu

partt= minus (u middot nabla)u+ vnabla2u+ f (1)

partρ

partt= minus (u middot nabla) ρ+ knabla2ρ+ S (2)

where u represents the velocity field v is a scalar de-scribing the viscosity of the fluid f is the external forceadded to the velocity field ρ is the density of the fieldk is a scalar that describes the rate at which density dif-fuses S is the external source added to the density field

and nabla =(part

partxpart

partypart

partz

)is the gradient

NS-based fluid simulators usually come with some sort ofcontrol user interface (CUI) to allow for the distinct usersto interact with the simulation (see steps 2 and 3 in Al-gorithm 1) In order to solve the previous equations NS-based fluid simulators work as follows

Algorithm 1 NS fluid simulatorOutput Updated fluid at each time-step

1 while simulating do2 Get forces from UI3 Get density source from UI4 Update velocity (Add force Diffuse Move)5 Update density (Add force Advect Diffuse)6 Display density7 end while

To better understand velocity and density updates let usdetail steps 4 and 5 in Algorithm 1

41 Add force (f term in Eq 1 and S term in Eq2)

In this step the influence of external forces to the field isconsidered It consists in adding a force f to the velocityfield or a source S to the density field For each grid cellthe new velocity u is equal to its previous value u0 plus theresult of multiplying the simulation time step ∆t by the

force f to add ie u = u0 + ∆t times f The same appliesto the density ie ρ = ρ0 + ∆t times S where ρ0 is thedensity previous value ρ is the new density value ∆t isthe simulation time step and S is the source to add to thedensity

42 Advect (minus (unabla)u term in Eq 1 and minus (unabla) ρterm in Eq 2)

The fluid moves according to the system velocity Whenmoving the fluid transports objects densities itself (self-advection) and other quantities This is referred as advec-tion Note that advection of the velocity also exists duringthe move step

43 Diffuse (vnabla2u term in Eq 1 and knabla2 term inEq 2)

Viscosity describes a fluidrsquos internal resistance to flowThis resistance results in diffusion of the momentum (andtherefore velocity) To diffuse we need to solve for the3D case the following equation for each grid cell

Dn+1ijk minus

kdt

h3

(Dn+1

iminus1jk +Dn+1ijminus1k +Dn+1

ijkminus1+

Dn+1i+1jk +Dn+1

ij+1k +Dn+1ijk+1 minus 6Dn+1

ijk

)= Dn

ijk

(3)

In both cases we will have to solve a sparse linear systemin the form Ax = b

44 Move (minus (unabla)u term in Eq 1) and 5v = 0

When the fluid moves mass conservation must be ensuredThis means that the flow leaving each cell (of the gridwhere the fluid is being simulated) must equal the flowcoming in But the previous steps (Add force and dif-fuse for velocity) violate the principle of mass conserva-tion Stam uses a Hodge decomposition of a vector field(the velocity vector field specifically) to address this issueHodge decomposition states that every vector field is thesum of a mass conserving field and a gradient field Toensure mass conservation we simply subtract the gradientfield from the vector field In order to do this we must findthe scalar function that defines the gradient field Comput-ing the gradient field is therefore a matter of solving forthe 3D case the following Poisson equation for each gridcell

Piminus1jk + Pijminus1k + Pijkminus1+

+Pi+1jk + Pij+1k + Pijk+1 minus 6Pijk =

= (Ui+1jk minus Uiminus1jk + Vij+1kminus

minusVijminus1k +Wijk+1 minusWijkminus1)h

(4)

Solving this Poisson equation for the 3D case for eachgrid cell is the same as solving a sparse symmetrical linear

system This system can be solved with the solver usedin the diffuse step (J GS or CG) as described in the nextsection

5 SOLVER ALGORITHMS

As previously mentioned the density diffusion the veloc-ity diffusion and move steps require for a sparse linearsystem to be solved To best understand the kind of prob-lem at hand let us assume we are going to simulate ourfluid in a 22 grid domain (blue cells in Fig 3 on the left)This means that our grid will actually be a 42 grid wherethe fluid is inside a container So the extra cells are the ex-ternal boundaries of the simulation (red cells in Fig 3 onthe left) To allow a better memory usage we represent thegrid as a 1D array with 42 elements (as shown in Fig 3 onthe left) For a 3D simulation eight 1D arrays are requiredVelocity requires six 1D arrays two for each of its com-ponents ie current and previous values of velocity (vxvx0 vy vy0 vz vz0) Density will require the remainingtwo 1D arrays (from the eight) for its current and previousvalues (d d0)

Figure 3 2D Grid represented by a 1Darray(left) and grid cell interacting with its neigh-bours (right)

During the density diffusion and the velocity diffusion andmove steps each cell in the grid interacts with its directneighbours (see Fig 3 on the right) In a 42 grid therewould be a total of 42 interactions between a cell and itsneighbours Let us consider one of the 1D array pairsfor example for the velocity y component previous (vy0)and current values (vy) If we took the interactions foreach fluid cell we would obtain a linear system in the formAx = b (see Fig 4)

In this system A is a Laplacian matrix of size 162 andits empty cells are zero For diffusion and move steps asystem in this form has to be solved To do so one caneither build and store A in memory using a 1D array ora data structure of some kind or to use its values directlyIn the second option this means that the central cell valueis multiplied by minus4 in 2D or by minus6 in 3D and its directneighbours are multiplied by 1

Figure 4 The sparse linear system to solve(for a 22 fluid simulation grid)

51 Jacobi and Gauss-Seidel Solvers

The Jacobi and Gauss-Seidel solvers for a given numberof iterations (line 1 in Algorithms 2 and 3) for each cellof the grid (line 2 in Algorithms 2 and 3) will calculatethe cell value (line 3 in Algorithms 2 and 3) What dis-tinguishes both solvers is that Gauss-Seidel uses the pre-viously calculated values but Jacobi does not ThereforeJacobi convergence rate will be slower when compared tothe Gauss-Seidel solver Since the Jacobi solver does notuse already updated cell values it requires the storage ofthe new values in a temporary auxiliary 1D array (aux)When all new values of cells have been determined theold values of cells will be replaced with the values storedin the auxiliary 1D array (lines 5 to 7 in Algorithm 2) Af-ter some maths (not addressed in this paper) the diffusionand move equations to solve can be made generic whereonly iter and a will differ (line 3 in Algorithms 2 and 3)

Algorithm 2 CPU based JacobiInputx 1D array with the grid current valuesx0 1D array with the grid previous valuesaux auxiliary 1D array

akdt

h3(see Eq 3)

iter 1 +kdt

h3(see Eq 3)

max iter number of times to iterateOutputx 1D array with the grid new interpolated values

1 for iteration = 0 to max iter do2 for all grid cells do3 auxijk = (x0ijk + atimes (ximinus1jk + xijminus1k +xijkminus1 + xi+1jk + xij+1k + xijk+1))iter

4 end for5 for all grid cells do6 xijk = auxijk

7 end for8 Enforce Boundary Conditions9 end for

Algorithm 3 CPU based Gauss-SeidelInputx 1D array with the grid current valuesx0 1D array with the grid previous values

akdt

h3(see Eq 3)

iter 1 +kdt

h3(see Eq 3)

1 for iteration = 0 to max iter do2 for all grid cells do3 xijk = (x0ijk + a times (ximinus1jk + xijminus1k +xijkminus1 + xi+1jk + xij+1k + xijk+1))iter

In the GPU for Jacobi and the Gauss-Seidel we willhave a call to a kernel (the kernels are Algorithms 4 and5) A kernel will have two parameters grid stands forthe number of blocks in X and Y axis and threadsstands for the number of threads per block The dimen-sions of the block are given with BLOCK DIM X andBLOCK DIM Y Each block treats all grid slices in zdirection for the threads in x and y

dim3 threads (BLOCK_DIM_X BLOCK_DIM_Y ) dim3 grid (NX BLOCK_DIM_Z NY BLOCK_DIM_Y )

J a c o b i k e r n e l c a l l_jcbltltltgrid threadsgtgtgt(x x0 a iter max_iter ) CUT_CHECK_ERROR (Kernel execution failed ) cudaThreadSynchronize ( )

o r

GaussminusS e i d e l r e d b l a c k k e r n e l c a l l_gs_rbltltltgrid threads gtgtgt(x x0 a iter max_iter ) CUT_CHECK_ERROR (Kernel execution failed ) cudaThreadSynchronize ( )

In the GPU implementations of Jacobi and Gauss-Seidelred black algorithms the values of i and j (cell coordi-nates) will be obtained with the blocks threads and gridinformation (lines 1 to 2 in Algorithms 4 and 5) TheGauss-Seidel solver is a sequential algorithm since it re-quires previous values to be calculated The GPU-basedversion of Gauss-Seidel has two interleaved passes first itupdates the red cells (line 7 in 5) and then the black cells(line 11 in Algorithm 5) according to the pattern shown inFig 5

Figure 5 Gauss-Seidel red black pattern fora 2D grid

Thus previous values are used as in the CPU-based ver-sion The GPU-based implementation of Gauss-Seidel al-lows more iterations than the CPU-based implementationNevertheless it also takes two times more iterations to con-verge to the same values as the CPU-based implementa-tion

The Jacobi GPU-based version requires to temporarilystore each grid cell new value in a device global memory1D array (aux) After each iteration the values stored in xare replaced by the new values temporarily stored in axu(line 8 in Algorithm 4)

The GPU-based version of all solvers (J GS CG) sufferfrom global memory latency which appears during suc-cessive runs of the solvers (an issue for real time purposes)However only the Conjugate Gradient is affected to a levelof degrading notoriously the solver performance

Algorithm 4 Jacobi GPU kernelInputx 1D device global memory array with the grid currentvaluesx0 1D device global memory array with the grid previousvaluesaux auxiliary 1D device global memory array

akdt

h3for diffusion (see Eq 3) 1 for move (see Eq 4)

iter 1 +kdt

h3(see Eq 3) 6 for Move (see Eq 4)

max iter number of times to iterateOutputx new interpolated values of x

1 i = threadIdxx+ blockIdxxtimes blockDimx2 j = threadIdxy + blockIdxy times blockDimy3 for iteration = 0 to max iter do4 for k = 0 to NZ do5 if (i = 0) ampamp (i = NX minus 1) ampamp (j = 0) ampamp

(j = NY minus 1)ampamp(k = 0)ampamp(k = NZ minus 1) then6 auxijk = (x0ijk +atimes(ximinus1jk +xi+1jk +xijminus1k + xij+1k + xijkminus1 + xijk+1))iter

7 syncthreads8 xijk = auxijk

9 end if10 Enforce Boundary Conditions11 end for12 end for

52 Conjugate Gradient Solver

The Conjugate Gradient algorithm (see Algorithm 6) con-sists in a series of calls to functions in the CPU-based ver-sion or to a kernel call in the GPU-based implementation

_cgltltlt1NXgtgtgt(r p q x b alpha beta rho rho0 rho_old alarriter max_iter )

CUT_CHECK_ERROR (Kernel execution failed )

Before iterating it is first required (lines 1 to 4 of Algo-rithm 6) to set the initial values of r and p and of ρ0 andρ After the initial values are set up we are ready to iterate

Algorithm 5 Gauss-Seidel red black GPU kernelInputx 1D device global memory array with the grid currentvaluesx0 1D device global memory array with the grid previousvaluesakdt

iter 1 +kdt

(j = NY minus 1)ampamp(k = 0)ampamp(k = NZ minus 1) then6 if (i+ j)2 == 0 then7 xijk = (x0ijk +atimes(ximinus1jk +xi+1jk +xijminus1k + xij+1k + xijkminus1 + xijk+1))iter

8 end if9 syncthreads

10 if (i+ j)2 = 0 then11 xijk = (x0ijk +atimes(ximinus1jk +xi+1jk +

xijminus1k + xij+1k + xijkminus1 + xijk+1))iter12 end if13 end if14 Enforce Boundary Conditions15 end for16 end for

We will iterate until all iterations are done or the stop cri-terion is achieved (lines 5 and 6 of Algorithm 6) For eachiteration the first step (line 7 of Algorithm 6) is to updateq After updating q the next step (lines 8 to 12 of Algo-rithm 6) is to determine the new distance to travel alongp α During the update of α the dot product of p by qmust be determined After updating α we need to deter-mine the iterated values of x and the new r residues (lines9 and 10 of Algorithm 6 Before updating each grid cellprevious optimal search vector (gradient) that is orthogo-nal (conjugate) to all the previous search vectors p (line13 of Algorithm 6) ρold ρ and β must be updated (lines11 to 13 of Algorithm 6) After updating β the new searchdirections (p values) must be set

The most intuitive way to migrate the Conjugate Gradientfrom a sequential to a parallel algorithm is to perform itssteps (ie dot products update of grid positions etc) bykernels or using the CUDA BLAS library kernels How-ever most of these kernels must be called for a certainnumber of iterations Therefore the successive invoca-tion of kernels will result in timeouts in the simulationThe best solution found was to build up a massive ker-nel However this results in losing much of the CUDAperformance gains The reason is related with the parallelversion of dot product which forces the use of one block

Algorithm 6 Conjugate Gradient methodInputx 1D array with the grid current valuesx0 1D array with the grid previous valuesr p q auxiliary 1D arrays

akdt

h3(see Eq 3)

iter 1 +kdt

h3(see Eq 3)

max iter number of times to iteratetol tolerance after which is safe to state that the values ofx converged optimally Outputx 1D array with the grid new interpolated values

1 r = bminusAx2 p = r3 ρ = rT middot r4 ρ0 = ρ5 for iteration = 0 to max iter do6 if (ρ = 0) and ρ gt tol2 times ρ0 then7 q = Ap8 α = ρ(pT middot q)9 x+ = αtimes p

10 rminus = αtimes q11 ρ old = ρ12 ρ = rT middot r13 β = ρρ old14 p = r + β times p15 Enforce Boundary Conditions16 end if17 end for

with NX threads in x Much of the steps of the ConjugateGradient performance degrades with this restriction Evenworse this version has worst performance than the CPU-based version

6 SOLVERS PERFORMANCE ANALYSIS

After implementing the solvers tests of their overallperformance were made (see Tables 1 and 2) Thesolvers both for GPU and CPU were tested on a In-tel(R) Core(TM)2 Quad CPU Q6600240GHz with4096MBytes of DDR2 RAM and an NVIDIA GeForce8800 GT graphics card The CPU-based version is purelysequential ie it runs in a single core and it is not multi-threaded The following tables show the total time (lsquoCPUtimersquo and lsquoGPU timersquo) that each solver takes for a certainnumber of iterations (lsquoIterationsrsquo) and a specific grid size(lsquoGrid Sizersquo) Each solver is invoked a number of times ina single step of the stable fluids method (5 for 2D and 6for 3D) For 2D we tested each solver using 10 iterationsfor all grid sizes In 3D we used 4 iterations instead foreach solver In 2D 10 iterations suffice while we needa minimum number of 4 iterations in 3D to ensure someconvergence of the results More accurate converging val-ues result in better visual quality A total time superior to33ms does not guarantee real-time performance ie noframe rate greater than 30 frames per second is achievedThe time values were obtained from the average time of 10

tests for each solver implementation independently of thegrid size

Grid CPU Time GPU TimeSize Iterations (ms) (ms)

J 10 0 0 0 290219GS 322 10 0 0 0 290032CG 10 0 0 0 567346

J 10 2 0 0 290900GS 642 10 3 0 0 295285CG 10 3 0 0 571604

J 10 8 0 0 293459GS 1282 10 15 0 0 289818CG 10 13 0 0 573089

J 10 35 0 0 296003GS 2562 10 60 0 0 289882CG 10 56 0 0 579169

J 10 298 0 0 306887GS 5122 10 245 0 0 308024CG 10 350 0 0 580205

Table 1 Performance of 2D solvers for CPUand GPU for distinct grid sizes

J 4 10 0 0 408663GS 323 4 15 0 0 343595CG 4 15 0 0 665508

J 4 170 0 0 416022GS 643 4 141 0 0 348235CG 4 239 0 0 673512

J 4 1482 0 0 424283GS 1283 4 1208 0 0 360813CG 4 2017 0 0 68272

From the previous tables we can draw some conclusionsWhen comparing the columns lsquoCPU timersquo and lsquoGPU timersquoit is clear that in the 2D and 3D versions the GPU-basedimplementation surpasses the CPU-based implementationHowever the previous tables only present the processingtimes not including timeouts When running the simula-tion timeouts appear from device global memory latencyin the GPU-based version These timeouts are hidden (iethey exist but they do not significantly degrade the solveroverall performance) in the Jacobi and Gauss-Seidel GPUimplementations Unfortunately the Conjugate Gradienttimeouts are so severe that the losses overcome the gainsin time

The Conjugate Gradient method converges faster than Ja-cobi and Gauss-Seidel in spite of involving more computa-

tions during each iteration but this is not visible for a smallnumber of iterations Therefore CPU-based implementa-tion of stable fluids using the Conjugate Gradient solveris inadequate for real-time purposes when the grid size isover 1282

Except for the 322 grid the GS and J solvers have sig-nificant gains on GPU Comparing lsquoCPU timersquo and lsquoGPUtimersquo for J and GS solvers it becomes clear that the GPU-based versions are faster Besides we can fit more solveriterations per second using the GPU-based implementa-tion Unlike the CPU-based version the GPU-based ver-sions of J and GS solvers enable the usage of a 1283 gridin real-time Thus the observation of the lsquoIterationsrsquo andlsquoGPU timersquo columns leads us to conclude that GS is thebest choice for 2D and 3D grid sizes In the CPU-basedversions the best choice in 2D is the J solver except forthe 5122 grid where GS is the best choice In 3D theCPU-based version of GS is a better choice for grid sizessuperior to 323

Another important consideration has to do with the timecomplexity of both CPU- and GPU-based implementationsof stable fluids Looking at Tables 1 and 2 we easily ob-serve that the GPU-based solvers have constant complex-ity approximately On the other hand CPU-based solvershave quadratic complexity for small grids but tend to cubiccomplexity (ie the worst case) for larger grids Howevercomputing the time complexity more accurately would re-quire more exhaustive experiments as well as a theoreticalanalysis

Figs 6 to 8 show a 1282 fluid simulation with internaland moving boundaries (red dots) Rendering was doneusing OpenGL Vertex Buffer Objects The CPU version isthe one here shown The frame rate includes the renderingtime

7 CONCLUSIONS AND FUTURE WORK

This paper has described CUDA-based implementations ofJacobi Gauss-Seidel and Conjugate Gradient solvers for3D stable fluids on GPU These solvers have been thencompared to each other including their CPU-based imple-mentations The most important result from this compar-ative study is that the GPU-based implementations haveconstant time complexity which allows to have a more ac-curate control in real-time applications

The 3D stable fluids method has significant memory re-quirements and time restrictions to solve the Navier-Stokesequations at each time step It remains to prove that otheralternatives (not addressed in this paper) to 3D fluid sim-ulations such as Shallow Water Equations [Miklos 09]the Lattice Boltzmann Method [TJ08] the Smoothed Par-ticle Hydrodynamics [Schlatter 99] or procedural meth-ods [Jeschke 03] are better choices We hope to exploreother emerging solvers for sparse linear systems in a nearfuture In particular we need a solver with a better conver-gence rate than relaxation techniques (J and GS) and withno significant extra computational effort such as the CG

Figure 6 A CPU version of [Stam 03] fluidsimulator with internal and moving bound-aries (red dots) using the J solver

References

[Amorim 08] Ronan Amorim Gundolf Haase ManfredLiebmann and Rodrigo Weber Compar-ing CUDA and OpenGL implementationsfor a Jacobi iteration Technical Report025 University of Graz SFB Dec 2008

[Ash 05] Michael Ash Simulation and visualiza-tion of a 3d fluid Masterrsquos thesis Univer-site drsquoOrleans France Sep 2005

[Bell 08] Nathan Bell and Michael Garland Ef-ficient sparse matrix-vector multiplica-tion on CUDA NVIDIA Technical Re-port NVR-2008-004 NVIDIA Corpora-tion Dec 2008

[Bolz 03] Jeff Bolz Ian Farmer Eitan Grinspun andPeter Schroder Sparse matrix solvers onthe GPU conjugate gradients and multi-grid ACM Trans Graph 22(3)917ndash9242003

[Bongart 07] Robert Bongart Efficient simulation offluid dynamics in a 3d game engine

Figure 7 A CPU version of [Stam 03] fluidsimulator with internal and moving bound-aries (red dots) using the GS solver

Masterrsquos thesis KTH Computer Scienceand Communication Stockholm Sweden2007

[Bridson 07] Robert Bridson and Matthias F MullerFluid simulation SIGGRAPH 2007course notes In ACM SIGGRAPH 2007Course Notes (SIGGRAPH rsquo07) pages 1ndash81 New York NY USA 2007 ACMPress

[Carlson 04] Mark Thomas Carlson Rigid meltingand flowing fluid PhD thesis AtlantaGA USA Jul 2004

[Goodnight 07] Nolan Goodnight CUDAOpenGL fluidsimulation httpdeveloperdownloadnvidiacomcomputecudasdkwebsitesampleshtmlpostProcessGL 2007

[Harris 05] Mark J Harris Fast fluid dynamicssimulation on the GPU In ACM SIG-GRAPH 2005 Course Notes (SIGGRAPHrsquo05) number 220 New York NY USA2005 ACM Press

Figure 8 A CPU version of [Stam 03] fluidsimulator with internal and moving bound-aries (red dots) using the CG solver

[Jeschke 03] Stefan Jeschke Hermann Birkholz andHeidrun Schumann A procedural modelfor interactive animation of breakingocean waves In The 11th InternationalConference in Central Europe on Com-puter Graphics Visualization and Com-puter Visionrsquo2003 (WSCG rsquo2003) 2003

[Kass 90] Michael Kass and Gavin Miller Rapidstable fluid dynamics for computer graph-ics In Proceedings of the 17th AnnualConference on Computer Graphics andInteractive Techniques (SIGGRAPHrsquo90)pages 49ndash57 ACM Press 1990

[Keenan 07] Crane Keenan Llamas Ignacio and TariqSarah Real-time simulation and render-ing of 3d fluids In Nguyen Hubert editorGPU Gems 3 chapter 30 pages 633ndash675Addison Wesley Professional Aug 2007

[Kim 08] Theodore Kim Hardware-aware analysisand optimization of stable fluids In Pro-ceedings of the 2008 Symposium on In-teractive 3D Graphics and Games (SI3Drsquo08) pages 99ndash106 2008

[Kruger 03] Jens Kruger and Rudiger WestermannLinear algebra operators for GPU imple-mentation of numerical algorithms InACM SIGGRAPH 2003 Papers (SIG-GRAPH rsquo03) pages 908ndash916 New YorkNY USA 2003 ACM Press

[Miklos 09] Balint Miklos Real time fluid simu-lation using height fields semester the-sis httpwwwbalintmikloscomlayered_waterpdf 2009

[Nickolls 08] John Nickolls Ian Buck Michael Gar-land and Kevin Skadron Scalable par-allel programming with CUDA Queue6(2)40ndash53 2008

[Noe 04] Karsten Noe Implementing rapidstable fluid dynamics on the GPUhttpprojectsn-o-edkpage=showampname=GPU20water20simulation 2004

[NVIDIA 08a] NVIDIA CUDA programmingguide 20 httpdeveloperdownloadnvidiacomcomputecuda2_0docsNVIDIA_CUDA_Programming_Guide_20pdfJul 2008

[NVIDIA 08b] NVIDIA CUDA reference manual 20httpdeveloperdownloadnvidiacomcomputecuda2_0docsCudaReferenceManual_20pdf Jun 2008

[Owens 08] John D Owens Mike Houston DavidLuebke Simon Green John E Stone andJames C Phillips GPU computing Pro-ceedings of the IEEE 96(5)879ndash89 2008

[Schlatter 99] Brian Schlatter A pedagogical tool us-ing smoothed particle hydrodynamics tomodel fluid flow past a system of cylin-ders Technical report Oregon State Uni-versity 1999

[Shewchuk 94] J R Shewchuk An introduction to theconjugate gradient method without the ag-onizing pain Technical report PittsburghPA USA Aug 1994

[Stam 99] Jos Stam Stable fluids In Proceedingsof the 26th Annual Conference on Com-puter Graphics and Interactive Techniques(SIGGRAPH rsquo99) pages 121ndash128 NewYork NY USA Aug 1999 ACM Press

[Stam 01] Jos Stam A simple fluid solver basedon the FFT J Graph Tools 6(2)43ndash522001

[Stam 03] Jos Stam Real-time fluid dynamics forgames In Proceedings of the Game De-veloper Conference Mar 2003

[TJ08] Tolke-Jonas Implementation of a Lattice-Boltzmann kernel using the compute uni-fied device architecture developed byNVIDIA Computing and Visualization inScience Feb 2008

[Wiggers 07] WA Wiggers V Bakker ABJKokkeler and GJM Smit Imple-menting the conjugate gradient algorithmon multi-core systems page 14 Nov2007

INTRODUCTION
PREVIOUS WORK
NVIDIA COMPUTE UNIFIED DEVICE ARCHITECTURE (CUDA)
STABLE FLUIDS
- Add force (f term in Eq 1 and S term in Eq 2)
- Advect (- (u)u term in Eq 1 and - (u) term in Eq 2)
- Diffuse (v 2 u term in Eq 1 and k 2 term in Eq 2)
- Move (- (u)u term in Eq 1) and v=0
- - SOLVER ALGORITHMS
  - - Jacobi and Gauss-Seidel Solvers
    - Conjugate Gradient Solver
    - - SOLVERS PERFORMANCE ANALYSIS
      - CONCLUSIONS AND FUTURE WORK

Page 2: Linear Solvers for Stable Fluids: GPU vs CPUzxu2/acms60212-40212-S12/final_project/Linear... · Linear Solvers for Stable Fluids: GPU vs CPU ... Fluid simulation has been an active