InteractiveWater_Lennartsson_dice

7/29/2019 InteractiveWater_Lennartsson_dice

1/60

MAS TERS THESIS IN COMPUTER SCIENCE

Data Oriented InteractiveWater

An Interactive Water Simulation For PlayStation 3

JOE L LENNARTSSONJune 25, 2012

Examiner:IngemarRagnemalm

Supervisor:Jens

Ogniewski

Supervisor at DICE:Torbjrn

Sderman


2/60

Abstract

In this report, a method for simulating interactive height-field based water on

a parallel architecture is presented. This simulation is designed for faster thanreal time applications and is highly suitable for video games on current genera-tion home computers. Specifically, the implementation proposed in this reportis targeted at the Sony PlayStation 3. This platform requires code to be bothhighly parallelized and data oriented in order to take advantage of the avail-able hardware which makes it an ideal platform for evaluating parallel code.The simulation captures the dispersive property of water and is scalable fromsmall collections of water to large lakes. It also uses dynamic Level Of Detailto achieve constant performance while at the same time presenting high fidelityanimated water to the player. This report describes the simulation method andimplementation in detail along with a performance analysis and discussion.


3/60

Contents

1 Introduction 4

2 Background And Related Works 6

2.1 Navier-Stokes Equations . . . . . . . . . . . . . . . . . . . . . . . 72.2 Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Lagrangian Particles . . . . . . . . . . . . . . . . . . . . . 82.2.2 Eulerian Grids . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Height Field Methods . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.1 Shallow Water Equations . . . . . . . . . . . . . . . . . . 92.3.2 Linear Wave Theory . . . . . . . . . . . . . . . . . . . . . 92.3.3 Wave Equation . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.4.1 Gerstner Waves . . . . . . . . . . . . . . . . . . . . . . . . 112.4.2 Fast Fourier Transforms . . . . . . . . . . . . . . . . . . . 122.4.3 Semi-Lagrangian Method . . . . . . . . . . . . . . . . . . 122.4.4 Wave Particles . . . . . . . . . . . . . . . . . . . . . . . . 122.4.5 Detailed Flow . . . . . . . . . . . . . . . . . . . . . . . . . 132.4.6 Choppy Waves . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Architectures 143.1 PC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.1.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Xbox 360 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.2.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 PlayStation 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.3.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Parallelization And Optimization 20

4.1 Parallel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.1.1 Task Parallelism . . . . . . . . . . . . . . . . . . . . . . . . 214.1.2 Data Parallelism . . . . . . . . . . . . . . . . . . . . . . . 21

1


4/60

4.2 Data Oriented Design . . . . . . . . . . . . . . . . . . . . . . . . . 224.2.1 Object Oriented Programming . . . . . . . . . . . . . . . 22

4.2.2 Cache Inefficiency of OOP . . . . . . . . . . . . . . . . . . 224.2.3 Structure Of Arrays . . . . . . . . . . . . . . . . . . . . . . 22

4.3 Optimization Details . . . . . . . . . . . . . . . . . . . . . . . . . 234.3.1 SIMD Vectorization . . . . . . . . . . . . . . . . . . . . . . 234.3.2 Software Pipelining . . . . . . . . . . . . . . . . . . . . . 244.3.3 Careful Branching . . . . . . . . . . . . . . . . . . . . . . 254.3.4 Load-Hit-Stores . . . . . . . . . . . . . . . . . . . . . . . . 25

5 Algorithm 26

5.1 Earlier work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.1.1 Surface Cells . . . . . . . . . . . . . . . . . . . . . . . . . 265.1.2 Dispersion By Convolution . . . . . . . . . . . . . . . . . . 275.1.3 Kernel Approximation . . . . . . . . . . . . . . . . . . . . 27

5.1.4 Laplacian Pyramids . . . . . . . . . . . . . . . . . . . . . . 285.1.5 Grid Summation . . . . . . . . . . . . . . . . . . . . . . . 285.1.6 Level Of Detail . . . . . . . . . . . . . . . . . . . . . . . . 295.1.7 Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . 295.1.8 Stitching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.2 Implementation Issues . . . . . . . . . . . . . . . . . . . . . . . . 305.2.1 Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . 305.2.2 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.2.3 Border Copying . . . . . . . . . . . . . . . . . . . . . . . . 315.2.4 Data Locality . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.3 Improved algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 325.3.1 Homogeneous Grids . . . . . . . . . . . . . . . . . . . . . 325.3.2 Quad-trees . . . . . . . . . . . . . . . . . . . . . . . . . . 335.3.3 Large cells . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.3.4 Memory Layout . . . . . . . . . . . . . . . . . . . . . . . . 33

6 Implementation 346.1 Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.1.1 Frame Overview . . . . . . . . . . . . . . . . . . . . . . . 356.2 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.2.1 Data Layout . . . . . . . . . . . . . . . . . . . . . . . . . . 366.2.2 Grid Dimensions . . . . . . . . . . . . . . . . . . . . . . . 36

6.3 Interaction Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.4 Frame Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.4.1 Level Of Detail . . . . . . . . . . . . . . . . . . . . . . . . 38

6.4.2 Fading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.5 Update Passes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.5.1 SPE Jobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.5.2 Applying Disturbances . . . . . . . . . . . . . . . . . . . . 406.5.3 Wave Propagation . . . . . . . . . . . . . . . . . . . . . . 416.5.4 Border Copying . . . . . . . . . . . . . . . . . . . . . . . . 42

2


5/60

6.5.5 Grid Summation . . . . . . . . . . . . . . . . . . . . . . . 426.6 Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.6.1 Drawing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436.6.2 Dispatch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

7 Results and analysis 457.1 Previous Implementation . . . . . . . . . . . . . . . . . . . . . . . 457.2 Parallel Implementation . . . . . . . . . . . . . . . . . . . . . . . 467.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

8 Discussion 508.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

8.1.1 Homogeneous Grids . . . . . . . . . . . . . . . . . . . . . 508.1.2 Improved Level Of Detail . . . . . . . . . . . . . . . . . . 518.1.3 Fading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

8.2 Future Implementations . . . . . . . . . . . . . . . . . . . . . . . 528.2.1 Enhanced Effects . . . . . . . . . . . . . . . . . . . . . . . 528.2.2 Ambient Waves . . . . . . . . . . . . . . . . . . . . . . . . 528.2.3 Customized Interaction . . . . . . . . . . . . . . . . . . . 528.2.4 Situational Level Of Detail . . . . . . . . . . . . . . . . . . 538.2.5 Better Boundary Conditions . . . . . . . . . . . . . . . . . 538.2.6 Non-Linear Texture Mapping . . . . . . . . . . . . . . . . 538.2.7 Performance Efficient Flow Simulation . . . . . . . . . . . 548.2.8 Minimized Vertices . . . . . . . . . . . . . . . . . . . . . . 548.2.9 Mesh Reduction . . . . . . . . . . . . . . . . . . . . . . . . 548.2.10 Unified Quad-tree . . . . . . . . . . . . . . . . . . . . . . 548.2.11 Multiple Convolutions . . . . . . . . . . . . . . . . . . . . 55

8.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

Bibliography 55

3


6/60

Chapter 1

Introduction

This is a thesis report for a bachelors degree in computer science at LinkpingUniversity, Sweden, written in 2012. The work for this thesis was done at EADICE (DICE) and implemented in the Frostbite 2 game engine. This thesis isthe continued work of a master thesis by Bjrn Ottosson [Ott11], also doneat DICE, which presented a height-field based method for simulating real-timeinteractive water with the dispersion property.

Ottossons work was prototyped inside the Frostbite 2 engine developed atDICE and managed to simulate and render high quality water waves at under 3milliseconds per frame on a single Intel Xeon core. On a system where multipleprocessing cores are available however, which has long since been the norm,the simulation will not utilize more than one core. On the PC platform, onlyrunning on a single core is not a huge issue since the water simulation can beincluded as a graphics option for high-end PCs. However, this is not possible onconsoles where the hardware is identical for all users and resources are sparse.Since DICE produces multi-platform games the simulation needs to work, withsimilar visual quality, on at least PC, Microsoft Xbox 360 and Sony PlayStation3, preferably still under 3 milliseconds on each platform.

Because of the linear nature of programs and code, it is often difficult toproperly utilize all available cores on a system. It is easy to just let most ofthe program run on the main processor. The assumption is that all three plat-forms would benefit from a simulation which is able to use multiple processingcores simultaneously. Mainly because secondary processors might go unusedotherwise but also to let the main thread perform other important tasks. Thisis especially true on the PlayStation 3 since it, along with six specialized vectorprocessors, only has one general purpose core available.

The goal for this thesis work was therefore to produce an adaptation of theexisting simulation better suitable for parallel architectures. This includes re-designing large portions of the previous algorithm, mostly changing implemen-tation details but also how the algorithm worked in general. For this assign-ment, the PlayStation 3 was chosen as the target platform for implementationsince its architecture places such high demands on how well the code is par-

4


7/60

allelized. Another reason for this decision is that a simulation working on thisplatform is reasonably simple to port to other parallel architectures afterwards.

The result is a high quality interactive water simulation, running at good per-formance, utilizing all of the PlayStation 3s vector processors simultaneously.

This report contains some general background information on real time wa-ter simulation along with an overview of both the previous and the parallelizedalgorithm. Techniques and theory used in the simulation will be described indetail along with a description of the implementation. There is also a chapterdedicated to data oriented programming since this is of high importance whenoptimizing for performance in games. Results are presented in detail along witha performance analysis and comparisons with the previous water simulation. Fi-nally there is a discussion around possible ways to improve the simulation in thefuture and on what needs to be done to get it ready for production.

5


8/60

Chapter 2

Background And RelatedWorks

Computer generated water in games and movies has always been a popularfield of research since water occurs so naturally in many scenes and situations.Complex behavior and appearance in everything from small puddles to large

windy oceans helps to build a immersive experience for the viewer. Waves,splashes, refraction and reflections are all examples of important attributes thatcontribute to the illusion of water. These are all complex properties that, if notemulated in an convincing way, can quickly make the water look unrealistic.Because water surrounds us daily, we are very good at determining how a body

water should behave and look in different situations. Realistically simulatingevery aspect of water, however, is very computationally demanding.

Water simulation have been used to great extent in the movie industry.Movies like Titanic and Waterworld were groundbreaking with the renditionof ocean water at a high degree of realism [Tes04b] and the industry have sinceproduced increasingly realistic computer generated water. The movie industryhas the luxury of being able to simulate and render water off-line, performingthe simulation and rendering of frames on clusters of computers over a periodof time, much, much slower than the speed at which the movie is played backto the audience. This is why water in movies can look incredibly realistic andgives an enormous advantage in visual quality over games which have to be ableto render water in real-time. In fact, games usually incorporate other elements

which share the same available performance, so the water animation needs tobe much faster than real-time.

Another advantage of movie water is that the behavior of water is completely

deterministic and can be customized to each event in a scene. While this ispossible for games as well, the water animation cannot be precomputed if theplayer is able to interact with it in any way. To allow such interactivity, the watermust be able to realistically simulate waves depending on the actions of theplayer. With the additional computational costs of an interactive simulation, it

6


9/60

becomes increasingly difficult to produce a high quality water simulation withinthe time constraints that a game enforces.

Even if the water simulation is kept within the allotted time frame it needsto be able adjust the performance cost according to how much and in whatdetail water is visible in a scene. The process of determining at which fidelityto present elements to the player is referred to as Level Of Detail. Presumably,the player is also free to move around and look at other things than water, sothe simulation should dynamically scale accordingly. Level Of Detail allows ahighly detailed presentation of the water while also freeing up resources whenpossible.

2.1 Navier-Stokes Equations

For most fluid simulations, realistically calculating the flow of a fluid is a pro-

cess of numerically solving the Navier-Stokes equations (2.1) and the volumeconservation equation (2.2). These equations describe the motion of fluid sub-stances and the physical relations between such attributes as viscosity, densityand fluid pressure. While generally considered a good model of fluid dynamics,the equations are performance costly to solve numerically. Therefore simplifica-tions to the model are necessary for them to be usable in real-time applications.In a game, the purpose of a fluid simulation is often to only render visuallyconvincing fluid animation. In order to reduce complexity, the physical cor-rectness of the simulation can therefore be compromised without the playernoticing [Sch07]. For many fluid simulations, the Navier-Stokes equations canbe reduced to the Incompressible Navier-Stokes equations (2.3) and volumeconservation equation (2.4).

The following formulae describe the Navier-Stokes equations and the vol-ume conservation equation:

v

t+ v v

= p + T+ f (2.1)

t+ (v) = 0 (2.2)

Where v is the flow velocity, is the fluid density, p is the pressure, T is thestress tensor and f represents body forces. is the vector of all partial deriva-tives.

The following formulae describe the Incompressible Navier-Stokes equationsand the volume conservation equation:

v

t+ v v

= p + 2v + f (2.3)

v = 0 (2.4)

7


10/60

Where is the constant viscosity term and 2the vector Laplacian. Theseequations assume a constant, homogeneous density across the whole fluid body

and replaces the stress tensor term with the viscosity term. For the majority ofwater waves, viscosity is assumed to be zero and as such, the viscosity term canbe eliminated

2.2 Numerical Methods

To be able to numerically integrate the Navier-Stokes equations the simulationmust be able to model the movement of fluid. The two main categories ofmethods for modeling water in simulations are Eulerian Grids and Lagrangianparticles [DYQKEH10]. While Lagrangian methods are able to model water

very realistically, Eulerian methods are popular in games because of the highperformance benefits. However, the latter can easily become unstable if the

duration of time between two simulated frames, the time step, is too great.

2.2.1 Lagrangian Particles

With Lagrangian Particles, the water is simulated in a system of discrete parti-cles that interact with each other. Each particle represents a small body of waterthat collide and attract to other particles depending on mass and speed. Becauseeach particle of water is simulated explicitly, it is trivial to preserve mass andenergy which makes it a very stable model. Performance-wise for games, it isdifficult to implement a real-time simulation solely based on Lagrangian Parti-cles since a very large amount of particles is required to render water of decent

visual quality. The simulation can also easily become wasteful for sections of

water that are not visible or not in motion. This model is used in the softwareapplication RealFlow [rea] to perform incredibly realistic off-line rendering ofwater.

2.2.2 Eulerian Grids

In this method, a uniform grid of cells is used as a fixed frame of reference. Wa-ter movement is simulated by keeping track of fluid properties such as velocity,density and pressure in each cell. Simulation is done by integrating values foreach cell based on the time step and is able to simulate large bodies of water

very efficiently depending on the resolution of the grid. When using Euleriangrids, the conservation of mass and energy must be taken into special accountsince this is not handled implicitly as is the case with particle systems [Kal08].

The length of the time step is also an important factor in this model. The framerate in games might vary and a large time step might lead to an overestimationof the values in a cell when integrating. In such a situation, multiple sequentialintegration errors can cause additional energy to be generated which might re-sult in the simulation exploding.

8


11/60

The algorithm in thesis uses an Eulerian Grid approach to simulate watersurfaces. This numerical method was chosen for simplicity and performance.

2.3 Height Field Methods

Height field methods are a specialization of Eulerian grids in which a uniform,two-dimensional grid are used in conjunction with height-values to produce atopographic surface. It is assumed that simulating the motion of whole volumesof resting water is unnecessary since such motion remains largely invisible tothe viewer [Lov02]. Instead, height-field methods simplifies water simulationby only modeling the surface of the water as a height-function of spatial surfacecoordinates. By using 2D-grids to simulate 3D water volumes, the complexityof the simulation is effectively reduced by one dimension. Using this techniquein games, high quality water animation can be achieved at low performance

costs. The height-field constraint does, however, limit the simulation by notallowing breaking waves and spray since the water cannot have more than oneheight-level for a given surface coordinate.

2.3.1 Shallow Water Equations

A height-field model commonly used in oceanic modeling is the Shallow WaterEquations which assumes that the length of the water waves is significantlylarger than the mean water depth. The Shallow Water Equations are:

dv

dt+ gh + (v ) v = 0 (2.5)

dhdt

+ (h + b) v = 0 (2.6)

Where v is the horizontal flow velocity, g is the acceleration of gravity, h thewater height from the mean surface level and b the water depth from that level.These equations are derived from the incompressible Navier-Stokes equations(2.3), assuming zero viscosity, but ignores the flow perpendicular to the watersurface [DYQKEH10]. Only horizontal flow is taken into account and as such,rivers and water currents can be simulated. Because of the wave length as-sumption however, it is only suitable for simulating the movement of water ona macro-scale, such as tidal waves. Smaller waves will break that assumptionand will be simulated incorrectly.

2.3.2 Linear Wave TheoryLinear Wave theory is a different height-field model that assumes that the sur-face displacement is relatively insignificant compared to the mean water depth.It also assumes incompressibility and zero viscosity, but contrary to the Shal-low Water equations, horizontal flow is not modeled. It is therefore unable to

9


12/60

simulate moving water across a surface but makes no assumption regarding thewave length in relation to water depth. This makes the model suitable for sim-

ulating large ranges of resting water volumes, from oceans to swimming pools.Linear Wave Theory gives the following formula for water speed, c:

c =

g

ktanh(kh) (2.7)

Where g is the acceleration by gravity, k the angular wave number andh thewater depth. For shallow and deep water respectively, the formula can be splitinto equations 2.8 and 2.9 depending on the wave length .

c =

gh, h (2.8)

c =g

k , h (2.9)

Dispersion And Kelvin Wakes

An important property captured by Linear Wave Theory is the relation betweenwave speed, wave length and water depth. Unlike sound waves, water wavespropagates at different speeds depending on the wave length and becomesslower when wave frequency increases [Lov02]. This property is known as dis-persion and becomes significant when wave length is significantly smaller thanthe mean water depth, see equation 2.9. For simulating realistic ocean waves,this property is essential and is responsible for the characteristic V-shapes of

waves in the wakes of moving ships. These are referred to as Kelvin Wakes.

2.3.3 Wave Equation

The Wave Equation is similar to the equations derived from Linear Wave Theorybut assumes the wave speed is constant for all wave lengths:

2u

t2= c22u (2.10)

Where u is the water height and c is the constant water speed. Withoutthe dispersive property, a group of waves composed of multiple frequenciespropagate in a medium with no deformation. This is suitable for modelinglight and sound waves but does not handle water waves realistically [Lov02].Despite being a technically incorrect simulation for water, the Wave Equation

has been widely used in games for interactive water because of its low demandon resources. A visual comparison between Linear Wave Theory and the WaveEquation can be found in figure 2.1.

10


13/60

Figure 2.1: A comparison between simulations using Linear Wave Theory (left image) and

the Wave Equation (right image). Waves are generated by a point-shaped object moving

from left to right at constant speed. Darker values represent negative wave amplitudes while

brighter values represent positive. Image courtesy of Ottosson [Ott11].

2.4 Related Works

Much research has been focused on simulating water realistically, both in engi-neering for physics simulation purposes and for visual presentation in moviesand games. While a full review of the field of water simulation is out of scopefor this report, this section brings up a few novel methods presented in otherarticles. For a good historical perspective on the research done within fluidsimulation, see Computer Graphics For Water Modeling And Rendering: A Survey

[Igl04].

2.4.1 Gerstner Waves

1986, Fournier and Reeves presented, what is possibly the first application ofGerstner Waves in the computer graphics field [FR86]. Gerstner Waves approx-imates the movement of points along a water surface as sinusoidal functions ofthe amplitude, direction and wavelength of traveling waves [Tes04b]. This re-sults in surface points moving in elliptical motions when affected by waves andgives the surface a choppy look with sharp tops and flattened valleys. Sincethe motions are dependent on wave length, the dispersion property is also mod-eled along with easy detection of breaking waves or spray. Gerstner Waves alsomodels wave refraction along shores which means that the elliptical motionsbecome smaller over shallow depths. This gives ocean waves the characteristicbehavior where incoming wavefronts aligns with the shoreline.

11


14/60

2.4.2 Fast Fourier Transforms

A common technique for height-field methods that employ uniform grids is topropagate waves using the Fourier domain. Ordinarily, a surface is representedby discrete spatial height-values, one for each cell. Assuming the surface is theresult of super-positioned sinusoidal frequencies with different amplitudes andphase, the height-field can be transformed into the Fourier domain, or frequencyspace. In this domain the surface is instead represented by complex valuesholding the amplitude and phase for all present wave frequencies, sampled atthe same resolution as the original height-field. Performing wave propagationin the Fourier domain is done at low performance cost simply by modifyingthe phase of each frequency. Insomniac Games uses this method to simulatedispersive waves [Day09] which was implemented in the game Resistance 2.

With the invention of the discrete Fast Fourier Transform, which is a highlyoptimized algorithm for transforming a sampled signal into the Fourier domain,

Fourier based methods became viable for real-time applications. Jensen andGolias [JG01] present an implementation of this method for use in games.

Since the sinus waves described in the Fourier domain are periodic, thismethod is suitable for rendering large oceans, where the waves for a surfacepatch can be calculated and then tiled over the whole surface. Using thismethod for interactive water can prove difficult however since dynamic localdistortions are not periodical in nature.

2.4.3 Semi-Lagrangian Method

A problem with Eulerian grid methods is that in each simulated step, data canonly be transferred between immediate neighbors on the grid. When simulat-ing water flow of high velocity, where mass might be transferred over morethan one grid cell per step, the ordinary grid model is insufficient. A Semi-Lagrangian method also uses a uniform grid of cells, but calculates displacedmass, or advection, via tracer particles. For every advection step, a temporarytracer particle is simulated for each cell, which travels backwards along the flow

vector field of the surface, subtracting mass from the calculated origin, addingthat mass to the current cell. Kallin [Kal08] uses a modified version of the Shal-low Water Equations together with a Semi-Lagrangian grid to simulate riversflowing over arbitrary terrain.

All grid methods, including the Semi-Lagrangian, suffer from dissipationwhen performing advection, that is, loss of data when adding mass to non-cell-centered locations. Kim et al. [KLLR07] present an error correcting algorithmfor compensating the data loss but such an algorithm might significantly lower

performance.

2.4.4 Wave Particles

A novel approach to the Lagrangian method is proposed by Yuksel [YHK07]called Wave Particles. This can be thought of as a sparse system of Lagrangian

12


15/60

Particles where particles are simulated over a static height-field. It differs fromother Lagrangian Particle systems by only simulating particles where there are

surface waves. Since particles are only created when disturbing the water, acalm body of water means that no particles need to be simulated. Yuksel usesthe Wave Equation for propagating Wave Particles that are splatted onto theheight-field before rendering. This can be performed on top of an existing staticheight-field water simulation to add interactivity.

2.4.5 Detailed Flow

Even if there is not enough performance available to include a physics basedwater simulation in a game, water animation can still be achieved by othermeans. Early games, for example, used scrolling textures on top of static meshesto give the impression of flowing water. Vlachos[Vla10] describes a similartechnique used in Portal 2 and Left 4 Dead 2 for visualizing currents in waterby using a precomputed flow vector map. This method is based on work byMax and Becker[MB96] and uses image advection to distort the normal mapof the water surface with great results. With this technique, static objects thatintersects the water surface can be mirrored in the flow map, emulating flowaround such objects.

2.4.6 Choppy Waves

Waves simulated using sinusoidal functions have a tendency to look very roundand smooth. Large, or steep, ocean waves created during windy conditions

will not look like this, but appear more sharp and choppy. Together withTessendorf [Tes04b], Jensen and Golias[JG01] suggest that the sinusoidal shapes

of the waves can be altered to achieve such a choppy look. This is done, afterthe simulation step, by displacing the height-field vertices near steep waves be-fore rendering the surface. For very steep waves, displaced vertices will start tooverlap, causing inverted waves. This can be used for creating breaking wavesand foam effects where overlaps occur.

13


16/60

Chapter 3

Architectures

With several competing gaming systems on the market, high quality games areoften produced for multiple platforms to reach as big an audience as possible.For this reason the Frostbite 2 engine is designed to run on at least PC, Microsoft

Xbox 360 and Sony PlayStation 3. Any water simulation in Frostbite 2 shouldtherefore be capable of running on all those platforms at full speed, with compa-rable visual results. This chapter gives an overview of the architectures targetedby Frostbite 2 in general and by this thesis specifically. The primary positive andnegative aspects of each platform will be discussed, along with a more detaileddescription of the PlayStation 3 as a reference for upcoming chapters.

3.1 PC

The PC is the most versatile of all platforms and at the same time the onethat puts the largest demands on the hardware compatibility of a publishedgame. The game has to be compatible with a wide range of different hardwaresolutions and software. Frostbite 2 has chosen to solely support the MicrosoftWindows operating system since most PC players use this OS for gaming. Whilethe PCs today are available with both 32-bit and 64-bit architectures, consolesstill only allow 32-bit applications. Because of this, the 32-bit architecture isoften preferred on PCs for multi-platform games.

3.1.1 Hardware

Since the PC is not a closed platform like consoles there are many companiesthat manufacture different systems and consumers are often free to arrange andreplace parts of their PCs hardware in any way they desire. This means that aPC game must be compatible with as many hardware configurations as possi-ble. Usually a minimum requirement on processing power and memory is posedto reduce the range of supported hardware. However, even with a fairly nar-row range, there is still room for vendor-specific differences regarding graphics

14


17/60

cards, sound cards, input devices and so on. Frostbite 2 uses the Microsoft Di-rectX library as a platform interface which relieves the developer of many of

these problems by abstracting the hardware.

Memory

Memory on a PC is often available in abundance compared to current generationconsoles, and even if there is not enough free physical memory, a hard drive canbe used, at the cost of latency, as a swap space. It is not uncommon for computergames today to use 1-2GB of memory and with 64-bit operating systems, thephysical memory available on a system will most likely not be a limiting factor.Depending on the CPU, a PC often has a variety of techniques and hardwarefor automatically reducing the issues inherent with memory management. Forexample, when fetching memory that need to be loaded into the cache, a CPU

with Out-Of-Order Execution can execute independent instructions ahead oftime. This reduces stalls that would otherwise have been the result of waitingfor cache operations. The downside of such techniques is that the developermight be unaware of the performance impact certain code might have on otherplatforms.

3.1.2 Performance

With the variety of hardware available, performance on the PC platform differsgreatly between consumers. For consumers who do not own the latest in PCgaming technology, it is important to have a minimum hardware requirementas low as possible. At the same time, the PC market is highly focused on the au-diovisual quality of a game, so for consumers with more processing power and

memory available, a game should try to take advantage of that performance.This puts high demands on customizability that does not exist on consoles. Theuser should be able to control the quality of the experience by adjusting set-tings for graphics and audio. Perhaps most importantly, there should be optionsavailable for configuring keyboard and mouse.

3.2 Xbox 360

With the Microsoft Xbox 360 controlling a substantial share of the console mar-ket, its naturally one of the main platforms for the Frostbite 2 engine. The

Xbox 360 hardware layout is largely similar to a PC which generally makescode developed for Windows and DirectX easy to port. The biggest difference

compared to a PC is that, being a console, all systems for all consumers arealike performance-wise. The immediate benefit of this is of course that a gamethat works on one system will work on all, which removes a lot of compatibilityissues. Working with the same hardware over a long period of time also tendsto bring out more efficient ways of utilizing the hardware during the lifespan ofthe console.

15


18/60

3.2.1 Hardware

The Xbox 360 is based on a 64-bit Power-PC architecture with 3 general pur-pose in-order CPU cores (PPUs), 512MB system memory and a powerful GPU.Each core is running at 3.2GHz and has support for two simultaneous hardwarethreads (hyper threading) with dual pipelines and duplicated register sets. Theresult of this is that the CPU appears as if it has 6 individual cores from the pointof view of the operating system. Along with 64-bit integer and float arithmeticunits, all cores are fitted with a 128-bit SIMD vector unit which allows the CPUto perform multiple arithmetic operations per cycle. Together with a competentGPU, with unified shader processors, running at 500MHz the Xbox 360 has a lotof performance to offer. Since it is so similar to a PC, experienced programmerscan quickly start developing which makes it a popular platform.

Memory

One of the key differences compared to a standard PC is that the 512MB systemmemory is shared by both the CPU and the GPU at the same time. On a PC, theGPU usually has a separate memory which means that data needs to be copiedfrom main memory to GPU memory for rendering. Since the main memory isequally accessible for both processing units on the Xbox 360, copying such datacan be avoided to increase performance. Another difference is the amount ofmemory available, only having a system memory of 512MB means that a lot ofdata has to reside on the optical storage. While some systems have access to ahard drive, the game still needs to be designed for a console without one. Thismeans that data constantly needs to be streamed from the DVD and replaced ifcurrently not needed. Reading data from the DVD is very slow compared to ahard drive which puts large demands on streaming techniques.

3.2.2 Performance

While the 6 virtual cores can theoretically yield double the performance com-pared to a single thread per physical core, this is often not the case. They willhelp the developer to utilize more of the dual pipeline structure but can alsolead to unpredictable results. Since 2 threads executing on the same core can-not use the same core component, for example the vector unit, getting goodperformance from hyper threading relies on the threads using different partsof the processor. If the developer does not have full control of which threadsexecute on each core, two computationally intensive threads might run on thesame core which would generate unnecessary stalls from waiting on proces-sor components. Two threads running on one core will also share the same L1cache, which can lead to unpredictable cache misses. While it is easy to get coderunning on this machine, the developers needs to be proficient at optimizing toget the most out of the performance.

16


19/60

3.3 PlayStation 3

The third architecture that Frostbite 2 supports is Sonys PlayStation 3. It isthe main console competitor to the Xbox 360 in terms of graphics and sinceit uses a heterogeneous CPU architecture it is a unique system on the consolemarket today. The hardware is highly shaped around parallel computing andtherefore also by far the most difficult of the platforms mentioned in this thesisto develop for. Since single-threaded code code alone would leave most of theprocessing resources completely unused, games on the PlayStation 3 have todo as many computations as possible in parallel in order to compete with othergaming platforms. This makes the system a good benchmark for testing parallelcode designs and is the reason for being the target architecture for this thesis.If a simulation runs well on the PlayStation 3 it is highly likely to run well onany architecture that can use multiple processing units.

A more thorough description of the PlayStation 3 can be found in A RoughGuide To Scientific Computing On The PlayStation 3 [BLK+07].

Figure 3.1: Simplified schematic over the architecture of the PlayStation 3. Shown in detail

is the layout of the CPU (CELL) and its connections to memory, input/output devices and

GPU (RSX).

3.3.1 Hardware

Not counting the GPU (RSX), the PlayStation 3 has two different types of pro-cessing cores, 9 in total, and separate memory access controllers. The archi-tecture is 64-bit but only allows 32-bit applications due to operating system

constraints. It has 512MB physical memory available, but unlike the Xbox 360,the main memory is physically split between the CPU and GPU with 256MBeach. Because of all its processing units the PlayStation 3 is the console withthe highest theoretical computing power. However, since most of the processorsrequire customized code, it is often difficult to utilize it to the full extent in

17


20/60

games. In addition, the GPU is in most cases inferior to the one in Xbox 360.For example, the RSX does not have a unified shader architecture.

Cell Broadband Engine

The CPU of the PlayStation 3 is an Intel Cell Broadband Engine (CELL), seefigure 3.1, which is a system-on-chip that is designed for parallel computation-heavy tasks. The CELL is a 64-bit architecture which consists of one generalpurpose processor (PPE) and 8 secondary processors specialized for vector mathoperations (SPE) connected via the Element Interconnect Bus (EIB). The sec-ondary processors have access to the main memory through memory controllersbut do not have the ability to read and write directly to it. Instead they need tocopy the data to a local storage before computation and then copy the resultingdata back to the main memory. Because of these limitations, a common tech-nique is to have the PPE run a main thread that controls and schedules batchesof tasks executed on the SPEs.

PPE

The PPE is a 64-bit processor that supports the Power-PC instruction set alongwith VMX, which is the Power-PC SIMD extension for performing 128-bit vec-tor operations. It operates at 3.2Ghz with a 32KB L1 cache and supports hyperthreading with separate registers for each hardware thread. Using hyper thread-ing, the performance per thread might be lower than if only a single thread isused, their combined performance is often higher. The PPE is very similar tothe processing cores of the Xbox 360, with the same SIMD capabilities and sup-porting the same instruction set. It is however the only processor on the Cellchip capable of executing Power-PC instructions. This means that for ordinarycode, compiled for Power-PC, the PlayStation 3 has only about a third of theprocessing power of the Xbox 360.

SPE

What the Cell processor lacks in general processing power it delivers with the socalled Synergistic Stream Processing Elements (SPE). While the CELL contains8 SPEs, only 6 of them are available for use since one is exclusively assignedto the operating system and one is a failsafe backup. The processing unit ofeach SPE is called an SPU. An SPU is a single-core in-order processor optimizedfor running computationally intensive code at high speeds. Similar to the PPE,it has a clock speed of 3.2GHz but with a limited instruction set customized

for SIMD operations only. It is equipped with a dual pipeline to be able toissue two instructions each cycle. Each execution unit in the SPU is assignedto one of those pipelines which means that instructions of the same type arealways scheduled to the same pipeline. To fully utilize both pipelines and thusmaximize performance, the programmer needs to organize instructions after

which pipeline they use.

18


21/60

Local Store

The SPEs do not use a conventional cache. In order to still use cache-like func-tionality, each SPE contains a single high speed memory called the Local Store(LS). The LS is 256KB in size and is similar to a L1 cache with the exceptionthat the programmer must load it manually from main memory via the MemoryFlow Controller (MFC). The LS must be able to hold both the instructions for thecurrently executing program and the working data which puts high demands onmemory management from the developer. A positive aspect of the Local Storeis that all unpredictability of a normal cache is eliminated in favor of loadingmemory explicitly.

MFC

The MFC of each SPE handles all the data transfers between the Local Store and

the systems main memory via DMA instructions. The latency of a DMA oper-ation is comparable to that of transferring memory to the cache in an ordinaryCPU and is optimized for data transfers in multiples of 16B or 128B. To avoidstalls produced by reading from and writing to main memory, DMA instructionsare handled asynchronously and let the SPU query the state of ongoing memorytransfers via different channels. Since the DMA operations are asynchronous, itis a common technique to double-buffer data fetches by loading a segment ofmemory while working on another. By shuffling memory this way, the SPE can

work on a larger total set of data than can fit into the Local Store.

3.3.2 Performance

As mentioned before, when developing games on the PlayStation 3, program-

mers should use the SPE-processors as much as possible. Because the SPEs onlyhave vector registers, ordinary single-variable operands need to be convertedto vector instructions before processing. Those results are then written back tomemory using several shuffle instructions. To avoid unnecessary instructions,programmers should rewrite their code to only use vector operations. Code thatefficiently use vector instructions can be more than 4 times as fast as ordinarycode. Since the SPEs are so good at processing large amount of vector data,Frostbite 2 also use them to do vertex operations that would otherwise be doneon the GPU.

The algorithm in this thesis is able to make use of an arbitrary number ofprocessors. This enables the PlayStation 3 to easily out-perform the Xbox 360 ifonly considering the water simulation.

19


22/60

Chapter 4

Parallelization AndOptimization

These days parallelism is an important aspect of any real time system. Pro-cessors are getting more and more cores added to them in order to increaseperformance and to fully utilize modern computers it is increasingly vital to beable to run simulations on multiple cores simultaneously.

One of the largest obstacles when designing code for parallel computation isthe manner in which data is accessed in memory. A parallel simulation is splitup into several jobs, or threads, each working on either separate tasks or par-tial data for the same task. Often, performance does not scale linearly with thenumber of processing cores compared to the single-threaded simulation. Stallsarise from different tasks that requires access to the same data, or dependencies

on previously executed tasks. By structuring the memory layout and the dataflow of the simulation, these stalls can be greatly reduced. This chapter intro-duces some of the parallelization techniques referred to in this thesis along witha few general methods for optimizing jobs on vector processors.

4.1 Parallel Methods

According to Flynns Taxonomy [Fly72], there exists at least two major parallelarchitectures for distributing work load over multiple processing cores: MultipleInstruction Multiple Data (MIMD) and Singe Instruction Multiple Data (SIMD).While these classifications refer to hardware configurations, they can also beused to describe parallel design patterns. MIMD executes different instructions

on multiple streams of data and will be referred to as Task Parallelism. SIMD,on the other hand, executes the same instruction, or function, on different partsof the same data stream. This is referred to as Data Parallelism.

20


23/60

4.1.1 Task Parallelism

Task Parallelism is often found in multi-threaded operating systems where par-allelism is achieved by having multiple separate processes executing simultane-ously. For this reason, it is also called Process Parallelism. For a simulation thatconsists of different tasks, or steps, code designed with this method lets eachprocessing core take care of a whole step of the simulation for every frame.This is preferable if a task can operate on a section of data, independently ofother tasks. If a task is dependent on one or several other tasks during the samesimulated frame, a pipelining structure needs to be implemented to achieve par-allelism over all cores. A good example of this is a multi-threaded game engine

where rendering for a frame is done at the same time as the simulation for thenext frame.

While being an intuitive way of parallelizing code, this method does notscale well with additional processing cores since the level of parallelism is lim-

ited to the number of tasks that can be run in parallel. When adding additionalcomputational cores, it might be difficult to create more tasks for these proces-sors. Furthermore, if the tasks for one frame is dependent on both tasks fromthe same frame and tasks from a previous frame, it might not be possible to usepipelining.

4.1.2 Data Parallelism

Data Parallelism is a good alternative to Task Parallelism for simulations thatneed to take advantage of an arbitrary amount of processing cores. Instead ofsplitting the work load over the different tasks of a simulation, code can bedesigned in a way that lets multiple processors work on the same task. If a taskconsists of performing the same operation on a large set of data, each processedpart of that data can be viewed as a partial result of the complete task. Fullparallelism can be achieved by letting all threads each calculate a partial result.This means that the number of processors that can work on a single task is onlylimited by the amount of data processed.

By parallelizing with focus on data rather than individual tasks, dependen-cies between other tasks are trivial to resolve since all jobs in a task can besynchronized to the completion of the previous task. To ensure that the simu-lation time scales well with both the number of processors and size of the data,data should be distributed over jobs in a way that creates good load balancing.

When dealing with simulations that have a lot of dependencies within thesame task, parallelization will be harder to achieve. A solution to this might beto redesign the simulation with smaller tasks to only have task-to-task depen-

dencies. [HS86] is a good resource on how to redesign algorithms with focuson data parallelization.

21


24/60

4.2 Data Oriented Design

The available performance of a system is dominated by the speed at which pro-cessors are able to execute instructions and the rate at which data can be readfrom memory. The rate of performance improvements in processors has, how-ever, by far exceeded those in memory over the years [Car02]. Because of this,the process-memory gap has grown larger and memory access speed has be-come the biggest bottleneck in performance today. To be able to fully utilizethe speed of the CPU, developers should design code with memory efficiency inmind, so called Data Oriented Design (DOD).

4.2.1 Object Oriented Programming

Object Oriented Programming (OOP), which is used widely in software devel-opment, groups data after the objects to which it logically belongs. This oftenmakes heavy use of classes and inheritance which might lead to class explosionsand large executables [Fre11]. The purpose of the programming constructs as-sociated with OOP is to make the code structured and easily manageable byabstracting and isolating data. They do not focus on organizing memory effi-ciently.

4.2.2 Cache Inefficiency of OOP

The main focus of DOD is to minimize the number of data transfers needed be-tween the main memory and the cache. Using OOP features like encapsulationand polymorphism often means extended use of virtual function calls. Each vir-tual function call needs to do a virtual table look-up before knowing which func-

tion actually implements the called function [DH96]. Performing these look-upsmeans that the tables need to be fetched into the cache before the instructionsof the implementing function can be loaded. If a large collection of encapsu-lated objects is being iterated over, like calling the update method on objects ina game, many cache misses are generated from the virtual table look-ups. Sincethe cache does all memory fetches in blocks the size of a cache line, even forsingle values, much of the memory loaded into the cache becomes wasted.

4.2.3 Structure Of Arrays

To maximize the use of every memory block loaded into the cache when iter-ating over objects, the data associated with each object needs to be organizedin groups according to how it is used rather than which object it belongs to

[Col10]. Consider an iteration over a collection of objects, each with many dif-ferent attributes, where each iteration only reads a specific attribute. If eachlogical object is represented by a continuous block of memory, the cache needsto be updated every time the specific attribute of each object is accessed. If,however, the data for all objects is stored as one continuous memory block per

22


25/60

attribute, the cache can contain the specific attribute for several objects at thesame time. This is the difference between an array of game objects, see fig-

ure 4.1, compared to a single structure with arrays for all object attributes, seefigure 4.2. By designing the code around collections of objects instead of theindividual objects, the cache can be utilized optimally.

Figure 4.1: Example of OOP organization of memory. The layout of 7 objects with 4

attributes each are shown. Each object is stored sequentially, left to right.

Figure 4.2: Example of DOD organization of memory. The layout of 7 objects with 4

attributes each are shown. Each attribute of all objects is stored sequentially, left to right.

4.3 Optimization Details

Often, most of the execution time in a simulation is spent on iterating over smallsections of code, such as inner loops. It is therefore important to optimize thosesections as much as possible for real-time applications. Presented here are someof the most important aspects of code optimization.

4.3.1 SIMD Vectorization

The SIMD architecture can also refer to that of a vector processor, which canperform the same operation on multiple scalar data. To fully utilize the perfor-mance that a SIMD processor core can deliver, data should be organized in a

23


26/60

way that allows operations to be done in vector format. By also organizing dataas Structures Of Arrays, great speedups can be achieved.

For example, a 32-bit dot product operation, which, with two 3d vectorsoperands and scalar multiplication, results in 3 multiplications and 2 additions.By storing the 3d components of 4 vectors as 3 128-bit SIMD operands, 4 dotproducts can be performed with 3 vector multiply-and-add instructions, whichis more than 6 times faster [Col11].

The value of vectorizing code becomes even greater on SPEs, since the SPUhas no scalar arithmetic unit. This means that ordinary scalar operations areperformed using the vector unit instead. Such an operation has a lot of costlyoverheads since the scalar operands needs to be shifted inside the vector registerbefore the operation and then shifted back before writing. See Preferred Slot inA Rough Guide To Scientific Computing On The PlayStation 3 [BLK+07].

4.3.2 Software PipeliningOut-Of-Order processors have the ability to execute instructions in an orderother than specified by the programmer. To reduce stalls from waiting instruc-tions, instructions that have no active dependencies can be executed in themeantime. Consoles today use In-Order processors, which, in order to save chipspace production costs, must execute instructions in the order in which theyare defined. To avoid stalling the processor pipeline, a method called SoftwarePipelining can be used. By rearranging instructions in a way that minimizesstalls, many processing cycles can be saved [Eng10, Cof11]. For example, givena loop that performs one read and one multiplication, the multiplication mightneed to wait 4 cycles for the read to finish, see figure 4.3. The programmer canunroll the loop by doing 4 loop iterations at a time, grouping the read instruc-

tions in front of all multiplications. This way, each read operation would becompleted just in time for the corresponding multiplication as shown in figure4.4.

Figure 4.3: Simplified cycle diagram of a simple two-instruction loop on an in-order pro-cessor. The first 4 iterations are shown. 20 cycles are needed to execute 8 instructions. Note

that branching instructions are not shown.

24


27/60

Figure 4.4: Simplified cycle diagram of the same loop unrolled 4 times. The first 8 iterations

are shown. 16 cycles are needed to execute 16 instructions. Note that branching instructions

are not shown.

4.3.3 Careful Branching

Another drawback of In-Order processors is the inability to mitigate branching

stalls. Since instructions cannot execute in advance, misprediction of a branchresults in a flush of the whole execution pipeline [Col11]. The programmershould avoid branches in sections of code that run frequently, like small com-putation loops. If branching cannot be avoided, special hint instructions can beused to control prediction. If a branch is more likely to result in a specific path,the hint instruction can be executed a few cycles ahead to force the pipeline tostart issuing instructions in the specified path before the outcome of the branchis calculated.

In special cases, branching can be completely avoided at the cost of execut-ing both branches by using result masking. With this method, the results of bothbranches are calcutated and then masked to yield the correct result. This is veryuseful on architectures with very high branching costs.

4.3.4 Load-Hit-Stores

One of the largest causes of performance loss is the Load-Hit-Store [Hei08],which is the process of loading a variable from cache or main memory and stor-ing it in a processor register. If a variable is written to, just before it is loadedinto another register, any instruction that use that register must wait for the

writing and loading operations to finish. Since a write often needs to updatethe cache, it could take many cycles before the read instruction can be exe-cuted and on an In-Order processor, this results in a pipeline stall. On Power-PCprocessors, which have separate registers for integers, floats and vectors, Load-Hit-Store situations can easily arise from simple type conversions. Since thereare no instructions on the Power-PC processor that can move values from one

type of register to another, type-converting a variable means that the data needsto be stored in the cache between registers.

While being too time consuming to fully implement in this thesis work, thedesign choices in the water simulation algorithm have been made with theseoptimizations in mind.

25


28/60

Chapter 5

Algorithm

The aim of this chapter is to give a detailed description of the water simulationpresented in this report along with a summary of the earlier work this thesisis based on. The issues with the previous algorithm are brought up together

with an explanation of the improved algorithm and how it aims to solve theseproblems. This chapter is focused on the theory of the algorithm and someof the methods involved rather than the actual implementation, which will bepresented in the next chapter.

5.1 Earlier work

Ottossons method for simulating interactive water [Ott11] uses a height-fieldmodel to effectively represent waves based on Linear Wave Theory. The simu-lation is able to render bodies of resting water which can be interacted with byphysical objects moving through, or under, the surface. The method is capableof simulating both very small and very large waves simultaneously over watersurfaces ranging in sizes from small puddles to big oceans.

5.1.1 Surface Cells

Each water surface is uniformly divided into a grid of smaller square sections, orcells, which are simulated individually, see figure 5.1. One such cell representsa fixed portion of the total height field for the whole surface, for example a32x32 grid. To allow propagation of waves over the whole surface, the dataalong the borders of these cells, produced during a simulation step, is copied to

neighboring cells in preparation for the next frame. To adjust the fidelity of thewater simulation, the size of these cells can be set as desired, with the defaultsize being 6x6 world units (measured in meters).

26


29/60

{{32Pt32Pt

{

{6m

6m

Figure 5.1: An illustration of a water surface divided into cells where each cell may or may

not contain a grid. In this illustration, 6x6m cells are showed together with a 32x32 grid.

5.1.2 Dispersion By Convolution

The method is heavily based on the iWave article [Tes04a] which presents a con-volution based method for simulating water as an alternative to Fourier basedmethods. The article introduces and derives a way of expressing a dispersivepropagation, normally done in the Fourier domain, as a convolution kernel ap-plied directly to the height-field (An alternative derivation of the same convolu-

tion kernel can be found in Ottossons report [Ott11]). This method producesexcellent results and is presented by Tessendorf as a viable alternative to real-time water simulation in games. However, simulating a high range of water

wave lengths in this way requires a very large convolution kernel.

5.1.3 Kernel Approximation

Convolution of a 2D-field is a costly operation that is is highly dependent onthe size of the kernel for performance. To reduce the cost of a convolution, it isdesirable to have a kernel that is separable. A separable kernel reduces the num-ber of multiply-and-add operations needed from nm to 2nm, where n is thesize of the kernel and m the dimensions of the data field. Since the convolutionkernel used in the iWave implementation is not separable, an approximation isused instead. The kernel is approximated by convoluting the height-field data

with a gaussian kernel and subtracting the original height-field with the con-voluted data [Ott11]. This results in a convolution operation that reasonablyapproximates the iWave kernel while still being separable.

27


30/60

5.1.4 Laplacian Pyramids

Propagating water waves of different wavelengths requires a kernel that is pro-portional to the width of the largest wave simulated. If a single convolutionkernel is used for waves of both centimeters and meters in length, the size ofthe kernel needs to be very large. Even if a separable kernel is used, this willseverely degrade performance. To be able to simulate a high range of wave-lengths, a multi-resolution height-field approach is applied. Burt and Adelson[BA83] introduced Laplacian Pyramids, which is a method of dividing an imageor a height-field into several grids, each containing data within a specific fre-quency range. Each grid is also of proportional dimensions for the frequenciesit holds, in this case, a grid is half the resolution of the next grid in the pyramid.Each surface cell contains a pyramid that holds decomposed waves in differentgrids down to a resolution of 4x4 as shown in figure 5.2 (It is important to notethat, while of different resolutions, all grids in a pyramid take up the same area

in world space). These grids can then be simulated separately with a convolu-tion kernel much smaller than if all waves would have been propagated on thesame grid.

Figure 5.2: An illustration of one surface cell with a 4-level Laplacian Pyramid containing

grids of sizes 32x32, 16x16, 8x8 and 4x4.

5.1.5 Grid Summation

Waves are stored separately in the pyramid levels between frames, as describedabove, but in order to render the simulation, a pyramid needs to be mergedinto a single height-field for each surface cell. To generate the total surfaceelevation for a section of water, the height data of all grids in the same pyramidare accumulated into a height-field of the same resolution as the most detailed

28


31/60

grid of that pyramid. This merging process is done cumulatively going from thelowest grid resolution to the highest. To merge two grids, where one grid is half

the resolution of the other, an up-scaling algorithm is used to interpolate databetween two pixels on the lower grid. Since each grid is assumed to representa frequency range of sinusoidal waves, summing a linearly up-scaled version ofthe lower grid with the higher is insufficient. This method uses instead a bi-cubic up-scaling algorithm which provides an interpolation that is close enoughto sinusoidal for good visual results.

5.1.6 Level Of Detail

To gain additional performance over the original iWave algorithm [Tes04a],Level Of Detail is applied, taking advantage of the aforementioned pyramid gridstructure. It is assumed that in video games, the detail of a water simulationcan be adjusted depending on the distance from the viewer without losing greatamounts of visual quality. A section of a water surface viewed from afar or froma narrow angle would not need the same attention to detail as a section close orperpendicular to the players point of view. In other words, the frequency rangeof waves which can be visibly determined is decreased with distance from the

viewer. Since the water is already divided into grids containing a discrete rangeof wavelengths it is trivial to disregard the high frequency pyramid grids whensimulating and rendering surface sections at a distance.

Adjusting the level of detail by dynamically changing the number of gridsused in a surface cell can create a popping effect if the visual changes are toogreat. In a game where the point of view is highly mobile, fading the amplitudeof newly added or removed grids over time is important to reduce this effect.

5.1.7 Interaction

Another key feature of the iWave method is the ease at which interaction canbe performed. Interaction is done by distorting the surface before propagating

waves. Since waves are not propagated or stored in the Fourier domain, distor-tions can be applied directly to the height-field around interacting objects. Tak-ing care to preserve total mass, the height-field is, in this implementation, raisedin front of moving objects and lowered behind them. This, together with thedispersion property, creates realistic looking movement through the water withthe characteristic Kelvin Wakes. Storing different lengths of waves in differentgrids, however, requires distortions to be decomposed into the correspondingfrequency ranges of the grids they are added to.

5.1.8 Stitching

Each simulated section of the water surface is carefully fitted to its neighboringsections to produce the full surface mesh for the water. When two neighboring

29


32/60

grids, sharing the same edge, are of different resolution, the in between ver-tices of the grid with higher resolution need to be handled. This is done to avoid

gaps in the combined rendered surface mesh. Fitting two such grids is done byfading the height data of the high resolution grid close to the seam to match thelower.

To achieve this, the high resolution features of one of the grids is faded outand linear interpolation is applied to the offending vertices to place them on theseam. This removes much of the need for an explicit stitching scheme which canimpact performance further.

However, two neighboring grids of different resolution will still leave themesh with T-junctions. Rendering meshes with T-junctions can result in smallartifacts, holes, between polygons. These artifacts have proven not to be aproblem in the Frostbite 2 engine. If these potential glitches do present issues inthe future, they can be completely removed with, for example, a frame of small

vertical skirts around each surface section as Zhao and Ma shows [ZM09].

5.2 Implementation Issues

While a competent method for simulating interactive water, Ottossons imple-mentation does have a few issues preventing it from being useful in the Frostbite2 engine, most of which the new implementation in this thesis seeks to elimi-nate.

5.2.1 Parallelism

Sonys PlayStation 3, which is the target architecture for this thesis, employs six

special stream processors (SPEs) along with a normal CPU (PPE) that togethermake up the CELL chip. To be able to run any water simulation effectivelyon this platform, the simulation needs to take advantage of those secondarycores. Without parallelizing code specifically for these processors, they wouldgo completely unused. Since the previous implementation is running on a singlecore only, this leaves a large area open for improvement.

5.2.2 Scaling

The method of uniformly dividing a water surface into small sections, or cells,is not optimal for scaling with larger water surfaces. If an ocean is divided intocells of the same size as a water surface the size of a puddle, memory is wastedfor areas of the ocean that are not presently simulated. Ideally, a simulation

should dynamically adjust the size and the amount of cells needed for a givenwater surface.

30


33/60

5.2.3 Border Copying

Borders consist of data outside of the effective simulation data on a grid.Copying border data is done from one grid in a pyramid to surrounding pyra-mids containing a grid of the same resolution. Since the wave propagation isdone with a convolution kernel of the same size regardless of grid resolution,the width of the borders is constant, in this case 4. To be able to simulate larger

water surfaces, an increasing amount of cells with pyramids are needed. Eachpyramid has at least a lowest resolution grid of size 4x4. This means that agrid of that size, is really 12x12 in size including borders, thus having a worstcase memory footprint that is 9 times larger than the actual simulated data, seecomparison in figure 5.3.

Figure 5.3: Comparison of the memory used for borders between a 4x4 grid and a 56x56

grid. Bright tiles represent simulated data and darkened tiles represent borders.

5.2.4 Data Locality

Since simulated data of the smallest resolution are copied 4 times per grid, alot of memory accesses from different addresses are potentially performed. Thisusually generates many cache misses. To gain as much performance from analgorithm as possible, special attention must be given to the memory layout ofthe data used. It is important to localize data according to how and when it isaccessed to avoid stalls caused by updating the cache. This is especially true forthe SPEs since the programmer has to do all the memory accesses by hand, andif memory can be accessed in larger chunks at a time it will benefit performance.Grouping data in this manner is also very useful when parallelizing code in

general since it might remove the need for critical sections.

31


34/60

5.3 Improved algorithm

To combat the issues with the previous method, a new improved algorithm wasdesigned. The goal of this algorithm is front and foremost to address the perfor-mance issues to make it more scalable with parallel architectures. This meansthat the method presented in this report differs mostly on implementationaldetails and the description of the techniques presented earlier in this chapteris largely applicable to the new adaptation as well. However, there are somenotable differences in design brought up here.

5.3.1 Homogeneous Grids

The previous algorithm used many pyramids of predetermined resolutions toconstruct the surface mesh. Since a limited amount of memory is available,these pyramids were stored in separate pools, one for each resolution. These

were distributed over cells according to the Level Of Detail. You could thereforebe in a situation where you might need one or more low resolution pyramids

while only having high resolution pyramids available. A solution to this is toonly use one pool, populated with separate grids of the same resolution for con-structing pyramids dynamically, as shown in figure 5.4. This way grids becomeall-purpose and can be used for both low and high resolution pyramids whichenables the Level Of Detail scheme to be more flexible.

Figure 5.4: An illustration of one surface cell with a complete 4-level quad-tree containing

85 grids. For simplicity reasons, the grid resolution is shown as 4x4.

32


35/60

5.3.2 Quad-trees

By only using grids of the same resolution, keeping in mind that resolution stillscales by 2 between each level, it is no longer possible to construct straight-forward pyramids for each cell. Instead a quad-tree structure is used, whichmeans that instead of having a one-dimensional stack of grids with decreasingresolution, each grid has up to 4 other grids attached as children. In contrastto the previous pyramids, where all grids take up the same world space area,a grid child only takes up a quarter of the area of the underlying grid. Thiseffectively doubles the world-space resolution of the children relative to theirparent. By arranging grids in this manner, one can treat each quadrant of a gridas the basis of a virtual pyramid with the corresponding child as its next level.This means that there is no longer an upper resolution limit since grid childrencan always be added on top of existing grids.

The quad-tree method is a common solution for adaptive grids and are used

extensively in rendering terrain. Some good resources on how to implementsuch trees together with continuous Level Of Detail can be found in the worksby Thatcher [Ulr00], Livny et al.[LKES09], and Pajarola[Paj98].

5.3.3 Large cells

Previously, the size of each surface cell was the size of the area that the highest-resolution pyramid occupied. For a high resolution water simulation, thesecells had to be quite small. In the improved algorithm, a cell contains a com-plete quad-tree which means that instead of cells being the size of the highest-frequency grid used, they are now the size of the lowest. In the simplest sce-nario, the whole water surface contains only one large cell. In order to limit themaximum length of simulated wavelengths, a surface can be divided into anynumber of cells.

5.3.4 Memory Layout

When using uniform grid sizes, the memory usage goes down drastically. Look-ing at a grid, 64x64 in size, the memory wasted for borders is always justbelow 25%, which is a large improvement from the worst case scenario fromthe previous implementation. Same-resolution grids also provides greater datalocality, which is another important advantage over the older algorithm. Before,the propagation algorithm was performed on a large number of small grids,along with the grids of higher resolutions. Performing propagation on manysmall grids results in much overhead and many cache misses. With all grids

being of equal size, there are more continuous segments of memory that can beworked on for each propagation, which helps avoid cache misses. Large gridsalso benefits parallelization on the SPEs since more memory can be fetched witha single read instruction.

33


36/60

Chapter 6

Implementation

The goal of the thesis presented in this report is to adapt an existing water sim-ulation for parallel architectures. The implementation of the improved methodrelies heavily on the theory and techniques already employed by the previous

work. Therefore, most of the new work done in the thesis is aimed towardsimplementational details and how to design the code to run in parallel. Thischapter aims to describe in detail how the improved algorithm presented in theprevious chapter is implemented and what happens during one frame of sim-ulation. The low-level implementation details will be specific to the Frostbite2 engine, but should be easy to adapt for any parallel game software. Prior toimplementing the simulation in Frostbite 2, the algorithm was prototyped in astandalone application with limited mouse interaction using GLFW/OpenGL asa framework.

6.1 Engine

The Frostbite 2 engine is a highly scalable engine designed to run on an arbitrarynumber of processors in parallel. To manage running the numerous threadinstances, called jobs, it uses a central job manager. The developer can use the

job manager to set up computation jobs in a deferred manner with dependencieson other jobs if so desired. These jobs are then scheduled by the job managerto run when there are available time slots on one of the processors.

In an environment with such high focus on parallelism, writing data orientedcode is very important. With good data orientated code, each job should be ableto work on an isolated block of data, minimizing the memory that is shared with

other jobs.

34


37/60

Figure 6.1: Timing diagram showing an example of a job-schedule for the first couple of

frames of a simulation. Indices indicate which frame data is being processed. Horizontal

length is approximately proportional to wall-clock time.

6.1.1 Frame Overview

Since a game needs to present discrete frames to the viewer, the engine uses amain thread to synchronize jobs each graphical frame. The data containing theheight field of the water simulation is updated once per such frame and thenrendered to the screen during the subsequent frames. This yields a three-stepprocess for the water simulation, see figure 6.1.

At the beginning of a frame, the grids of a water surface are simulated andthe height field is updated. After a complete simulation step, the render data forthat frame is submitted to a dispatch job where GPU buffers are filled with themesh and texture data. Finally the GPU renders meshes to the screen with thedesired shaders applied, using the data in the buffers. To avoid having to waitfor the GPU to fully render a frame before continuing the next frame, which

would leave idle processors, the simulation for the next frame is able to startright after the dispatch job has been set up. This means that a simulation updatefor frame Fi is run at the same time as the dispatch job for frame Fi1 and thescreen rendering for frame Fi2.

6.2 Simulation

Each frame, the height field for the whole water surface is updated with newdistortion data made by interacting objects and then simulated before render-ing. Since the simulation update for a frame is run at the same time as therendering for the previous frame, the height field data needs to be duplicatedover two buffers. This is achieved by a double-buffering mechanism where thebuffer holding the height data for the current frame is being written to whilethe other buffer, containing the previous height data, is being read from by thedispatch job. To synchronize these buffers, a pointer is swapped on the mainthread at the start of the update. After the buffers are swapped, the grids are re-

arranged according to the Level Of Detail and then simulated through a numberof passes, as shown in figure 6.2.

35


38/60

6.2.1 Data Layout

The water surface is built up by an array of square cells, each containing a quad-tree of square grids. These grids are stored in a global pool for all surfaces inthe game world and are distributed according to the Level Of Detail scheme.Each of these grids represent either a section of the whole surface height-fieldor a partial height-field used in the summation with other grids. By using aquad-tree structure for distributing grids in each cell instead of a list, all gridscan be of constant size. This greatly simplifies the data layout since the data forall grids can be stored as a simple indexed array.

Each grid also has attributes describing, for example, the location and scaleof that grid. The attributes for all grids are each stored separately using theStructures Of Arrays (4.2.3) design pattern to reduce cache misses when readingthe same attribute for many grids. The data block of a grid is split into datafields of the same dimensions containing partial height data Hpartial(x, y), total

height data Htotal(x, y) and velocity data V(x, y). A simulated grid requiresHpartial and V for both the current and the previous frame along with a double-buffered Htotal. This means that each grid contains two copies of each data field(Hpartial, Htotal, V), for a total of 6 fields. Each field is stored as one float32array in order to enable SIMD operations to process 4 values at once. The wholesimulation is performed on the partial height data, using the velocity data of theprevious frame, which is then accumulated into the total height data together

with up-scaled data from lower-level grids.

6.2.2 Grid Dimensions

The choice of dimensions the grids is an important factor when optimizing thesimulation for the right hardware. When using grids of small dimensions, moregrids can be stored in the global pool for the same memory cost, which leadsto greater freedom when distributing grids. However, each grid comes withan overhead, both in attributes and the communication of nearby grids. Thisneeds to be taken into account if memory operations have a high latency, whichis the case of the SPEs. Having grid dimensions of powers of 2 speeds up DMAtransfers which work best in blocks of 16 or 128 bytes. Also, keeping in mindthe limited size of the SPE Local Store (3.3.1) and that vectorized instructions

works best on multiples of 4, the grid size 64x64 was chosen for the Frostbite 2implementation.

36


39/60

Figure 6.2: Job dependency graph for one step of simulation. For simplicity, only 3 worker

threads are shown for the update passes.

6.3 Interaction Setup

Interaction with the water is done by displacing the height field with the desiredshape before performing the wave propagation. Interaction data is gatheredduring a frame from objects that are intersecting the water surface and writtento a double-buffered array, synchronized with the rest of the simulation. Toavoid having to read properties directly from the intersecting objects, all datanecessary for applying a disturbance is written into the array. All disturbancesare abstracted with ellipses coupled with a vertical force. An object in the water

would generate ellipses best matching the intersection with the resting watersurface plane. In order to conserve mass when moving along the surface, twooverlapping ellipses with opposing vertical forces are generated which separate

when the object is in motion. The length of the separation is determined by theframe delta-time and results in water being pushed up in front of a movingobject depending on speed.

6.4 Frame Setup

Before any simulation can be done the attributes of all grids must be determinedand updated. These attributes include the position and scale of each grid and its

37


40/60

position in the quad-tree along with information about the neighboring same-dimension grids.

6.4.1 Level Of Detail

Updating the grids positions in the quad-trees is done by a Level Of Detail man-ager (LOD manager). The LOD manager distributes and culls grids dependingon the position of the camera, as shown in figure 6.3. Since removing gridsfrom the simulation effectively destroys waves that cannot be restored, remov-ing grids must be done with care. By not taking the direction of the camerainto account, the LOD manager sees to that a player cant destroy surrounding

waves simply by turning the camera on the spot.In order to determine which grids should be culled and where grids need to

be added, a sorted priority list of candidates is constructed every frame, whichare then truncated to the same size as the global grid pool. Candidates canbe either existing grids, grids that should be added as a child to another gridor root-level grids that should be added to an empty cell. The priority of acandidate is determined by the position of the camera and the distance to thesurface section of the candidate relative to its scale. The formula is shown here:

pcand =

s|v|(n v

|v|)

(6.1)Where s is the scale of the candidate, v the distance vector from the nearest

point, inside the section, to the camera and n the normal of the surface. Byusing this formula, where the surface-relative angle to the candidate section istaken into account, the LOD manager will down-prioritize candidates that areseen narrowly. Its important not to distribute grids unnecessarily in a situation

where the water surface is too far away from the camera too be seen, there-fore a lowest-threshold exists that will prune candidates that are not interestingregardless of other candidates.

Figure 6.3: An example of grid distribution over a surface with 4 cells at a given camera

position. For simplicity reasons, the grid resolution is shown as 4x4.

38


41/60

6.4.2 Fading

When the LOD manager culls a grid, it is not removed immediately. This isdone to prevent popping, where the camera sees large quick changes in thedetail of the water simulation. Instead, when a grid is culled or added, theheight data of that grid is faded towards either zero or one over a short periodof time. Each frame, the fade levels of all grids are updated and the grids thathave faded away completely are removed. This res

Date post:	03-Apr-2018
Category:	Documents
Upload:	yurymik
View:	213 times
Download:	0 times

InteractiveWater_Lennartsson_dice

Documents