Blasting Sand with MPM - GPU Technology...

Post on 19-Mar-2020

22 views 0 download

transcript

Blasting Sand with CUDA:

MPM Sand Simulation for VFX Gergely Klár

DreamWorks Animation

tn tn+1

tn tn+1

tn tn+1

Grid influence

Naïve Particles-to-Grid

Gather Particles-to-Grid

Our Solution

• Each particle is read only once,

• We efficiently use shared memory for the grids,

• We significantly reduce the number of atomic operations,

• And our secret sauce: a special data structure for particle queries.

1 CUDA

Block 1 CUDA

Block

1 CUDA

Block

1 CUDA

Block 1 CUDA

Block

1 CUDA

Block

1 CUDA

Block 1 CUDA

Block

1 CUDA

Block

CellBins

ParticleIDs

Actual particle data

TileBins

CellBins

ParticleIDs

Actual particle data

• In each block/tile: – Get blockIdx

– Cells in the tile are TileBins[blockIdx-1].. TileBins[blockIdx]-1

– Get a cellId for each warp from this list • Each thread works on two affected grid nodes

• Particles of a cell are CellBins[cellId-1]..CellBins[cellId]-1

• Compute the contribution from the particle

• Store in shared

– Write back to global

Tile & Cell Keys

●Particle coordinates: (px, py, pz)

●Cell coordinates: (ci, cj, ck) = ⌊(px, py, pz)/Δx⌋

●Tile and in-tile coordinates: (ci, cj, ck) = (ti, tj, tk)∙TILE_SIZE + (ri, rj, rk)

Δx

tj ti tk rj rk ri 7 bits 7 bits 7 bits 3 bits 3 bits 3 bits

32 bit unsigned integer

Tile Bins

sort

Initial Particle IDs

Particle IDs

RLE

inc. sum Cell Bins

masked RLE

inc. sum

Tile & Cell Keys ● When sorted as uint32s, keys of the

same tile will be consecutive

● RLE encoding counts the number of

particles per cell

● The running sum of the counts gives

the offsets to particles

● RLE encoding with a mask for the

tile bits counts the number of non-

empty cells per tile

● The running sum of these counts

gives the offsets to cells

Results

Overall

0

200

400

600

800

1000

262K 884K 2,097K 7,000K

# of particles

GPU

CPU

Milliseconds per time step. Smaller is better.

nVidia Quadro K5200 Intel Xeon CPU E5-2697 v3 @ 2.60GHz w/ 28 cores

Particles to Grids

0

100

200

300

400

500

600

262K 884K 2,097K 7,000K

Grids to Particles

0

100

200

300

400

500

600

262K 884K 2,097K 7,000K

Milliseconds per time step. Smaller is better.

Summary

• Particle binning with sort-RLE-scan

• Breaking the domain to tiles fitting to shared memory

• Processing particles of a cell by a single warp

Special thanks to:

• Ken Museth

• Stephen Jones

• Jeff Budsberg

• Lawrence Lee

• Rob Tesdahl

• David Tonnesen

• Ibrahim Sani

Thank you!