Masked Software Occlusion Culling

Post on 23-Jan-2017

956 views 9 download

transcript

Magnus Andersson

2

Occlusion Culling

Stanford Bunny in the Crytek Sponza AtriumEye

View frustum

3

Occlusion Culling

Stanford Bunny in the Crytek Sponza Atrium

Fully occluded

4

Occlusion Culling

Stanford Bunny in the Crytek Sponza Atrium

Partially occluded

Pixel processing

Geometry processing

Draw call

5

Hardware Fixed-function Occlusion Culling

Handled automatically under the hood

Per-tile culling granularity

– Semi-occluded triangles can be partially culled

Very late in the pipeline

Upload frame data

Game logic

Z Tile Culling

CP

U s

ide

GP

U s

ide

CP

U s

ide

GP

U s

ide

Game logic +

Pixel processing

Geometry processing

Draw call

Upload frame data

Z Tile Culling

SW culling

6

Software Occlusion Culling

Cull very early in the pipeline

– Cull both CPU and GPU work

Short delay

– Can be integrated with scene traversal

7

Binary Space Partitioning (BSP) trees & portals

Precomputed – very efficient

Scene (occluders) must be static

Difficult to handle general scenes

Potentially Visible Sets (PVS)

Quake II, id Software, 1997

Half-Life 2, Valve Corporation, 2004

8

Potentially Visible Sets (PVS)

Quake II, id Software, 1997

Half-Life 2, Valve Corporation, 2004

Player

Not part of PVS

Leaf boundaries

9

Increasingly popular

Modern games have more complex and dynamic worlds

No complex pre-computation

– Simpler content pipeline

Dynamic Occlusion Culling

Assassin’s Creed Unity, Ubisoft, 2014

Battlefield 4, EA DICE, 2013

[HA15]

[Col11]

10

Hierarchical Z Buffer (HiZ) [Greene93]

Rasterize to full resolution z buffer

Create HiZ buffer

– Find the maximum depth in each NxN tile

Perform occlusion query with HiZ buffer

General algorithm works for both SW and HW occlusion culling

Z-buffer Based Culling

Full resolution depth buffer

HiZ buffer

Complexobject

Bounding shape

Dragon model courtesy of Stanford University Computer Graphics Laboratory

11

Intel Software Occlusion Culling Framework [CMK16]

Algorithm phases:

1. Rasterize a few designated occluder objects to z buffer

– Heavily SSE/AVX optimized

– Parallel triangle setup

– Parallel pixel depth computation

2. Compute 1-level HiZ buffer (and throw away z buffer)

3. Perform queries and render surviving objects

12

Rendering to z-buffer per pixel

Updating HiZ tile needs all pixels within the tile

Occlusion Query per tile

Wouldn’t it be nice to compute HiZ directly?

– Being conservative is the only requirement

Idea: use alternative HiZ representation

Z-buffer Based Culling

Full resolution depth buffer

HiZ buffer

13

Alternative HiZ buffer representation

Masked Occlusion Culling for Graphics Hardware [AHAM15]

Two depth values per tile

Per-pixel selection mask

zmax0 zmax

1 Layer selection mask

0 0 0 10 0 1 10 0 1 10 1 1 1

0 0 0 00 0 0 00 0 0 10 0 0 1

1 1 1 11 1 1 11 1 1 11 1 1 1

0 0 0 10 0 1 10 0 0 10 0 0 1

14

Masked Occlusion Culling [AHAM15]

15

Masked Occlusion Culling [AHAM15]

16

Masked Occlusion Culling [AHAM15]

17

Masked Occlusion Culling [AHAM15]

18

Masked Occlusion Culling [AHAM15]

19

Masked Occlusion Culling [AHAM15]

20

Masked Occlusion Culling [AHAM15]

Merge

?

21

Masked Occlusion Culling [AHAM15]

22

Masked Occlusion Culling [AHAM15]

CulledNot culled

23

Masked Occlusion Culling [AHAM15]

Triangle meshes

24

Originally designed for graphics hardware

Directly update HiZ buffer withoutcomputing a full res z buffer

Decouples coverage sampling (rasterization) and depth computation

Masked Occlusion Culling [AHAM15]

Approximate, conservative HiZ buffer

Depth buffer

25

Masked Software Occlusion Culling

Could Masked Occlusion Culling [AHAM15] be really fast for softwareocclusion culling?

Much less memory to read/write than full res z-buffer

Updates use bitmasks – can process many pixels in parallel (i.e. SSE/AVX)

No need to compute per-pixel depths

– Would need a fast SW rasterizer to compute coverage

Turns out it can

Paper presented at High Performance Graphics this year [HAAM16]

Source code available!

26

Single Instruction, Multiple Data (SIMD)

3 3 5 6 2

32 bits 32 bits 32 bits 32 bits 32 bits

A A

5 5 7 3 5B B

+ + + ++

8 8 12 9 7

256 bits

AVXx86

4 1 4 10

5 11 4 5

+ + + +

9 12 8 15

32 bits 32 bits 32 bits 32 bits

27

Single Instruction, Multiple Data (SIMD)

32 bits

AVXx86

0xAC1DBA5EAC1DBA5EAC1DBA5EAC1DBA5E51CAFE3751CAFE3751CAFE3751CAFE37

256 bits

A

0x51CAFE3751CAFE3751CAFE3751CAFE37AC1DBA5EAC1DBA5EAC1DBA5EAC1DBA5EB

&

0x0008BA160008BA160008BA160008BA160008BA160008BA160008BA160008BA16

0xAC1DBA5EA

0x51CAFE37B

&

0x0008BA16

New algorithmtarget architecture

Supported in our library codeEasily extended to AVX-512

28

An abridged history of Intel’s SIMD instruction sets

SSE, 1999128b wide

SSE2, 2001

SSE4, 2006Intel® microarchitecture code name Nehalem

AVX, 2011256b wide2nd Gen Intel® Core™ Processors

AVX2, 20134th Gen Intel® Core™ Processors

AVX-512, 2016512b wide

1998 2017

Masked software occlusion culling

30

Algorithm Overview

Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

8-wide triangle setup

8 scanlines

256 pixels (8 tiles with 8x4 pixels)

Til

e

tra

ve

rsa

lT

ria

ng

lese

tup

31

Transform and Clip

Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

32

Compute Bounding Box

Padded to 32x8 pixel supertiles

Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

33

Compute Depth Plane Depth = ax + by + c

– Conservative tile depth: Check sign of a and b

– Can be incrementally updated Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

-, - +, -

-, + +, +

Clamp to vertex depths

+ a

+ b

34

Supertile Traversal Order

Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

35

AVX Register Layout

Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

36

AVX Register Layout

One scanline per SIMD lane

Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Compute slopes (∆y/∆x) once

– Similar to regular scanline rasterizers

37

Edge Slopes

Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

38

Compute Intersections

Compute intersections for each scanline

– Eight scanlines in parallel using AVX Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Intersections

39

Compute Coverage Mask

Start with full coverage mask

Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Intersections

40

Compute Coverage Mask

>>>>>>>>>>>>>>>>

Start with full coverage mask

– Shift each lane (scanline) to intersection

– AVX2 and later have per-lane shift instruction Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Intersections

41

Compute Coverage Mask

Repeat the same process for the next edge

Left edge

Right edge

Right edge

Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Intersections

42

Compute Coverage Mask

Repeat the same process for the next edge

– Edge is facing right invert maskUpdate

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Intersections

43

Compute Coverage Mask

Combine masks of all overlapping edges

Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

44

Compute Coverage Mask

Combine masks of all overlapping edges

– Using bitwise ANDUpdate

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

45

Compute Coverage Mask

Combine masks of all overlapping edges

– Using bitwise ANDUpdate

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

46

Shuffle Mask

Shuffle mask to form better shaped tiles

– Before: each SIMD lane is a scanlineUpdate

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

47

Shuffle Mask

Shuffle mask to form better shaped tiles

– Before: each SIMD lane is a scanline

– After: each SIMD lane is a 8x4 tile Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

48

Depth Test

Interpolate conservative depth (per 8x4 tile)

Test against bufferUpdate

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Buffer

49

Update Tile

Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Two code paths (can be switched compile time)

– Original update method [AHAM15]

– New update method tailored for SW [HAAM16]

Why use a new update method?

– Faster – same culling power

– Less accurate than original, more dependent on render order

– Works best if you render front-to-back

50

Update Tile, New Method [HAAM16]

Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

zmax is the reference layer

– Maximum value for the entire tile

zmax is the working layer

– Maximum value for a subset of the tile

– Updated as

– New depth = max(zmax , zmax)

– New mask = TriangleMask OR LayerMask

Whenever working layer mask is full, overwrite reference layer

1

1

tri

0

51

Update Tile

52

Update Tile

53

Update Tile

54

Update Tile

55

Update Tile

Discard heuristic: If zmax – zmax > zmax – zmax , discard working layer

56

Update Tiletri1 10

Restart

57

Update Tile

58

Update Tile

59

Update Tile

Full overwrite:Restart from new value

60

Update Tile

Update

Depth test

Compute coverage

Traversal setup

Depth plane

Compute bounds

Clip

Transform

Update is quicker than original [AHAM15]

Test is also quicker

– Need only to test against reference layer (zmax)0

62

ResultsIntel Occlusion Culling Sample

Clear: Clearing the depth buffer

Geom: Transform & project geometry

Rast: Triangle setup & occluder rasterization

Gen: Compute HiZ buffer from full resolution z buffer

Test: Perform occlusion queries

3.7x16x

(μs)

Old [CMK16]

New [HAAM16]

63

Performance comparison for camera animation

Results

First frame

Last frame

Old New Frustum only

Code is available as open-source

65

Masked Occlusion Culling API

void SetResolution();

void SetNearClipPlane();

void ClearBuffer();

static void TransformVertices();

Result RenderTriangles();

Result TestTriangles();

Result TestRect();

void ComputePixelDepthBuffer();

OcclusionCullingStatistics GetStatistics();

Setup

Debug

Render &query

66

Masked Occlusion Culling APIResult RenderTriangles(

float *inVtx,

uint *inTris,

int nTris,

ClipPlanes mask,

ScissorRect *scissor,

VertexLayout &layout

);

Render to the software HiZ buffer

// Clip space vertex positions

// Index array (Indices to inVtx buffer)

// Triangle count (the number of index triplets in inTris)

// Mask for potential frustum bound overlap

// Scissor region

// Vertex format of inTris. There is a fast-path for AoS with

(x, y, z, w) coordinates

67

Masked Occlusion Culling APIResult RenderTriangles(

float *inVtx,

uint *inTris,

int nTris,

ClipPlanes mask,

ScissorRect *scissor,

VertexLayout &layout

);

Eye

View frustum

Near plane

mask = 0

mask = leftPlane | nearPlane

Clipping is not free...

– If you’re already doing frustum culling, let the API know the outcome

68

Masked Occlusion Culling APIResult RenderTriangles(

float *inVtx,

uint *inTris,

int nTris,

ClipPlanes mask,

ScissorRect *scissor,

VertexLayout &layout

);

Eye

View frustum

Scissor region (screen space AABB)

Can be used for threading

– One scissor region per thread

69

Masked Occlusion Culling APIResult TestTriangles(

float *inVtx,

uint *inTris,

int nTris,

ClipPlanes mask,

ScissorRect *scissor,

VertexLayout &layout

);

Test triangles against the software HiZ buffer

– Does not update the buffer

// Returns the collective culling outcome of the triangles

// Clip space vertex positions

// Index array (Indices to inVtx buffer)

// Triangle count (the number of index triplets in inTris)

// Mask for potential frustum bound overlap

// Scissor region

// Vertex format of inTris. There is a fast-path for AoS with

(x, y, z, w) coordinates

70

Masked Occlusion Culling APIResult TestRect(

float xmin,

float ymin,

float xmax,

float ymax,

float wmin

);

Test rectangle against the software HiZ buffer

– Does not update the buffer

// Returns the culling outcome of the screen space rectangle

/*

Screen space bounds:

[xmin, ymin] – [xmax, ymax]

*/

// Conservative clip space w (typically the w-component of the nearest

bbox vertex in clip space)

71

Example use case: Scene Bounding Volume Hierarchy (BVH) traversal and culling

ClearBuffer();

prioQueue.push(root);

while (!prioQueue.empty()) {

Node node = prioQueue.pop();

if (FrustumTest(node) == Culled)

continue;

compute_screen_space_bounds(node);

if (TestRect(bounds) == Culled)

continue;

if (node is InnerNode) {

prioQueue.push(node.left, dist);

prioQueue.push(node.right, dist);

} else (node is Leaf) {

TransformVertices(leaf.vertices);

RenderTriangles(xfVertices);

send_leaf_to_GPU();

}

}

RenderFrame

Culled!

72

Essential Tools We Have Relied On

Intel® VTune™

– https://software.intel.com/en-us/intel-vtune-amplifier-xe

SSE/AVX intrinsics guide

– https://software.intel.com/sites/landingpage/IntrinsicsGuide/

73

References

[AHAM15] ANDERSSON M., HASSELGREN J., AKENINE-MÖLLER T.: Masked Depth Culling for Graphics Hardware. ACM Transactions on Graphics 34, 6 (2015), pp. 188:1–188:9

[CMK16] CHANDRASEKARAN C., MCNABB D., KUAH K., FAUCONNEAU M., GIESEN F.: Software Occlusion Culling. Published online at: https://software.intel.com/en-us/articles/software-occlusion-culling, (2013–2016)

[Col11] COLLIN D.: Culling the Battlefield. Game Developer’s Conference (presentation), (2011)

[Greene93] GREENE N., KASS M., MILLER G.: Hierarchical Z-Buffer Visibility. In Proceedings of SIGGRAPH, (1993), pp. 231–238

[HA15] HAAR U., AALTONEN S.: GPU-Driven Rendering Pipelines. SIGGRAPH Advances in Real-Time Rendering in Games course, (2015)

[HAAM16] HASSELGREN J., ANDERSSON M., AKENINE-MÖLLER T.: Masked Software Occlusion Culling. High Performance Graphics, (2016)

74

Check it out!

GitHub: Lightweight library

– https://github.com/GameTechDev/MaskedOcclusionCulling

GitHub: Example integrated in Intel’s Software Occlusion Culling demo

– https://github.com/GameTechDev/OcclusionCulling

Project page: Masked Software Occlusion Culling

– https://software.intel.com/en-us/articles/masked-software-occlusion-culling

Questions and feedback welcome

– magnus.andersson@intel.com

Legal Notices and DisclaimersIntel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit http://www.intel.com/performance.

Cost reduction scenarios described are intended as examples of how a given Intel-based product, in the specified circumstances and configurations, may affect future costs and provide cost savings. Circumstances will vary. Intel does not guarantee any costs or cost reduction.

This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

Statements in this document that refer to Intel’s plans and expectations for the quarter, the year, and the future, are forward-looking statements that involve a number of risks and uncertainties. A detailed discussion of the factors that could affect Intel’s results and plans is included in Intel’s SEC filings, including the annual report on Form 10-K.

All products, computer systems, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.

Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate. © 2016 Intel Corporation. Intel, the Intel logo, VTune and others are trademarks of Intel Corporation in the U.S. and/or other countries.

*Other names and brands may be claimed as the property of others.