PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi Cohen

OPTIMIZING RAYTRACING ON GCN WITH AMD DEVELOPMENT TOOLS

TZACHI COHEN NOVEMBER 2013

2 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013

AGENDA

Overview of Raytracing & KD Trees

Review of GCN Architecture

Mapping Raytracing to GPUs

Optimizing Raytracing using CodeXL

Overview Of Raytracing


ACCELERATION STRUCTURES TRADE OFFS

Bounding Volume Hierarchies

KD Tree Uniform Grid

Construction Speed

Tracing Speed


HIERARCHICAL KD TREE – 2D

B C

A

D E F G

A

B E

C

F

D G


KD TREE – 3D


STACK BASED TRAVERSAL KD TREE – 2D

B C

A

D E F

A

B E

C

F

D

tMin

tMax

t1

t2

t1

G

G


TRAVERSING KD TREES – PSEUDO CODE

stack.push(KDroot,sceneMin,sceneMax)

tHit=infinity while !(stack.empty()): (node,tStart,tEnd)=stack.pop() while !(node.isLeaf()): tSplit = ( node.value - ray.origin[node.axis] ) / ray.direction[node.axis] (near, far) = findNear(ray.origin[node.axis], node.left, node.right) if( tSplit >= tEnd or tSplit < 0) node=near else if( tSplit <= tStart) node=second else stack.push( far, tSplit, tEnd) node=near tEnd=tSplit for prim in node.primitives(): tHit=min(tHit,prim.Intersect(ray)) if tHit<tEnd: return tHit return tHit

GCN ARCHITECTURE


First introduced with the “Southern Island” family of GPUs.

Is available with the upcoming “Kaveri” APU.

Scalar architecture.

ECC support. (with some models).

Double precision support.

Multiple concurrent queues for compute.


GPU SCALAR ARCHITECTURE VS CPU SSE EXTENSIONS

Thread 1 Thread 2 Thread 3 Thread 4

float x;

X = x+1;




Scalar code does not utilize the SSE capabilities of the CPU.


GCN

float x;

X = x+1;

HOW SCALAR CODE IS EXECUTED

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16


IMPLICATIONS FOR RAY TRACING

Ray Packetization – having a single thread trace several rays in one KD tree traverse to achieve better utilization of the SIMD and cache.

No explicit ray packetization is required on GCN.

The HW is implicitly packetizing every 64 threads. All 64 threads of a Wavefront

execute the same instruction together.


A SEQUENCER FOR EVERY COMPUTE UNIT

Compute Unit

SQ

Compute Unit

SQ

Compute Unit

SQ

Compute Unit

SQ

A sequencer is a HW block responsible for issuing program instructions.

A compute unit can run up to 40 Wavefronts each with a distinct program counter.

GPU under-utilization due to long traversing rays may happen only on the Wavefront level.


HOW MUCH ON CHIP MEMORY DO WE HAVE?

HD 7970 – “Tahiti”

256 KB VGPR per CU X 32 = 8.192 MB

8 KB SGPR per CU X 32 = 0.256 MB 16 KB L1 V-Data cache per CU X 32 = 0.512 MB 16 KB L1 S-Data cache per 4 CUs X 8 = 0.128 MB 32 KB instruction cache per 4 CUs X 8 = 0.256 MB L2 Data Cache = 768 KB LDS 64KB per CU X32 = 2.048 MB

Total : 12.16 MB


AMD CODE XL

Coherent, innovative and unified developer tools suite

‒ Debug, Profile, and Analyze applications

‒ Support OpenCL™ and OpenGL.

‒ AMD CPUs, GPUs and APUs

‒ Standalone and integrated into Microsoft® Visual Studio®

‒ Supported on Windows® and Linux®

‒ Does not require source code modifications


BE SURE YOUR KERNEL SIZE DOES NOT EXCEED INSTRUCTION CACHE SIZE

Mapping Raytracing To GPUs


HOW CAN A GPU TRAVERSE A TREE?

Node

Node Node

Node Node Node Node

Nest all the nodes on a buffer, wrap the buffer with CL mem object.

When using HSA we can leverage the unified memory architecture and access the tree as-is.


HOW MUCH MEMORY DO WE NEED FOR THE STACK?

Per Wave front = Maximal Depth Of the Tree X size of frame X 64 .

25 X 12 X 64 = ~19 KB

Leads to GPR spilling to local memory or low scheduling utilization.

GPRs spilled to local memory are also known as Scratch Registers.

GPR spilling is decided upon by the OCL compiler on compile time.


HOW TO DETECT SCRATCH REGISTERS USING CODEXL


STACKLESS TRACE – RESTART TRAVERSAL

B C

A

D E F G

A

B E C

F

D G

tmin

t1

t2

t3

tMax

t3 tMax

t2 t3

t2 t1

t1 tMax


KD RESTART ALGORITHM tStart=tEnd=sceneMin timeHit=infinity while (tEnd<sceneMax): node=root tStart=tEnd tEnd=sceneMax while (not node.isLeaf()): axis = node.axis tSplit = ( node.PlanePos - ray.origin[axis] ) / ray.direction[axis] (near, far) = findNear(ray.origin[axis], node.left, node.right) if( tSplit >= tEnd or tSplit <= 0) node=near else if( tSplit <= tStart) node=far else node=near tEnd=tSplit for prim in node.primitives(): timeHit=min(tHit,prim.Intersect(ray)) if timeHit<tEnd: return tHit return tHit


EFFECT ON GPR SPILLAGE

Demo

Optimizing Raytracing using CodeXL


CAN THIS BE FURTHER REFINED?

What on chip memory aren’t we using ?

LDS = Local Data Store.

Short Stack Algorithm – initialize a stack smaller than the maximum depth of the tree. If we overflow, fall back to KD-Restart algorithm.

If we place the short stack in the LDS, what should be

the depth of the “short stack”?


HOW MANY WAVEFRONTS ARE EXECUTED CONCURRENTLY

Use CodeXL application trace to discover how many Wavefronts are executed concurrently with stackless traversal


OCCUPANCY GRAPHS


WHAT SHOULD BE THE SIZE OF THE SHORT STACK?

64 KB / 12 wavefronts / 64 threads / sizeof (Frame) = 7

Demo


RESULTS

60

70

80

90

100

110

120

Full stack stackless short stack Short stack onLDS

Results are in Million rays per second on Radeon™ HD 7970.


Questions?


DISCLAIMER & ATTRIBUTION

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

OpenCL™ is a trademark of Apple Inc. which is licensed to the Khronos organization. Linux™ is the trademark of Linus Torvalds.

Microsoft™ and Windows™ are the trademarks of Microsoft Corp. All other names used in this presentation are for

informational purposes only and may be trademarks of their respective owners.

AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ATTRIBUTION

© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.


REFERENCES

Introduction to GCN

‒ http://developer.amd.com/wordpress/media/2013/06/2620_final.pdf

GCN white paper

‒ http://www.amd.com/us/Documents/GCN_Architecture_whitepaper.pdf

CodeXL home page

‒ http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/

AMD OpenCL programmers guide

‒ http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

http://developer.amd.com/wordpress/media/2013/06/2620_final.pdf

http://developer.amd.com/wordpress/media/2013/06/2620_final.pdf

http://www.amd.com/us/Documents/GCN_Architecture_whitepaper.pdf

http://www.amd.com/us/Documents/GCN_Architecture_whitepaper.pdf

http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/








http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf

Date post:	13-Jan-2015
Category:	Technology
Upload:	amd-developer-central
View:	596 times
Download:	2 times

PT-4055, Optimizing Raytracing on GCN with AMD Development Tools, by Tzachi Cohen

Technology