Date post: | 13-Jan-2015 |
Category: |
Technology |
Upload: | amd-developer-central |
View: | 596 times |
Download: | 2 times |
OPTIMIZING RAYTRACING ON GCN WITH AMD DEVELOPMENT TOOLS
TZACHI COHEN NOVEMBER 2013
2 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
AGENDA
Overview of Raytracing & KD Trees
Review of GCN Architecture
Mapping Raytracing to GPUs
Optimizing Raytracing using CodeXL
Overview Of Raytracing
4 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
ACCELERATION STRUCTURES TRADE OFFS
Bounding Volume Hierarchies
KD Tree Uniform Grid
Construction Speed
Tracing Speed
5 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
HIERARCHICAL KD TREE – 2D
B C
A
D E F G
A
B E
C
F
D G
6 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
KD TREE – 3D
7 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
STACK BASED TRAVERSAL KD TREE – 2D
B C
A
D E F
A
B E
C
F
D
tMin
tMax
t1
t2
t1
G
G
8 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
TRAVERSING KD TREES – PSEUDO CODE
stack.push(KDroot,sceneMin,sceneMax)
tHit=infinity while !(stack.empty()): (node,tStart,tEnd)=stack.pop() while !(node.isLeaf()): tSplit = ( node.value - ray.origin[node.axis] ) / ray.direction[node.axis] (near, far) = findNear(ray.origin[node.axis], node.left, node.right) if( tSplit >= tEnd or tSplit < 0) node=near else if( tSplit <= tStart) node=second else stack.push( far, tSplit, tEnd) node=near tEnd=tSplit for prim in node.primitives(): tHit=min(tHit,prim.Intersect(ray)) if tHit<tEnd: return tHit return tHit
GCN ARCHITECTURE
10 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
First introduced with the “Southern Island” family of GPUs.
Is available with the upcoming “Kaveri” APU.
Scalar architecture.
ECC support. (with some models).
Double precision support.
Multiple concurrent queues for compute.
11 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
GPU SCALAR ARCHITECTURE VS CPU SSE EXTENSIONS
Thread 1 Thread 2 Thread 3 Thread 4
float x;
X = x+1;
Thread 5 Thread 6 Thread 7 Thread 8
Thread 9 Thread 10 Thread 11 Thread 12
Thread 13 Thread 14 Thread 15 Thread 16
Scalar code does not utilize the SSE capabilities of the CPU.
12 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
GCN
float x;
X = x+1;
HOW SCALAR CODE IS EXECUTED
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16
13 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
IMPLICATIONS FOR RAY TRACING
Ray Packetization – having a single thread trace several rays in one KD tree traverse to achieve better utilization of the SIMD and cache.
No explicit ray packetization is required on GCN.
The HW is implicitly packetizing every 64 threads. All 64 threads of a Wavefront
execute the same instruction together.
14 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
A SEQUENCER FOR EVERY COMPUTE UNIT
Compute Unit
SQ
Compute Unit
SQ
Compute Unit
SQ
Compute Unit
SQ
A sequencer is a HW block responsible for issuing program instructions.
A compute unit can run up to 40 Wavefronts each with a distinct program counter.
GPU under-utilization due to long traversing rays may happen only on the Wavefront level.
15 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
HOW MUCH ON CHIP MEMORY DO WE HAVE?
HD 7970 – “Tahiti”
256 KB VGPR per CU X 32 = 8.192 MB
8 KB SGPR per CU X 32 = 0.256 MB 16 KB L1 V-Data cache per CU X 32 = 0.512 MB 16 KB L1 S-Data cache per 4 CUs X 8 = 0.128 MB 32 KB instruction cache per 4 CUs X 8 = 0.256 MB L2 Data Cache = 768 KB LDS 64KB per CU X32 = 2.048 MB
Total : 12.16 MB
16 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
AMD CODE XL
Coherent, innovative and unified developer tools suite
‒ Debug, Profile, and Analyze applications
‒ Support OpenCL™ and OpenGL.
‒ AMD CPUs, GPUs and APUs
‒ Standalone and integrated into Microsoft® Visual Studio®
‒ Supported on Windows® and Linux®
‒ Does not require source code modifications
17 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
BE SURE YOUR KERNEL SIZE DOES NOT EXCEED INSTRUCTION CACHE SIZE
Mapping Raytracing To GPUs
19 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
HOW CAN A GPU TRAVERSE A TREE?
Node
Node Node
Node Node Node Node
Nest all the nodes on a buffer, wrap the buffer with CL mem object.
When using HSA we can leverage the unified memory architecture and access the tree as-is.
20 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
HOW MUCH MEMORY DO WE NEED FOR THE STACK?
Per Wave front = Maximal Depth Of the Tree X size of frame X 64 .
25 X 12 X 64 = ~19 KB
Leads to GPR spilling to local memory or low scheduling utilization.
GPRs spilled to local memory are also known as Scratch Registers.
GPR spilling is decided upon by the OCL compiler on compile time.
21 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
HOW TO DETECT SCRATCH REGISTERS USING CODEXL
22 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
STACKLESS TRACE – RESTART TRAVERSAL
B C
A
D E F G
A
B E C
F
D G
tmin
t1
t2
t3
tMax
t3 tMax
t2 t3
t2 t1
t1 tMax
23 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
KD RESTART ALGORITHM tStart=tEnd=sceneMin timeHit=infinity while (tEnd<sceneMax): node=root tStart=tEnd tEnd=sceneMax while (not node.isLeaf()): axis = node.axis tSplit = ( node.PlanePos - ray.origin[axis] ) / ray.direction[axis] (near, far) = findNear(ray.origin[axis], node.left, node.right) if( tSplit >= tEnd or tSplit <= 0) node=near else if( tSplit <= tStart) node=far else node=near tEnd=tSplit for prim in node.primitives(): timeHit=min(tHit,prim.Intersect(ray)) if timeHit<tEnd: return tHit return tHit
24 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
EFFECT ON GPR SPILLAGE
Demo
Optimizing Raytracing using CodeXL
27 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
CAN THIS BE FURTHER REFINED?
What on chip memory aren’t we using ?
LDS = Local Data Store.
Short Stack Algorithm – initialize a stack smaller than the maximum depth of the tree. If we overflow, fall back to KD-Restart algorithm.
If we place the short stack in the LDS, what should be
the depth of the “short stack”?
28 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
HOW MANY WAVEFRONTS ARE EXECUTED CONCURRENTLY
Use CodeXL application trace to discover how many Wavefronts are executed concurrently with stackless traversal
29 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
OCCUPANCY GRAPHS
30 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
WHAT SHOULD BE THE SIZE OF THE SHORT STACK?
64 KB / 12 wavefronts / 64 threads / sizeof (Frame) = 7
Demo
32 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
RESULTS
60
70
80
90
100
110
120
Full stack stackless short stack Short stack onLDS
Results are in Million rays per second on Radeon™ HD 7970.
33 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
Questions?
34 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
DISCLAIMER & ATTRIBUTION
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.
OpenCL™ is a trademark of Apple Inc. which is licensed to the Khronos organization. Linux™ is the trademark of Linus Torvalds.
Microsoft™ and Windows™ are the trademarks of Microsoft Corp. All other names used in this presentation are for
informational purposes only and may be trademarks of their respective owners.
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
ATTRIBUTION
© 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.
35 | Optimizing Raytracing on GCN with AMD Development Tools | NOVEMBER 2013
REFERENCES
Introduction to GCN
‒ http://developer.amd.com/wordpress/media/2013/06/2620_final.pdf
GCN white paper
‒ http://www.amd.com/us/Documents/GCN_Architecture_whitepaper.pdf
CodeXL home page
‒ http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/
AMD OpenCL programmers guide
‒ http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf