+ All Categories
Home > Documents > Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat...

Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat...

Date post: 18-Jan-2018
Category:
Upload: harvey-jerome-blair
View: 220 times
Download: 0 times
Share this document with a friend
Description:
Motivation – Multipass Partitioning Divide GPU program (shader) into a partition set of rendering passes each pass satisfies all resource constraints save/restore intermediate values in textures Many possible partitions exist The problem: given a program, find the best partition
43
Efficient Partitioning of Fragment Shaders for Multiple- Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University
Transcript
Page 1: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware

Tim FoleyMike HoustonPat Hanrahan

Computer Graphics LabStanford University

Page 2: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

Motivation

GPU Programming Interactive shading Offline rendering Computation

physical simulations numerical methods BrookGPU [Buck et al. 2004]

Shouldn’t be constrained by hardware limits but demand high runtime performance

Page 3: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

Motivation – Multipass Partitioning Divide GPU program (shader) into a

partition set of rendering passes each pass satisfies all resource

constraints save/restore intermediate values in

textures

Many possible partitions exist The problem:

given a program, find the best partition

Page 4: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

Related Work

SGI’s ISL [Peercy et al. 2000] treat OpenGL machine as SIMD processor

Recursive Dominator Split (RDS) [Chan et al. 2002] graph partitioning of shader dag

Data-Dependent Multipass Control Flow on GPU [Popa and McCool 2004] partition around flow control and

schedule passes Mio [Riffel et al. 2004]

instruction scheduling with backtracking

Page 5: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

Contribution

Merging Recursive Dominator Split (MRDS)

MRDS – Extends RDS support shaders with multiple outputs support hardware with multiple render

targets generate more optimal partitions same running time as RDS

Page 6: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

Outline

Motivation Related Work RDS Algorithm MRDS Algorithm Results Future Work

Page 7: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

RDS - Overview

Input: dag of n nodes shader ops inputs

interpolants constants textures

Goal: mark subset of nodes as splits split nodes define pass boundaries 2n possible subsets

Page 8: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

RDS - Overview

Input: dag of n nodes shader ops inputs

interpolants constants textures

Goal: mark subset of nodes as splits split nodes define pass boundaries 2n possible subsets

Page 9: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

RDS - Overview

Input: dag of n nodes shader ops inputs

interpolants constants textures

Goal: mark subset of nodes as splits split nodes define pass boundaries 2n possible subsets

Page 10: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

RDS - Overview

Combination of approaches to limit search space

Save/recompute decisions primary performance tradeoff

Dominator tree used to avoid save/recompute tradeoffs

Page 11: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

RDS – Save / Recompute

M – multiply refereced node

Page 12: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

RDS – Save / Recompute

M – multiply refereced node

Page 13: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

RDS – Save / Recompute

M – multiply refereced node

Page 14: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

RDS – Save / Recompute

M – multiply refereced node

Page 15: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

Dominator

B dom G all paths to B go through G

Page 16: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

Dominator Tree

Page 17: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

Key Insight

if B, G in same passand B dom Gthen no save/recompute costs for G

Page 18: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

MRDS – Multiple-Output Shaders

Page 19: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

MRDS – Multiple-Output Shaders

Page 20: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

MRDS – Multiple-Output Hardware

float4 x, y;...for( i=0; i<N; i++ ){

x' = x*x - y*y;y' = 2*x*y;x = x'; y = y';

}...

Page 21: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

MRDS – Multiple-Output Hardware

float4 x, y;...for( i=0; i<N; i++ ){

x' = f( x, y );y' = g( x, y );x = x'; y = y';

}...

Page 22: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

MRDS – Multiple-Output Hardware

float4 x, y;...for( i=0; i<N; i++ ){

x' = f( x, y );y' = g( x, y );x = x'; y = y';

}...

Page 23: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

MRDS – Multiple-Output Hardware

State cannot fit in single output

float4 x, y;...for( i=0; i<N; i++ ){

x' = f( x, y );y' = g( x, y );x = x'; y = y';

}...

Page 24: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

MRDS – Multiple-Output Hardware

State cannot fit in single output

float4 x, y;...for( i=0; i<N; i++ ){

x' = f( x, y );y' = g( x, y );x = x'; y = y';

}...

Page 25: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

MRDS – Dominating Sets

Dominating Set S = {A,D} S dom G All paths to G go through element of S S, G in same pass

avoid save/recompute for G

Page 26: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

MRDS – Pass Merging

Generate initial passes with RDS

Find potential merges check if valid evaluate change in cost

Execute from best to worst revalidate

Stop when no more beneficial merges

Page 27: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

MRDS – Pass Merging

Generate initial passes with RDS

Find potential merges check if valid evaluate change in cost

Execute from best to worst revalidate

Stop when no more beneficial merges

Page 28: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

MRDS – Pass Merging

Generate initial passes with RDS

Find potential merges check if valid evaluate change in cost

Execute from best to worst revalidate

Stop when no more beneficial merges

Page 29: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

MRDS – Pass Merging

Generate initial passes with RDS

Find potential merges check if valid evaluate change in cost

Execute from best to worst revalidate

Stop when no more beneficial merges

Page 30: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

MRDS – Pass Merging

Generate initial passes with RDS

Find potential merges check if valid evaluate change in cost

Execute from best to worst revalidate

Stop when no more beneficial merges

Page 31: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

MRDS – Pass Merging

What if RDS chose to recompute G?

Merge between passes A and D eliminates duplicate instructions gets high score

Page 32: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

MRDS – Pass Merging

What if RDS chose to recompute G?

Merge between passes A and D eliminates duplicate instructions gets high score

Page 33: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

MRDS – Time Complexity

Cost of merging dominated by initial search iterates over s2 pairs of splits each pair requires size-s set operations

and 1 compiler call O(s2(s+n))

s = O(n) in worst case MRDS = O(n3) in worst case in practice we expect s << n

Assumes compiler calls are linear not true for fxc

Page 34: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

MRDS'

RDS uses linear search for save/recompute evaluates cost of both alternatives with RDSh

RDS = O(n * RDSh) = O(n3)

MRDS merges after RDS has made these decisions MRDS = O(RDS + n3) = O(n3)

MRDS' merges during cost evaluation adds linear factor in worst case MRDS' = O(n * (RDSh + n3)) = O(n4)

Page 35: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

Results

3 Brook Programs Procedural Fire Mandelbrot Fractal Matrix Mulitply

Compiled for ATI Radeon 9800 XT with RDS MRDS MRDS'

Page 36: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

Results – Procedural Fire

MRDS' better than MRDS and RDS better save/recompute decisions results in less bandwidth used

0500

100015002000250030003500

RDS MRDS MRDS'

Tim

e (n

s)

Page 37: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

Results – Compile Times

00.5

11.5

22.5

33.5

Fire Fractal Matrix

RDSMRDSMRDS'

Page 38: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

Results – Mandelbrot Fractal

MRDS', MRDS better than RDS iterative computation – state in 2

variables RDS duplicates computation

020406080

100120140

RDS MRDS MRDS'

Tim

e (n

s)

Page 39: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

Results – Matrix Multiply

Matrix-matrix multiply benefits from blocking blocking cuts computation by ~2

Blocking requires multiple outputs performance limited by MRT performance

050

100150200250300350400

RDS MRDS MRDS'

Tim

e (n

s)

Page 40: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

Summary

Modified RDS algorithm, MRDS supports multiple-output shaders generates code for multiple-render-

targets easy to implement, same running time generates better-performing partitions

Page 41: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

Future Work

Implementations Ashli combine with Mio

Exploit new hardware data-dependent flow control large numbers of outputs

Page 42: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

Acknowledgements

Eric Chan, Ren Ng, Pradeep Sen, Kekoa Proudfoot RDS implementation, design discussions

Kayvon Fatahalian, Ian Buck GPUBench results

ATI hardware

DARPA, ATI, IBM, NVIDIA, SONY funding

Page 43: Efficient Partitioning of Fragment Shaders for Multiple-Output Hardware Tim Foley Mike Houston Pat Hanrahan Computer Graphics Lab Stanford University.

Recommended