+ All Categories
Home > Documents > 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos...

1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos...

Date post: 15-Jan-2016
Category:
View: 228 times
Download: 0 times
Share this document with a friend
Popular Tags:
33
1 A Single (Unified) A Single (Unified) Shader GPU Shader GPU Microarchitecture Microarchitecture for Embedded Systems for Embedded Systems Victor Moya, Carlos Victor Moya, Carlos González, Jordi Roca, González, Jordi Roca, Agustín Fernández Agustín Fernández Department of Computer Department of Computer Architecture UPC Architecture UPC Roger Espasa Roger Espasa Intel DEG Intel DEG Barcelona Barcelona
Transcript
Page 1: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

11

A Single (Unified) Shader A Single (Unified) Shader GPU Microarchitecture for GPU Microarchitecture for

Embedded SystemsEmbedded Systems

Victor Moya, Carlos González, Victor Moya, Carlos González, Jordi Roca, Agustín FernándezJordi Roca, Agustín Fernández

Department of Computer Department of Computer Architecture UPCArchitecture UPC

Roger EspasaRoger EspasaIntel DEG Intel DEG BarcelonaBarcelona

Page 2: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

22

IntroductionIntroduction

Graphics and specifically 3D graphics have Graphics and specifically 3D graphics have become an important element in current PDA, become an important element in current PDA, mobile phone and other handheld systemsmobile phone and other handheld systems OpenGL ES: A simplified OpenGL specification for OpenGL ES: A simplified OpenGL specification for

embedded systemsembedded systems

The classic GPU architecture for the PC is not The classic GPU architecture for the PC is not suited for embedded systemssuited for embedded systems Low powerLow power Low area budgetLow area budget

We propose a single unified shader GPU We propose a single unified shader GPU architecture for embedded systemsarchitecture for embedded systems

Page 3: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

33

OutlineOutline

ATTILA PCATTILA PC

ATTILA EmbeddedATTILA Embedded

Triangle Setup in the Shader UnitTriangle Setup in the Shader Unit

ATTILA Simulation FrameworkATTILA Simulation Framework

ResultsResults

Page 4: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

44

OutlineOutline

ATTILA PCATTILA PC

ATTILA EmbeddedATTILA Embedded

Triangle Setup in the Shader UnitTriangle Setup in the Shader Unit

ATTILA Simulation FrameworkATTILA Simulation Framework

ResultsResults

Page 5: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

55

Attila Classic for PCsAttila Classic for PCs

Optimized for large resolutionsOptimized for large resolutions Above 1024x768Above 1024x768

Optimized for high performanceOptimized for high performanceHigh power requirementsHigh power requirements

No power optimizationsNo power optimizations 100+ watts on current high-end GPUs100+ watts on current high-end GPUs

Large area budgetLarge area budget 300+ million transistors on current high-end GPUs300+ million transistors on current high-end GPUs

Large dedicated of memory bandwidthLarge dedicated of memory bandwidth 40+ GB/s on current high-end GPUs40+ GB/s on current high-end GPUs

Specialized Shader UnitsSpecialized Shader Units 2 to 8 vertex shader units2 to 8 vertex shader units 1 to 6 fragment shader units1 to 6 fragment shader units

Page 6: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

66

Vertex Shader

Vertex Shader

Primitive Assembly

Clipping

Triangle Setup

Rasterization

FragmentShader

ROP

HierarchicalZ

Vertex Fetch

MemoryController

MemoryController

Attila PCAttila PC

SpecializedShaders

Four fragments

processed in parallel

FragmentShader

ROP

Page 7: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

77

OutlineOutline

ATTILA PCATTILA PC

ATTILA EmbeddedATTILA Embedded

Triangle Setup in the Shader UnitTriangle Setup in the Shader Unit

ATTILA Simulation FrameworkATTILA Simulation Framework

ResultsResults

Page 8: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

88

Embedded RequirementsEmbedded Requirements

Optimized for small resolutionsOptimized for small resolutions 320x240 to 640x480320x240 to 640x480

Optimized for low powerOptimized for low power Reduce frequencyReduce frequency Power optimizationsPower optimizations Improve efficiencyImprove efficiency

Small area budgetSmall area budget Remove non crucial hardwareRemove non crucial hardware

Low available bandwidthLow available bandwidthReduced shading powerReduced shading powerReduce design complexityReduce design complexity

Page 9: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

99

Attila EmbeddedAttila Embedded

No Hierarchical ZNo Hierarchical ZNo Z compressionNo Z compressionSingle unified shaderSingle unified shader

1 SIMD ALU1 SIMD ALU MultithreadedMultithreaded

16 threads of four vertex/triangle/fragment elements16 threads of four vertex/triangle/fragment elements16 128-bit registers for temporal storage available per thread16 128-bit registers for temporal storage available per thread

Texture unit outputs 1 bilinear for a whole fragment quad each 4 cycles Texture unit outputs 1 bilinear for a whole fragment quad each 4 cycles 4 KB Texture Cache4 KB Texture Cache

ROPROP One z and one color values updated per cycle in the framebuffer (a fragment One z and one color values updated per cycle in the framebuffer (a fragment

quad each 4 cycles).quad each 4 cycles).

Single 64-bit DDR channelSingle 64-bit DDR channel Limited by current simulator implementationLimited by current simulator implementation Assimilated to small (1 MB) embedded DRAMAssimilated to small (1 MB) embedded DRAM

32-bit high latency bus to large system memory for 32-bit high latency bus to large system memory for texturestextures

Page 10: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

1010

MemoryController

ROP

Shader

Vertex Fetch

Primitive Assembly

Rasterization

Scheduler

Distributor

Vertices Triangles Fragments

Attila EmbeddedAttila Embedded

Single Unified Shader

Single fragment per cycle pipeline

Clipping

Page 11: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

1111

OutlineOutline

ATTILA PCATTILA PC

ATTILA EmbeddedATTILA Embedded

Triangle Setup in the Shader UnitTriangle Setup in the Shader Unit

ATTILA Simulation FrameworkATTILA Simulation Framework

ResultsResults

Page 12: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

1212

Triangle Setup in the ShaderTriangle Setup in the Shader

2D Homogeneous Rasterization2D Homogeneous Rasterization Olano & Greer Olano & Greer

Triangle setup algorithm:Triangle setup algorithm: Calculate setup matrix from triangle vertex matrixCalculate setup matrix from triangle vertex matrix Calculate interpolation equation for fragment ZCalculate interpolation equation for fragment Z Cull triangles based on their facing direction (area sign)Cull triangles based on their facing direction (area sign)

Algorithm suited for a SIMD implementation in the Algorithm suited for a SIMD implementation in the Unified ShaderUnified ShaderInputs:Inputs:

Four 3 component vectors as input for the triangle vertex positionsFour 3 component vectors as input for the triangle vertex positions

Outputs:Outputs: Three 4 component vectors as output for the triangle edge and z Three 4 component vectors as output for the triangle edge and z

interpolation equation coefficients.interpolation equation coefficients. One signed triangle area register as output for face culling stageOne signed triangle area register as output for face culling stage

26 Instruction Triangle Shader program26 Instruction Triangle Shader program

Page 13: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

1313

Triangle Setup in the ShaderTriangle Setup in the Shader

BenefitsBenefits Reduce areaReduce area

No specialized hardware required for Triangle setupNo specialized hardware required for Triangle setup Reduce design complexityReduce design complexity Improve efficiencyImprove efficiency

Graphic workload in embedded applications may not fully Graphic workload in embedded applications may not fully utilize the triangle setup specialized hardware in most casesutilize the triangle setup specialized hardware in most casesHigher utilization of the shaderHigher utilization of the shader

CostsCosts Shader workload increasesShader workload increases Rerouting of the rasterization pipeline requiredRerouting of the rasterization pipeline required

Page 14: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

1414

OutlineOutline

ATTILA PCATTILA PC

ATTILA EmbeddedATTILA Embedded

Triangle Setup in the Shader UnitTriangle Setup in the Shader Unit

ATTILA Simulation FrameworkATTILA Simulation Framework

ResultsResults

Page 15: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

1515

Collect Verify Simulate Analyze

OpenGL Application

GLInterceptor

Vendor OpenGL Driver

Trace

ATI R520/NVidia G70

Framebuffer

Vendor OpenGL Driver

ATI R520/NVidia G70

Framebuffer

ATTILA OpenGL Driver

ATTILA Simulator

Framebuffer

GLPlayer

Signal Visualizer

Statistics

Signal Traffic

CHECK! CHECK!

Page 16: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

1616

Collect Verify Simulate Analyze

OpenGL Application

GLInterceptor

Vendor OpenGL Driver

Trace

ATI R520/NVidia G70

Framebuffer

Vendor OpenGL Driver

ATI R520/NVidia G70

Framebuffer

ATTILA OpenGL Driver

ATTILA Simulator

Framebuffer

GLPlayer

Signal Visualizer

Statistics

Signal Traffic

CHECK! CHECK!

GLInterceptor

•Capture a trace of OpenGL API calls from a real game

Page 17: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

1717

Collect Verify Simulate Analyze

OpenGL Application

GLInterceptor

Vendor OpenGL Driver

Trace

ATI R520/NVidia G70

Framebuffer

Vendor OpenGL Driver

ATI R520/NVidia G70

Framebuffer

ATTILA OpenGL Driver

ATTILA Simulator

Framebuffer

GLPlayer

Signal Visualizer

Statistics

Signal Traffic

CHECK! CHECK!

GLPlayer

•Reproduce the captured trace

Page 18: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

1818

Collect Verify Simulate Analyze

OpenGL Application

GLInterceptor

Vendor OpenGL Driver

Trace

ATI R520/NVidia G70

Framebuffer

Vendor OpenGL Driver

ATI R520/NVidia G70

Framebuffer

ATTILA OpenGL Driver

ATTILA Simulator

Framebuffer

GLPlayer

Signal Visualizer

Statistics

Signal Traffic

CHECK! CHECK!

OpenGL Library- Transform Fixed Function API into Shader code- Transform Fixed Function API into Shader code- 200 API calls supported- 200 API calls supported- ARB Vertex and Fragment extensions- ARB Vertex and Fragment extensions- Alpha and Fog emulated via Shader code- Alpha and Fog emulated via Shader code

DriverDriver- Low level interface to GPU hardware- Low level interface to GPU hardware- Attila memory management- Attila memory management

Page 19: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

1919

Collect Verify Simulate Analyze

OpenGL Application

GLInterceptor

Vendor OpenGL Driver

Trace

ATI R520/NVidia G70

Framebuffer

Vendor OpenGL Driver

ATI R520/NVidia G70

Framebuffer

ATTILA OpenGL Driver

ATTILA Simulator

Framebuffer

GLPlayer

Signal Visualizer

Statistics

Signal Traffic

CHECK! CHECK!

ATTILA SimulatorATTILA Simulator- Detailed cycle-by-cycle simulation of all - Detailed cycle-by-cycle simulation of all

pipeline stagespipeline stages- 20 boxes, modeling a 100-deep pipeline- 20 boxes, modeling a 100-deep pipeline- Execute@Execute: functionality - Execute@Execute: functionality

embedded at each pipeline stageembedded at each pipeline stage

Page 20: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

2020

Spot the differencesSpot the differences

AttilaNVidia GeForce FX 5900XT

Page 21: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

2121

OutlineOutline

ATTILA PCATTILA PC

ATTILA EmbeddedATTILA Embedded

Triangle Setup in the Shader UnitTriangle Setup in the Shader Unit

ATTILA Simulation FrameworkATTILA Simulation Framework

ResultsResults

Page 22: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

2222

BenchmarkBenchmark

Unreal Tournament 2004Unreal Tournament 2004 NOT AN EMBEDDED BENCHMARKNOT AN EMBEDDED BENCHMARK

Up to 300K vertices per frame!Up to 300K vertices per frame! Fixed function OpenGL APIFixed function OpenGL API

Vertex and fragments shaders generated by our Vertex and fragments shaders generated by our librarylibrary

320x240 resolution320x240 resolution 140 of 450 frames simulated140 of 450 frames simulated 100+ frames ~ 1 day simulation100+ frames ~ 1 day simulation

On a Xeon P4 @ 2.0GhzOn a Xeon P4 @ 2.0Ghz

Page 23: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

2323

ConfigurationsConfigurationsWe have evaluatedWe have evaluated

3 middle-end to low-end PC GPU configurations3 middle-end to low-end PC GPU configurations 2 integrated on chipset GPUs and high-end PDA GPUs configurations2 integrated on chipset GPUs and high-end PDA GPUs configurations 4 embedded low-end GPUs configurations4 embedded low-end GPUs configurations

We tried to keep a balance between memory bandwidth and shading We tried to keep a balance between memory bandwidth and shading computing powercomputing power

From 4 to no vertex shader unitsFrom 4 to no vertex shader units From 2 quad fragment shader units to a single unified shader unitFrom 2 quad fragment shader units to a single unified shader unit From four to one 64-bit DDR memory channelsFrom four to one 64-bit DDR memory channels Store framebuffer in small (1 MB) GPU memory and textures in system memoryStore framebuffer in small (1 MB) GPU memory and textures in system memory

Halved the frequency for embedded systemsHalved the frequency for embedded systems Restricted design rulesRestricted design rules Reduce power consumptionReduce power consumption

Removed all optional features at the low endRemoved all optional features at the low end Hierarchical ZHierarchical Z Z compressionZ compression Specialized Triangle Setup hardwareSpecialized Triangle Setup hardware

Page 24: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

2424

Evaluated ConfigurationsEvaluated ConfigurationsConfConf ResRes MHzMHz VShVSh (F)Sh(F)Sh Fetch Fetch

WayWayRegs Regs

ThreadThreadSetupSetup BusesBuses CacheCache eDRAMeDRAM HZHZ Z Z

ComprCompr

AA 10241024 400400 44 2x42x4 22 16x3216x32 FixedFixed 44 16 KB16 KB -- YY YY

BB 320320 400400 44 2x42x4 22 16x3216x32 FixedFixed 44 16 KB16 KB -- YY YY

CC 320320 400400 22 1x41x4 22 16x3216x32 FixedFixed 22 16 KB16 KB -- YY YY

DD 320320 400400 22 1x41x4 22 16x3216x32 FixedFixed 22 8 KB8 KB -- NN YY

EE 320320 200200 -- 1x21x2 22 8x328x32 FixedFixed 11 8 KB8 KB -- NN YY

FF 320320 200200 -- 1x21x2 22 8x328x32 FixedFixed 11 4 KB4 KB -- NN NN

GG 320320 200200 -- 1x11x1 22 16x1616x16 FixedFixed 11 4 KB4 KB -- NN NN

HH 320320 200200 -- 1x11x1 11 16x1616x16 FixedFixed 11 4 KB4 KB -- NN NN

II 320320 200200 -- 1x11x1 11 16x1616x16 ShaderShader 11 4 KB4 KB -- NN NN

JJ 320320 200200 -- 1x11x1 11 16x1616x16 ShaderShader 11 4 KB4 KB 1 MB1 MB NN NN

KK 320320 200200 -- 1x11x1 11 16x1616x16 ShaderShader 11 4 KB4 KB 1 MB1 MB YY YY

Page 25: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

2525

Configuration ComparisonConfiguration Comparison

0

5

10

15

20

25

BW (GB/s) 23,8 23,8 11,9 11,9 2,98 2,98 2,98 2,98 2,98 4,47 4,47

A B C D E F G H I J K0

1020304050607080

GFlops 76,8 76,8 38,4 38,4 6,4 6,4 3,2 1,6 1,6 1,6 1,6

A B C D E F G H I J K

0

20

40

60

80

100

Caches (KB) 96 96 48 24 24 12 12 12 12 12 12

A B C D E F G H I J K

Page 26: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

2626

PerformancePerformance

Average of 20 frames per second at 320x240 for the Average of 20 frames per second at 320x240 for the lower end single shader configurationslower end single shader configurations

0

20

40

60

80

100

FPS 80,2 339 209 202 61,4 60,1 33,6 24,2 20,2 20,2 20,5

A B C D E F G H I J K

Page 27: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

2727

EfficiencyEfficiency

The limiting factor for PC and high embedded configurations is memory The limiting factor for PC and high embedded configurations is memory bandwidthbandwidth

Shaders underutilized for the evaluated benchmarkShaders underutilized for the evaluated benchmarkThe limiting factor for low end configurations is shading processingThe limiting factor for low end configurations is shading processing

Memory bandwidth could be further reducedMemory bandwidth could be further reducedCaches seem over dimensioned for the low-end embedded configurationsCaches seem over dimensioned for the low-end embedded configurations

02468

10121416182022

A B C D E F G H I J K

FPS per GFops FPS per BW FPS per Cache KB

Page 28: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

2828

Shaded Triangle Setup PerformanceShaded Triangle Setup Performance

No overhead on fragment limited benchmarksNo overhead on fragment limited benchmarks16% less performance in vertex and triangle 16% less performance in vertex and triangle limited traceslimited traces

0,7

0,75

0,8

0,85

0,9

0,95

1

torus UT-2004 lit spheres spaceship VL-II

on shader on specif ic hardw are

Page 29: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

2929

ConclusionConclusion

The Attila Embedded achieves 20 frames per The Attila Embedded achieves 20 frames per second on a single unified shader architecture at second on a single unified shader architecture at a 320x240 resolution when using a year old PC a 320x240 resolution when using a year old PC benchmarkbenchmark 1 MB of fast embedded DRAM provides more 1 MB of fast embedded DRAM provides more

than enough bandwidth for framebuffer than enough bandwidth for framebuffer accessesaccesses

Texture data stored in system memoryTexture data stored in system memory 16% performance reduction when removing 16% performance reduction when removing

the specialized Triangle Setup unit in the the specialized Triangle Setup unit in the worst tested caseworst tested case

Page 30: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

3030

Questions?Questions?

Page 31: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

3131

MemoryController

MemoryController

MemoryController

MemoryController

ROP ROP ROP ROP

Shader

Shader

Shader

Shader

Vertex Fetch

Primitive Assembly

Clipping

Triangle Setup

Rasterization

HierarchicalZ

Scheduler

Distributor

Attila PCAttila PC

Unified Shader Pool

Page 32: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

3232

02468

10121416

A B C D E F G H I J K

Performance Performance per Gflop

Performance per BW Performance per Cache KB

Page 33: 1 A Single (Unified) Shader GPU Microarchitecture for Embedded Systems Victor Moya, Carlos González, Jordi Roca, Agustín Fernández Department of Computer.

3333

PowerVR SGXPowerVR SGX


Recommended