Date post: | 15-Jan-2016 |
Category: |
Documents |
View: | 228 times |
Download: | 0 times |
11
A Single (Unified) Shader A Single (Unified) Shader GPU Microarchitecture for GPU Microarchitecture for
Embedded SystemsEmbedded Systems
Victor Moya, Carlos González, Victor Moya, Carlos González, Jordi Roca, Agustín FernándezJordi Roca, Agustín Fernández
Department of Computer Department of Computer Architecture UPCArchitecture UPC
Roger EspasaRoger EspasaIntel DEG Intel DEG BarcelonaBarcelona
22
IntroductionIntroduction
Graphics and specifically 3D graphics have Graphics and specifically 3D graphics have become an important element in current PDA, become an important element in current PDA, mobile phone and other handheld systemsmobile phone and other handheld systems OpenGL ES: A simplified OpenGL specification for OpenGL ES: A simplified OpenGL specification for
embedded systemsembedded systems
The classic GPU architecture for the PC is not The classic GPU architecture for the PC is not suited for embedded systemssuited for embedded systems Low powerLow power Low area budgetLow area budget
We propose a single unified shader GPU We propose a single unified shader GPU architecture for embedded systemsarchitecture for embedded systems
33
OutlineOutline
ATTILA PCATTILA PC
ATTILA EmbeddedATTILA Embedded
Triangle Setup in the Shader UnitTriangle Setup in the Shader Unit
ATTILA Simulation FrameworkATTILA Simulation Framework
ResultsResults
44
OutlineOutline
ATTILA PCATTILA PC
ATTILA EmbeddedATTILA Embedded
Triangle Setup in the Shader UnitTriangle Setup in the Shader Unit
ATTILA Simulation FrameworkATTILA Simulation Framework
ResultsResults
55
Attila Classic for PCsAttila Classic for PCs
Optimized for large resolutionsOptimized for large resolutions Above 1024x768Above 1024x768
Optimized for high performanceOptimized for high performanceHigh power requirementsHigh power requirements
No power optimizationsNo power optimizations 100+ watts on current high-end GPUs100+ watts on current high-end GPUs
Large area budgetLarge area budget 300+ million transistors on current high-end GPUs300+ million transistors on current high-end GPUs
Large dedicated of memory bandwidthLarge dedicated of memory bandwidth 40+ GB/s on current high-end GPUs40+ GB/s on current high-end GPUs
Specialized Shader UnitsSpecialized Shader Units 2 to 8 vertex shader units2 to 8 vertex shader units 1 to 6 fragment shader units1 to 6 fragment shader units
66
Vertex Shader
Vertex Shader
Primitive Assembly
Clipping
Triangle Setup
Rasterization
FragmentShader
ROP
HierarchicalZ
Vertex Fetch
MemoryController
MemoryController
Attila PCAttila PC
SpecializedShaders
Four fragments
processed in parallel
FragmentShader
ROP
77
OutlineOutline
ATTILA PCATTILA PC
ATTILA EmbeddedATTILA Embedded
Triangle Setup in the Shader UnitTriangle Setup in the Shader Unit
ATTILA Simulation FrameworkATTILA Simulation Framework
ResultsResults
88
Embedded RequirementsEmbedded Requirements
Optimized for small resolutionsOptimized for small resolutions 320x240 to 640x480320x240 to 640x480
Optimized for low powerOptimized for low power Reduce frequencyReduce frequency Power optimizationsPower optimizations Improve efficiencyImprove efficiency
Small area budgetSmall area budget Remove non crucial hardwareRemove non crucial hardware
Low available bandwidthLow available bandwidthReduced shading powerReduced shading powerReduce design complexityReduce design complexity
99
Attila EmbeddedAttila Embedded
No Hierarchical ZNo Hierarchical ZNo Z compressionNo Z compressionSingle unified shaderSingle unified shader
1 SIMD ALU1 SIMD ALU MultithreadedMultithreaded
16 threads of four vertex/triangle/fragment elements16 threads of four vertex/triangle/fragment elements16 128-bit registers for temporal storage available per thread16 128-bit registers for temporal storage available per thread
Texture unit outputs 1 bilinear for a whole fragment quad each 4 cycles Texture unit outputs 1 bilinear for a whole fragment quad each 4 cycles 4 KB Texture Cache4 KB Texture Cache
ROPROP One z and one color values updated per cycle in the framebuffer (a fragment One z and one color values updated per cycle in the framebuffer (a fragment
quad each 4 cycles).quad each 4 cycles).
Single 64-bit DDR channelSingle 64-bit DDR channel Limited by current simulator implementationLimited by current simulator implementation Assimilated to small (1 MB) embedded DRAMAssimilated to small (1 MB) embedded DRAM
32-bit high latency bus to large system memory for 32-bit high latency bus to large system memory for texturestextures
1010
MemoryController
ROP
Shader
Vertex Fetch
Primitive Assembly
Rasterization
Scheduler
Distributor
Vertices Triangles Fragments
Attila EmbeddedAttila Embedded
Single Unified Shader
Single fragment per cycle pipeline
Clipping
1111
OutlineOutline
ATTILA PCATTILA PC
ATTILA EmbeddedATTILA Embedded
Triangle Setup in the Shader UnitTriangle Setup in the Shader Unit
ATTILA Simulation FrameworkATTILA Simulation Framework
ResultsResults
1212
Triangle Setup in the ShaderTriangle Setup in the Shader
2D Homogeneous Rasterization2D Homogeneous Rasterization Olano & Greer Olano & Greer
Triangle setup algorithm:Triangle setup algorithm: Calculate setup matrix from triangle vertex matrixCalculate setup matrix from triangle vertex matrix Calculate interpolation equation for fragment ZCalculate interpolation equation for fragment Z Cull triangles based on their facing direction (area sign)Cull triangles based on their facing direction (area sign)
Algorithm suited for a SIMD implementation in the Algorithm suited for a SIMD implementation in the Unified ShaderUnified ShaderInputs:Inputs:
Four 3 component vectors as input for the triangle vertex positionsFour 3 component vectors as input for the triangle vertex positions
Outputs:Outputs: Three 4 component vectors as output for the triangle edge and z Three 4 component vectors as output for the triangle edge and z
interpolation equation coefficients.interpolation equation coefficients. One signed triangle area register as output for face culling stageOne signed triangle area register as output for face culling stage
26 Instruction Triangle Shader program26 Instruction Triangle Shader program
1313
Triangle Setup in the ShaderTriangle Setup in the Shader
BenefitsBenefits Reduce areaReduce area
No specialized hardware required for Triangle setupNo specialized hardware required for Triangle setup Reduce design complexityReduce design complexity Improve efficiencyImprove efficiency
Graphic workload in embedded applications may not fully Graphic workload in embedded applications may not fully utilize the triangle setup specialized hardware in most casesutilize the triangle setup specialized hardware in most casesHigher utilization of the shaderHigher utilization of the shader
CostsCosts Shader workload increasesShader workload increases Rerouting of the rasterization pipeline requiredRerouting of the rasterization pipeline required
1414
OutlineOutline
ATTILA PCATTILA PC
ATTILA EmbeddedATTILA Embedded
Triangle Setup in the Shader UnitTriangle Setup in the Shader Unit
ATTILA Simulation FrameworkATTILA Simulation Framework
ResultsResults
1515
Collect Verify Simulate Analyze
OpenGL Application
GLInterceptor
Vendor OpenGL Driver
Trace
ATI R520/NVidia G70
Framebuffer
Vendor OpenGL Driver
ATI R520/NVidia G70
Framebuffer
ATTILA OpenGL Driver
ATTILA Simulator
Framebuffer
GLPlayer
Signal Visualizer
Statistics
Signal Traffic
CHECK! CHECK!
1616
Collect Verify Simulate Analyze
OpenGL Application
GLInterceptor
Vendor OpenGL Driver
Trace
ATI R520/NVidia G70
Framebuffer
Vendor OpenGL Driver
ATI R520/NVidia G70
Framebuffer
ATTILA OpenGL Driver
ATTILA Simulator
Framebuffer
GLPlayer
Signal Visualizer
Statistics
Signal Traffic
CHECK! CHECK!
GLInterceptor
•Capture a trace of OpenGL API calls from a real game
1717
Collect Verify Simulate Analyze
OpenGL Application
GLInterceptor
Vendor OpenGL Driver
Trace
ATI R520/NVidia G70
Framebuffer
Vendor OpenGL Driver
ATI R520/NVidia G70
Framebuffer
ATTILA OpenGL Driver
ATTILA Simulator
Framebuffer
GLPlayer
Signal Visualizer
Statistics
Signal Traffic
CHECK! CHECK!
GLPlayer
•Reproduce the captured trace
1818
Collect Verify Simulate Analyze
OpenGL Application
GLInterceptor
Vendor OpenGL Driver
Trace
ATI R520/NVidia G70
Framebuffer
Vendor OpenGL Driver
ATI R520/NVidia G70
Framebuffer
ATTILA OpenGL Driver
ATTILA Simulator
Framebuffer
GLPlayer
Signal Visualizer
Statistics
Signal Traffic
CHECK! CHECK!
OpenGL Library- Transform Fixed Function API into Shader code- Transform Fixed Function API into Shader code- 200 API calls supported- 200 API calls supported- ARB Vertex and Fragment extensions- ARB Vertex and Fragment extensions- Alpha and Fog emulated via Shader code- Alpha and Fog emulated via Shader code
DriverDriver- Low level interface to GPU hardware- Low level interface to GPU hardware- Attila memory management- Attila memory management
1919
Collect Verify Simulate Analyze
OpenGL Application
GLInterceptor
Vendor OpenGL Driver
Trace
ATI R520/NVidia G70
Framebuffer
Vendor OpenGL Driver
ATI R520/NVidia G70
Framebuffer
ATTILA OpenGL Driver
ATTILA Simulator
Framebuffer
GLPlayer
Signal Visualizer
Statistics
Signal Traffic
CHECK! CHECK!
ATTILA SimulatorATTILA Simulator- Detailed cycle-by-cycle simulation of all - Detailed cycle-by-cycle simulation of all
pipeline stagespipeline stages- 20 boxes, modeling a 100-deep pipeline- 20 boxes, modeling a 100-deep pipeline- Execute@Execute: functionality - Execute@Execute: functionality
embedded at each pipeline stageembedded at each pipeline stage
2020
Spot the differencesSpot the differences
AttilaNVidia GeForce FX 5900XT
2121
OutlineOutline
ATTILA PCATTILA PC
ATTILA EmbeddedATTILA Embedded
Triangle Setup in the Shader UnitTriangle Setup in the Shader Unit
ATTILA Simulation FrameworkATTILA Simulation Framework
ResultsResults
2222
BenchmarkBenchmark
Unreal Tournament 2004Unreal Tournament 2004 NOT AN EMBEDDED BENCHMARKNOT AN EMBEDDED BENCHMARK
Up to 300K vertices per frame!Up to 300K vertices per frame! Fixed function OpenGL APIFixed function OpenGL API
Vertex and fragments shaders generated by our Vertex and fragments shaders generated by our librarylibrary
320x240 resolution320x240 resolution 140 of 450 frames simulated140 of 450 frames simulated 100+ frames ~ 1 day simulation100+ frames ~ 1 day simulation
On a Xeon P4 @ 2.0GhzOn a Xeon P4 @ 2.0Ghz
2323
ConfigurationsConfigurationsWe have evaluatedWe have evaluated
3 middle-end to low-end PC GPU configurations3 middle-end to low-end PC GPU configurations 2 integrated on chipset GPUs and high-end PDA GPUs configurations2 integrated on chipset GPUs and high-end PDA GPUs configurations 4 embedded low-end GPUs configurations4 embedded low-end GPUs configurations
We tried to keep a balance between memory bandwidth and shading We tried to keep a balance between memory bandwidth and shading computing powercomputing power
From 4 to no vertex shader unitsFrom 4 to no vertex shader units From 2 quad fragment shader units to a single unified shader unitFrom 2 quad fragment shader units to a single unified shader unit From four to one 64-bit DDR memory channelsFrom four to one 64-bit DDR memory channels Store framebuffer in small (1 MB) GPU memory and textures in system memoryStore framebuffer in small (1 MB) GPU memory and textures in system memory
Halved the frequency for embedded systemsHalved the frequency for embedded systems Restricted design rulesRestricted design rules Reduce power consumptionReduce power consumption
Removed all optional features at the low endRemoved all optional features at the low end Hierarchical ZHierarchical Z Z compressionZ compression Specialized Triangle Setup hardwareSpecialized Triangle Setup hardware
2424
Evaluated ConfigurationsEvaluated ConfigurationsConfConf ResRes MHzMHz VShVSh (F)Sh(F)Sh Fetch Fetch
WayWayRegs Regs
ThreadThreadSetupSetup BusesBuses CacheCache eDRAMeDRAM HZHZ Z Z
ComprCompr
AA 10241024 400400 44 2x42x4 22 16x3216x32 FixedFixed 44 16 KB16 KB -- YY YY
BB 320320 400400 44 2x42x4 22 16x3216x32 FixedFixed 44 16 KB16 KB -- YY YY
CC 320320 400400 22 1x41x4 22 16x3216x32 FixedFixed 22 16 KB16 KB -- YY YY
DD 320320 400400 22 1x41x4 22 16x3216x32 FixedFixed 22 8 KB8 KB -- NN YY
EE 320320 200200 -- 1x21x2 22 8x328x32 FixedFixed 11 8 KB8 KB -- NN YY
FF 320320 200200 -- 1x21x2 22 8x328x32 FixedFixed 11 4 KB4 KB -- NN NN
GG 320320 200200 -- 1x11x1 22 16x1616x16 FixedFixed 11 4 KB4 KB -- NN NN
HH 320320 200200 -- 1x11x1 11 16x1616x16 FixedFixed 11 4 KB4 KB -- NN NN
II 320320 200200 -- 1x11x1 11 16x1616x16 ShaderShader 11 4 KB4 KB -- NN NN
JJ 320320 200200 -- 1x11x1 11 16x1616x16 ShaderShader 11 4 KB4 KB 1 MB1 MB NN NN
KK 320320 200200 -- 1x11x1 11 16x1616x16 ShaderShader 11 4 KB4 KB 1 MB1 MB YY YY
2525
Configuration ComparisonConfiguration Comparison
0
5
10
15
20
25
BW (GB/s) 23,8 23,8 11,9 11,9 2,98 2,98 2,98 2,98 2,98 4,47 4,47
A B C D E F G H I J K0
1020304050607080
GFlops 76,8 76,8 38,4 38,4 6,4 6,4 3,2 1,6 1,6 1,6 1,6
A B C D E F G H I J K
0
20
40
60
80
100
Caches (KB) 96 96 48 24 24 12 12 12 12 12 12
A B C D E F G H I J K
2626
PerformancePerformance
Average of 20 frames per second at 320x240 for the Average of 20 frames per second at 320x240 for the lower end single shader configurationslower end single shader configurations
0
20
40
60
80
100
FPS 80,2 339 209 202 61,4 60,1 33,6 24,2 20,2 20,2 20,5
A B C D E F G H I J K
2727
EfficiencyEfficiency
The limiting factor for PC and high embedded configurations is memory The limiting factor for PC and high embedded configurations is memory bandwidthbandwidth
Shaders underutilized for the evaluated benchmarkShaders underutilized for the evaluated benchmarkThe limiting factor for low end configurations is shading processingThe limiting factor for low end configurations is shading processing
Memory bandwidth could be further reducedMemory bandwidth could be further reducedCaches seem over dimensioned for the low-end embedded configurationsCaches seem over dimensioned for the low-end embedded configurations
02468
10121416182022
A B C D E F G H I J K
FPS per GFops FPS per BW FPS per Cache KB
2828
Shaded Triangle Setup PerformanceShaded Triangle Setup Performance
No overhead on fragment limited benchmarksNo overhead on fragment limited benchmarks16% less performance in vertex and triangle 16% less performance in vertex and triangle limited traceslimited traces
0,7
0,75
0,8
0,85
0,9
0,95
1
torus UT-2004 lit spheres spaceship VL-II
on shader on specif ic hardw are
2929
ConclusionConclusion
The Attila Embedded achieves 20 frames per The Attila Embedded achieves 20 frames per second on a single unified shader architecture at second on a single unified shader architecture at a 320x240 resolution when using a year old PC a 320x240 resolution when using a year old PC benchmarkbenchmark 1 MB of fast embedded DRAM provides more 1 MB of fast embedded DRAM provides more
than enough bandwidth for framebuffer than enough bandwidth for framebuffer accessesaccesses
Texture data stored in system memoryTexture data stored in system memory 16% performance reduction when removing 16% performance reduction when removing
the specialized Triangle Setup unit in the the specialized Triangle Setup unit in the worst tested caseworst tested case
3030
Questions?Questions?
3131
MemoryController
MemoryController
MemoryController
MemoryController
ROP ROP ROP ROP
Shader
Shader
Shader
Shader
Vertex Fetch
Primitive Assembly
Clipping
Triangle Setup
Rasterization
HierarchicalZ
Scheduler
Distributor
Attila PCAttila PC
Unified Shader Pool
3232
02468
10121416
A B C D E F G H I J K
Performance Performance per Gflop
Performance per BW Performance per Cache KB
3333
PowerVR SGXPowerVR SGX