Post on 29-Jan-2016
transcript
A User-Programmable Vertex Engine
A User-Programmable Vertex Engine
Erik LindholmErik Lindholm
Mark KilgardMark Kilgard
Henry MoretonHenry Moreton
NVIDIA CorporationNVIDIA Corporation
Presented by Han-Wei ShenPresented by Han-Wei Shen
Where does the Vertex Engine fit? Where does the Vertex Engine fit?
frame-bufferanti-aliasingframe-bufferanti-aliasing
textureblendingtexture
blending
setuprasterizer
setuprasterizer
Transform & LightingTransform & Lighting
Traditional Graphics Pipeline
frame-bufferanti-aliasingframe-bufferanti-aliasing
textureblendingtexture
blending
setuprasterizer
setuprasterizer
Transform & LightingTransform & Lighting
GeForce 3 Vertex EngineGeForce 3 Vertex Engine
VertexProgramVertex
Program
API SupportAPI Support
• Designed to fit into OpenGL and Designed to fit into OpenGL and D3D API’sD3D API’s
• Program mode vs. Fixed function Program mode vs. Fixed function modemode
• Load and bind programLoad and bind program
• Simple to add to old D3D and Simple to add to old D3D and OpenGL programsOpenGL programs
Programming Model Programming Model
• Enable vertex program Enable vertex program •glEnable(GL_VERTEX_PROGRAM_NV);
• Create vertex program objectCreate vertex program object
• Bind vertex program object Bind vertex program object
• Execute vertex program object Execute vertex program object
Create Vertex Program Create Vertex Program
• Programs (assembly) are defined Programs (assembly) are defined inline as inline as
character strings character strings static const GLubyte vpgm[] = “\!!VP1. 0\ DP4 o[HPOS].x, c[0], v[0]; \ DP4 o[HPOS].y, c[1], v[0]; \ DP4 o[HPOS].z, c[2], v[0]; \ DP4 o[HPOS].w, c[3], v[0]; \ MOV o[COL0],v[3]; \END";
Create Vertex Program (2)Create Vertex Program (2)
• Load and bind vertex programs Load and bind vertex programs similar to texture objects similar to texture objects glLoadProgramNV(GL_VERTEX_PROGRAM_NV, 7,
strelen(programString), programString);
….
glBindProgramNV(GL_VERTEX_PROGRAM_NV, 7);
Invoke Vertex Program Invoke Vertex Program
• The vertex program is initiated The vertex program is initiated when a vertex is given, i.e., whenwhen a vertex is given, i.e., when
glBegin(…)glBegin(…)
glVertex3f(x,y,z)glVertex3f(x,y,z)
… …
glEnd()glEnd()
Let’s look at the sample program
Let’s look at the sample program
static const GLubyte vpgm[] = “\!!VP1. 0\ DP4 o[HPOS].x, c[0], v[0]; \ DP4 o[HPOS].y, c[1], v[0]; \ DP4 o[HPOS].z, c[2], v[0]; \ DP4 o[HPOS].w, c[3], v[0]; \ MOV o[COL0],v[3]; \END";
O[HPOS] = M(c0,c1,c2,c3) * v - HPOS? O[COL0] = v[3] - COL0?
Calculate the clip space point position and Assign the vertex with v[3] as its diffuse color
Vertex Source
Vertex Program
Vertex Output
Program Constants
Temporary Registers
16x4 registers
128 instructions
96x4 registers
12x4 registers
15x4 registers
Programming ModelProgramming Model
V[0] …V[15] c[0]
…c[96]
R0 …R11
O[HPOS]O[COL0]O[COL1]O[FOGP]O[PSIZ]O[TEX0] …O[TEX7]
All quad floats
Input Vertex AttributesInput Vertex Attributes
• V[0] – V[15]V[0] – V[15]
• Aliased (tracked) with conventional per-Aliased (tracked) with conventional per-vertex attributes (Table 3)vertex attributes (Table 3)
• Use glVertexAttribNV() to explicitly assig Use glVertexAttribNV() to explicitly assig values values
• Can also specify a scalar value to the vertex Can also specify a scalar value to the vertex attribute array - glVertexAttributesNV()attribute array - glVertexAttributesNV()
• Can change values inside or outside Can change values inside or outside glBegin()/glEnd() pairglBegin()/glEnd() pair
Program ConstantsProgram Constants
• Can only change values outside glBegin()/glEnd() Can only change values outside glBegin()/glEnd() pair pair
• No automatic aliasing No automatic aliasing
• Can be used to track OpenGl matrices Can be used to track OpenGl matrices (modelview, projection, texture, etc.)(modelview, projection, texture, etc.)
• Example: Example:
glTrackMatrix(GL_VERTEX_PROGRAM_NV, 0, glTrackMatrix(GL_VERTEX_PROGRAM_NV, 0, GL_MODELVIEW_PROJECTION_NV, GL_MODELVIEW_PROJECTION_NV, GL_IDENTIGY_NV)GL_IDENTIGY_NV)
- track 4 contiguous program constants starting - track 4 contiguous program constants starting with c[0]with c[0]
Program Constants (cont’d)
Program Constants (cont’d)
DP4 o[HPOS].x, c[0], v[OPOS]DP4 o[HPOS].x, c[0], v[OPOS]
DP4 o[HPOS].y, c[1], v[OPOS]DP4 o[HPOS].y, c[1], v[OPOS]
DP4 o[HPOS].z, c[2], v[OPOS]DP4 o[HPOS].z, c[2], v[OPOS]
DP4 o[HPOS].w, c[3], v[OPOS]DP4 o[HPOS].w, c[3], v[OPOS]
What does it do? What does it do?
Program Constants (cont’d)
Program Constants (cont’d)
glTrackMatrixNV(GL_VERTEX_PROGRAM_NV, 4, glTrackMatrixNV(GL_VERTEX_PROGRAM_NV, 4, GL_MODEL_VIEW, GL_INVERSE_TRANPOSE_NV)GL_MODEL_VIEW, GL_INVERSE_TRANPOSE_NV)
DP3 R0.x, C[4], V[NRML]DP3 R0.x, C[4], V[NRML]
DP3 R0.y, C[5[, V[NRML]DP3 R0.y, C[5[, V[NRML]
DP3 R0.z, C[6], V[NRML] DP3 R0.z, C[6], V[NRML]
What doe it do? What doe it do?
Hardware Block DiagramHardware Block Diagram
Vertex Attribute Buffer (VAB)
Vector FP Core
Vertex In
Vertex Out
Vertex Attribute Buffer (VAB)
Vertex Attribute Buffer (VAB)
…
128 ( 32 x 4 )
128
dirty bitsVAB
….0 1 14 15IB
0 1 n-2 n-1........IB
0 1 n-2 n-1........OB
SIMDVector Unit
SpecialFunction
Unit
ConstantMemory
InstructionMemory
Registers
writemask
sw/neg
writemask
sw/negsw/neg
HW Block DiagramHW Block Diagram
Data PathData Path
FPU Core
NegateSwizzle
NegateSwizzle
NegateSwizzle
X Y Z WX Y Z W X Y Z W
Write Mask
X Y Z W
Instruction Set: The opsInstruction Set: The ops
• 17 instructions total17 instructions total
• MOV, MUL, ADD, MAD, DSTMOV, MUL, ADD, MAD, DST
• DP3, DP4DP3, DP4
• MIN, MAX, SLT, SGEMIN, MAX, SLT, SGE
• RCP, RSQ, LOG, EXP, LITRCP, RSQ, LOG, EXP, LIT
• ARL ARL
Instruction Set: The Core FeaturesInstruction Set: The Core Features
• Immediate access to sourcesImmediate access to sources
• Swizzle/negate on all sourcesSwizzle/negate on all sources
• Write mask on all destinationsWrite mask on all destinations
• DP3,DP4 most common graphics opsDP3,DP4 most common graphics ops
• Cross product is MUL+MAD with Cross product is MUL+MAD with swizzlingswizzling
• LIT instruction implements LIT instruction implements phongphonglightinglighting
Dot Product Instruction Dot Product Instruction
DP3 R0.x, R1, R2DP3 R0.x, R1, R2
R0.x = R1.x * R2.x + R1.y * R1.y + R0.x = R1.x * R2.x + R1.y * R1.y + R1.z * R2.zR1.z * R2.z
DP4 R0.x, R1, R2DP4 R0.x, R1, R2
4-component dot product 4-component dot product
MUL instruction MUL instruction
MUL R1, R0, R2 MUL R1, R0, R2 (component-wise (component-wise mult.)mult.)
R1.x = R0.x * R2.x R1.x = R0.x * R2.x
R1.y = R0.y * R2.y R1.y = R0.y * R2.y
R1.z = R0.z * R2.z R1.z = R0.z * R2.z
R1.w = R0.w * R2.w R1.w = R0.w * R2.w
MAD instruction MAD instruction
MAD R1, R2, R3, R4MAD R1, R2, R3, R4
R1 = R2 * R3 + R4 R1 = R2 * R3 + R4
*: component wise multiplication*: component wise multiplication
Example: Example:
MAD R1, R0.yzxw, R2.zxyw, -R1MAD R1, R0.yzxw, R2.zxyw, -R1
What does it do? What does it do?
Cross Product Coding ExampleCross Product Coding Example
# Cross product R2 = R0 x R1# Cross product R2 = R0 x R1
MUL R2, R0.zxyw, R1.yzxw;MUL R2, R0.zxyw, R1.yzxw;MAD R2, R0.yzxw, R1.zxyw, -R2;MAD R2, R0.yzxw, R1.zxyw, -R2;
Lighting instructionLighting instruction
LIT R1, R0 LIT R1, R0 (phong light model)(phong light model)Input: R0 = (diffuse, specular, ??, shiness)Input: R0 = (diffuse, specular, ??, shiness)
Output R1 = (1, diffuse, specular^shininess, Output R1 = (1, diffuse, specular^shininess, 1)1)
Usually followed by Usually followed by
DP3DP3 o[COL0], C[21], R1 o[COL0], C[21], R1 (assuming using (assuming using c[21]) c[21])
where C[xx] = (ka, kd, ks, ??) where C[xx] = (ka, kd, ks, ??)
Ready to trace some program? Ready to trace some program?
Previous Work: Geometry EnginePrevious Work: Geometry Engine
• High bandwidth + lots of FlopsHigh bandwidth + lots of Flops
• Low clock rateLow clock rate
• No architectural continuityNo architectural continuity
• VERY hard to programVERY hard to program
• Some high-level language support Some high-level language support (maybe)(maybe)
• A compromise solution (vtx,prim,pix,A compromise solution (vtx,prim,pix,…)…)
Alternative: The CPUAlternative: The CPU
• Low bandwidth + reasonable FlopsLow bandwidth + reasonable Flops
• High clock rateHigh clock rate
• Excellent architectural continuityExcellent architectural continuity
• VERY hard to use efficientlyVERY hard to use efficiently
• Excellent high-level language Excellent high-level language supportsupport
• Flexible, but often too slowFlexible, but often too slow
New Design: The Vertex EngineNew Design: The Vertex Engine
• Simple hardware for a commodity Simple hardware for a commodity GPUGPU
• Allows user to manipulate vertex Allows user to manipulate vertex transformtransform
• Simple to use programming modelSimple to use programming model
• Superset of fixed function modeSuperset of fixed function mode
Why Vertex Processing?Why Vertex Processing?
• Very parallelVery parallel
• Use single vertex programming Use single vertex programming modelmodel
• Hardware can batch or interleaveHardware can batch or interleave
• KISSKISS
Why Not Primitive Processing?Why Not Primitive Processing?
• Face culling and clipping break Face culling and clipping break parallelismparallelism
• Complicates memory accessesComplicates memory accesses
• Inefficient (control takes time)Inefficient (control takes time)
• Let hardware designers optimizeLet hardware designers optimize
Programming Model: Vertex I/OProgramming Model: Vertex I/O
• Streaming vertex architectureStreaming vertex architecture
• Source data converted to floatsSource data converted to floats
• Source data loadedSource data loaded
• Run programRun program
• Destination data drainedDestination data drained
• Destination data re-formatted for Destination data re-formatted for hwhw
Hardware ImplementationHardware Implementation
• Vector SIMD Unit + Special Vector SIMD Unit + Special Function UnitFunction Unit
• Multithreaded and pipelined to hide Multithreaded and pipelined to hide latencylatency
• Any one instruction/cycleAny one instruction/cycle
• All instructions equal latencyAll instructions equal latency
• Free swizzling/negate/write mask Free swizzling/negate/write mask supportsupport
ConclusionConclusion
• Very simple, efficient Very simple, efficient implementationimplementation
• Allows vertex programming Allows vertex programming continuitycontinuity
• Stanford Imagine ArchitectureStanford Imagine Architecture
• A work in progress, lots more to A work in progress, lots more to come…come…
• We welcome your feedbackWe welcome your feedback