Rasterization on Larrabee:
A First Look at the Larrabee New
Instructions (LRBni) in Action *
Michael AbrashGDC, March 2009
* Warning: this is a pretty technical talk, so if you have no interest
in processor architecture, instruction sets, or programming and
were trying to decide between this talk and something about game
design or the production pipeline, you might want to reconsider!
I never did get that lerp instruction!
Larrabee
What is Larrabee?
Better yet, why is Larrabee?
For decades, processors have gotten faster
Higher clock speeds
Smaller (== faster and more) gates
Bigger caches
Hardware that extracts more work per clock
Why Larrabee?
That process certainly hasn’t stopped
But it’s getting harder
Low-hanging fruit already taken
Running up against power budgets
Why Larrabee?
More recently, developments along another axis
Vector processing
And another
Multiple hardware threads
And yet another
Multiple cores
Performance can scale linearly with gates
More work per clock by moving burden of extracting parallelism to software
What is Larrabee?
Larrabee is the logical conclusion of these trends
Lots of power-efficient cores
In-order pipeline
Clocked at the power/performance sweet spot
4 threads per core
16-wide vector units
Streaming support
What is Larrabee?
Very powerful
Very power-efficient
Very highly parallel
Very dependent on software to make use of that
parallelism
What is Larrabee?
Gets as much work out of each watt and each
square millimeter as possible
Scales well far into the future
Massive potential performance; 1 teraflop & up
My favorite part
Excellent software control of performance
Relies heavily on software to get it to live up to its
potential
Larrabee hardware architecture
Obligatory vague architecture slide
Me
mo
ry C
on
tro
ller
Multi-Threaded
Wide SIMD
Multi-Threaded
Wide SIMD
Multi-Threaded
Wide SIMD
Multi-Threaded
Wide SIMD
D$D$I$I$D$D$I$I$
D$D$I$I$
Fix
ed
Fu
ncti
on
Textu
re L
og
ic
Me
mo
ry C
on
tro
ller
Me
mo
ry C
on
tro
ller
Dis
pla
y I
nte
rface
Syste
m In
terf
ace
D$I$
Multi-ThreadedWide SIMD
D$I$
Multi-ThreadedWide SIMD
D$I$
Multi-ThreadedWide SIMD
D$I$
Multi-ThreadedWide SIMD
L2 Cache
. . .
. . .
Me
mo
ry C
on
tro
ller
Multi-Threaded
Wide SIMD
Multi-Threaded
Wide SIMD
Multi-Threaded
Wide SIMD
Multi-Threaded
Wide SIMD
Multi-Threaded
Wide SIMD
Multi-Threaded
Wide SIMD
Multi-Threaded
Wide SIMD
Multi-Threaded
Wide SIMD
D$D$I$I$D$D$I$I$ D$D$I$I$
D$D$I$I$
Fix
ed
Fu
ncti
on
Textu
re L
og
ic
Me
mo
ry C
on
tro
ller
Me
mo
ry C
on
tro
ller
Dis
pla
y I
nte
rface
Syste
m In
terf
ace
D$I$
Multi-ThreadedWide SIMD
D$I$ D$I$
Multi-ThreadedWide SIMD
D$I$
Multi-ThreadedWide SIMD
D$I$ D$I$
Multi-ThreadedWide SIMD
D$I$
Multi-ThreadedWide SIMD
D$I$ D$I$
Multi-ThreadedWide SIMD
D$I$
Multi-ThreadedWide SIMD
D$I$ D$I$
Multi-ThreadedWide SIMD
L2 Cache
. . .
. . .
Larrabee hardware architecture
Lots of enhanced in-order x86 cores
Fully capable of running an OS and apps
Great flexibility in graphics pipeline design
Can support a wide variety of software
4 threads per core
Hide pipeline latency, cache misses
Coherent caches, connected by a fast ring
Larrabee hardware architecture
Tried to maximize general usability of Larrabee
hw
Fixed-function texture sampler units
Also a per-core cache-management unit
No other fixed-function units
Larrabee hardware architecture
These features boost performance via thread-
level parallelism
Key element of Larrabee performance, but it’s not
unique to Larrabee, so I’m not going to talk about it
further today
Larrabee hardware architecture
I’m going to focus on per-thread (data-level,
SIMD) parallelism
16-wide vector unit
Why 16-wide?
The wider the better – if it gets used
LRBni
Larrabee New Instructions
>100 new instructions
Mostly vector instructions
Architected in close collaboration with developers
Design philosophy
It’s hard to leverage data parallelism without all the right
pieces in the instruction set
Enable generally-usable extraction of data-level parallelism
The fundamentals of LRBni
32 vector register, each 512 bits wide
v0-v31
Full complement of vector instructions
Operate on int32, float32, float64
Mul, add, sub, adc, sbb, subr, and, or, xor, multiply-add, multiply-sub
Vector compares
Aligned and unaligned store/load
Gather/scatter
Bit manipulation: insert field, interleave, shuffle
The fundamentals of LRBni
Ternary (three-argument) operations
Load-operand – can read one arg from memory
No-cost type conversions on load/store
All math is 32- or 64-bit wide
Smaller data in memory to save bw and footprint
Common upconversions on load-operand
Upconversion and/or broadcast on memory load
Downconversion and/or selection on memory store
All common DX/OpenGL types including float16, unorm8, etc.
The fundamentals of LRBni
8 16-bit mask registers
Every vector instruction can do no-cost predication
Most often set by vector compares
Can be copied from scalar registers (eax, ebx, …)
Set of logical instructions that operate on masks
Mask tests allow real branches and loops
The fundamentals of LRBni
Parallel <=> serial conversion
Pack-store; load-unpack
Gather; scatter
Bit scan initialized
Streaming support
Prefetching
Cache control
Tim Sweeney on Larrabee
Quotes from Tim Sweeney on Larrabee:
Short version: Larrabee rocks!
Larrabee instructions are “vector complete”
More precisely: Any loop written in a traditional programming language can be vectorized, to execute 16 iterations of the loop in parallel on Larrabee vector units, provided the loop body meets the following criteria:
Its call graph is statically known.
There are no data dependencies between iterations.
Michael Abrash on Larrabee
“Tim’s absolutely right, but I’ll bet there’s still a
lot of performance to be had from mucking
around under the hood ”
Sample LRBni codekxnor k2, k2
vorpi v0, v2, v2
vorpi v1, v3, v3
vxorpi v4, v4, v4
mov ebx, 256
loop:
cmp ebx,0
jl endloop
dec ebx
vmulps v21 {k2}, v0, v1
vaddps v21 {k2}, v21, v21
vmadd213ps v0 {k2}, v0, v2
vmsub231ps v0 {k2}, v1, v1
vaddps v1 {k2}, v21, v3
vaddps v4 {k2}, v4, [ConstantFloatOne] {1to16}
vmulps v25 {k2}, v0, v0
vmadd231ps v25 {k2}, v1, v1
vcmpleps k2 {k2}, v25, [ConstantFloatOne] {1to16}
kortest k2, k2
jnz loop
endloop:
(this happens to be a Mandelbrot-set generator)
(thanks Dean for fixing this!)
LRBni examples
Ternary: start with a simple vector multiply
vmulps v0, v5, v6 ; v0 = v5 * v6
LRBni examples
Multiply-add: destination is also third source
vmadd231ps v0, v5, v6 ; v0 = v5 * v6 + v0
Operand 2 times operand 3 plus operand 1
LRBni examples
Multiply-add: destination is also third source
vmadd231ps v0, v5, v6 ; v0 = v5 * v6 + v0
Operand 2 times operand 3 plus operand 1
LRBni examples
Multiply-add: destination is also third source
vmadd231ps v0, v5, v6 ; v0 = v5 * v6 + v0
Operand 2 times operand 3 plus operand 1
LRBni examples
Multiply-add: destination is also third source
vmadd231ps v0, v5, v6 ; v0 = v5 * v6 + v0
Operand 2 times operand 3 plus operand 1
LRBni examples
Predication: mask the writing of the elements
vmadd231ps v0 {k1}, v5, v6
LRBni examples
Load-operand: src2 is the memory location
specified by rbx+rcx*4
vmadd231ps v0 {k1}, v5, [rbx+rcx*4]
LRBni examples
The operands can be plugged in differently
vmadd213ps v0 {k1}, v5, [rbx+rcx*4]
Operand 2 times operand 1 plus operand 3
LRBni examples
Broadcast: expand 4 (or 1) elements in memory
vmadd231ps v0 {k1}, v5, [rbx+rcx*4] {4to16}
LRBni examples
Conversion: upconvert from float16 format
vmadd231ps v0 {k1}, v5, [rbx+rcx*4] {float16}
LRBni examples
Pack-store
vcompressd [rbx] {k1}, v0
LRBni examples Gather
vgatherd v1 {k1}, [rbx + v2*4]
The fundamentals of LRBni
All these instructions run at full speed!
I know this has been way too fast
There just isn’t time for an in-depth look
LRBni in more detail:
Tom Forsyth’s talk (10:30: Room 3002, West Hall)
Dr. Dobb’s Journal article in April
These slides (complete with notes)
LRBni in Action
Everything you need to run fully-parallel code well
General code running on a CPU Can run anything
How well can it run less-than-perfectly-parallelizable code?
RAD has been working on the Larrabee D3D/OpenGL graphics pipeline
Pipeline is not an ideal candidate for parallelization Retirement must be in order
Rasterization is not easy to vectorize efficiently
I’ll look at rasterization today The process of determining which pixels are inside a triangle
Applying LRBni to the graphics
pipeline
For the most part, this is easy
Z/stencil buffering, pixel shading, blending all
fall out of processing 4x4 blocks
16 vertices can be shaded and cached in parallel
Vertex usage tends to be localized
Triangle set-up lends itself pretty well to
vectorization
Some efficiency cost from culling
Rasterization is not easy to vectorize
At least not with usable performance
In fact, we were sure it couldn’t be done!
Forced to reexamine assumptions
So irritated at being asked for the hundredth time if
it was possible that we sat down to prove it wasn’t
Rasterization is not easy to vectorize
At least not with usable performance
In fact, we were sure it couldn’t be done!
Forced to reexamine assumptions
So irritated at being asked for the hundredth time if
it was possible that we sat down to prove it wasn’t
We failed
Dedicated hardware can do any
given task more efficiently
In general, dedicated hardware will be able to do any
given graphics task more efficiently per square
millimeter than software
However, CPU flexibility can gain some or all of that
back by applying square millimeters as needed
Hardware needs worst-case capacity for each component
Often partly or entirely idle
When little rasterization is needed (long shaders, say), CPUs
can just use cycles for other purposes; ALUs are never idle
Can even have higher peak rates in many cases, because the
whole chip can work on a single task if necessary
A quick refresher
Three edges per triangle, each defined by an equation Bx+Cy relative to any point on edge
Sign indicates in/out
X and y snapped to 15.8
Range [-16K,+16K]
Tested at pixel/sample centers
Fill rules must be observed
Must be exact (discrete math)
Must support multisampled antialiasing (MSAA)
Pixel rasterization: example
+ +
+
- --
Pixomatic 1 rasterization
Pixo 1 decomposed triangles into 1 or 2
trapezoids
Stepped down 2 edges at a time on pixel centers
Could do with only 1 branch, for loop
Branch misprediction is very expensive
Rasterization code tends to predict poorly
Pixomatic 1 rasterization
Problems with Pixo 1 rasterization
Required expensive IMUL per edge
Poorly suited to small triangles
Not well suited to MSAA sample jittering
Never could figure out how to vectorize it
efficiently
Sweep rasterization
Sweep rasterization
Sweep rasterization
Sweep rasterization
Sweep rasterization
Sweep rasterization
Sweep rasterization
Generally well suited to vectorization
Problems for CPUs
Lots of badly-predicted branching
Significant work to figure out where to descend
Example of an approach that dedicated
hardware can perform much more efficiently
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
CPU smarts are useful for
rasterization
Not-rasterizing
CPU smarts are useful for
rasterization
The Larrabee renderer uses chunking (binning, tiling)
For chunking, separate per-tile and intra-tile rasterization
Per-tile rasterization (triangle->tile assignment)
Bounding box tests for triangles up to 1 tile in size
Sweep or just walk bounding box for larger triangles
Tile-assignment time is insignificant for larger triangles
CPUs make it easy to do the 90% case well, difficult 10% case
adequately
If large-triangle tile assignment important, could use a form
of the approach used for intra-tile (discussed next)
A tiled render target
(0,0)
(256,256)
Tile 0
Tile 3Tile 2
Tile 1
256x256 render target
Tile assignment test – trivial reject
Calculate value of each edge equation at its
trivial reject tile corner
If any edge is non-negative, triangle does not
intersect tile (<0 == inside so we can just test sign
bit)
Trivial reject: example
(0,0)
(256,256)
Tile 0
Tile 3Tile 2
Tile 1
256x256 render target
+-
Trivial reject corner of
tile 0 for black edge; if
this point isn’t inside
black edge, no point in
the tile can be inside
black edge
Trivial reject corner of
tile 2 for black edge
Trivial reject
corner of tile
1 for black
edge
Trivial reject
corner of tile
3 for black
edge
More positive
More negative
Trivial reject: example
(0,0)
(256,256)
Tile 0
Tile 3Tile 2
Tile 1
256x256 render target
+-
Trivial reject: example
(0,0)
(256,256)
Tile 0
Tile 3Tile 2
Tile 1
256x256 render target
+-
Trivial reject corner of
tile 0 for black edge; if
this point isn’t inside
black edge, no point in
the tile can be inside
black edge
Trivial reject corner of
tile 2 for black edge
Trivial reject
corner of tile
1 for black
edge
Trivial reject
corner of tile
3 for black
edge
Tile assignment test – trivial accept
For each edge, sum trivial reject corner value
with the equation step to opposite corner
If any edge is negative, the whole tile is trivially
accepted for that edge
No need to consider it when rasterizing within tile
In general, only relevant edges need to be considered
Scissors, user clip
Trivial accept: example
(0,0)
(256,256)
Tile 0
Tile 3Tile 2
Tile 1
256x256 render target
+
-
Trivial accept corner
of tile 0 for black
edge; if this point is
inside black edge, all
points in the tile must
be inside black edge
Trivial accept corner
of tile 2 for black edge
Trivial accept
corner of tile 1
for black edge
Trivial accept
corner of tile
3 for black
edge
Trivial accept: example
(0,0)
(256,256)
Tile 0
Tile 3Tile 2
Tile 1
256x256 render target
+-
Trivial accept corner
of tile 0 for black
edge; if this point is
inside black edge, all
points in the tile must
be inside black edge
Trivial accept corner
of tile 2 for black edge
Trivial accept
corner of tile 1
for black edge
Trivial accept
corner of tile
3 for black
edge
-+
-+
Trivial accept: example
(0,0)
(256,256)
Tile 0
Tile 3Tile 2
Tile 1
256x256 render target
+-
Trivial accept corner
of tile 0 for black
edge; if this point is
inside black edge, all
points in the tile must
be inside black edge
Trivial accept corner
of tile 2 for black edge
Trivial accept
corner of tile 1
for black edge
Trivial accept
corner of tile
3 for black
edge
-+
+-
Not-rasterizing
If all edges are negative at their trivial accept
corners, the whole tile is inside the triangle
No further rasterization is needed
Can just store a “draw-whole-tile” command in bin
Back end can then effectively do two nested loops
around shaders
Full-screen triangle rasterization speed ~= infinity
Intra-tile rasterization
Same principle, but vectorized
Assume tile is 64x64 pixels and vector width is 16
Vector-calculate the 16 trivial-reject and trivial-
accept corners of the 16x16 blocks as a delta
from the tile trivial-reject corner
Just two adds per edge, using tables generated at
triangle set-up time
AND edge results together
Vector rasterization to 16x16s:
trivial reject example64x64 tile
+-
White dots
are trivial
reject
corners of
16x16s for
black edge
Gray 16x16s
are trivially
rejected by
black edge
Tile trivial
reject corner
for black edge
Vector rasterization to 16x16s:
trivial reject example64x64 tile
+-
White dots
are trivial
reject
corners of
16x16s for
black edge
Gray 16x16s
are trivially
rejected by
black edge
Tile trivial
reject corner
for black edge
Vector rasterization to 16x16s:
trivial reject example64x64 tile
+-
White dots
are trivial
reject
corners of
16x16s for
black edge
Gray 16x16s
are trivially
rejected by
black edge
Tile trivial reject
corner for black edge
[Step from value here](-48, 0) [Step by B(-48) + C(0)]
(-48, -48) [Step by
B(-48) + C(-48)]
Vector rasterization to 16x16s:
trivial accept example64x64 tile
+-
White dots are
trivial accept
corners of
16x16s for
black edge
Gray 16x16s
are trivially
rejected by
black edge
Pink 16x16s
are trivially
accepted by
black edge
Intra-tile rasterization
Vector-test sign of edge equations at trivial
accept and trivial reject corners and AND
together
Bit-scan through resulting masks to find trivially
and partially accepted 16x16 blocks
Each trivial accept becomes a draw-block command
Again, no further rasterization needed for those pixels
Intra-tile rasterization
2 1 0 2 1 0 -1 1 0 0-1 -2 -1 -2 -3
Edge #1 equation step values
from tile trivial accept corner to
trivial accept corners of 16x16s 3
1Edge #1 equation value at tile
trivial accept corner
+ + + + + + + + + + + + + + + +
3 2 1 3 2 1 0 2 1 10 -1 0 -1 -24
= = = = = = = = = = = = = = = =
Edge #1 equation values at
trivial accept corners of 16x16s
Intra-tile rasterization
2 1 0 2 1 0 -1 1 0 0-1 -2 -1 -2 -3
Edge #1 equation step values
from tile trivial accept corner to
trivial accept corners of 16x16s 3
1Edge #1 equation value at tile
trivial accept corner
+ + + + + + + + + + + + + + + +
0 0 0 0 0 0 0 0 0 00 0 0 0 00
< < < < < < < < < < < < < < < <
3 2 1 3 2 1 0 2 1 10 -1 0 -1 -24
= = = = = = = = = = = = = = = =
= = = = = = = = = = = = = = = =
0 0 0 0 0 0 0 0 0 00 1 0 1 10Bit mask, in mask register, for
edge #1 trivial accept 16x16s
Edge #1 equation values at
trivial accept corners of 16x16s
Preset zeroes, in a register or
broadcast from memory
Intra-tile rasterization
= = = = = = = = = = = = = = = =
0 0 0 0 0 0 0 0 0 00 1 0 1 10
&
0 0 0 0 0 0 0 0 0 00 1 0 1 10
1 1 1 0 1 1 1 0 1 01 1 1 1 10
& & & & & & & & & & & & & & &
Bit mask, in mask register, for
edge #1 trivial accept 16x16s
Bit mask, in mask register, for
edge #2 trivial accept 16x16s
Intermediate result
&
0 1 1 0 0 1 1 0 0 01 1 0 0 10
& & & & & & & & & & & & & & &Bit mask, in mask register, for
edge #3 trivial accept 16x16s
0 0 0 0 0 0 0 0 0 00 1 0 0 10Composite trivial accept mask
for 16x16 blocks
= = = = = = = = = = = = = = = =
Intra-tile rasterization
= = = = = = = = = = = = = = = =
0 0 0 0 0 0 0 0 0 00 1 0 1 10
&
0 0 0 0 0 0 0 0 0 00 1 0 1 10
1 1 1 0 1 1 1 0 1 01 1 1 1 10
& & & & & & & & & & & & & & &
Bit mask, in mask register, for
edge #1 trivial accept 16x16s
Bit mask, in mask register, for
edge #2 trivial accept 16x16s
Intermediate result
&
0 1 1 0 0 1 1 0 0 01 1 0 0 10
& & & & & & & & & & & & & & &Bit mask, in mask register, for
edge #3 trivial accept 16x16s
0 0 0 0 0 0 0 0 0 00 1 0 0 10Composite trivial accept mask
for 16x16 blocks
= = = = = = = = = = = = = = = =
Block #0 found
by first bit-scan
Intra-tile rasterization
= = = = = = = = = = = = = = = =
0 0 0 0 0 0 0 0 0 00 1 0 1 10
&
0 0 0 0 0 0 0 0 0 00 1 0 1 10
1 1 1 0 1 1 1 0 1 01 1 1 1 10
& & & & & & & & & & & & & & &
Bit mask, in mask register, for
edge #1 trivial accept 16x16s
Bit mask, in mask register, for
edge #2 trivial accept 16x16s
Intermediate result
&
0 1 1 0 0 1 1 0 0 01 1 0 0 10
& & & & & & & & & & & & & & &Bit mask, in mask register, for
edge #3 trivial accept 16x16s
0 0 0 0 0 0 0 0 0 00 1 0 0 10Composite trivial accept mask
for 16x16 blocks
= = = = = = = = = = = = = = = =
Block #4 found by
second bit-scan
Intra-tile rasterization
= = = = = = = = = = = = = = = =
0 0 0 0 0 0 0 0 0 00 1 0 1 10
&
0 0 0 0 0 0 0 0 0 00 1 0 1 10
1 1 1 0 1 1 1 0 1 01 1 1 1 10
& & & & & & & & & & & & & & &
Bit mask, in mask register, for
edge #1 trivial accept 16x16s
Bit mask, in mask register, for
edge #2 trivial accept 16x16s
Intermediate result
&
0 1 1 0 0 1 1 0 0 01 1 0 0 10
& & & & & & & & & & & & & & &Bit mask, in mask register, for
edge #3 trivial accept 16x16s
0 0 0 0 0 0 0 0 0 00 1 0 0 10Composite trivial accept mask
for 16x16 blocks
= = = = = = = = = = = = = = = =
Bit-scan
completed
Trivial accept for 16x16 blocks; Step to edge values at 16 16x16 block trivial accept corners
vaddpi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}
vaddpi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}
vaddpi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}
; See if each trivial accept corner is inside all three edges
vcmplepi k1, v0, [rsi+ConstantZero]
vcmplepi k1 {k1}, v1, [rsi+ConstantZero]
vcmplepi k1 {k1}, v2, [rsi+ConstantZero]
; Get the mask; 1-bits are trivial accept corners that are
; inside all three edges
kmov eax, k1
; Loop through 1-bits, issuing a draw-16x16-block command
; for each trivially-accepted 16x16 block
bsf ecx, eax
jnz TrivialAcceptDone
TrivialAcceptLoop:
; <Store draw-16x16-block command, along with (x,y) location>
bsfi ecx, eax
jnz TrivialAcceptLoop
TrivialAcceptDone:
Trivial accept for 16x16 blocks; Step to edge values at 16 16x16 block trivial accept corners
vaddpi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}
vaddpi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}
vaddpi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}
; See if each trivial accept corner is inside all three edges
vcmplepi k1, v0, [rsi+ConstantZero]
vcmplepi k1 {k1}, v1, [rsi+ConstantZero]
vcmplepi k1 {k1}, v2, [rsi+ConstantZero]
; Get the mask; 1-bits are trivial accept corners that are
; inside all three edges
kmov eax, k1
; Loop through 1-bits, issuing a draw-16x16-block command
; for each trivially-accepted 16x16 block
bsf ecx, eax
jnz TrivialAcceptDone
TrivialAcceptLoop:
; <Store draw-16x16-block command, along with (x,y) location>
bsfi ecx, eax
jnz TrivialAcceptLoop
TrivialAcceptDone:
Trivial accept for 16x16 blocks; Step to edge values at 16 16x16 block trivial accept corners
vaddpi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}
vaddpi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}
vaddpi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}
; See if each trivial accept corner is inside all three edges
vcmplepi k1, v0, [rsi+ConstantZero]
vcmplepi k1 {k1}, v1, [rsi+ConstantZero]
vcmplepi k1 {k1}, v2, [rsi+ConstantZero]
; Get the mask; 1-bits are trivial accept corners that are
; inside all three edges
kmov eax, k1
; Loop through 1-bits, issuing a draw-16x16-block command
; for each trivially-accepted 16x16 block
bsf ecx, eax
jnz TrivialAcceptDone
TrivialAcceptLoop:
; <Store draw-16x16-block command, along with (x,y) location>
bsfi ecx, eax
jnz TrivialAcceptLoop
TrivialAcceptDone:
Trivial accept for 16x16 blocks; Step to edge values at 16 16x16 block trivial accept corners
vaddpi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}
vaddpi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}
vaddpi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}
; See if each trivial accept corner is inside all three edges
vcmplepi k1, v0, [rsi+ConstantZero]
vcmplepi k1 {k1}, v1, [rsi+ConstantZero]
vcmplepi k1 {k1}, v2, [rsi+ConstantZero]
; Get the mask; 1-bits are trivial accept corners that are
; inside all three edges
kmov eax, k1
; Loop through 1-bits, issuing a draw-16x16-block command
; for each trivially-accepted 16x16 block
bsf ecx, eax
jnz TrivialAcceptDone
TrivialAcceptLoop:
; <Store draw-16x16-block command, along with (x,y) location>
bsfi ecx, eax
jnz TrivialAcceptLoop
TrivialAcceptDone:
Trivial accept for 16x16 blocks; Step to edge values at 16 16x16 block trivial accept corners
vaddpi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}
vaddpi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}
vaddpi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}
; See if each trivial accept corner is inside all three edges
vcmplepi k1, v0, [rsi+ConstantZero]
vcmplepi k1 {k1}, v1, [rsi+ConstantZero]
vcmplepi k1 {k1}, v2, [rsi+ConstantZero]
; Get the mask; 1-bits are trivial accept corners that are
; inside all three edges
kmov eax, k1
; Loop through 1-bits, issuing a draw-16x16-block command
; for each trivially-accepted 16x16 block
bsf ecx, eax
jnz TrivialAcceptDone
TrivialAcceptLoop:
; <Store draw-16x16-block command, along with (x,y) location>
bsfi ecx, eax
jnz TrivialAcceptLoop
TrivialAcceptDone:
Trivial accept for 16x16 blocks
; Step to edge values at 16 16x16 block trivial accept corners &
; see if each trivial accept corner is inside all three edges
vaddsetspi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}
vaddsetspi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}
vaddsetspi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}
; Get the mask; 1-bits are trivial accept corners that are
; inside all three edges
kmov eax, k1
; Loop through 1-bits, issuing a draw-16x16-block command
; for each trivially-accepted 16x16 block
bsf ecx, eax
jnz TrivialAcceptDone
TrivialAcceptLoop:
; <Store draw-16x16-block command, along with (x,y) location>
bsfi ecx, eax
jnz TrivialAcceptLoop
TrivialAcceptDone:
Partially accepted 16x16 blocks
For each partial 16x16, descend again to 4x4s
Put trivially accepted 4x4s in bins
Partially accepted 4x4s need to be processed
into pixel masks
Vector add of equation step for each edge
AND signs together to form pixel mask
Pixel mask for partial 4x4: example4x4 pixels
+-
White dots
are pixel
centers
Blue pixels
are inside
black edge
Grey pixels
are outside
black edge
Trivial reject
corner for black
edge for 4x4
1110111011001100
Resulting mask register
Pixel mask for partial 4x4: example
4x4 pixels
+-
White dots
are pixel
centers
Blue pixels
are inside
black edgeGrey pixels
are outside
black edge
Trivial reject
corner for black
edge for 4x4
Partially accepted 4x4 blocks; Store values at corners of 16 4x4s in 16x16 for indexing into
vstored [rsi+Edge0TrivialRejectCornerValues4x4], v0
vstored [rsi+Edge1TrivialRejectCornerValues4x4], v1
vstored [rsi+Edge2TrivialRejectCornerValues4x4], v2
; Load step tables from corners of 4x4s to pixel centers
vloadd v3, [rsi+Edge0PixelCenterTable]
vloadd v4, [rsi+Edge1PixelCenterTable]
vloadd v5, [rsi+Edge2PixelCenterTable]
; Loop through 1-bits from trivial reject test on 16x16 block,
; descending to rasterize each partially-accepted 4x4
kmov eax, k1
bsf ecx, eax
jnz Partial4x4Done
Partial4x4Loop:
; See if each of 16 pixel centers is inside all three edges
vcmpgtpi k2, v3, [rsi+Edge0TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v4, [rsi+Edge1TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v5, [rsi+Edge2TrivialRejectCornerValues4x4+rcx*4] {1to16}
; Store the mask
kmov edx, k2
mov [rbx], dx
; <Store the (x,y) location and advance rbx>
bsfi ecx, eax
jnz Partial4x4Loop
Partial4x4Done:
Partially accepted 4x4 blocks; Store values at corners of 16 4x4s in 16x16 for indexing into
vstored [rsi+Edge0TrivialRejectCornerValues4x4], v0
vstored [rsi+Edge1TrivialRejectCornerValues4x4], v1
vstored [rsi+Edge2TrivialRejectCornerValues4x4], v2
; Load step tables from corners of 4x4s to pixel centers
vloadd v3, [rsi+Edge0PixelCenterTable]
vloadd v4, [rsi+Edge1PixelCenterTable]
vloadd v5, [rsi+Edge2PixelCenterTable]
; Loop through 1-bits from trivial reject test on 16x16 block,
; descending to rasterize each partially-accepted 4x4
kmov eax, k1
bsf ecx, eax
jnz Partial4x4Done
Partial4x4Loop:
; See if each of 16 pixel centers is inside all three edges
vcmpgtpi k2, v3, [rsi+Edge0TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v4, [rsi+Edge1TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v5, [rsi+Edge2TrivialRejectCornerValues4x4+rcx*4] {1to16}
; Store the mask
kmov edx, k2
mov [rbx], dx
; <Store the (x,y) location and advance rbx>
bsfi ecx, eax
jnz Partial4x4Loop
Partial4x4Done:
Partially accepted 4x4 blocks; Store values at corners of 16 4x4s in 16x16 for indexing into
vstored [rsi+Edge0TrivialRejectCornerValues4x4], v0
vstored [rsi+Edge1TrivialRejectCornerValues4x4], v1
vstored [rsi+Edge2TrivialRejectCornerValues4x4], v2
; Load step tables from corners of 4x4s to pixel centers
vloadd v3, [rsi+Edge0PixelCenterTable]
vloadd v4, [rsi+Edge1PixelCenterTable]
vloadd v5, [rsi+Edge2PixelCenterTable]
; Loop through 1-bits from trivial reject test on 16x16 block,
; descending to rasterize each partially-accepted 4x4
kmov eax, k1
bsf ecx, eax
jnz Partial4x4Done
Partial4x4Loop:
; See if each of 16 pixel centers is inside all three edges
vcmpgtpi k2, v3, [rsi+Edge0TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v4, [rsi+Edge1TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v5, [rsi+Edge2TrivialRejectCornerValues4x4+rcx*4] {1to16}
; Store the mask
kmov edx, k2
mov [rbx], dx
; <Store the (x,y) location and advance rbx>
bsfi ecx, eax
jnz Partial4x4Loop
Partial4x4Done:
Partially accepted 4x4 blocks; Store values at corners of 16 4x4s in 16x16 for indexing into
vstored [rsi+Edge0TrivialRejectCornerValues4x4], v0
vstored [rsi+Edge1TrivialRejectCornerValues4x4], v1
vstored [rsi+Edge2TrivialRejectCornerValues4x4], v2
; Load step tables from corners of 4x4s to pixel centers
vloadd v3, [rsi+Edge0PixelCenterTable]
vloadd v4, [rsi+Edge1PixelCenterTable]
vloadd v5, [rsi+Edge2PixelCenterTable]
; Loop through 1-bits from trivial reject test on 16x16 block,
; descending to rasterize each partially-accepted 4x4
kmov eax, k1
bsf ecx, eax
jnz Partial4x4Done
Partial4x4Loop:
; See if each of 16 pixel centers is inside all three edges
vcmpgtpi k2, v3, [rsi+Edge0TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v4, [rsi+Edge1TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v5, [rsi+Edge2TrivialRejectCornerValues4x4+rcx*4] {1to16}
; Store the mask
kmov edx, k2
mov [rbx], dx
; <Store the (x,y) location and advance rbx>
bsfi ecx, eax
jnz Partial4x4Loop
Partial4x4Done:
Partially accepted 4x4 blocks; Store values at corners of 16 4x4s in 16x16 for indexing into
vstored [rsi+Edge0TrivialRejectCornerValues4x4], v0
vstored [rsi+Edge1TrivialRejectCornerValues4x4], v1
vstored [rsi+Edge2TrivialRejectCornerValues4x4], v2
; Load step tables from corners of 4x4s to pixel centers
vloadd v3, [rsi+Edge0PixelCenterTable]
vloadd v4, [rsi+Edge1PixelCenterTable]
vloadd v5, [rsi+Edge2PixelCenterTable]
; Loop through 1-bits from trivial reject test on 16x16 block,
; descending to rasterize each partially-accepted 4x4
kmov eax, k1
bsf ecx, eax
jnz Partial4x4Done
Partial4x4Loop:
; See if each of 16 pixel centers is inside all three edges
vcmpgtpi k2, v3, [rsi+Edge0TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v4, [rsi+Edge1TrivialRejectCornerValues4x4+rcx*4] {1to16}
vcmpgtpi k2 {k2}, v5, [rsi+Edge2TrivialRejectCornerValues4x4+rcx*4] {1to16}
; Store the mask
kmov edx, k2
mov [rbx], dx
; <Store the (x,y) location and advance rbx>
bsfi ecx, eax
jnz Partial4x4Loop
Partial4x4Done:
Partially accepted 4x4 blocks - MSAA
4x4 pixels
+ -
Yellow dots
are sample 2
centers
Blue samples
are inside
black edgeGrey
samples are
outside black
edge
Trivial reject
corner for black
edge for 4x4
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Larrabee rasterization
Adaptive rasterization
Don’t have the luxury of custom data and ALU sizes
Do have the luxury of adapting to input data
Edge equations have to be evaluated with 48 bits in the worst case
We have to use 64 bits
Can use 32 bits for triangles that fit in 128x128 bounding boxes
90+% of triangles
Adaptive rasterization (cont.)
When we do have to do 64-bit edge evaluation
64-bit only required for tile assignment
Any tile up to 128x128 that’s not trivially accepted or rejected can be rasterized using 32 bits
Adaptive intra-tile rasterization
Triangles that fit in a 16x16 bounding box
One less level of descent, less set-up, no trivial accept test
Direct mask stamping for 4x4, 4x8, 8x4 bounding boxes
Non-rasterization-based z for small triangles
Easy to try things out
Implementation
Will still not match dedicated hardware peak
rates per square millimeter on average
Efficient enough, and avoids dedicating area and
design effort for a narrow purpose
Generality improves overall perf for a wide range of
tasks
For example, can bring more mm^2 to bear – the
whole chip!
Implementation
Texture sampling and filtering remain as
significant challenges for software
Apart from that, everything can be implemented
in software
Not always obvious how at first, but surprisingly
often doable
Still evolving
A whole new way to think about optimization
Further Larrabee Information
Tom Forsyth’s talk
10:30: Room 3002, West Hall
SIGGRAPH paper
“Larrabee: A Many-Core x86 Architecture for Visual Computing,” Seiler et al
Just search on “Larrabee SIGGRAPH paper”
Dr. Dobb’s Journal article in April
www.intel.com/software/graphics
GDC Larrabee talks
C++ Larrabee Prototype Library