+ All Categories
Home > Documents > GDC09 Abrash Larrabee+Final

GDC09 Abrash Larrabee+Final

Date post: 04-Apr-2015
Category:
Upload: asdffroot
View: 204 times
Download: 2 times
Share this document with a friend
116
Rasterization on Larrabee: A First Look at the Larrabee New Instructions (LRBni) in Action * Michael Abrash GDC, March 2009 * Warning: this is a pretty technical talk, so if you have no interest in processor architecture, instruction sets, or programming and were trying to decide between this talk and something about game design or the production pipeline, you might want to reconsider!
Transcript
Page 1: GDC09 Abrash Larrabee+Final

Rasterization on Larrabee:

A First Look at the Larrabee New

Instructions (LRBni) in Action *

Michael AbrashGDC, March 2009

* Warning: this is a pretty technical talk, so if you have no interest

in processor architecture, instruction sets, or programming and

were trying to decide between this talk and something about game

design or the production pipeline, you might want to reconsider!

Page 2: GDC09 Abrash Larrabee+Final

I never did get that lerp instruction!

Page 3: GDC09 Abrash Larrabee+Final

Larrabee

What is Larrabee?

Better yet, why is Larrabee?

For decades, processors have gotten faster

Higher clock speeds

Smaller (== faster and more) gates

Bigger caches

Hardware that extracts more work per clock

Page 4: GDC09 Abrash Larrabee+Final

Why Larrabee?

That process certainly hasn’t stopped

But it’s getting harder

Low-hanging fruit already taken

Running up against power budgets

Page 5: GDC09 Abrash Larrabee+Final

Why Larrabee?

More recently, developments along another axis

Vector processing

And another

Multiple hardware threads

And yet another

Multiple cores

Performance can scale linearly with gates

More work per clock by moving burden of extracting parallelism to software

Page 6: GDC09 Abrash Larrabee+Final

What is Larrabee?

Larrabee is the logical conclusion of these trends

Lots of power-efficient cores

In-order pipeline

Clocked at the power/performance sweet spot

4 threads per core

16-wide vector units

Streaming support

Page 7: GDC09 Abrash Larrabee+Final

What is Larrabee?

Very powerful

Very power-efficient

Very highly parallel

Very dependent on software to make use of that

parallelism

Page 8: GDC09 Abrash Larrabee+Final

What is Larrabee?

Gets as much work out of each watt and each

square millimeter as possible

Scales well far into the future

Massive potential performance; 1 teraflop & up

My favorite part

Excellent software control of performance

Relies heavily on software to get it to live up to its

potential

Page 9: GDC09 Abrash Larrabee+Final

Larrabee hardware architecture

Obligatory vague architecture slide

Me

mo

ry C

on

tro

ller

Multi-Threaded

Wide SIMD

Multi-Threaded

Wide SIMD

Multi-Threaded

Wide SIMD

Multi-Threaded

Wide SIMD

D$D$I$I$D$D$I$I$

D$D$I$I$

Fix

ed

Fu

ncti

on

Textu

re L

og

ic

Me

mo

ry C

on

tro

ller

Me

mo

ry C

on

tro

ller

Dis

pla

y I

nte

rface

Syste

m In

terf

ace

D$I$

Multi-ThreadedWide SIMD

D$I$

Multi-ThreadedWide SIMD

D$I$

Multi-ThreadedWide SIMD

D$I$

Multi-ThreadedWide SIMD

L2 Cache

. . .

. . .

Me

mo

ry C

on

tro

ller

Multi-Threaded

Wide SIMD

Multi-Threaded

Wide SIMD

Multi-Threaded

Wide SIMD

Multi-Threaded

Wide SIMD

Multi-Threaded

Wide SIMD

Multi-Threaded

Wide SIMD

Multi-Threaded

Wide SIMD

Multi-Threaded

Wide SIMD

D$D$I$I$D$D$I$I$ D$D$I$I$

D$D$I$I$

Fix

ed

Fu

ncti

on

Textu

re L

og

ic

Me

mo

ry C

on

tro

ller

Me

mo

ry C

on

tro

ller

Dis

pla

y I

nte

rface

Syste

m In

terf

ace

D$I$

Multi-ThreadedWide SIMD

D$I$ D$I$

Multi-ThreadedWide SIMD

D$I$

Multi-ThreadedWide SIMD

D$I$ D$I$

Multi-ThreadedWide SIMD

D$I$

Multi-ThreadedWide SIMD

D$I$ D$I$

Multi-ThreadedWide SIMD

D$I$

Multi-ThreadedWide SIMD

D$I$ D$I$

Multi-ThreadedWide SIMD

L2 Cache

. . .

. . .

Page 10: GDC09 Abrash Larrabee+Final

Larrabee hardware architecture

Lots of enhanced in-order x86 cores

Fully capable of running an OS and apps

Great flexibility in graphics pipeline design

Can support a wide variety of software

4 threads per core

Hide pipeline latency, cache misses

Coherent caches, connected by a fast ring

Page 11: GDC09 Abrash Larrabee+Final

Larrabee hardware architecture

Tried to maximize general usability of Larrabee

hw

Fixed-function texture sampler units

Also a per-core cache-management unit

No other fixed-function units

Page 12: GDC09 Abrash Larrabee+Final

Larrabee hardware architecture

These features boost performance via thread-

level parallelism

Key element of Larrabee performance, but it’s not

unique to Larrabee, so I’m not going to talk about it

further today

Page 13: GDC09 Abrash Larrabee+Final

Larrabee hardware architecture

I’m going to focus on per-thread (data-level,

SIMD) parallelism

16-wide vector unit

Why 16-wide?

The wider the better – if it gets used

Page 14: GDC09 Abrash Larrabee+Final

LRBni

Larrabee New Instructions

>100 new instructions

Mostly vector instructions

Architected in close collaboration with developers

Design philosophy

It’s hard to leverage data parallelism without all the right

pieces in the instruction set

Enable generally-usable extraction of data-level parallelism

Page 15: GDC09 Abrash Larrabee+Final

The fundamentals of LRBni

32 vector register, each 512 bits wide

v0-v31

Full complement of vector instructions

Operate on int32, float32, float64

Mul, add, sub, adc, sbb, subr, and, or, xor, multiply-add, multiply-sub

Vector compares

Aligned and unaligned store/load

Gather/scatter

Bit manipulation: insert field, interleave, shuffle

Page 16: GDC09 Abrash Larrabee+Final

The fundamentals of LRBni

Ternary (three-argument) operations

Load-operand – can read one arg from memory

No-cost type conversions on load/store

All math is 32- or 64-bit wide

Smaller data in memory to save bw and footprint

Common upconversions on load-operand

Upconversion and/or broadcast on memory load

Downconversion and/or selection on memory store

All common DX/OpenGL types including float16, unorm8, etc.

Page 17: GDC09 Abrash Larrabee+Final

The fundamentals of LRBni

8 16-bit mask registers

Every vector instruction can do no-cost predication

Most often set by vector compares

Can be copied from scalar registers (eax, ebx, …)

Set of logical instructions that operate on masks

Mask tests allow real branches and loops

Page 18: GDC09 Abrash Larrabee+Final

The fundamentals of LRBni

Parallel <=> serial conversion

Pack-store; load-unpack

Gather; scatter

Bit scan initialized

Streaming support

Prefetching

Cache control

Page 19: GDC09 Abrash Larrabee+Final

Tim Sweeney on Larrabee

Quotes from Tim Sweeney on Larrabee:

Short version: Larrabee rocks!

Larrabee instructions are “vector complete”

More precisely: Any loop written in a traditional programming language can be vectorized, to execute 16 iterations of the loop in parallel on Larrabee vector units, provided the loop body meets the following criteria:

Its call graph is statically known.

There are no data dependencies between iterations.

Page 20: GDC09 Abrash Larrabee+Final

Michael Abrash on Larrabee

“Tim’s absolutely right, but I’ll bet there’s still a

lot of performance to be had from mucking

around under the hood ”

Page 21: GDC09 Abrash Larrabee+Final

Sample LRBni codekxnor k2, k2

vorpi v0, v2, v2

vorpi v1, v3, v3

vxorpi v4, v4, v4

mov ebx, 256

loop:

cmp ebx,0

jl endloop

dec ebx

vmulps v21 {k2}, v0, v1

vaddps v21 {k2}, v21, v21

vmadd213ps v0 {k2}, v0, v2

vmsub231ps v0 {k2}, v1, v1

vaddps v1 {k2}, v21, v3

vaddps v4 {k2}, v4, [ConstantFloatOne] {1to16}

vmulps v25 {k2}, v0, v0

vmadd231ps v25 {k2}, v1, v1

vcmpleps k2 {k2}, v25, [ConstantFloatOne] {1to16}

kortest k2, k2

jnz loop

endloop:

(this happens to be a Mandelbrot-set generator)

(thanks Dean for fixing this!)

Page 22: GDC09 Abrash Larrabee+Final

LRBni examples

Ternary: start with a simple vector multiply

vmulps v0, v5, v6 ; v0 = v5 * v6

Page 23: GDC09 Abrash Larrabee+Final

LRBni examples

Multiply-add: destination is also third source

vmadd231ps v0, v5, v6 ; v0 = v5 * v6 + v0

Operand 2 times operand 3 plus operand 1

Page 24: GDC09 Abrash Larrabee+Final

LRBni examples

Multiply-add: destination is also third source

vmadd231ps v0, v5, v6 ; v0 = v5 * v6 + v0

Operand 2 times operand 3 plus operand 1

Page 25: GDC09 Abrash Larrabee+Final

LRBni examples

Multiply-add: destination is also third source

vmadd231ps v0, v5, v6 ; v0 = v5 * v6 + v0

Operand 2 times operand 3 plus operand 1

Page 26: GDC09 Abrash Larrabee+Final

LRBni examples

Multiply-add: destination is also third source

vmadd231ps v0, v5, v6 ; v0 = v5 * v6 + v0

Operand 2 times operand 3 plus operand 1

Page 27: GDC09 Abrash Larrabee+Final

LRBni examples

Predication: mask the writing of the elements

vmadd231ps v0 {k1}, v5, v6

Page 28: GDC09 Abrash Larrabee+Final

LRBni examples

Load-operand: src2 is the memory location

specified by rbx+rcx*4

vmadd231ps v0 {k1}, v5, [rbx+rcx*4]

Page 29: GDC09 Abrash Larrabee+Final

LRBni examples

The operands can be plugged in differently

vmadd213ps v0 {k1}, v5, [rbx+rcx*4]

Operand 2 times operand 1 plus operand 3

Page 30: GDC09 Abrash Larrabee+Final

LRBni examples

Broadcast: expand 4 (or 1) elements in memory

vmadd231ps v0 {k1}, v5, [rbx+rcx*4] {4to16}

Page 31: GDC09 Abrash Larrabee+Final

LRBni examples

Conversion: upconvert from float16 format

vmadd231ps v0 {k1}, v5, [rbx+rcx*4] {float16}

Page 32: GDC09 Abrash Larrabee+Final

LRBni examples

Pack-store

vcompressd [rbx] {k1}, v0

Page 33: GDC09 Abrash Larrabee+Final

LRBni examples Gather

vgatherd v1 {k1}, [rbx + v2*4]

Page 34: GDC09 Abrash Larrabee+Final

The fundamentals of LRBni

All these instructions run at full speed!

I know this has been way too fast

There just isn’t time for an in-depth look

LRBni in more detail:

Tom Forsyth’s talk (10:30: Room 3002, West Hall)

Dr. Dobb’s Journal article in April

These slides (complete with notes)

Page 35: GDC09 Abrash Larrabee+Final

LRBni in Action

Everything you need to run fully-parallel code well

General code running on a CPU Can run anything

How well can it run less-than-perfectly-parallelizable code?

RAD has been working on the Larrabee D3D/OpenGL graphics pipeline

Pipeline is not an ideal candidate for parallelization Retirement must be in order

Rasterization is not easy to vectorize efficiently

I’ll look at rasterization today The process of determining which pixels are inside a triangle

Page 36: GDC09 Abrash Larrabee+Final

Applying LRBni to the graphics

pipeline

For the most part, this is easy

Z/stencil buffering, pixel shading, blending all

fall out of processing 4x4 blocks

16 vertices can be shaded and cached in parallel

Vertex usage tends to be localized

Triangle set-up lends itself pretty well to

vectorization

Some efficiency cost from culling

Page 37: GDC09 Abrash Larrabee+Final

Rasterization is not easy to vectorize

At least not with usable performance

In fact, we were sure it couldn’t be done!

Forced to reexamine assumptions

So irritated at being asked for the hundredth time if

it was possible that we sat down to prove it wasn’t

Page 38: GDC09 Abrash Larrabee+Final

Rasterization is not easy to vectorize

At least not with usable performance

In fact, we were sure it couldn’t be done!

Forced to reexamine assumptions

So irritated at being asked for the hundredth time if

it was possible that we sat down to prove it wasn’t

We failed

Page 39: GDC09 Abrash Larrabee+Final

Dedicated hardware can do any

given task more efficiently

In general, dedicated hardware will be able to do any

given graphics task more efficiently per square

millimeter than software

However, CPU flexibility can gain some or all of that

back by applying square millimeters as needed

Hardware needs worst-case capacity for each component

Often partly or entirely idle

When little rasterization is needed (long shaders, say), CPUs

can just use cycles for other purposes; ALUs are never idle

Can even have higher peak rates in many cases, because the

whole chip can work on a single task if necessary

Page 40: GDC09 Abrash Larrabee+Final

A quick refresher

Three edges per triangle, each defined by an equation Bx+Cy relative to any point on edge

Sign indicates in/out

X and y snapped to 15.8

Range [-16K,+16K]

Tested at pixel/sample centers

Fill rules must be observed

Must be exact (discrete math)

Must support multisampled antialiasing (MSAA)

Page 41: GDC09 Abrash Larrabee+Final

Pixel rasterization: example

+ +

+

- --

Page 42: GDC09 Abrash Larrabee+Final

Pixomatic 1 rasterization

Pixo 1 decomposed triangles into 1 or 2

trapezoids

Stepped down 2 edges at a time on pixel centers

Could do with only 1 branch, for loop

Branch misprediction is very expensive

Rasterization code tends to predict poorly

Page 43: GDC09 Abrash Larrabee+Final

Pixomatic 1 rasterization

Page 44: GDC09 Abrash Larrabee+Final

Problems with Pixo 1 rasterization

Required expensive IMUL per edge

Poorly suited to small triangles

Not well suited to MSAA sample jittering

Never could figure out how to vectorize it

efficiently

Page 45: GDC09 Abrash Larrabee+Final

Sweep rasterization

Page 46: GDC09 Abrash Larrabee+Final

Sweep rasterization

Page 47: GDC09 Abrash Larrabee+Final

Sweep rasterization

Page 48: GDC09 Abrash Larrabee+Final

Sweep rasterization

Page 49: GDC09 Abrash Larrabee+Final

Sweep rasterization

Page 50: GDC09 Abrash Larrabee+Final

Sweep rasterization

Page 51: GDC09 Abrash Larrabee+Final

Sweep rasterization

Generally well suited to vectorization

Problems for CPUs

Lots of badly-predicted branching

Significant work to figure out where to descend

Example of an approach that dedicated

hardware can perform much more efficiently

Page 52: GDC09 Abrash Larrabee+Final

Larrabee rasterization

Page 53: GDC09 Abrash Larrabee+Final

Larrabee rasterization

Page 54: GDC09 Abrash Larrabee+Final

Larrabee rasterization

Page 55: GDC09 Abrash Larrabee+Final

Larrabee rasterization

Page 56: GDC09 Abrash Larrabee+Final

Larrabee rasterization

Page 57: GDC09 Abrash Larrabee+Final

Larrabee rasterization

Page 58: GDC09 Abrash Larrabee+Final

Larrabee rasterization

Page 59: GDC09 Abrash Larrabee+Final

Larrabee rasterization

Page 60: GDC09 Abrash Larrabee+Final

Larrabee rasterization

Page 61: GDC09 Abrash Larrabee+Final

Larrabee rasterization

Page 62: GDC09 Abrash Larrabee+Final

CPU smarts are useful for

rasterization

Not-rasterizing

Page 63: GDC09 Abrash Larrabee+Final

CPU smarts are useful for

rasterization

The Larrabee renderer uses chunking (binning, tiling)

For chunking, separate per-tile and intra-tile rasterization

Per-tile rasterization (triangle->tile assignment)

Bounding box tests for triangles up to 1 tile in size

Sweep or just walk bounding box for larger triangles

Tile-assignment time is insignificant for larger triangles

CPUs make it easy to do the 90% case well, difficult 10% case

adequately

If large-triangle tile assignment important, could use a form

of the approach used for intra-tile (discussed next)

Page 64: GDC09 Abrash Larrabee+Final

A tiled render target

(0,0)

(256,256)

Tile 0

Tile 3Tile 2

Tile 1

256x256 render target

Page 65: GDC09 Abrash Larrabee+Final

Tile assignment test – trivial reject

Calculate value of each edge equation at its

trivial reject tile corner

If any edge is non-negative, triangle does not

intersect tile (<0 == inside so we can just test sign

bit)

Page 66: GDC09 Abrash Larrabee+Final

Trivial reject: example

(0,0)

(256,256)

Tile 0

Tile 3Tile 2

Tile 1

256x256 render target

+-

Trivial reject corner of

tile 0 for black edge; if

this point isn’t inside

black edge, no point in

the tile can be inside

black edge

Trivial reject corner of

tile 2 for black edge

Trivial reject

corner of tile

1 for black

edge

Trivial reject

corner of tile

3 for black

edge

More positive

More negative

Page 67: GDC09 Abrash Larrabee+Final

Trivial reject: example

(0,0)

(256,256)

Tile 0

Tile 3Tile 2

Tile 1

256x256 render target

+-

Page 68: GDC09 Abrash Larrabee+Final

Trivial reject: example

(0,0)

(256,256)

Tile 0

Tile 3Tile 2

Tile 1

256x256 render target

+-

Trivial reject corner of

tile 0 for black edge; if

this point isn’t inside

black edge, no point in

the tile can be inside

black edge

Trivial reject corner of

tile 2 for black edge

Trivial reject

corner of tile

1 for black

edge

Trivial reject

corner of tile

3 for black

edge

Page 69: GDC09 Abrash Larrabee+Final

Tile assignment test – trivial accept

For each edge, sum trivial reject corner value

with the equation step to opposite corner

If any edge is negative, the whole tile is trivially

accepted for that edge

No need to consider it when rasterizing within tile

In general, only relevant edges need to be considered

Scissors, user clip

Page 70: GDC09 Abrash Larrabee+Final

Trivial accept: example

(0,0)

(256,256)

Tile 0

Tile 3Tile 2

Tile 1

256x256 render target

+

-

Trivial accept corner

of tile 0 for black

edge; if this point is

inside black edge, all

points in the tile must

be inside black edge

Trivial accept corner

of tile 2 for black edge

Trivial accept

corner of tile 1

for black edge

Trivial accept

corner of tile

3 for black

edge

Page 71: GDC09 Abrash Larrabee+Final

Trivial accept: example

(0,0)

(256,256)

Tile 0

Tile 3Tile 2

Tile 1

256x256 render target

+-

Trivial accept corner

of tile 0 for black

edge; if this point is

inside black edge, all

points in the tile must

be inside black edge

Trivial accept corner

of tile 2 for black edge

Trivial accept

corner of tile 1

for black edge

Trivial accept

corner of tile

3 for black

edge

-+

-+

Page 72: GDC09 Abrash Larrabee+Final

Trivial accept: example

(0,0)

(256,256)

Tile 0

Tile 3Tile 2

Tile 1

256x256 render target

+-

Trivial accept corner

of tile 0 for black

edge; if this point is

inside black edge, all

points in the tile must

be inside black edge

Trivial accept corner

of tile 2 for black edge

Trivial accept

corner of tile 1

for black edge

Trivial accept

corner of tile

3 for black

edge

-+

+-

Page 73: GDC09 Abrash Larrabee+Final

Not-rasterizing

If all edges are negative at their trivial accept

corners, the whole tile is inside the triangle

No further rasterization is needed

Can just store a “draw-whole-tile” command in bin

Back end can then effectively do two nested loops

around shaders

Full-screen triangle rasterization speed ~= infinity

Page 74: GDC09 Abrash Larrabee+Final

Intra-tile rasterization

Same principle, but vectorized

Assume tile is 64x64 pixels and vector width is 16

Vector-calculate the 16 trivial-reject and trivial-

accept corners of the 16x16 blocks as a delta

from the tile trivial-reject corner

Just two adds per edge, using tables generated at

triangle set-up time

AND edge results together

Page 75: GDC09 Abrash Larrabee+Final

Vector rasterization to 16x16s:

trivial reject example64x64 tile

+-

White dots

are trivial

reject

corners of

16x16s for

black edge

Gray 16x16s

are trivially

rejected by

black edge

Tile trivial

reject corner

for black edge

Page 76: GDC09 Abrash Larrabee+Final

Vector rasterization to 16x16s:

trivial reject example64x64 tile

+-

White dots

are trivial

reject

corners of

16x16s for

black edge

Gray 16x16s

are trivially

rejected by

black edge

Tile trivial

reject corner

for black edge

Page 77: GDC09 Abrash Larrabee+Final

Vector rasterization to 16x16s:

trivial reject example64x64 tile

+-

White dots

are trivial

reject

corners of

16x16s for

black edge

Gray 16x16s

are trivially

rejected by

black edge

Tile trivial reject

corner for black edge

[Step from value here](-48, 0) [Step by B(-48) + C(0)]

(-48, -48) [Step by

B(-48) + C(-48)]

Page 78: GDC09 Abrash Larrabee+Final

Vector rasterization to 16x16s:

trivial accept example64x64 tile

+-

White dots are

trivial accept

corners of

16x16s for

black edge

Gray 16x16s

are trivially

rejected by

black edge

Pink 16x16s

are trivially

accepted by

black edge

Page 79: GDC09 Abrash Larrabee+Final

Intra-tile rasterization

Vector-test sign of edge equations at trivial

accept and trivial reject corners and AND

together

Bit-scan through resulting masks to find trivially

and partially accepted 16x16 blocks

Each trivial accept becomes a draw-block command

Again, no further rasterization needed for those pixels

Page 80: GDC09 Abrash Larrabee+Final

Intra-tile rasterization

2 1 0 2 1 0 -1 1 0 0-1 -2 -1 -2 -3

Edge #1 equation step values

from tile trivial accept corner to

trivial accept corners of 16x16s 3

1Edge #1 equation value at tile

trivial accept corner

+ + + + + + + + + + + + + + + +

3 2 1 3 2 1 0 2 1 10 -1 0 -1 -24

= = = = = = = = = = = = = = = =

Edge #1 equation values at

trivial accept corners of 16x16s

Page 81: GDC09 Abrash Larrabee+Final

Intra-tile rasterization

2 1 0 2 1 0 -1 1 0 0-1 -2 -1 -2 -3

Edge #1 equation step values

from tile trivial accept corner to

trivial accept corners of 16x16s 3

1Edge #1 equation value at tile

trivial accept corner

+ + + + + + + + + + + + + + + +

0 0 0 0 0 0 0 0 0 00 0 0 0 00

< < < < < < < < < < < < < < < <

3 2 1 3 2 1 0 2 1 10 -1 0 -1 -24

= = = = = = = = = = = = = = = =

= = = = = = = = = = = = = = = =

0 0 0 0 0 0 0 0 0 00 1 0 1 10Bit mask, in mask register, for

edge #1 trivial accept 16x16s

Edge #1 equation values at

trivial accept corners of 16x16s

Preset zeroes, in a register or

broadcast from memory

Page 82: GDC09 Abrash Larrabee+Final

Intra-tile rasterization

= = = = = = = = = = = = = = = =

0 0 0 0 0 0 0 0 0 00 1 0 1 10

&

0 0 0 0 0 0 0 0 0 00 1 0 1 10

1 1 1 0 1 1 1 0 1 01 1 1 1 10

& & & & & & & & & & & & & & &

Bit mask, in mask register, for

edge #1 trivial accept 16x16s

Bit mask, in mask register, for

edge #2 trivial accept 16x16s

Intermediate result

&

0 1 1 0 0 1 1 0 0 01 1 0 0 10

& & & & & & & & & & & & & & &Bit mask, in mask register, for

edge #3 trivial accept 16x16s

0 0 0 0 0 0 0 0 0 00 1 0 0 10Composite trivial accept mask

for 16x16 blocks

= = = = = = = = = = = = = = = =

Page 83: GDC09 Abrash Larrabee+Final

Intra-tile rasterization

= = = = = = = = = = = = = = = =

0 0 0 0 0 0 0 0 0 00 1 0 1 10

&

0 0 0 0 0 0 0 0 0 00 1 0 1 10

1 1 1 0 1 1 1 0 1 01 1 1 1 10

& & & & & & & & & & & & & & &

Bit mask, in mask register, for

edge #1 trivial accept 16x16s

Bit mask, in mask register, for

edge #2 trivial accept 16x16s

Intermediate result

&

0 1 1 0 0 1 1 0 0 01 1 0 0 10

& & & & & & & & & & & & & & &Bit mask, in mask register, for

edge #3 trivial accept 16x16s

0 0 0 0 0 0 0 0 0 00 1 0 0 10Composite trivial accept mask

for 16x16 blocks

= = = = = = = = = = = = = = = =

Block #0 found

by first bit-scan

Page 84: GDC09 Abrash Larrabee+Final

Intra-tile rasterization

= = = = = = = = = = = = = = = =

0 0 0 0 0 0 0 0 0 00 1 0 1 10

&

0 0 0 0 0 0 0 0 0 00 1 0 1 10

1 1 1 0 1 1 1 0 1 01 1 1 1 10

& & & & & & & & & & & & & & &

Bit mask, in mask register, for

edge #1 trivial accept 16x16s

Bit mask, in mask register, for

edge #2 trivial accept 16x16s

Intermediate result

&

0 1 1 0 0 1 1 0 0 01 1 0 0 10

& & & & & & & & & & & & & & &Bit mask, in mask register, for

edge #3 trivial accept 16x16s

0 0 0 0 0 0 0 0 0 00 1 0 0 10Composite trivial accept mask

for 16x16 blocks

= = = = = = = = = = = = = = = =

Block #4 found by

second bit-scan

Page 85: GDC09 Abrash Larrabee+Final

Intra-tile rasterization

= = = = = = = = = = = = = = = =

0 0 0 0 0 0 0 0 0 00 1 0 1 10

&

0 0 0 0 0 0 0 0 0 00 1 0 1 10

1 1 1 0 1 1 1 0 1 01 1 1 1 10

& & & & & & & & & & & & & & &

Bit mask, in mask register, for

edge #1 trivial accept 16x16s

Bit mask, in mask register, for

edge #2 trivial accept 16x16s

Intermediate result

&

0 1 1 0 0 1 1 0 0 01 1 0 0 10

& & & & & & & & & & & & & & &Bit mask, in mask register, for

edge #3 trivial accept 16x16s

0 0 0 0 0 0 0 0 0 00 1 0 0 10Composite trivial accept mask

for 16x16 blocks

= = = = = = = = = = = = = = = =

Bit-scan

completed

Page 86: GDC09 Abrash Larrabee+Final

Trivial accept for 16x16 blocks; Step to edge values at 16 16x16 block trivial accept corners

vaddpi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}

vaddpi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}

vaddpi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}

; See if each trivial accept corner is inside all three edges

vcmplepi k1, v0, [rsi+ConstantZero]

vcmplepi k1 {k1}, v1, [rsi+ConstantZero]

vcmplepi k1 {k1}, v2, [rsi+ConstantZero]

; Get the mask; 1-bits are trivial accept corners that are

; inside all three edges

kmov eax, k1

; Loop through 1-bits, issuing a draw-16x16-block command

; for each trivially-accepted 16x16 block

bsf ecx, eax

jnz TrivialAcceptDone

TrivialAcceptLoop:

; <Store draw-16x16-block command, along with (x,y) location>

bsfi ecx, eax

jnz TrivialAcceptLoop

TrivialAcceptDone:

Page 87: GDC09 Abrash Larrabee+Final

Trivial accept for 16x16 blocks; Step to edge values at 16 16x16 block trivial accept corners

vaddpi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}

vaddpi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}

vaddpi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}

; See if each trivial accept corner is inside all three edges

vcmplepi k1, v0, [rsi+ConstantZero]

vcmplepi k1 {k1}, v1, [rsi+ConstantZero]

vcmplepi k1 {k1}, v2, [rsi+ConstantZero]

; Get the mask; 1-bits are trivial accept corners that are

; inside all three edges

kmov eax, k1

; Loop through 1-bits, issuing a draw-16x16-block command

; for each trivially-accepted 16x16 block

bsf ecx, eax

jnz TrivialAcceptDone

TrivialAcceptLoop:

; <Store draw-16x16-block command, along with (x,y) location>

bsfi ecx, eax

jnz TrivialAcceptLoop

TrivialAcceptDone:

Page 88: GDC09 Abrash Larrabee+Final

Trivial accept for 16x16 blocks; Step to edge values at 16 16x16 block trivial accept corners

vaddpi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}

vaddpi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}

vaddpi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}

; See if each trivial accept corner is inside all three edges

vcmplepi k1, v0, [rsi+ConstantZero]

vcmplepi k1 {k1}, v1, [rsi+ConstantZero]

vcmplepi k1 {k1}, v2, [rsi+ConstantZero]

; Get the mask; 1-bits are trivial accept corners that are

; inside all three edges

kmov eax, k1

; Loop through 1-bits, issuing a draw-16x16-block command

; for each trivially-accepted 16x16 block

bsf ecx, eax

jnz TrivialAcceptDone

TrivialAcceptLoop:

; <Store draw-16x16-block command, along with (x,y) location>

bsfi ecx, eax

jnz TrivialAcceptLoop

TrivialAcceptDone:

Page 89: GDC09 Abrash Larrabee+Final

Trivial accept for 16x16 blocks; Step to edge values at 16 16x16 block trivial accept corners

vaddpi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}

vaddpi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}

vaddpi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}

; See if each trivial accept corner is inside all three edges

vcmplepi k1, v0, [rsi+ConstantZero]

vcmplepi k1 {k1}, v1, [rsi+ConstantZero]

vcmplepi k1 {k1}, v2, [rsi+ConstantZero]

; Get the mask; 1-bits are trivial accept corners that are

; inside all three edges

kmov eax, k1

; Loop through 1-bits, issuing a draw-16x16-block command

; for each trivially-accepted 16x16 block

bsf ecx, eax

jnz TrivialAcceptDone

TrivialAcceptLoop:

; <Store draw-16x16-block command, along with (x,y) location>

bsfi ecx, eax

jnz TrivialAcceptLoop

TrivialAcceptDone:

Page 90: GDC09 Abrash Larrabee+Final

Trivial accept for 16x16 blocks; Step to edge values at 16 16x16 block trivial accept corners

vaddpi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}

vaddpi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}

vaddpi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}

; See if each trivial accept corner is inside all three edges

vcmplepi k1, v0, [rsi+ConstantZero]

vcmplepi k1 {k1}, v1, [rsi+ConstantZero]

vcmplepi k1 {k1}, v2, [rsi+ConstantZero]

; Get the mask; 1-bits are trivial accept corners that are

; inside all three edges

kmov eax, k1

; Loop through 1-bits, issuing a draw-16x16-block command

; for each trivially-accepted 16x16 block

bsf ecx, eax

jnz TrivialAcceptDone

TrivialAcceptLoop:

; <Store draw-16x16-block command, along with (x,y) location>

bsfi ecx, eax

jnz TrivialAcceptLoop

TrivialAcceptDone:

Page 91: GDC09 Abrash Larrabee+Final

Trivial accept for 16x16 blocks

; Step to edge values at 16 16x16 block trivial accept corners &

; see if each trivial accept corner is inside all three edges

vaddsetspi v0, v3, [rsi+Edge0TileTrivialAcceptCornerValue]{1to16}

vaddsetspi v1, v4, [rsi+Edge1TileTrivialAcceptCornerValue]{1to16}

vaddsetspi v2, v5, [rsi+Edge2TileTrivialAcceptCornerValue]{1to16}

; Get the mask; 1-bits are trivial accept corners that are

; inside all three edges

kmov eax, k1

; Loop through 1-bits, issuing a draw-16x16-block command

; for each trivially-accepted 16x16 block

bsf ecx, eax

jnz TrivialAcceptDone

TrivialAcceptLoop:

; <Store draw-16x16-block command, along with (x,y) location>

bsfi ecx, eax

jnz TrivialAcceptLoop

TrivialAcceptDone:

Page 92: GDC09 Abrash Larrabee+Final

Partially accepted 16x16 blocks

For each partial 16x16, descend again to 4x4s

Put trivially accepted 4x4s in bins

Partially accepted 4x4s need to be processed

into pixel masks

Vector add of equation step for each edge

AND signs together to form pixel mask

Page 93: GDC09 Abrash Larrabee+Final

Pixel mask for partial 4x4: example4x4 pixels

+-

White dots

are pixel

centers

Blue pixels

are inside

black edge

Grey pixels

are outside

black edge

Trivial reject

corner for black

edge for 4x4

1110111011001100

Resulting mask register

Page 94: GDC09 Abrash Larrabee+Final

Pixel mask for partial 4x4: example

4x4 pixels

+-

White dots

are pixel

centers

Blue pixels

are inside

black edgeGrey pixels

are outside

black edge

Trivial reject

corner for black

edge for 4x4

Page 95: GDC09 Abrash Larrabee+Final

Partially accepted 4x4 blocks; Store values at corners of 16 4x4s in 16x16 for indexing into

vstored [rsi+Edge0TrivialRejectCornerValues4x4], v0

vstored [rsi+Edge1TrivialRejectCornerValues4x4], v1

vstored [rsi+Edge2TrivialRejectCornerValues4x4], v2

; Load step tables from corners of 4x4s to pixel centers

vloadd v3, [rsi+Edge0PixelCenterTable]

vloadd v4, [rsi+Edge1PixelCenterTable]

vloadd v5, [rsi+Edge2PixelCenterTable]

; Loop through 1-bits from trivial reject test on 16x16 block,

; descending to rasterize each partially-accepted 4x4

kmov eax, k1

bsf ecx, eax

jnz Partial4x4Done

Partial4x4Loop:

; See if each of 16 pixel centers is inside all three edges

vcmpgtpi k2, v3, [rsi+Edge0TrivialRejectCornerValues4x4+rcx*4] {1to16}

vcmpgtpi k2 {k2}, v4, [rsi+Edge1TrivialRejectCornerValues4x4+rcx*4] {1to16}

vcmpgtpi k2 {k2}, v5, [rsi+Edge2TrivialRejectCornerValues4x4+rcx*4] {1to16}

; Store the mask

kmov edx, k2

mov [rbx], dx

; <Store the (x,y) location and advance rbx>

bsfi ecx, eax

jnz Partial4x4Loop

Partial4x4Done:

Page 96: GDC09 Abrash Larrabee+Final

Partially accepted 4x4 blocks; Store values at corners of 16 4x4s in 16x16 for indexing into

vstored [rsi+Edge0TrivialRejectCornerValues4x4], v0

vstored [rsi+Edge1TrivialRejectCornerValues4x4], v1

vstored [rsi+Edge2TrivialRejectCornerValues4x4], v2

; Load step tables from corners of 4x4s to pixel centers

vloadd v3, [rsi+Edge0PixelCenterTable]

vloadd v4, [rsi+Edge1PixelCenterTable]

vloadd v5, [rsi+Edge2PixelCenterTable]

; Loop through 1-bits from trivial reject test on 16x16 block,

; descending to rasterize each partially-accepted 4x4

kmov eax, k1

bsf ecx, eax

jnz Partial4x4Done

Partial4x4Loop:

; See if each of 16 pixel centers is inside all three edges

vcmpgtpi k2, v3, [rsi+Edge0TrivialRejectCornerValues4x4+rcx*4] {1to16}

vcmpgtpi k2 {k2}, v4, [rsi+Edge1TrivialRejectCornerValues4x4+rcx*4] {1to16}

vcmpgtpi k2 {k2}, v5, [rsi+Edge2TrivialRejectCornerValues4x4+rcx*4] {1to16}

; Store the mask

kmov edx, k2

mov [rbx], dx

; <Store the (x,y) location and advance rbx>

bsfi ecx, eax

jnz Partial4x4Loop

Partial4x4Done:

Page 97: GDC09 Abrash Larrabee+Final

Partially accepted 4x4 blocks; Store values at corners of 16 4x4s in 16x16 for indexing into

vstored [rsi+Edge0TrivialRejectCornerValues4x4], v0

vstored [rsi+Edge1TrivialRejectCornerValues4x4], v1

vstored [rsi+Edge2TrivialRejectCornerValues4x4], v2

; Load step tables from corners of 4x4s to pixel centers

vloadd v3, [rsi+Edge0PixelCenterTable]

vloadd v4, [rsi+Edge1PixelCenterTable]

vloadd v5, [rsi+Edge2PixelCenterTable]

; Loop through 1-bits from trivial reject test on 16x16 block,

; descending to rasterize each partially-accepted 4x4

kmov eax, k1

bsf ecx, eax

jnz Partial4x4Done

Partial4x4Loop:

; See if each of 16 pixel centers is inside all three edges

vcmpgtpi k2, v3, [rsi+Edge0TrivialRejectCornerValues4x4+rcx*4] {1to16}

vcmpgtpi k2 {k2}, v4, [rsi+Edge1TrivialRejectCornerValues4x4+rcx*4] {1to16}

vcmpgtpi k2 {k2}, v5, [rsi+Edge2TrivialRejectCornerValues4x4+rcx*4] {1to16}

; Store the mask

kmov edx, k2

mov [rbx], dx

; <Store the (x,y) location and advance rbx>

bsfi ecx, eax

jnz Partial4x4Loop

Partial4x4Done:

Page 98: GDC09 Abrash Larrabee+Final

Partially accepted 4x4 blocks; Store values at corners of 16 4x4s in 16x16 for indexing into

vstored [rsi+Edge0TrivialRejectCornerValues4x4], v0

vstored [rsi+Edge1TrivialRejectCornerValues4x4], v1

vstored [rsi+Edge2TrivialRejectCornerValues4x4], v2

; Load step tables from corners of 4x4s to pixel centers

vloadd v3, [rsi+Edge0PixelCenterTable]

vloadd v4, [rsi+Edge1PixelCenterTable]

vloadd v5, [rsi+Edge2PixelCenterTable]

; Loop through 1-bits from trivial reject test on 16x16 block,

; descending to rasterize each partially-accepted 4x4

kmov eax, k1

bsf ecx, eax

jnz Partial4x4Done

Partial4x4Loop:

; See if each of 16 pixel centers is inside all three edges

vcmpgtpi k2, v3, [rsi+Edge0TrivialRejectCornerValues4x4+rcx*4] {1to16}

vcmpgtpi k2 {k2}, v4, [rsi+Edge1TrivialRejectCornerValues4x4+rcx*4] {1to16}

vcmpgtpi k2 {k2}, v5, [rsi+Edge2TrivialRejectCornerValues4x4+rcx*4] {1to16}

; Store the mask

kmov edx, k2

mov [rbx], dx

; <Store the (x,y) location and advance rbx>

bsfi ecx, eax

jnz Partial4x4Loop

Partial4x4Done:

Page 99: GDC09 Abrash Larrabee+Final

Partially accepted 4x4 blocks; Store values at corners of 16 4x4s in 16x16 for indexing into

vstored [rsi+Edge0TrivialRejectCornerValues4x4], v0

vstored [rsi+Edge1TrivialRejectCornerValues4x4], v1

vstored [rsi+Edge2TrivialRejectCornerValues4x4], v2

; Load step tables from corners of 4x4s to pixel centers

vloadd v3, [rsi+Edge0PixelCenterTable]

vloadd v4, [rsi+Edge1PixelCenterTable]

vloadd v5, [rsi+Edge2PixelCenterTable]

; Loop through 1-bits from trivial reject test on 16x16 block,

; descending to rasterize each partially-accepted 4x4

kmov eax, k1

bsf ecx, eax

jnz Partial4x4Done

Partial4x4Loop:

; See if each of 16 pixel centers is inside all three edges

vcmpgtpi k2, v3, [rsi+Edge0TrivialRejectCornerValues4x4+rcx*4] {1to16}

vcmpgtpi k2 {k2}, v4, [rsi+Edge1TrivialRejectCornerValues4x4+rcx*4] {1to16}

vcmpgtpi k2 {k2}, v5, [rsi+Edge2TrivialRejectCornerValues4x4+rcx*4] {1to16}

; Store the mask

kmov edx, k2

mov [rbx], dx

; <Store the (x,y) location and advance rbx>

bsfi ecx, eax

jnz Partial4x4Loop

Partial4x4Done:

Page 100: GDC09 Abrash Larrabee+Final

Partially accepted 4x4 blocks - MSAA

4x4 pixels

+ -

Yellow dots

are sample 2

centers

Blue samples

are inside

black edgeGrey

samples are

outside black

edge

Trivial reject

corner for black

edge for 4x4

Page 101: GDC09 Abrash Larrabee+Final

Larrabee rasterization

Page 102: GDC09 Abrash Larrabee+Final

Larrabee rasterization

Page 103: GDC09 Abrash Larrabee+Final

Larrabee rasterization

Page 104: GDC09 Abrash Larrabee+Final

Larrabee rasterization

Page 105: GDC09 Abrash Larrabee+Final

Larrabee rasterization

Page 106: GDC09 Abrash Larrabee+Final

Larrabee rasterization

Page 107: GDC09 Abrash Larrabee+Final

Larrabee rasterization

Page 108: GDC09 Abrash Larrabee+Final

Larrabee rasterization

Page 109: GDC09 Abrash Larrabee+Final

Larrabee rasterization

Page 110: GDC09 Abrash Larrabee+Final

Larrabee rasterization

Page 111: GDC09 Abrash Larrabee+Final

Larrabee rasterization

Page 112: GDC09 Abrash Larrabee+Final

Adaptive rasterization

Don’t have the luxury of custom data and ALU sizes

Do have the luxury of adapting to input data

Edge equations have to be evaluated with 48 bits in the worst case

We have to use 64 bits

Can use 32 bits for triangles that fit in 128x128 bounding boxes

90+% of triangles

Page 113: GDC09 Abrash Larrabee+Final

Adaptive rasterization (cont.)

When we do have to do 64-bit edge evaluation

64-bit only required for tile assignment

Any tile up to 128x128 that’s not trivially accepted or rejected can be rasterized using 32 bits

Adaptive intra-tile rasterization

Triangles that fit in a 16x16 bounding box

One less level of descent, less set-up, no trivial accept test

Direct mask stamping for 4x4, 4x8, 8x4 bounding boxes

Non-rasterization-based z for small triangles

Easy to try things out

Page 114: GDC09 Abrash Larrabee+Final

Implementation

Will still not match dedicated hardware peak

rates per square millimeter on average

Efficient enough, and avoids dedicating area and

design effort for a narrow purpose

Generality improves overall perf for a wide range of

tasks

For example, can bring more mm^2 to bear – the

whole chip!

Page 115: GDC09 Abrash Larrabee+Final

Implementation

Texture sampling and filtering remain as

significant challenges for software

Apart from that, everything can be implemented

in software

Not always obvious how at first, but surprisingly

often doable

Still evolving

A whole new way to think about optimization

Page 116: GDC09 Abrash Larrabee+Final

Further Larrabee Information

Tom Forsyth’s talk

10:30: Room 3002, West Hall

SIGGRAPH paper

“Larrabee: A Many-Core x86 Architecture for Visual Computing,” Seiler et al

Just search on “Larrabee SIGGRAPH paper”

Dr. Dobb’s Journal article in April

www.intel.com/software/graphics

GDC Larrabee talks

C++ Larrabee Prototype Library


Recommended