Evaluating The Raw Microprocessor: Scalability and Versatility

Evaluating The Raw Microprocessor:Scalability and Versatility

Michael Taylor

Walter Lee, Jason Miller, David Wentzlaff,Ian Bratt, Ben Greenwald, Henry Hoffmann, Paul Johnson, Jason Kim, James Psota, Arvind Saraf, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal.

M.I.T.

Could processors be even more general purpose?

Square inch of siliconGets more powerful every generation

CustomChip

“General Purpose”Microprocessor

Video/3D GraphicsNetworkEncryptionWireless/Cell PhoneDigital CameraMP3 PlayerAutomotiveWhy can custom chips run these apps?

SpecOffice

Custom Chips: Efficient Extraction of Parallelism

10’s, 100’s or 1000’s of parallel operators10’s or 100’s of parallel memory ports10’s or 100’s of parallel I/O ops

But, not general purpose!Can’t run GCC.

memmem

mem

mem

mem

Customized placement and routing of operators & operands

-High locality -Minimum Control

-Operands routed over wires, not thru register files Area and Power Efficient

GP Micro3-821

The Raw Goal

Create an architecture that: Scales to 100’s-1000’s of functional units, memory ports by exploiting custom-chip like features - in particular, application-specific routing of operands

… while being “general purpose”: Run ILP-based sequential programs Support standard General Purpose Abstractions

- like context switching, caching and instruction virtualization

[IEEE Micro, “Billion Transistor” Issue, 1997]

Un-buildable Super-Wide Issue GP

ControlWideFetch

(16 inst)

UnifiedLoad/Store

Queue

PC

RF

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALUBypass Net

Area and Frequency Scalability Problems

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALUBypass Net

RF

~N3 ~N2 N ALUs

Ex: Itanium 2

Without modification, freq decreases linearly or worse.

Operand Routing is Global

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALUBypass Net

RF

>>

+

Idea: Exploit Locality

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALUBypass Net

RF

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RF

Bypass Net

Idea: Exploit Locality

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RF

Replace the crossbar with a point-to-point, pipelined, routed network.

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RF>>

+

Replace the crossbar with a point-to-point, pipelined, routed network.

Un-pipelinedcrossbar

Point-to-PointRouted MeshNetwork

ALUs N N

Bisection BW ~ N½ ~ N½

Local BW ~ N½ ~ N

Area ~ N2 ~ N

Operand Transport Scaling – Bandwidth and Area

If we want to keep our ALUs busy, we better mapcommunicating instructions nearby so communicationis local.

Scalesas 2-DVLSI

Operand Transport Scaling - LatencyTime for operand to travel between instructions mapped todifferent ALUs.

Non-local Placement

~ N ~ N½

Locality Driven Placement

~ N ~ 1

Un-pipelinedcrossbar

Point-to-PointRouted MeshNetwork

If we want to make sure that a latency-bound program doesn’t slow down when more ALUs are added, we mustmap the instructions to ALUs in a local fashion. [ASPLOS98]

Distribute the Register File

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RF

RFRF RFRF

RFRF RFRF

RFRF RFRF

RFRF RFRF

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RFRF RFRF

RFRF RFRF

RFRF RFRF

RFRF RFRF

ControlWideFetch

(16 inst)

UnifiedLoad/Store

Queue

PC

SCALABLE

More Scalability Problems

ControlWideFetch

(16 inst)

UnifiedLoad/Store

Queue

PC

Distribute the rest.

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RFRF RFRF

RFRF RFRF

RFRF RFRF

RFRF RFRF

Control

WideFetch

(16 inst)

UnifiedLoad/Store

Queue

PC I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$PC

D$I$

PC

D$I$

PC

D$I$

PC

D$

[ISCA99]

Tiles!

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RFRF RFRF

RFRF RFRF

RFRF RFRF

RFRF RFRF

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$PC

D$I$

PC

D$I$

PC

D$I$

PC

D$

Tiles!

Tiled Processor Architectures

-composed of a replicated tile -all signals registered at tile

boundaries

-NO global signals

-wire delay problem much easier

- easy scalability storyEasier to Tune the FrequencyEasier to VerifyEasier to do the Physical Design

Raw Compute Internals

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RFRF RFRF

RFRF RFRF

RFRF RFRF

RFRF RFRF

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$PC

D$I$

PC

D$I$

PC

D$I$

PC

D$

RFA TL

M1 M2

F P

E

U

r26

r27

r25

r24

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RFRF RFRF

RFRF RFRF

RFRF RFRF

RFRF RFRF

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$PC

D$I$

PC

D$I$

PC

D$I$

PC

D$

We could not find this type of networkin Patterson & Hennessey. - optimizes time for delivery of scalar operands between functional units

- we conceptualized this idea into the term “scalar operand network” or SON

- CMP: 15-100 cycles - iWarp: 12 cycles - Raw: 3 cycles - Alpha 21264: 1 cycle - Superscalar: 0 cycle

scalable

HPCA 2003 – “Scalar Operand Networks”

Intended foruse as SON

Evaluation of Raw

- holistic approach

- design a complete architecture

- design and build the processor and enclosing system

- build the compilers - used the chip in real systems

- head-to-head versus Intel Chip in same litho generation

Raw

180 nm ASIC (IBM SA-27E)16 tiles

Core Frequency: 425 MHz @ 1.8 V 500 MHz @ 2.2 V

Frequency competitivewith IBM-implementedPowerPCs in same process.

18 W (vpenta)Critical Path: ≈ Single-Ported 32 KB SRAM + 14-bit Mux. + Flip Flop

Raw Chips

October 02

Raw motherboard

Support Chipset implemented in FPGA (vs. custom ASICs for P3)

Comparison to Pentium 3

Self-comparisons hide architectural and compiler inefficiency.

What’s hard:

Normalizations between processors is very tricky.

Especially academic projects versus indu$try.- ASIC cannot attain the same frequencies.

Honest:

Our solution:

-Pick closest Intel processor implementation-Don’t scale any numbers in any way.

People can now compare to P3 and by extension to Raw.

Parameter IBM SA-27E (Raw) Intel P858 (P3) Favors

Litho 180 nm 180 nm -

Metal Layers Cu 6 Al 6 Raw

Wire sizing No Yes Intel

Dielectric k 4.1 3.55 Intel

FO1 Delay 23 ps 11 ps Intel

Design Style Std Cell ASIC Full custom Intel

Voltage Tweak 0 % 10 % Intel

Initial Freq 425 500-733 -

Presumed

Ave. Chip Freq

425 600 -

Pins 1100 190 Raw

Die Area 331 mm2 106 mm2 Raw

Methodology - HWIntel:

Pentium III Coppermine 600 MHzDell Precision 410, stocked with 2-2-2 PC100 DRAM

Raw:Validated Cycle-Accurate Simulator - Matches RTL for Raw Chip to the precise cycle for all 200,000+ lines of test code

Simulator used so we could: - Normalize motherboard + DRAM timings - replace (research) software i-caching system

with conventional hardware i-cache.

Methodology - SWWhen applicable

- normalize compiler: P3: gcc 3.3 –O3 –march=pentium3 –

mfpmath=sse Raw: gcc 3.3 –O3 (non parallelizing)- normalize stdio/stdlib: P3 & Raw: Newlib 1.9.0 w/ Deionizer

P3:Intel Performance PrimitivesLAPACK/BLAS with SSE for linear algebra routines

Raw:rawcc - home brew parallelizing compilerStreamit - home brew parallelizing compilergcc 3.3 + snippets inline assembly for some parallel

apps

Performance Survey

Sources of Speedup vs. P3 or 1 TileFactor Approx. Upper

Bound on Speedup

Tile Parallelism 16x

Streaming I/O Bandwidth 60x

Streaming v. cache thrashing 15x

Future Work: Raw supercomputing fabric

Emulator of a 1K-tileRaw chipcirca. 2010

…Ultimatetest ofscaling

Related Work: AsTrO Taxonomy

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

>>

+Assignment (Static/Dynamic)

Transport (Static/Dynamic)

Ordering (Static/Dynamic)

+

>>

Is instruction assignment to ALUs predetermined?

Are operand routes predetermined?

Is the execution order of instructions assigned to a node predetermined?

%&/

Static Dynamic

Static

Static

Dynamic

DynamicStatic

RawDyn [00]Raw [97]Scale [04]

GRID [01]WaveScalar [03]

Static

Dynamic

Dynamic

ILDP[00] OOO- Superscalar

Assignment

Transport

Ordering

How Raw relates to otherdistributed microprocessors

using AsTrO taxonomy

Conclusions

•VLSI Scalable microprocessors are possible.

Constant factors are beginning to give way to asymptotics: - 16 ALU Raw – Oct 2002 - 64 ALU Raw – Now - 1,024 ALU Raw - 2010 - 32,768 ALU Raw – If Moore’s Law makes it to 2 nm•There is an opportunity to make processors more

“versatile” i.e., steal applications from custom chips.

•Tiled Processor Architectures are a promising approach and merit further research.

* * * *

Embedded system:1020 Element Microphone Array

Date post:	04-Feb-2016
Category:	Documents
Upload:	louvain
View:	23 times
Download:	0 times

Evaluating The Raw Microprocessor: Scalability and Versatility

Documents