Evaluating The Raw Microprocessor:Scalability and Versatility
Michael Taylor
Walter Lee, Jason Miller, David Wentzlaff,Ian Bratt, Ben Greenwald, Henry Hoffmann, Paul Johnson, Jason Kim, James Psota, Arvind Saraf, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal.
M.I.T.
Could processors be even more general purpose?
Square inch of siliconGets more powerful every generation
CustomChip
“General Purpose”Microprocessor
Video/3D GraphicsNetworkEncryptionWireless/Cell PhoneDigital CameraMP3 PlayerAutomotiveWhy can custom chips run these apps?
SpecOffice
Custom Chips: Efficient Extraction of Parallelism
10’s, 100’s or 1000’s of parallel operators10’s or 100’s of parallel memory ports10’s or 100’s of parallel I/O ops
But, not general purpose!Can’t run GCC.
memmem
mem
mem
mem
Customized placement and routing of operators & operands
-High locality -Minimum Control
-Operands routed over wires, not thru register files Area and Power Efficient
GP Micro3-821
The Raw Goal
Create an architecture that: Scales to 100’s-1000’s of functional units, memory ports by exploiting custom-chip like features - in particular, application-specific routing of operands
… while being “general purpose”: Run ILP-based sequential programs Support standard General Purpose Abstractions
- like context switching, caching and instruction virtualization
[IEEE Micro, “Billion Transistor” Issue, 1997]
Un-buildable Super-Wide Issue GP
ControlWideFetch
(16 inst)
UnifiedLoad/Store
Queue
PC
RF
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALUBypass Net
Area and Frequency Scalability Problems
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALUBypass Net
RF
~N3 ~N2 N ALUs
Ex: Itanium 2
Without modification, freq decreases linearly or worse.
Operand Routing is Global
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALUBypass Net
RF
>>
+
Idea: Exploit Locality
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALUBypass Net
RF
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
RF
Bypass Net
Idea: Exploit Locality
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
RF
Replace the crossbar with a point-to-point, pipelined, routed network.
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
RF>>
+
Replace the crossbar with a point-to-point, pipelined, routed network.
Un-pipelinedcrossbar
Point-to-PointRouted MeshNetwork
ALUs N N
Bisection BW ~ N½ ~ N½
Local BW ~ N½ ~ N
Area ~ N2 ~ N
Operand Transport Scaling – Bandwidth and Area
If we want to keep our ALUs busy, we better mapcommunicating instructions nearby so communicationis local.
Scalesas 2-DVLSI
Operand Transport Scaling - LatencyTime for operand to travel between instructions mapped todifferent ALUs.
Non-local Placement
~ N ~ N½
Locality Driven Placement
~ N ~ 1
Un-pipelinedcrossbar
Point-to-PointRouted MeshNetwork
If we want to make sure that a latency-bound program doesn’t slow down when more ALUs are added, we mustmap the instructions to ALUs in a local fashion. [ASPLOS98]
Distribute the Register File
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
RF
RFRF RFRF
RFRF RFRF
RFRF RFRF
RFRF RFRF
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
RFRF RFRF
RFRF RFRF
RFRF RFRF
RFRF RFRF
ControlWideFetch
(16 inst)
UnifiedLoad/Store
Queue
PC
SCALABLE
More Scalability Problems
ControlWideFetch
(16 inst)
UnifiedLoad/Store
Queue
PC
Distribute the rest.
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
RFRF RFRF
RFRF RFRF
RFRF RFRF
RFRF RFRF
Control
WideFetch
(16 inst)
UnifiedLoad/Store
Queue
PC I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$PC
D$I$
PC
D$I$
PC
D$I$
PC
D$
[ISCA99]
Tiles!
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
RFRF RFRF
RFRF RFRF
RFRF RFRF
RFRF RFRF
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$PC
D$I$
PC
D$I$
PC
D$I$
PC
D$
Tiles!
Tiled Processor Architectures
-composed of a replicated tile -all signals registered at tile
boundaries
-NO global signals
-wire delay problem much easier
- easy scalability storyEasier to Tune the FrequencyEasier to VerifyEasier to do the Physical Design
Raw Compute Internals
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
RFRF RFRF
RFRF RFRF
RFRF RFRF
RFRF RFRF
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$PC
D$I$
PC
D$I$
PC
D$I$
PC
D$
RFA TL
M1 M2
F P
E
U
r26
r27
r25
r24
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
RFRF RFRF
RFRF RFRF
RFRF RFRF
RFRF RFRF
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$
PC
D$
I$PC
D$I$
PC
D$I$
PC
D$I$
PC
D$
We could not find this type of networkin Patterson & Hennessey. - optimizes time for delivery of scalar operands between functional units
- we conceptualized this idea into the term “scalar operand network” or SON
- CMP: 15-100 cycles - iWarp: 12 cycles - Raw: 3 cycles - Alpha 21264: 1 cycle - Superscalar: 0 cycle
scalable
HPCA 2003 – “Scalar Operand Networks”
Intended foruse as SON
Evaluation of Raw
- holistic approach
- design a complete architecture
- design and build the processor and enclosing system
- build the compilers - used the chip in real systems
- head-to-head versus Intel Chip in same litho generation
Raw
180 nm ASIC (IBM SA-27E)16 tiles
Core Frequency: 425 MHz @ 1.8 V 500 MHz @ 2.2 V
Frequency competitivewith IBM-implementedPowerPCs in same process.
18 W (vpenta)Critical Path: ≈ Single-Ported 32 KB SRAM + 14-bit Mux. + Flip Flop
Raw Chips
October 02
Raw motherboard
Support Chipset implemented in FPGA (vs. custom ASICs for P3)
Comparison to Pentium 3
Self-comparisons hide architectural and compiler inefficiency.
What’s hard:
Normalizations between processors is very tricky.
Especially academic projects versus indu$try.- ASIC cannot attain the same frequencies.
Honest:
Our solution:
-Pick closest Intel processor implementation-Don’t scale any numbers in any way.
People can now compare to P3 and by extension to Raw.
Parameter IBM SA-27E (Raw) Intel P858 (P3) Favors
Litho 180 nm 180 nm -
Metal Layers Cu 6 Al 6 Raw
Wire sizing No Yes Intel
Dielectric k 4.1 3.55 Intel
FO1 Delay 23 ps 11 ps Intel
Design Style Std Cell ASIC Full custom Intel
Voltage Tweak 0 % 10 % Intel
Initial Freq 425 500-733 -
Presumed
Ave. Chip Freq
425 600 -
Pins 1100 190 Raw
Die Area 331 mm2 106 mm2 Raw
Methodology - HWIntel:
Pentium III Coppermine 600 MHzDell Precision 410, stocked with 2-2-2 PC100 DRAM
Raw:Validated Cycle-Accurate Simulator - Matches RTL for Raw Chip to the precise cycle for all 200,000+ lines of test code
Simulator used so we could: - Normalize motherboard + DRAM timings - replace (research) software i-caching system
with conventional hardware i-cache.
Methodology - SWWhen applicable
- normalize compiler: P3: gcc 3.3 –O3 –march=pentium3 –
mfpmath=sse Raw: gcc 3.3 –O3 (non parallelizing)- normalize stdio/stdlib: P3 & Raw: Newlib 1.9.0 w/ Deionizer
P3:Intel Performance PrimitivesLAPACK/BLAS with SSE for linear algebra routines
Raw:rawcc - home brew parallelizing compilerStreamit - home brew parallelizing compilergcc 3.3 + snippets inline assembly for some parallel
apps
Performance Survey
Sources of Speedup vs. P3 or 1 TileFactor Approx. Upper
Bound on Speedup
Tile Parallelism 16x
Streaming I/O Bandwidth 60x
Streaming v. cache thrashing 15x
Future Work: Raw supercomputing fabric
Emulator of a 1K-tileRaw chipcirca. 2010
…Ultimatetest ofscaling
Related Work: AsTrO Taxonomy
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
>>
+Assignment (Static/Dynamic)
Transport (Static/Dynamic)
Ordering (Static/Dynamic)
+
>>
Is instruction assignment to ALUs predetermined?
Are operand routes predetermined?
Is the execution order of instructions assigned to a node predetermined?
%&/
Static Dynamic
Static
Static
Dynamic
DynamicStatic
RawDyn [00]Raw [97]Scale [04]
GRID [01]WaveScalar [03]
Static
Dynamic
Dynamic
ILDP[00] OOO- Superscalar
Assignment
Transport
Ordering
How Raw relates to otherdistributed microprocessors
using AsTrO taxonomy
Conclusions
•VLSI Scalable microprocessors are possible.
Constant factors are beginning to give way to asymptotics: - 16 ALU Raw – Oct 2002 - 64 ALU Raw – Now - 1,024 ALU Raw - 2010 - 32,768 ALU Raw – If Moore’s Law makes it to 2 nm•There is an opportunity to make processors more
“versatile” i.e., steal applications from custom chips.
•Tiled Processor Architectures are a promising approach and merit further research.
* * * *
Embedded system:1020 Element Microphone Array