+ All Categories
Home > Documents > Evaluating The Raw Microprocessor: Scalability and Versatility

Evaluating The Raw Microprocessor: Scalability and Versatility

Date post: 04-Feb-2016
Category:
Upload: louvain
View: 23 times
Download: 0 times
Share this document with a friend
Description:
Evaluating The Raw Microprocessor: Scalability and Versatility. Michael Taylor Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry Hoffmann, Paul Johnson, Jason Kim, James Psota, Arvind Saraf, Nathan Shnidman, Volker Strumpen, Matt Frank, - PowerPoint PPT Presentation
38
valuating The Raw Microprocessor: calability and Versatility Michael Taylor Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry Hoffmann, Paul Johnson, Jason Kim, James Psota, Arvind Saraf, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal. M.I.T.
Transcript
Page 1: Evaluating The Raw Microprocessor: Scalability and Versatility

Evaluating The Raw Microprocessor:Scalability and Versatility

Michael Taylor

Walter Lee, Jason Miller, David Wentzlaff,Ian Bratt, Ben Greenwald, Henry Hoffmann, Paul Johnson, Jason Kim, James Psota, Arvind Saraf, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal.

M.I.T.

Page 2: Evaluating The Raw Microprocessor: Scalability and Versatility

Could processors be even more general purpose?

Square inch of siliconGets more powerful every generation

CustomChip

“General Purpose”Microprocessor

Video/3D GraphicsNetworkEncryptionWireless/Cell PhoneDigital CameraMP3 PlayerAutomotiveWhy can custom chips run these apps?

SpecOffice

Page 3: Evaluating The Raw Microprocessor: Scalability and Versatility

Custom Chips: Efficient Extraction of Parallelism

10’s, 100’s or 1000’s of parallel operators10’s or 100’s of parallel memory ports10’s or 100’s of parallel I/O ops

But, not general purpose!Can’t run GCC.

memmem

mem

mem

mem

Customized placement and routing of operators & operands

-High locality -Minimum Control

-Operands routed over wires, not thru register files Area and Power Efficient

GP Micro3-821

Page 4: Evaluating The Raw Microprocessor: Scalability and Versatility

The Raw Goal

Create an architecture that: Scales to 100’s-1000’s of functional units, memory ports by exploiting custom-chip like features - in particular, application-specific routing of operands

… while being “general purpose”: Run ILP-based sequential programs Support standard General Purpose Abstractions

- like context switching, caching and instruction virtualization

[IEEE Micro, “Billion Transistor” Issue, 1997]

Page 5: Evaluating The Raw Microprocessor: Scalability and Versatility

Un-buildable Super-Wide Issue GP

ControlWideFetch

(16 inst)

UnifiedLoad/Store

Queue

PC

RF

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALUBypass Net

Page 6: Evaluating The Raw Microprocessor: Scalability and Versatility

Area and Frequency Scalability Problems

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALUBypass Net

RF

~N3 ~N2 N ALUs

Ex: Itanium 2

Without modification, freq decreases linearly or worse.

Page 7: Evaluating The Raw Microprocessor: Scalability and Versatility

Operand Routing is Global

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALUBypass Net

RF

>>

+

Page 8: Evaluating The Raw Microprocessor: Scalability and Versatility

Idea: Exploit Locality

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALUBypass Net

RF

Page 9: Evaluating The Raw Microprocessor: Scalability and Versatility

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RF

Bypass Net

Idea: Exploit Locality

Page 10: Evaluating The Raw Microprocessor: Scalability and Versatility

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RF

Replace the crossbar with a point-to-point, pipelined, routed network.

Page 11: Evaluating The Raw Microprocessor: Scalability and Versatility

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RF>>

+

Replace the crossbar with a point-to-point, pipelined, routed network.

Page 12: Evaluating The Raw Microprocessor: Scalability and Versatility

Un-pipelinedcrossbar

Point-to-PointRouted MeshNetwork

ALUs N N

Bisection BW ~ N½ ~ N½

Local BW ~ N½ ~ N

Area ~ N2 ~ N

Operand Transport Scaling – Bandwidth and Area

If we want to keep our ALUs busy, we better mapcommunicating instructions nearby so communicationis local.

Scalesas 2-DVLSI

Page 13: Evaluating The Raw Microprocessor: Scalability and Versatility

Operand Transport Scaling - LatencyTime for operand to travel between instructions mapped todifferent ALUs.

Non-local Placement

~ N ~ N½

Locality Driven Placement

~ N ~ 1

Un-pipelinedcrossbar

Point-to-PointRouted MeshNetwork

If we want to make sure that a latency-bound program doesn’t slow down when more ALUs are added, we mustmap the instructions to ALUs in a local fashion. [ASPLOS98]

Page 14: Evaluating The Raw Microprocessor: Scalability and Versatility

Distribute the Register File

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RF

RFRF RFRF

RFRF RFRF

RFRF RFRF

RFRF RFRF

Page 15: Evaluating The Raw Microprocessor: Scalability and Versatility

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RFRF RFRF

RFRF RFRF

RFRF RFRF

RFRF RFRF

ControlWideFetch

(16 inst)

UnifiedLoad/Store

Queue

PC

SCALABLE

Page 16: Evaluating The Raw Microprocessor: Scalability and Versatility

More Scalability Problems

ControlWideFetch

(16 inst)

UnifiedLoad/Store

Queue

PC

Page 17: Evaluating The Raw Microprocessor: Scalability and Versatility

Distribute the rest.

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RFRF RFRF

RFRF RFRF

RFRF RFRF

RFRF RFRF

Control

WideFetch

(16 inst)

UnifiedLoad/Store

Queue

PC I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$PC

D$I$

PC

D$I$

PC

D$I$

PC

D$

[ISCA99]

Page 18: Evaluating The Raw Microprocessor: Scalability and Versatility

Tiles!

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RFRF RFRF

RFRF RFRF

RFRF RFRF

RFRF RFRF

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$PC

D$I$

PC

D$I$

PC

D$I$

PC

D$

Page 19: Evaluating The Raw Microprocessor: Scalability and Versatility

Tiles!

Page 20: Evaluating The Raw Microprocessor: Scalability and Versatility

Tiled Processor Architectures

-composed of a replicated tile -all signals registered at tile

boundaries

-NO global signals

-wire delay problem much easier

- easy scalability storyEasier to Tune the FrequencyEasier to VerifyEasier to do the Physical Design

Page 21: Evaluating The Raw Microprocessor: Scalability and Versatility

Raw Compute Internals

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RFRF RFRF

RFRF RFRF

RFRF RFRF

RFRF RFRF

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$PC

D$I$

PC

D$I$

PC

D$I$

PC

D$

RFA TL

M1 M2

F P

E

U

r26

r27

r25

r24

Page 22: Evaluating The Raw Microprocessor: Scalability and Versatility

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

RFRF RFRF

RFRF RFRF

RFRF RFRF

RFRF RFRF

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$

PC

D$

I$PC

D$I$

PC

D$I$

PC

D$I$

PC

D$

We could not find this type of networkin Patterson & Hennessey. - optimizes time for delivery of scalar operands between functional units

- we conceptualized this idea into the term “scalar operand network” or SON

- CMP: 15-100 cycles - iWarp: 12 cycles - Raw: 3 cycles - Alpha 21264: 1 cycle - Superscalar: 0 cycle

scalable

HPCA 2003 – “Scalar Operand Networks”

Intended foruse as SON

Page 23: Evaluating The Raw Microprocessor: Scalability and Versatility

Evaluation of Raw

- holistic approach

- design a complete architecture

- design and build the processor and enclosing system

- build the compilers - used the chip in real systems

- head-to-head versus Intel Chip in same litho generation

Page 24: Evaluating The Raw Microprocessor: Scalability and Versatility

Raw

180 nm ASIC (IBM SA-27E)16 tiles

Core Frequency: 425 MHz @ 1.8 V 500 MHz @ 2.2 V

Frequency competitivewith IBM-implementedPowerPCs in same process.

18 W (vpenta)Critical Path: ≈ Single-Ported 32 KB SRAM + 14-bit Mux. + Flip Flop

Page 25: Evaluating The Raw Microprocessor: Scalability and Versatility

Raw Chips

October 02

Page 26: Evaluating The Raw Microprocessor: Scalability and Versatility

Raw motherboard

Support Chipset implemented in FPGA (vs. custom ASICs for P3)

Page 27: Evaluating The Raw Microprocessor: Scalability and Versatility

Comparison to Pentium 3

Self-comparisons hide architectural and compiler inefficiency.

What’s hard:

Normalizations between processors is very tricky.

Especially academic projects versus indu$try.- ASIC cannot attain the same frequencies.

Honest:

Our solution:

-Pick closest Intel processor implementation-Don’t scale any numbers in any way.

People can now compare to P3 and by extension to Raw.

Page 28: Evaluating The Raw Microprocessor: Scalability and Versatility

Parameter IBM SA-27E (Raw) Intel P858 (P3) Favors

Litho 180 nm 180 nm -

Metal Layers Cu 6 Al 6 Raw

Wire sizing No Yes Intel

Dielectric k 4.1 3.55 Intel

FO1 Delay 23 ps 11 ps Intel

Design Style Std Cell ASIC Full custom Intel

Voltage Tweak 0 % 10 % Intel

Initial Freq 425 500-733 -

Presumed

Ave. Chip Freq

425 600 -

Pins 1100 190 Raw

Die Area 331 mm2 106 mm2 Raw

Page 29: Evaluating The Raw Microprocessor: Scalability and Versatility

Methodology - HWIntel:

Pentium III Coppermine 600 MHzDell Precision 410, stocked with 2-2-2 PC100 DRAM

Raw:Validated Cycle-Accurate Simulator - Matches RTL for Raw Chip to the precise cycle for all 200,000+ lines of test code

Simulator used so we could: - Normalize motherboard + DRAM timings - replace (research) software i-caching system

with conventional hardware i-cache.

Page 30: Evaluating The Raw Microprocessor: Scalability and Versatility

Methodology - SWWhen applicable

- normalize compiler: P3: gcc 3.3 –O3 –march=pentium3 –

mfpmath=sse Raw: gcc 3.3 –O3 (non parallelizing)- normalize stdio/stdlib: P3 & Raw: Newlib 1.9.0 w/ Deionizer

P3:Intel Performance PrimitivesLAPACK/BLAS with SSE for linear algebra routines

Raw:rawcc - home brew parallelizing compilerStreamit - home brew parallelizing compilergcc 3.3 + snippets inline assembly for some parallel

apps

Page 31: Evaluating The Raw Microprocessor: Scalability and Versatility

Performance Survey

Page 32: Evaluating The Raw Microprocessor: Scalability and Versatility

Sources of Speedup vs. P3 or 1 TileFactor Approx. Upper

Bound on Speedup

Tile Parallelism 16x

Streaming I/O Bandwidth 60x

Streaming v. cache thrashing 15x

Page 33: Evaluating The Raw Microprocessor: Scalability and Versatility

Future Work: Raw supercomputing fabric

Emulator of a 1K-tileRaw chipcirca. 2010

…Ultimatetest ofscaling

Page 34: Evaluating The Raw Microprocessor: Scalability and Versatility

Related Work: AsTrO Taxonomy

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

>>

+Assignment (Static/Dynamic)

Transport (Static/Dynamic)

Ordering (Static/Dynamic)

+

>>

Is instruction assignment to ALUs predetermined?

Are operand routes predetermined?

Is the execution order of instructions assigned to a node predetermined?

%&/

Page 35: Evaluating The Raw Microprocessor: Scalability and Versatility

Static Dynamic

Static

Static

Dynamic

DynamicStatic

RawDyn [00]Raw [97]Scale [04]

GRID [01]WaveScalar [03]

Static

Dynamic

Dynamic

ILDP[00] OOO- Superscalar

Assignment

Transport

Ordering

How Raw relates to otherdistributed microprocessors

using AsTrO taxonomy

Page 36: Evaluating The Raw Microprocessor: Scalability and Versatility

Conclusions

•VLSI Scalable microprocessors are possible.

Constant factors are beginning to give way to asymptotics: - 16 ALU Raw – Oct 2002 - 64 ALU Raw – Now - 1,024 ALU Raw - 2010 - 32,768 ALU Raw – If Moore’s Law makes it to 2 nm•There is an opportunity to make processors more

“versatile” i.e., steal applications from custom chips.

•Tiled Processor Architectures are a promising approach and merit further research.

Page 37: Evaluating The Raw Microprocessor: Scalability and Versatility

* * * *

Page 38: Evaluating The Raw Microprocessor: Scalability and Versatility

Embedded system:1020 Element Microphone Array


Recommended