1 Digital Space Anant Agarwal MIT and Tilera Corporation.

1

Digital Space

Anant Agarwal

MIT and Tilera Corporation

2

Arecibo

3

Stages of Reality

memmem

mem

mem

mem

1996

CPUMem

19972002

2007

pms

pms

pms

pms

pms

pms

pms

pms

pm

pms

pms

pms

pms

pms

pms

pms

pms

pms

pm

pms

pms

pms

pms

pms

pms

pms

pms

pms

pm

pms

pms

pms

pms

pms

pms

pms

pms

pms

pm

pms

pms

pms

pms

pms

pms

pms

pms

pms

pm

pms

pms

pms

pms

pms

pms

pms

pms

pms

pm

pms

pms

pms

pms

pms

pms

pms

pms

pms

pm

pms

pms

pms

pms

pms

pms

pms

pms

pms

pm

pms

pm

pm

pm

pm

pm

pm

pm

pm

pm

pm

2014

100B transistors

2018

1B transistors

2007

4

Virtual reality

Simulator realityPrototype reality

Product reality

Virtual reality


Simulator reality

Product realityPrototype reality

Simulator reality

Virtual Reality

5

The Opportunity

20MIPS cpuin 1987

1996…

Few thousand gates

6

The Opportunity

The billion transistor chip of 2007

7

How to Fritter Away Opportunity

the x1786? does not scale

100 ported RegFil and RR

Caches

Control

More resolution buffers, control

“1/10 ns”

8

• Lots of ALUs, lots of registers, lots of local memories – huge on-chip parallelism – but with a slower clock

• Custom-routed, short wires optimized for specific applications

Fast, low power, area efficientBut not programmable

memmem

mem

mem

mem

Take Inspiration from ASICs

9

Our Early Raw Proposal

CPU

Mem E.g., 100-way unrolled loop,running on 100 ALUs, 1000 regs,100 memory banks

But how to build programmable, yet custom, wires?

Got parallelism?

10

A digital wire

Ctrl

Ctrl

Ctrl

CtrlCtrl

Software orchestrate it!

• Customize to application and maximize utilization

Multiplex it!

• Improve utilization

Pipeline it!

• Fast clock (10GHz in 2010)

Uh! What were we smoking!

A dynamic router!

Replace custom wires with routed on-chip networks

11

Static Router

ApplicationCompiler

SwitchCode

SwitchCode

SwitchCode

SwitchCode

SwitchCode

A static router!

12

Replace Wires with Routed Networks

Ctrl

13

50-Ported Register File Distributed Registers

Gigantic 50

ported register

file

14

Gigantic 50

ported register

file

50-Ported Register File Distributed Registers

15

Distributed Registers + Routed Network

Distributed register file

R

Called NURA [ASPLOS 1998]

16

16-Way ALU Clump Distributed ALUs

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALUBypass Net

RF

17

Distributed ALUs, Routed Bypass Network

AL

UA

LU A

LU A

LU

AL

U

AL

UR

Scalar Operand Network (SON) [TPDS 2005]

18

Mongo Cache Distributed Cache

Gigantic 10

ported cache

19

Distributing the Cache

20

Distributed Shared Cache

AL

UA

LU A

LU A

LU

AL

U

AL

U

Like DSM (distributed shared memory), cache is distributed; But, unlike NUCA, caches are local to processors, not far away

R

$

[ISCA 1999]

21

Tiled Multicore Architecture

AL

UA

LU A

LU A

LU

AL

U

AL

U

R

$

22

E.g., Operand Routing in 16-way Superscalar

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALUBypass Net

RF

>>

+

Source: [Taylor ISCA 2004]

23

Operand Routing in a Tiled Architecture

>>

AL

UA

LU A

LU A

LU

AL

U

AL

U

>>

+R

$

24

Tiled Multicore

• Scales to large numbers of cores• Modular – design, layout and verify 1 tile• Power efficient [MIT-CSAIL-TR-2008-066]

– Short wires CV2f– Chandrakasan effect CV2f

– Dynamic and compiler scheduled routing

ProcessorCore

= TileCore + Switch

S

25

A Prototype Tiled Architecture: The Raw Microprocessor

The Raw ChipTile

Disk stream

Video1

DRAM

Packet stream

A Raw Tile

SMEM

SWITCHPC

DMEMIMEM

REGPC

FPU

ALU

Raw Switch

PC

SMEM[Billion transistor IEEE Computer Issue ’97]www.cag.csail.mit.edu/raw

Scalar operand network (SON): Capable of low latency transport of small (or large) packets

[IEEE TPDS 2005]

26

Virtual reality

Simulator reality

Prototype realityProduct reality

27

Scalar Operand Transport in Raw

fmul r24, r3, r4

softwarecontrolledcrossbar


fadd r5, r3, r24

route P->E, N->S route W->P, S->N

Goal: flow controlled, in order delivery of operands

28

RawCC: Distributed ILP Compilation (DILP)

tmp0 = (seed*3+2)/2tmp1 = seed*v1+2tmp2 = seed*v2 + 2tmp3 = (seed*6+2)/3v2 = (tmp1 - tmp3)*5v1 = (tmp1 + tmp2)*3v0 = tmp0 - v1v3 = tmp3 - v2

pval5=seed.0*6.0

pval4=pval5+2.0

tmp3.6=pval4/3.0

tmp3=tmp3.6

v3.10=tmp3.6-v2.7

v3=v3.10

v2.4=v2

pval3=seed.o*v2.4

tmp2.5=pval3+2.0

tmp2=tmp2.5

pval6=tmp1.3-tmp2.5

v2.7=pval6*5.0

v2=v2.7

seed.0=seed

pval1=seed.0*3.0

pval0=pval1+2.0

tmp0.1=pval0/2.0

tmp0=tmp0.1

v1.2=v1

pval2=seed.0*v1.2

tmp1.3=pval2+2.0

tmp1=tmp1.3

pval7=tmp1.3+tmp2.5

v1.8=pval7*3.0

v1=v1.8

v0.9=tmp0.1-v1.8

v0=v0.9

pval5=seed.0*6.0

pval4=pval5+2.0

tmp3.6=pval4/3.0

tmp3=tmp3.6

v3.10=tmp3.6-v2.7

v3=v3.10

v2.4=v2

pval3=seed.o*v2.4

tmp2.5=pval3+2.0

tmp2=tmp2.5

pval6=tmp1.3-tmp2.5

v2.7=pval6*5.0

v2=v2.7

seed.0=seed

pval1=seed.0*3.0

pval0=pval1+2.0

tmp0.1=pval0/2.0

tmp0=tmp0.1

v1.2=v1

pval2=seed.0*v1.2

tmp1.3=pval2+2.0

tmp1=tmp1.3

pval7=tmp1.3+tmp2.5

v1.8=pval7*3.0

v1=v1.8v0.9=tmp0.1-v1.8

v0=v0.9

Black arrows = Operand Communication over SON

[ASPLOS 1998]

Partitioning

Place, Route, ScheduleC

29

Virtual realitySimulator reality

Prototype reality

Product reality

30

A Tiled Processor Architecture Prototype: the Raw Microprocessor

October 02

Michael TaylorWalter LeeJason MillerDavid WentzlaffIan BrattBen GreenwaldHenry HoffmannPaul JohnsonJason KimJames PsotaArvind SarafNathan ShnidmanVolker StrumpenMatt FrankRajeev BaruaElliot WaingoldJonathan BabbSri DevabhaktuniSaman AmarasingheAnant Agarwal

31

Raw Die Photo

IBM .18 micron process, 16 tiles, 425MHz, 18 Watts (vpenta)

[ISCA 2004]

32

Raw Motherboard

33

Raw Ideas and Decisions: What Worked, What Did Not

• Build a complete prototype system• Simple processor with single issue cores• FPGA logic block in each tile• Distributed ILP and static network• Static network for streaming• Multiple types of computation – ILP, streams, TLP, server• PC in every tile

34

Why Build?

• Compiler (Amarasinghe), OS and runtimes (ISI), apps (ISI, Lincoln Labs, Durand) folks will not work with you unless you are serious about building hardware

• Need motivaion to build software tools -- compilers, runtimes, debugging, visualization – many challenges here

• Run large data sets (simulation takes forever even with 100 servers!)• Many hard problems show up or are better understood after you begin building (how to

maintain ordering for distributed ILP, slack for streaming codes)• Have to solve hard problems – no magic!• The more radical the idea, the more important it is to build

– World will only trust end-to-end results since it is too hard to dive into details and understand all assumptions

– Would you believe this: “Prof. John Bull has demonstrated a simulation prototype of a 64-way issue out-of-order superscalar”

• Cycle simulator became cycle accurate simulator only after HW got precisely defined• Don’t bother to commercialize unless you have a working prototype• Total network power few percent for real apps [Aug 2003 ISLPED, Kim et al. Energy

characterization of a tiled architecture processor with on-chip networks] [MIT-CSAIL-TR-2008-066 Energy scalability of on-chip interconnection networks in multicore architecures ]

– Network power is few percent in Raw for real apps; however, it is 36% only for a highly contrived synthetic sequence meant to toggle every network wire

35

Raw Ideas and Decisions: What Worked, What Did Not

• Build a complete prototype system• Simple processor, single issue• FPGA logic block in each tile• Distributed ILP• Static network for streaming• Multiple types of computation – ILP, streams, TLP, server• PC in every tile

Yes1GHz, 2-way, inorder in 2016

NoYes ‘02, No ‘06, Yes ‘14

YesYes

36



route P->E, N->S route W->P, S->N

Raw Ideas and Decisions: Streaming – Interconnect Support

Forced synchronization in

static network

37

sub r5, r3, r55

DynamicSwitch

add r55, r3, r4

DynamicSwitch

Catch a

ll

Streaming in Tilera’s Tile Processor

TA

G

• Streaming done over dynamic interconnect with stream demuxing (AsTrO SDS)

• Automatic demultiplexing of streams into registers• Number of streams is virtualized

38

Virtual realitySimulator reality

Prototype reality

Product reality

39

Why Do We Care?Markets Demanding More Performance

Wireless Networks- Demand for high thruput – more channels- Fast moving standards LTE, services

Networking market - Demand for high performance – 10Gbps- Demand for more services, intelligence

Digital Multimedia market - Demand for high performance – H.264 HD - Demand for more services – VoD, transcode

Cable & BroadcastCable & BroadcastVideo ConferencingVideo Conferencing

SwitchesSwitches

Security AppliancesSecurity Appliances

RoutersRouters

… and with power efficiency and programming ease

GGSNGGSN

Base StationBase Station

39

40

Tilera’s TILEPro64™ Processor

Power per tile (depending on app) 170 – 300 mW

Core power for h.264 encode (64 tiles) 12W

Clock speed Up to 866 MHz

I/O bandwidth 40 Gbps

Main Memory bandwidth 200 Gbps

Multicore Performance (90nm)

Number of tiles 64

Cache-coherent distributed cache 5 MB

Operations @ 750MHz (32, 16, 8 bit) 144-192-384 BOPS

Bisection bandwidth 2 Terabits per second

Power Efficiency

I/O and Memory Bandwidth

ProgrammingANSI standard CSMP Linux programming

Stream programming

Product reality

[Tile64, Hotchips 2007][Tile64, Microprocessor Report Nov 2007]

41

PCIe 1MACPHY

PCIe 1MACPHY

PCIe 0MACPHY

PCIe 0MACPHY

SerdesSerdes

SerdesSerdes

Flexible IOFlexible IO

GbE 0GbE 0

GbE 1GbE 1Flexible IOFlexible IO

UART, HPIJTAG, I2C,

SPI

UART, HPIJTAG, I2C,

SPI

DDR2 Memory Controller 3DDR2 Memory Controller 3




XAUIMAC

PHY 0

XAUIMAC

PHY 0

SerdesSerdes

XAUIMAC

PHY 1

XAUIMAC

PHY 1

SerdesSerdes

Tile Processor Block DiagramA Complete System on a Chip

PROCESSOR

P2

Reg File

P1 P0

CACHE

L2 CACHE

L1I L1D

ITLB DTLB

2D DMA

STN

MDN TDN

UDN IDN

SWITCH

42

Tile Processor NoC

• 5 independent non-blocking networks– 64 switches per network– 1 Terabit/sec per Tile

• Each network switch directly and independently connected to tiles

• One hop per clock on all networks

• I/O write example

• Memory write example

• Tile to Tile access example

• All accesses can be performed simultaneously on non-blocking networks

UDN

STN

IDN

MDN

Tiles

TDN

VDN[IEEE Micro Sep 2007]

43

Multicore Hardwall ImplementationOr Protection and Interconnects

OS1/APP1

OS1/APP3

OS2/APP2

datavalidSwitch

datavalidSwitch

HARDWALL_ENABLE

44

Product Reality Differences• Market forces

– Need crisper answer to “who cares”– SMP Linux programming with pthreads – fully cache coherent – C + API approach to streaming vs new language Streamit in Raw– Special instructions for video, networking– Floating point needed in research project, but not in product for embedded market

• Lessons from Raw– E.g., Dynamic network for streams– HW instruction cache– Protected interconnects

• More substantial engineering – 3-way VLIW CPU, subword arithmetic– Engineering for clock speed and power efficiency– Completeness – I/O interfaces on chip – complete system chip. Just add DRAM for system– Support for virtual memory, 2D DMA– Runs SMP Linux (can run multiple OSes simultaneously)

45

Virtual reality


Product reality

46

What Does the Future Look Like?

Corollary of Moore’s law: Number of cores will double every 18 months

‘05 ‘08 ‘11 ‘14

64 256 1024 4096

‘02

16Research

Industry 16 64 256 10244

(Cores minimally big enough to run a self respecting OS!)

1K cores by 2014! Are we ready?

47

Vision for the Future

• The ‘core’ is the logic gate of the 21st century

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

48

Research Challenges for 1K Cores

• 4-16 cores not interesting. Industry is there. University must focus on “1K cores”; Everything will change!

• Can we use 4 cores to get 2X through DILP? Remember cores will be 1GHz and simple! What is the interconnect?

• How should we program 1K cores? Can interconnect help with programming?• Locality and reliability WILL matter for 1K cores. Spatial view of multicore? • Can we add architectural support for programming ease? E.g., suppose I told you

cores are free. Can you discover mechanisms to make programming easier?• What is the right grain size for a core?• How must our computational models change in the face of small memories per core?• How to “feed the beast”? I/O and external memory bandwidth• Can we assume perfect reliability any longer?

49

ATAC Architecture

p

switch

m

p

switch

m

p

switch

m

p

switch

m

p

switch

m

p

switch

m

p

switch

m

p

switch

m

p

switch

m

p

switch

m

p

switch

m

p

switch

m

p

switch

m

p

switch

m

p

switch

m

p

switch

m

Optical Broadcast WDM Interconnect

Electrical Mesh Interconnect (EMesh)

[Proc. BARC Jan 2007, MIT-CSAIL-TR-2009-018 ]

50



• Can we use 4 cores to get 2X through DILP? What is the interconnect?• How should we program 1K cores? Can interconnect help with programming?• Locality and reliability WILL matter for 1K cores. Spatial view of multicore? • Can we add architectural support for programming ease? E.g., suppose I told you


51

FOS – Factored Operating System

•Today: User app and OS kernel thrash each other in a core’s cache• User/OS time sharing is inefficient

•Angstrom: OS assumes abstracted space model. OS services bound to distinct cores, separate from user cores. OS service cores collaborate to achieve best resource management

• User/OS space sharing is efficient

The key idea: space sharing replaces time sharing

OS OS OS

OS OS OS

File System

FS

User

AppFS

OS cores collaborate, inspired by distributed internet services model

Need new page

I/O

FS

[OS Review 2008]

52



• Can we use 4 cores to get 2X through DILP? What is the interconnect?• How should we program 1K cores? Can interconnect help with programming?• Locality and reliability WILL matter for 1K cores. Spatial view of multicore? • Can we add architectural support for programming ease? E.g., suppose I told you


53

The following are trademarks of Tilera Corporation: Tilera, the Tilera Logo, Tile Processor, TILE64, Embedding Multicore, Multicore Development Environment, Gentle Slope Programming, iLib, iMesh and Multicore Hardwall. All other trademarks and/or registered trademarks are the property of their respective owners.

Date post:	20-Dec-2015
Category:	Documents
View:	217 times
Download:	1 times

1 Digital Space Anant Agarwal MIT and Tilera Corporation.

Documents