+ All Categories
Home > Documents > 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

1 Digital Space Anant Agarwal MIT and Tilera Corporation.

Date post: 20-Dec-2015
Category:
View: 217 times
Download: 1 times
Share this document with a friend
Popular Tags:
53
1 Digital Space Anant Agarwal MIT and Tilera Corporation
Transcript
Page 1: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

1

Digital Space

Anant Agarwal

MIT and Tilera Corporation

Page 2: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

2

Arecibo

Page 3: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

3

Stages of Reality

memmem

mem

mem

mem

1996

CPUMem

19972002

2007

pms

pms

pms

pms

pms

pms

pms

pms

pm

pms

pms

pms

pms

pms

pms

pms

pms

pms

pm

pms

pms

pms

pms

pms

pms

pms

pms

pms

pm

pms

pms

pms

pms

pms

pms

pms

pms

pms

pm

pms

pms

pms

pms

pms

pms

pms

pms

pms

pm

pms

pms

pms

pms

pms

pms

pms

pms

pms

pm

pms

pms

pms

pms

pms

pms

pms

pms

pms

pm

pms

pms

pms

pms

pms

pms

pms

pms

pms

pm

pms

pm

pm

pm

pm

pm

pm

pm

pm

pm

pm

2014

100B transistors

2018

1B transistors

2007

Page 4: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

4

Virtual reality

Simulator realityPrototype reality

Product reality

Virtual reality

Simulator realityPrototype reality

Simulator reality

Product realityPrototype reality

Simulator reality

Virtual Reality

Page 5: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

5

The Opportunity

20MIPS cpuin 1987

1996…

Few thousand gates

Page 6: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

6

The Opportunity

The billion transistor chip of 2007

Page 7: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

7

How to Fritter Away Opportunity

the x1786? does not scale

100 ported RegFil and RR

Caches

Control

More resolution buffers, control

“1/10 ns”

Page 8: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

8

• Lots of ALUs, lots of registers, lots of local memories – huge on-chip parallelism – but with a slower clock

• Custom-routed, short wires optimized for specific applications

Fast, low power, area efficientBut not programmable

memmem

mem

mem

mem

Take Inspiration from ASICs

Page 9: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

9

Our Early Raw Proposal

CPU

Mem E.g., 100-way unrolled loop,running on 100 ALUs, 1000 regs,100 memory banks

But how to build programmable, yet custom, wires?

Got parallelism?

Page 10: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

10

A digital wire

Ctrl

Ctrl

Ctrl

CtrlCtrl

Software orchestrate it!

• Customize to application and maximize utilization

Multiplex it!

• Improve utilization

Pipeline it!

• Fast clock (10GHz in 2010)

Uh! What were we smoking!

A dynamic router!

Replace custom wires with routed on-chip networks

Page 11: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

11

Static Router

ApplicationCompiler

SwitchCode

SwitchCode

SwitchCode

SwitchCode

SwitchCode

A static router!

Page 12: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

12

Replace Wires with Routed Networks

Ctrl

Page 13: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

13

50-Ported Register File Distributed Registers

Gigantic 50

ported register

file

Page 14: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

14

Gigantic 50

ported register

file

50-Ported Register File Distributed Registers

Page 15: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

15

Distributed Registers + Routed Network

Distributed register file

R

Called NURA [ASPLOS 1998]

Page 16: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

16

16-Way ALU Clump Distributed ALUs

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALUBypass Net

RF

Page 17: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

17

Distributed ALUs, Routed Bypass Network

AL

UA

LU A

LU A

LU

AL

U

AL

UR

Scalar Operand Network (SON) [TPDS 2005]

Page 18: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

18

Mongo Cache Distributed Cache

Gigantic 10

ported cache

Page 19: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

19

Distributing the Cache

Page 20: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

20

Distributed Shared Cache

AL

UA

LU A

LU A

LU

AL

U

AL

U

Like DSM (distributed shared memory), cache is distributed; But, unlike NUCA, caches are local to processors, not far away

R

$

[ISCA 1999]

Page 21: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

21

Tiled Multicore Architecture

AL

UA

LU A

LU A

LU

AL

U

AL

U

R

$

Page 22: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

22

E.g., Operand Routing in 16-way Superscalar

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALU

ALUBypass Net

RF

>>

+

Source: [Taylor ISCA 2004]

Page 23: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

23

Operand Routing in a Tiled Architecture

>>

AL

UA

LU A

LU A

LU

AL

U

AL

U

>>

+R

$

Page 24: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

24

Tiled Multicore

• Scales to large numbers of cores• Modular – design, layout and verify 1 tile• Power efficient [MIT-CSAIL-TR-2008-066]

– Short wires CV2f– Chandrakasan effect CV2f

– Dynamic and compiler scheduled routing

ProcessorCore

= TileCore + Switch

S

Page 25: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

25

A Prototype Tiled Architecture: The Raw Microprocessor

The Raw ChipTile

Disk stream

Video1

DRAM

Packet stream

A Raw Tile

SMEM

SWITCHPC

DMEMIMEM

REGPC

FPU

ALU

Raw Switch

PC

SMEM[Billion transistor IEEE Computer Issue ’97]www.cag.csail.mit.edu/raw

Scalar operand network (SON): Capable of low latency transport of small (or large) packets

[IEEE TPDS 2005]

Page 26: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

26

Virtual reality

Simulator reality

Prototype realityProduct reality

Page 27: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

27

Scalar Operand Transport in Raw

fmul r24, r3, r4

softwarecontrolledcrossbar

softwarecontrolledcrossbar

fadd r5, r3, r24

route P->E, N->S route W->P, S->N

Goal: flow controlled, in order delivery of operands

Page 28: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

28

RawCC: Distributed ILP Compilation (DILP)

tmp0 = (seed*3+2)/2tmp1 = seed*v1+2tmp2 = seed*v2 + 2tmp3 = (seed*6+2)/3v2 = (tmp1 - tmp3)*5v1 = (tmp1 + tmp2)*3v0 = tmp0 - v1v3 = tmp3 - v2

pval5=seed.0*6.0

pval4=pval5+2.0

tmp3.6=pval4/3.0

tmp3=tmp3.6

v3.10=tmp3.6-v2.7

v3=v3.10

v2.4=v2

pval3=seed.o*v2.4

tmp2.5=pval3+2.0

tmp2=tmp2.5

pval6=tmp1.3-tmp2.5

v2.7=pval6*5.0

v2=v2.7

seed.0=seed

pval1=seed.0*3.0

pval0=pval1+2.0

tmp0.1=pval0/2.0

tmp0=tmp0.1

v1.2=v1

pval2=seed.0*v1.2

tmp1.3=pval2+2.0

tmp1=tmp1.3

pval7=tmp1.3+tmp2.5

v1.8=pval7*3.0

v1=v1.8

v0.9=tmp0.1-v1.8

v0=v0.9

pval5=seed.0*6.0

pval4=pval5+2.0

tmp3.6=pval4/3.0

tmp3=tmp3.6

v3.10=tmp3.6-v2.7

v3=v3.10

v2.4=v2

pval3=seed.o*v2.4

tmp2.5=pval3+2.0

tmp2=tmp2.5

pval6=tmp1.3-tmp2.5

v2.7=pval6*5.0

v2=v2.7

seed.0=seed

pval1=seed.0*3.0

pval0=pval1+2.0

tmp0.1=pval0/2.0

tmp0=tmp0.1

v1.2=v1

pval2=seed.0*v1.2

tmp1.3=pval2+2.0

tmp1=tmp1.3

pval7=tmp1.3+tmp2.5

v1.8=pval7*3.0

v1=v1.8v0.9=tmp0.1-v1.8

v0=v0.9

Black arrows = Operand Communication over SON

[ASPLOS 1998]

Partitioning

Place, Route, ScheduleC

Page 29: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

29

Virtual realitySimulator reality

Prototype reality

Product reality

Page 30: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

30

A Tiled Processor Architecture Prototype: the Raw Microprocessor

October 02

Michael TaylorWalter LeeJason MillerDavid WentzlaffIan BrattBen GreenwaldHenry HoffmannPaul JohnsonJason KimJames PsotaArvind SarafNathan ShnidmanVolker StrumpenMatt FrankRajeev BaruaElliot WaingoldJonathan BabbSri DevabhaktuniSaman AmarasingheAnant Agarwal

Page 31: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

31

Raw Die Photo

IBM .18 micron process, 16 tiles, 425MHz, 18 Watts (vpenta)

[ISCA 2004]

Page 32: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

32

Raw Motherboard

Page 33: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

33

Raw Ideas and Decisions: What Worked, What Did Not

• Build a complete prototype system• Simple processor with single issue cores• FPGA logic block in each tile• Distributed ILP and static network• Static network for streaming• Multiple types of computation – ILP, streams, TLP, server• PC in every tile

Page 34: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

34

Why Build?

• Compiler (Amarasinghe), OS and runtimes (ISI), apps (ISI, Lincoln Labs, Durand) folks will not work with you unless you are serious about building hardware

• Need motivaion to build software tools -- compilers, runtimes, debugging, visualization – many challenges here

• Run large data sets (simulation takes forever even with 100 servers!)• Many hard problems show up or are better understood after you begin building (how to

maintain ordering for distributed ILP, slack for streaming codes)• Have to solve hard problems – no magic!• The more radical the idea, the more important it is to build

– World will only trust end-to-end results since it is too hard to dive into details and understand all assumptions

– Would you believe this: “Prof. John Bull has demonstrated a simulation prototype of a 64-way issue out-of-order superscalar”

• Cycle simulator became cycle accurate simulator only after HW got precisely defined• Don’t bother to commercialize unless you have a working prototype• Total network power few percent for real apps [Aug 2003 ISLPED, Kim et al. Energy

characterization of a tiled architecture processor with on-chip networks] [MIT-CSAIL-TR-2008-066 Energy scalability of on-chip interconnection networks in multicore architecures ]

– Network power is few percent in Raw for real apps; however, it is 36% only for a highly contrived synthetic sequence meant to toggle every network wire

Page 35: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

35

Raw Ideas and Decisions: What Worked, What Did Not

• Build a complete prototype system• Simple processor, single issue• FPGA logic block in each tile• Distributed ILP• Static network for streaming• Multiple types of computation – ILP, streams, TLP, server• PC in every tile

Yes1GHz, 2-way, inorder in 2016

NoYes ‘02, No ‘06, Yes ‘14

YesYes

Page 36: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

36

softwarecontrolledcrossbar

softwarecontrolledcrossbar

route P->E, N->S route W->P, S->N

Raw Ideas and Decisions: Streaming – Interconnect Support

Forced synchronization in

static network

Page 37: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

37

sub r5, r3, r55

DynamicSwitch

add r55, r3, r4

DynamicSwitch

Catch a

ll

Streaming in Tilera’s Tile Processor

TA

G

• Streaming done over dynamic interconnect with stream demuxing (AsTrO SDS)

• Automatic demultiplexing of streams into registers• Number of streams is virtualized

Page 38: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

38

Virtual realitySimulator reality

Prototype reality

Product reality

Page 39: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

39

Why Do We Care?Markets Demanding More Performance

Wireless Networks- Demand for high thruput – more channels- Fast moving standards LTE, services

Networking market - Demand for high performance – 10Gbps- Demand for more services, intelligence

Digital Multimedia market - Demand for high performance – H.264 HD - Demand for more services – VoD, transcode

Cable & BroadcastCable & BroadcastVideo ConferencingVideo Conferencing

SwitchesSwitches

Security AppliancesSecurity Appliances

RoutersRouters

… and with power efficiency and programming ease

GGSNGGSN

Base StationBase Station

39

Page 40: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

40

Tilera’s TILEPro64™ Processor

Power per tile (depending on app) 170 – 300 mW

Core power for h.264 encode (64 tiles) 12W

Clock speed Up to 866 MHz

I/O bandwidth 40 Gbps

Main Memory bandwidth 200 Gbps

Multicore Performance (90nm)

Number of tiles 64

Cache-coherent distributed cache 5 MB

Operations @ 750MHz (32, 16, 8 bit) 144-192-384 BOPS

Bisection bandwidth 2 Terabits per second

Power Efficiency

I/O and Memory Bandwidth

ProgrammingANSI standard CSMP Linux programming

Stream programming

Product reality

[Tile64, Hotchips 2007][Tile64, Microprocessor Report Nov 2007]

Page 41: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

41

PCIe 1MACPHY

PCIe 1MACPHY

PCIe 0MACPHY

PCIe 0MACPHY

SerdesSerdes

SerdesSerdes

Flexible IOFlexible IO

GbE 0GbE 0

GbE 1GbE 1Flexible IOFlexible IO

UART, HPIJTAG, I2C,

SPI

UART, HPIJTAG, I2C,

SPI

DDR2 Memory Controller 3DDR2 Memory Controller 3

DDR2 Memory Controller 0DDR2 Memory Controller 0

DDR2 Memory Controller 2DDR2 Memory Controller 2

DDR2 Memory Controller 1DDR2 Memory Controller 1

XAUIMAC

PHY 0

XAUIMAC

PHY 0

SerdesSerdes

XAUIMAC

PHY 1

XAUIMAC

PHY 1

SerdesSerdes

Tile Processor Block DiagramA Complete System on a Chip

PROCESSOR

P2

Reg File

P1 P0

CACHE

L2 CACHE

L1I L1D

ITLB DTLB

2D DMA

STN

MDN TDN

UDN IDN

SWITCH

Page 42: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

42

Tile Processor NoC

• 5 independent non-blocking networks– 64 switches per network– 1 Terabit/sec per Tile

• Each network switch directly and independently connected to tiles

• One hop per clock on all networks

• I/O write example

• Memory write example

• Tile to Tile access example

• All accesses can be performed simultaneously on non-blocking networks

UDN

STN

IDN

MDN

Tiles

TDN

VDN[IEEE Micro Sep 2007]

Page 43: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

43

Multicore Hardwall ImplementationOr Protection and Interconnects

OS1/APP1

OS1/APP3

OS2/APP2

datavalidSwitch

datavalidSwitch

HARDWALL_ENABLE

Page 44: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

44

Product Reality Differences• Market forces

– Need crisper answer to “who cares”– SMP Linux programming with pthreads – fully cache coherent – C + API approach to streaming vs new language Streamit in Raw– Special instructions for video, networking– Floating point needed in research project, but not in product for embedded market

• Lessons from Raw– E.g., Dynamic network for streams– HW instruction cache– Protected interconnects

• More substantial engineering – 3-way VLIW CPU, subword arithmetic– Engineering for clock speed and power efficiency– Completeness – I/O interfaces on chip – complete system chip. Just add DRAM for system– Support for virtual memory, 2D DMA– Runs SMP Linux (can run multiple OSes simultaneously)

Page 45: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

45

Virtual reality

Simulator realityPrototype reality

Product reality

Page 46: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

46

What Does the Future Look Like?

Corollary of Moore’s law: Number of cores will double every 18 months

‘05 ‘08 ‘11 ‘14

64 256 1024 4096

‘02

16Research

Industry 16 64 256 10244

(Cores minimally big enough to run a self respecting OS!)

1K cores by 2014! Are we ready?

Page 47: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

47

Vision for the Future

• The ‘core’ is the logic gate of the 21st century

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

spm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

pm

s

Page 48: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

48

Research Challenges for 1K Cores

• 4-16 cores not interesting. Industry is there. University must focus on “1K cores”; Everything will change!

• Can we use 4 cores to get 2X through DILP? Remember cores will be 1GHz and simple! What is the interconnect?

• How should we program 1K cores? Can interconnect help with programming?• Locality and reliability WILL matter for 1K cores. Spatial view of multicore? • Can we add architectural support for programming ease? E.g., suppose I told you

cores are free. Can you discover mechanisms to make programming easier?• What is the right grain size for a core?• How must our computational models change in the face of small memories per core?• How to “feed the beast”? I/O and external memory bandwidth• Can we assume perfect reliability any longer?

Page 49: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

49

ATAC Architecture

p

switch

m

p

switch

m

p

switch

m

p

switch

m

p

switch

m

p

switch

m

p

switch

m

p

switch

m

p

switch

m

p

switch

m

p

switch

m

p

switch

m

p

switch

m

p

switch

m

p

switch

m

p

switch

m

Optical Broadcast WDM Interconnect

Electrical Mesh Interconnect (EMesh)

[Proc. BARC Jan 2007, MIT-CSAIL-TR-2009-018 ]

Page 50: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

50

Research Challenges for 1K Cores

• 4-16 cores not interesting. Industry is there. University must focus on “1K cores”; Everything will change!

• Can we use 4 cores to get 2X through DILP? What is the interconnect?• How should we program 1K cores? Can interconnect help with programming?• Locality and reliability WILL matter for 1K cores. Spatial view of multicore? • Can we add architectural support for programming ease? E.g., suppose I told you

cores are free. Can you discover mechanisms to make programming easier?• What is the right grain size for a core?• How must our computational models change in the face of small memories per core?• How to “feed the beast”? I/O and external memory bandwidth• Can we assume perfect reliability any longer?

Page 51: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

51

FOS – Factored Operating System

•Today: User app and OS kernel thrash each other in a core’s cache• User/OS time sharing is inefficient

•Angstrom: OS assumes abstracted space model. OS services bound to distinct cores, separate from user cores. OS service cores collaborate to achieve best resource management

• User/OS space sharing is efficient

The key idea: space sharing replaces time sharing

OS OS OS

OS OS OS

File System

FS

User

AppFS

OS cores collaborate, inspired by distributed internet services model

Need new page

I/O

FS

[OS Review 2008]

Page 52: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

52

Research Challenges for 1K Cores

• 4-16 cores not interesting. Industry is there. University must focus on “1K cores”; Everything will change!

• Can we use 4 cores to get 2X through DILP? What is the interconnect?• How should we program 1K cores? Can interconnect help with programming?• Locality and reliability WILL matter for 1K cores. Spatial view of multicore? • Can we add architectural support for programming ease? E.g., suppose I told you

cores are free. Can you discover mechanisms to make programming easier?• What is the right grain size for a core?• How must our computational models change in the face of small memories per core?• How to “feed the beast”? I/O and external memory bandwidth• Can we assume perfect reliability any longer?

Page 53: 1 Digital Space Anant Agarwal MIT and Tilera Corporation.

53

The following are trademarks of Tilera Corporation: Tilera, the Tilera Logo, Tile Processor, TILE64, Embedding Multicore, Multicore Development Environment, Gentle Slope Programming, iLib, iMesh and Multicore Hardwall. All other trademarks and/or registered trademarks are the property of their respective owners.


Recommended