Date post: | 20-Dec-2015 |
Category: |
Documents |
View: | 217 times |
Download: | 1 times |
1
Digital Space
Anant Agarwal
MIT and Tilera Corporation
2
Arecibo
3
Stages of Reality
memmem
mem
mem
mem
1996
CPUMem
19972002
2007
pms
pms
pms
pms
pms
pms
pms
pms
pm
pms
pms
pms
pms
pms
pms
pms
pms
pms
pm
pms
pms
pms
pms
pms
pms
pms
pms
pms
pm
pms
pms
pms
pms
pms
pms
pms
pms
pms
pm
pms
pms
pms
pms
pms
pms
pms
pms
pms
pm
pms
pms
pms
pms
pms
pms
pms
pms
pms
pm
pms
pms
pms
pms
pms
pms
pms
pms
pms
pm
pms
pms
pms
pms
pms
pms
pms
pms
pms
pm
pms
pm
pm
pm
pm
pm
pm
pm
pm
pm
pm
2014
100B transistors
2018
1B transistors
2007
4
Virtual reality
Simulator realityPrototype reality
Product reality
Virtual reality
Simulator realityPrototype reality
Simulator reality
Product realityPrototype reality
Simulator reality
Virtual Reality
5
The Opportunity
20MIPS cpuin 1987
1996…
Few thousand gates
6
The Opportunity
The billion transistor chip of 2007
7
How to Fritter Away Opportunity
the x1786? does not scale
100 ported RegFil and RR
Caches
Control
More resolution buffers, control
“1/10 ns”
8
• Lots of ALUs, lots of registers, lots of local memories – huge on-chip parallelism – but with a slower clock
• Custom-routed, short wires optimized for specific applications
Fast, low power, area efficientBut not programmable
memmem
mem
mem
mem
Take Inspiration from ASICs
9
Our Early Raw Proposal
CPU
Mem E.g., 100-way unrolled loop,running on 100 ALUs, 1000 regs,100 memory banks
But how to build programmable, yet custom, wires?
Got parallelism?
10
A digital wire
Ctrl
Ctrl
Ctrl
CtrlCtrl
Software orchestrate it!
• Customize to application and maximize utilization
Multiplex it!
• Improve utilization
Pipeline it!
• Fast clock (10GHz in 2010)
Uh! What were we smoking!
A dynamic router!
Replace custom wires with routed on-chip networks
11
Static Router
ApplicationCompiler
SwitchCode
SwitchCode
SwitchCode
SwitchCode
SwitchCode
A static router!
12
Replace Wires with Routed Networks
Ctrl
13
50-Ported Register File Distributed Registers
Gigantic 50
ported register
file
14
Gigantic 50
ported register
file
50-Ported Register File Distributed Registers
15
Distributed Registers + Routed Network
Distributed register file
R
Called NURA [ASPLOS 1998]
16
16-Way ALU Clump Distributed ALUs
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALUBypass Net
RF
17
Distributed ALUs, Routed Bypass Network
AL
UA
LU A
LU A
LU
AL
U
AL
UR
Scalar Operand Network (SON) [TPDS 2005]
18
Mongo Cache Distributed Cache
Gigantic 10
ported cache
19
Distributing the Cache
20
Distributed Shared Cache
AL
UA
LU A
LU A
LU
AL
U
AL
U
Like DSM (distributed shared memory), cache is distributed; But, unlike NUCA, caches are local to processors, not far away
R
$
[ISCA 1999]
21
Tiled Multicore Architecture
AL
UA
LU A
LU A
LU
AL
U
AL
U
R
$
22
E.g., Operand Routing in 16-way Superscalar
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALU
ALUBypass Net
RF
>>
+
Source: [Taylor ISCA 2004]
23
Operand Routing in a Tiled Architecture
>>
AL
UA
LU A
LU A
LU
AL
U
AL
U
>>
+R
$
24
Tiled Multicore
• Scales to large numbers of cores• Modular – design, layout and verify 1 tile• Power efficient [MIT-CSAIL-TR-2008-066]
– Short wires CV2f– Chandrakasan effect CV2f
– Dynamic and compiler scheduled routing
ProcessorCore
= TileCore + Switch
S
25
A Prototype Tiled Architecture: The Raw Microprocessor
The Raw ChipTile
Disk stream
Video1
DRAM
Packet stream
A Raw Tile
SMEM
SWITCHPC
DMEMIMEM
REGPC
FPU
ALU
Raw Switch
PC
SMEM[Billion transistor IEEE Computer Issue ’97]www.cag.csail.mit.edu/raw
Scalar operand network (SON): Capable of low latency transport of small (or large) packets
[IEEE TPDS 2005]
26
Virtual reality
Simulator reality
Prototype realityProduct reality
27
Scalar Operand Transport in Raw
fmul r24, r3, r4
softwarecontrolledcrossbar
softwarecontrolledcrossbar
fadd r5, r3, r24
route P->E, N->S route W->P, S->N
Goal: flow controlled, in order delivery of operands
28
RawCC: Distributed ILP Compilation (DILP)
tmp0 = (seed*3+2)/2tmp1 = seed*v1+2tmp2 = seed*v2 + 2tmp3 = (seed*6+2)/3v2 = (tmp1 - tmp3)*5v1 = (tmp1 + tmp2)*3v0 = tmp0 - v1v3 = tmp3 - v2
pval5=seed.0*6.0
pval4=pval5+2.0
tmp3.6=pval4/3.0
tmp3=tmp3.6
v3.10=tmp3.6-v2.7
v3=v3.10
v2.4=v2
pval3=seed.o*v2.4
tmp2.5=pval3+2.0
tmp2=tmp2.5
pval6=tmp1.3-tmp2.5
v2.7=pval6*5.0
v2=v2.7
seed.0=seed
pval1=seed.0*3.0
pval0=pval1+2.0
tmp0.1=pval0/2.0
tmp0=tmp0.1
v1.2=v1
pval2=seed.0*v1.2
tmp1.3=pval2+2.0
tmp1=tmp1.3
pval7=tmp1.3+tmp2.5
v1.8=pval7*3.0
v1=v1.8
v0.9=tmp0.1-v1.8
v0=v0.9
pval5=seed.0*6.0
pval4=pval5+2.0
tmp3.6=pval4/3.0
tmp3=tmp3.6
v3.10=tmp3.6-v2.7
v3=v3.10
v2.4=v2
pval3=seed.o*v2.4
tmp2.5=pval3+2.0
tmp2=tmp2.5
pval6=tmp1.3-tmp2.5
v2.7=pval6*5.0
v2=v2.7
seed.0=seed
pval1=seed.0*3.0
pval0=pval1+2.0
tmp0.1=pval0/2.0
tmp0=tmp0.1
v1.2=v1
pval2=seed.0*v1.2
tmp1.3=pval2+2.0
tmp1=tmp1.3
pval7=tmp1.3+tmp2.5
v1.8=pval7*3.0
v1=v1.8v0.9=tmp0.1-v1.8
v0=v0.9
Black arrows = Operand Communication over SON
[ASPLOS 1998]
Partitioning
Place, Route, ScheduleC
29
Virtual realitySimulator reality
Prototype reality
Product reality
30
A Tiled Processor Architecture Prototype: the Raw Microprocessor
October 02
Michael TaylorWalter LeeJason MillerDavid WentzlaffIan BrattBen GreenwaldHenry HoffmannPaul JohnsonJason KimJames PsotaArvind SarafNathan ShnidmanVolker StrumpenMatt FrankRajeev BaruaElliot WaingoldJonathan BabbSri DevabhaktuniSaman AmarasingheAnant Agarwal
31
Raw Die Photo
IBM .18 micron process, 16 tiles, 425MHz, 18 Watts (vpenta)
[ISCA 2004]
32
Raw Motherboard
33
Raw Ideas and Decisions: What Worked, What Did Not
• Build a complete prototype system• Simple processor with single issue cores• FPGA logic block in each tile• Distributed ILP and static network• Static network for streaming• Multiple types of computation – ILP, streams, TLP, server• PC in every tile
34
Why Build?
• Compiler (Amarasinghe), OS and runtimes (ISI), apps (ISI, Lincoln Labs, Durand) folks will not work with you unless you are serious about building hardware
• Need motivaion to build software tools -- compilers, runtimes, debugging, visualization – many challenges here
• Run large data sets (simulation takes forever even with 100 servers!)• Many hard problems show up or are better understood after you begin building (how to
maintain ordering for distributed ILP, slack for streaming codes)• Have to solve hard problems – no magic!• The more radical the idea, the more important it is to build
– World will only trust end-to-end results since it is too hard to dive into details and understand all assumptions
– Would you believe this: “Prof. John Bull has demonstrated a simulation prototype of a 64-way issue out-of-order superscalar”
• Cycle simulator became cycle accurate simulator only after HW got precisely defined• Don’t bother to commercialize unless you have a working prototype• Total network power few percent for real apps [Aug 2003 ISLPED, Kim et al. Energy
characterization of a tiled architecture processor with on-chip networks] [MIT-CSAIL-TR-2008-066 Energy scalability of on-chip interconnection networks in multicore architecures ]
– Network power is few percent in Raw for real apps; however, it is 36% only for a highly contrived synthetic sequence meant to toggle every network wire
35
Raw Ideas and Decisions: What Worked, What Did Not
• Build a complete prototype system• Simple processor, single issue• FPGA logic block in each tile• Distributed ILP• Static network for streaming• Multiple types of computation – ILP, streams, TLP, server• PC in every tile
Yes1GHz, 2-way, inorder in 2016
NoYes ‘02, No ‘06, Yes ‘14
YesYes
36
softwarecontrolledcrossbar
softwarecontrolledcrossbar
route P->E, N->S route W->P, S->N
Raw Ideas and Decisions: Streaming – Interconnect Support
Forced synchronization in
static network
37
sub r5, r3, r55
DynamicSwitch
add r55, r3, r4
DynamicSwitch
Catch a
ll
Streaming in Tilera’s Tile Processor
TA
G
• Streaming done over dynamic interconnect with stream demuxing (AsTrO SDS)
• Automatic demultiplexing of streams into registers• Number of streams is virtualized
38
Virtual realitySimulator reality
Prototype reality
Product reality
39
Why Do We Care?Markets Demanding More Performance
Wireless Networks- Demand for high thruput – more channels- Fast moving standards LTE, services
Networking market - Demand for high performance – 10Gbps- Demand for more services, intelligence
Digital Multimedia market - Demand for high performance – H.264 HD - Demand for more services – VoD, transcode
Cable & BroadcastCable & BroadcastVideo ConferencingVideo Conferencing
SwitchesSwitches
Security AppliancesSecurity Appliances
RoutersRouters
… and with power efficiency and programming ease
GGSNGGSN
Base StationBase Station
39
40
Tilera’s TILEPro64™ Processor
Power per tile (depending on app) 170 – 300 mW
Core power for h.264 encode (64 tiles) 12W
Clock speed Up to 866 MHz
I/O bandwidth 40 Gbps
Main Memory bandwidth 200 Gbps
Multicore Performance (90nm)
Number of tiles 64
Cache-coherent distributed cache 5 MB
Operations @ 750MHz (32, 16, 8 bit) 144-192-384 BOPS
Bisection bandwidth 2 Terabits per second
Power Efficiency
I/O and Memory Bandwidth
ProgrammingANSI standard CSMP Linux programming
Stream programming
Product reality
[Tile64, Hotchips 2007][Tile64, Microprocessor Report Nov 2007]
41
PCIe 1MACPHY
PCIe 1MACPHY
PCIe 0MACPHY
PCIe 0MACPHY
SerdesSerdes
SerdesSerdes
Flexible IOFlexible IO
GbE 0GbE 0
GbE 1GbE 1Flexible IOFlexible IO
UART, HPIJTAG, I2C,
SPI
UART, HPIJTAG, I2C,
SPI
DDR2 Memory Controller 3DDR2 Memory Controller 3
DDR2 Memory Controller 0DDR2 Memory Controller 0
DDR2 Memory Controller 2DDR2 Memory Controller 2
DDR2 Memory Controller 1DDR2 Memory Controller 1
XAUIMAC
PHY 0
XAUIMAC
PHY 0
SerdesSerdes
XAUIMAC
PHY 1
XAUIMAC
PHY 1
SerdesSerdes
Tile Processor Block DiagramA Complete System on a Chip
PROCESSOR
P2
Reg File
P1 P0
CACHE
L2 CACHE
L1I L1D
ITLB DTLB
2D DMA
STN
MDN TDN
UDN IDN
SWITCH
42
Tile Processor NoC
• 5 independent non-blocking networks– 64 switches per network– 1 Terabit/sec per Tile
• Each network switch directly and independently connected to tiles
• One hop per clock on all networks
• I/O write example
• Memory write example
• Tile to Tile access example
• All accesses can be performed simultaneously on non-blocking networks
UDN
STN
IDN
MDN
Tiles
TDN
VDN[IEEE Micro Sep 2007]
43
Multicore Hardwall ImplementationOr Protection and Interconnects
OS1/APP1
OS1/APP3
OS2/APP2
datavalidSwitch
datavalidSwitch
HARDWALL_ENABLE
44
Product Reality Differences• Market forces
– Need crisper answer to “who cares”– SMP Linux programming with pthreads – fully cache coherent – C + API approach to streaming vs new language Streamit in Raw– Special instructions for video, networking– Floating point needed in research project, but not in product for embedded market
• Lessons from Raw– E.g., Dynamic network for streams– HW instruction cache– Protected interconnects
• More substantial engineering – 3-way VLIW CPU, subword arithmetic– Engineering for clock speed and power efficiency– Completeness – I/O interfaces on chip – complete system chip. Just add DRAM for system– Support for virtual memory, 2D DMA– Runs SMP Linux (can run multiple OSes simultaneously)
45
Virtual reality
Simulator realityPrototype reality
Product reality
46
What Does the Future Look Like?
Corollary of Moore’s law: Number of cores will double every 18 months
‘05 ‘08 ‘11 ‘14
64 256 1024 4096
‘02
16Research
Industry 16 64 256 10244
(Cores minimally big enough to run a self respecting OS!)
1K cores by 2014! Are we ready?
47
Vision for the Future
• The ‘core’ is the logic gate of the 21st century
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
spm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
spm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
spm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
spm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
spm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
spm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
spm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
spm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
spm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
spm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
spm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
spm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
spm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
spm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
spm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
spm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
spm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
spm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
spm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
spm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
pm
s
48
Research Challenges for 1K Cores
• 4-16 cores not interesting. Industry is there. University must focus on “1K cores”; Everything will change!
• Can we use 4 cores to get 2X through DILP? Remember cores will be 1GHz and simple! What is the interconnect?
• How should we program 1K cores? Can interconnect help with programming?• Locality and reliability WILL matter for 1K cores. Spatial view of multicore? • Can we add architectural support for programming ease? E.g., suppose I told you
cores are free. Can you discover mechanisms to make programming easier?• What is the right grain size for a core?• How must our computational models change in the face of small memories per core?• How to “feed the beast”? I/O and external memory bandwidth• Can we assume perfect reliability any longer?
49
ATAC Architecture
p
switch
m
p
switch
m
p
switch
m
p
switch
m
p
switch
m
p
switch
m
p
switch
m
p
switch
m
p
switch
m
p
switch
m
p
switch
m
p
switch
m
p
switch
m
p
switch
m
p
switch
m
p
switch
m
Optical Broadcast WDM Interconnect
Electrical Mesh Interconnect (EMesh)
[Proc. BARC Jan 2007, MIT-CSAIL-TR-2009-018 ]
50
Research Challenges for 1K Cores
• 4-16 cores not interesting. Industry is there. University must focus on “1K cores”; Everything will change!
• Can we use 4 cores to get 2X through DILP? What is the interconnect?• How should we program 1K cores? Can interconnect help with programming?• Locality and reliability WILL matter for 1K cores. Spatial view of multicore? • Can we add architectural support for programming ease? E.g., suppose I told you
cores are free. Can you discover mechanisms to make programming easier?• What is the right grain size for a core?• How must our computational models change in the face of small memories per core?• How to “feed the beast”? I/O and external memory bandwidth• Can we assume perfect reliability any longer?
51
FOS – Factored Operating System
•Today: User app and OS kernel thrash each other in a core’s cache• User/OS time sharing is inefficient
•Angstrom: OS assumes abstracted space model. OS services bound to distinct cores, separate from user cores. OS service cores collaborate to achieve best resource management
• User/OS space sharing is efficient
The key idea: space sharing replaces time sharing
OS OS OS
OS OS OS
File System
FS
User
AppFS
OS cores collaborate, inspired by distributed internet services model
Need new page
I/O
FS
[OS Review 2008]
52
Research Challenges for 1K Cores
• 4-16 cores not interesting. Industry is there. University must focus on “1K cores”; Everything will change!
• Can we use 4 cores to get 2X through DILP? What is the interconnect?• How should we program 1K cores? Can interconnect help with programming?• Locality and reliability WILL matter for 1K cores. Spatial view of multicore? • Can we add architectural support for programming ease? E.g., suppose I told you
cores are free. Can you discover mechanisms to make programming easier?• What is the right grain size for a core?• How must our computational models change in the face of small memories per core?• How to “feed the beast”? I/O and external memory bandwidth• Can we assume perfect reliability any longer?
53
The following are trademarks of Tilera Corporation: Tilera, the Tilera Logo, Tile Processor, TILE64, Embedding Multicore, Multicore Development Environment, Gentle Slope Programming, iLib, iMesh and Multicore Hardwall. All other trademarks and/or registered trademarks are the property of their respective owners.