+ All Categories
Home > Documents > CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros...

CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros...

Date post: 18-Dec-2015
Category:
Upload: mary-lynch
View: 213 times
Download: 1 times
Share this document with a friend
22
CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli LS I Integrated Systems Laboratory
Transcript
Page 1: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

CCNoC: On-Chip Interconnects forCache-Coherent Manycore Server Chips

CiprianSeiculescu

Stavros Volos

Naser Khosro Pour

Babak Falsafi

Giovanni De Micheli LSIIntegratedSystemsLaboratory

Page 2: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

NoCs Major Power Consumer

Move towards manycore • Tiled architectures

Network-on-Chip (NoC) • Significant power

consumer• 40% MIT RAW• 30% Intel Tera-scale

Cache coherent CMP• Server workloads

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

Core Core

$ $

Crossbar

Page 3: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Proposals to Reduce NoC Power

Multiple networks• Better area and power [Balfour & Dally ICS 2006]

Commercial server workloads• Traffic patterns are different

Run on cache coherent CMPs• Strong relation between coherence protocol and NoC

Not optimized for Commercial Server Workload traffic

Page 4: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Contributions

Commercial server workloads• Optimized for reuse in L1, little sharing• Full blown coherence protocol in CMPs• Only some transitions are frequent

Duality in Request/Response message size

CCNoC• Full advantage of heterogeneity • Same number of buffers • 16% less power same performance as Mesh

Page 5: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Outline

Overview

Why CCNoC?

Dual-router design

Evaluation

Conclusions

Page 6: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Dual Router is More Efficient

Dual router• Two crossbars per routing node

Wires less expensive on-chip• Use more wires for better performance

Area and power grows faster than connectivity• Balfour & Dally ICS 2006• Dual router: better performance, power and area

N bit wide

N/2 bit wide

N/2 bit wide

Page 7: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Right Dual Router Design

Avoid protocol level deadlock• Separate

- Requests - Responses

• Use Virtual Channels

CCNoC • sub-networks

- Request / Response• No VCs needed• Same number of buffers

Buffers are power hungry

MIT RAW

BuffersCrossbar + Links

H.S.Wang & L.S.Peh, MICRO 2003

Page 8: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Protocol Activity

CMPs implement full blown coherence protocol

• Some transitions are frequent [Hardavellas ISCA 2009]- Read clean block- Evict clean block- Write to unshared block

• Other transitions needed for correctness (infrequent)- Read dirty block- Evict dirty- Write to shared block

Page 9: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Frequent Read Protocol Activity

Reader Directory Writer

Read Req

Read Resp

Evict Clean Req

Short Req

Short Req

Short Resp

Long Resp

Page 10: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Frequent Write Protocol Activity

Writer Directory

Fetch/Upgrade Req

FetchResp

Short Req

Short Req

Short Resp

Long Resp

Upgrade Resp

Page 11: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Infrequent Read Protocol Activity

Reader Directory Writer

Read Req

Read Resp

Short Req

Short Req

Short Resp

Long Resp

Downgrade Req

Downgrade Resp

Page 12: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Infrequent Write Protocol Activity

Writer Directory Reader 1Fetch/Upgrade Req

Fetch Resp

Short Req

Short Req

Short Resp

Long Resp

Reader 2

Upgrade Resp

Inv Req Inv

Req

Inv Resp

Inv Resp

Evict Dirty Req

Page 13: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Traffic Analysis

DB

2

OR

AC

LE

DB

2 M

IX

AP

AC

HE

ZE

US

EM

3D

SP

EC

2K

OLTP DSS WEB SCI MIX

0%

20%

40%

60%

80%

100%

Long RespShort RespLong ReqShort Req

Tra

ffic

Dis

trib

uti

on

Request: 93% short Response: 86% long

Page 14: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

CCNoC Router

Request network narrow: optimized for short messages Response network wide: optimized for long messages

RequestSwitch

ResponseSwitch

NI

Router

Page 15: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Previous Work

Balfour et al. ICS 2006• Better than single large router• Read/Write traffic• Same number of reads and writes

Yoon et al. DAC 2010• Physical channel better then virtual channel

Not optimized for cache coherent CMP• Running commercial server workloads

Page 16: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Outline

Overview

Why CCNoC?

Dual-router design

Evaluation

Conclusions

Page 17: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Evaluation Methodology

FLEXUS• Full system simulation • 16 or 8 UltraSPARC III

ISA cores• Split I/D, 64KB L1• 1 or 2 MB L2

ORION 2.0• power estimation• area estimation

Workloads• OLTP: TPC-C

- IBM DB2 and Oracle

• DSS: TPC-H - IBM DB2- Q1, Q6, Q13, Q16

• Web: SPECweb99 - Apache and Zeus

• Scientific: EM3D• Multiprogrammed:

- SPEC2K - 2x: gcc, twolf, art, mcf

Page 18: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Evaluation NoCs

Mesh-128 - baseline• 128 bit flit width

Torus - reference• 128 bit flit width

Mesh-176 – high performance • 176 bit flit width

CCNoC• Request: 48 bit flit width• Response: 128 bit flit width

Switches• Wormhole flow control• Input queued • Transmission protocol

- On/Off

• Input buffers- 2 entry

Page 19: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Performance

DB

2

OR

AC

LE

DB

2 M

IX

AP

AC

HE

ZE

US

EM

3D

SP

EC

2K

OLTP DSS WEB SCI MIX

0

0.2

0.4

0.6

0.8

1

1.2

Mesh-128Mesh-176CCNoC

No

rma

lize

d I

PC

(to

To

rus

)

Performance loss: 2% Torus, 8% Mesh-176

Page 20: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Power Savings

Power savings: 16% Mesh-128, 22% Torus, 38% Mesh-176

DB

2

OR

AC

LE

DB

2 M

IX

AP

AC

HE

ZE

US

EM

3D

SP

EC

2K

OLTP DSS

WEB SCI

MIX

-2.22044604925031E-16

0.2

0.4

0.6

0.8

1

1.2

1.4

TorusMesh-128Mesh-176CCNoC

No

rma

lize

d T

ota

l P

ow

er(

%)

Page 21: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Conclusions

Duality in Request/Response traffic• Request: dominated by short messages• Response: dominated by long messages

Proposed CCNoC• Narrow request network• Wide response network

Showed significant power savings• 22% against Torus• 38% against Mesh-176

Page 22: CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.

Thank you!

Q&A


Recommended