CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros...

Post on 18-Dec-2015

213 views 1 download

transcript

CCNoC: On-Chip Interconnects forCache-Coherent Manycore Server Chips

CiprianSeiculescu

Stavros Volos

Naser Khosro Pour

Babak Falsafi

Giovanni De Micheli LSIIntegratedSystemsLaboratory

NoCs Major Power Consumer

Move towards manycore • Tiled architectures

Network-on-Chip (NoC) • Significant power

consumer• 40% MIT RAW• 30% Intel Tera-scale

Cache coherent CMP• Server workloads

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

C$

Core Core

$ $

Crossbar

Proposals to Reduce NoC Power

Multiple networks• Better area and power [Balfour & Dally ICS 2006]

Commercial server workloads• Traffic patterns are different

Run on cache coherent CMPs• Strong relation between coherence protocol and NoC

Not optimized for Commercial Server Workload traffic

Contributions

Commercial server workloads• Optimized for reuse in L1, little sharing• Full blown coherence protocol in CMPs• Only some transitions are frequent

Duality in Request/Response message size

CCNoC• Full advantage of heterogeneity • Same number of buffers • 16% less power same performance as Mesh

Outline

Overview

Why CCNoC?

Dual-router design

Evaluation

Conclusions

Dual Router is More Efficient

Dual router• Two crossbars per routing node

Wires less expensive on-chip• Use more wires for better performance

Area and power grows faster than connectivity• Balfour & Dally ICS 2006• Dual router: better performance, power and area

N bit wide

N/2 bit wide

N/2 bit wide

Right Dual Router Design

Avoid protocol level deadlock• Separate

- Requests - Responses

• Use Virtual Channels

CCNoC • sub-networks

- Request / Response• No VCs needed• Same number of buffers

Buffers are power hungry

MIT RAW

BuffersCrossbar + Links

H.S.Wang & L.S.Peh, MICRO 2003

Protocol Activity

CMPs implement full blown coherence protocol

• Some transitions are frequent [Hardavellas ISCA 2009]- Read clean block- Evict clean block- Write to unshared block

• Other transitions needed for correctness (infrequent)- Read dirty block- Evict dirty- Write to shared block

Frequent Read Protocol Activity

Reader Directory Writer

Read Req

Read Resp

Evict Clean Req

Short Req

Short Req

Short Resp

Long Resp

Frequent Write Protocol Activity

Writer Directory

Fetch/Upgrade Req

FetchResp

Short Req

Short Req

Short Resp

Long Resp

Upgrade Resp

Infrequent Read Protocol Activity

Reader Directory Writer

Read Req

Read Resp

Short Req

Short Req

Short Resp

Long Resp

Downgrade Req

Downgrade Resp

Infrequent Write Protocol Activity

Writer Directory Reader 1Fetch/Upgrade Req

Fetch Resp

Short Req

Short Req

Short Resp

Long Resp

Reader 2

Upgrade Resp

Inv Req Inv

Req

Inv Resp

Inv Resp

Evict Dirty Req

Traffic Analysis

DB

2

OR

AC

LE

DB

2 M

IX

AP

AC

HE

ZE

US

EM

3D

SP

EC

2K

OLTP DSS WEB SCI MIX

0%

20%

40%

60%

80%

100%

Long RespShort RespLong ReqShort Req

Tra

ffic

Dis

trib

uti

on

Request: 93% short Response: 86% long

CCNoC Router

Request network narrow: optimized for short messages Response network wide: optimized for long messages

RequestSwitch

ResponseSwitch

NI

Router

Previous Work

Balfour et al. ICS 2006• Better than single large router• Read/Write traffic• Same number of reads and writes

Yoon et al. DAC 2010• Physical channel better then virtual channel

Not optimized for cache coherent CMP• Running commercial server workloads

Outline

Overview

Why CCNoC?

Dual-router design

Evaluation

Conclusions

Evaluation Methodology

FLEXUS• Full system simulation • 16 or 8 UltraSPARC III

ISA cores• Split I/D, 64KB L1• 1 or 2 MB L2

ORION 2.0• power estimation• area estimation

Workloads• OLTP: TPC-C

- IBM DB2 and Oracle

• DSS: TPC-H - IBM DB2- Q1, Q6, Q13, Q16

• Web: SPECweb99 - Apache and Zeus

• Scientific: EM3D• Multiprogrammed:

- SPEC2K - 2x: gcc, twolf, art, mcf

Evaluation NoCs

Mesh-128 - baseline• 128 bit flit width

Torus - reference• 128 bit flit width

Mesh-176 – high performance • 176 bit flit width

CCNoC• Request: 48 bit flit width• Response: 128 bit flit width

Switches• Wormhole flow control• Input queued • Transmission protocol

- On/Off

• Input buffers- 2 entry

Performance

DB

2

OR

AC

LE

DB

2 M

IX

AP

AC

HE

ZE

US

EM

3D

SP

EC

2K

OLTP DSS WEB SCI MIX

0

0.2

0.4

0.6

0.8

1

1.2

Mesh-128Mesh-176CCNoC

No

rma

lize

d I

PC

(to

To

rus

)

Performance loss: 2% Torus, 8% Mesh-176

Power Savings

Power savings: 16% Mesh-128, 22% Torus, 38% Mesh-176

DB

2

OR

AC

LE

DB

2 M

IX

AP

AC

HE

ZE

US

EM

3D

SP

EC

2K

OLTP DSS

WEB SCI

MIX

-2.22044604925031E-16

0.2

0.4

0.6

0.8

1

1.2

1.4

TorusMesh-128Mesh-176CCNoC

No

rma

lize

d T

ota

l P

ow

er(

%)

Conclusions

Duality in Request/Response traffic• Request: dominated by short messages• Response: dominated by long messages

Proposed CCNoC• Narrow request network• Wide response network

Showed significant power savings• 22% against Torus• 38% against Mesh-176

Thank you!

Q&A