+ All Categories
Home > Documents > Quantifying and Comparing the Impact of Wrong-Path Memory References in Multiple-CMP Systems Ayse...

Quantifying and Comparing the Impact of Wrong-Path Memory References in Multiple-CMP Systems Ayse...

Date post: 01-Jan-2016
Category:
Upload: olivia-rich
View: 218 times
Download: 0 times
Share this document with a friend
Popular Tags:
23
Quantifying and Comparing the Impact of Wrong-Path Memory References in Multiple-CMP Systems Ayse Yilmazer, University of Rhode Island Resit Sendag, University of Rhode Island Joshua J. Yi, Freescale
Transcript

Quantifying and Comparing the Impact of Wrong-Path Memory References in Multiple-

CMP Systems

Ayse Yilmazer, University of Rhode Island Resit Sendag, University of Rhode Island

Joshua J. Yi, Freescale Semiconductor, Inc.

Motivation

Previous work on Wrong-path (WP) effects in Uniprocessors Positive Effects: Prefetching

Up to 20% better performance for 181.mcf (SPECint 2000) Negative Effects: Pollution

L1 and L2 cache pollution Extra traffic

Important to simulate WP, especially for some applications

How about WP effects in Multiple-CMP systems?

Outlines Wrong Path Effects in SMPs and multi-CMPs

Simulation Methodology

Evaluation Results

Conclusion

Wrong-path effects in SMPs – 0 / 4 Broadcast (snoop)- and directory-based SMP

systems MSI, MOSI, MESI, MOESI cache coherence protocols

Same issues in uniprocessors apply Pollution effect Prefetching effect Extra cache/memory traffic

In contrast to uniprocessor effects, WP cause: Extra coherence traffic:

data, invalidations, write-backs, acknowledgements Additional cache block state transitions

Wrong-path effects in SMPs – 1 / 4 Replacements

A speculatively replaces B

A is a Wrong-path Block !

Initial States

Block B M Block A I -> M

Processor 0 Processor 1

LRU

1 Intial: P1 writes on block A

Block A I -> S Block A M -> O

Processor 0 Processor 1

2 P0 speculatively reads block A

Block A S Block A O

Processor 0 Processor 1

3 Speculation resolves: Mis-speculation!

Block A S -> I Block A O -> M

Processor 0 Processor 1

4 P1 writes on block A

Wrong-path effects in SMPs – 2 / 4 Write-backs

Block B M Block A I -> M

Processor 0 Processor 1

LRU

1 Intial: P1 writes on block A

Block A I -> S Block A M -> O

Processor 0 Processor 1

2 P0 speculatively reads block A

Block A S Block A O

Processor 0 Processor 1

3 Speculation resolves: Mis-speculation!

Block A S -> I Block A O -> M

Processor 0 Processor 1

4 P1 writes on block A

Write-back dirty copy of B

Write-back dirty copy of AOnly for MESI (or MSI)

M -> S

Wrong-path effects in SMPs – 3 / 4 Invalidations

Block B M Block A I -> M

Processor 0 Processor 1

LRU

1 Intial: P1 writes on block A

Block A I -> S Block A M -> O

Processor 0 Processor 1

2 P0 speculatively reads block A

Block A S Block A O

Processor 0 Processor 1

3 Speculation resolves: Mis-speculation!

Block A S -> I Block A O -> M

Processor 0 Processor 1

4 P1 writes on block A

P1 loses its write privileges for block A

P1 asks for grant to write and sends invalidation

Wrong-path effects in SMPs – 4 / 4 Data/Bus and Coherence Traffic Increases

L1 references, L2 references, coherence traffic

snoop, directory requests for data and invalidations

Power Consumption Increases Due to extra cache references, coherence traffic and

cache block state transitions

Resource Contention Competing with correct-path resources

In contrast to uniprocessors, the increase in the frequency of full service buffers critical when many cache-to-cache transfers

WP effects in Multiple-CMPs – 0 / 2

CMP node and a 4 CMP system We studied inclusive L1 and L2 cache L2 cache also tracks the coherence of cache blocks in L1

Memory Controller

Mem I/F

P

I D

P

I D

P

I D

P

I D

Interconnection network

L2 L2 L2 L2 I/F

(a)

CMP CMP

CMP CMP

Intra-CMP coherence

Inter-CMP coherence

(b)

WP effects in Multiple-CMPs – 1 / 2

OIV SO

OIN I

S

I

Wrong-path Replacement

Place invalidation for sharer

copies

Write-back ACK received from dir

All A

CK

s received from sharers

Write-b

ack cach

e block

L2 CacheL1 Cache

Invalidation received from L

2 cache

Sen

d A

CK

to L2 cach

e

State Transitions when replacement of an SO line in L2 cache

SOOIV

OIN I

S

I

WP effects in Multiple-CMPs – 1 / 2

State Transitions when an MT line in L2 cache receives a WP request

MO MT

SO

M

S

Wrong-path shared copy request from local L1s or external world

Issue downgrade to writing L1 cache

Data and A

CK

received from

writing L

1 cache

Sen

d d

ata to requ

estor

L2 Cache L1 Cache

Dow

ngrade requested from

L2 cache

Sen

d d

ata and

AC

K to

L2 cach

e

MTMO

SO

M

S

Outlines Wrong Path Effects in SMPs and multi-CMPs

Simulation Methodology

Evaluation Results

Conclusion

Experimental Methodology

GEMS simulator – Wisconsin Multifacet Group Based on Virtutech SIMICS Aggressive out-of-order superscalar processor Detailed Shared-Memory Model

We evaluate 16-processor (4 and 8-CMPs) SPARC V9 system running unmodified Solaris 9

Evaluated 2-level MOSI directory coherence protocol MOSI: Modified, Owned, Shared, Invalid

We track the speculatively generated memory references and mark them as being on the wrong-path when the branch

misprediction is known

Experimental MethodologyProcessor Configuration

UltraSPARC III ISA

2 GHz 15-stage pipeline, OoO execution

8-wide dispatch/retirement

256/128-entry ROB/scheduler

10 cycle branch misprediction penalty

gshare branch predictor with 4K PHT

64-entry RAS and RAS exception table

32-entry CAS and CAS exception table

CMP Configuration

4 CMPs

4 processors per CMP

L1 caches: 32 KB, 2-way, 2 cycle latency

L2 cache: 2M, 2-way, 20 cycle latency

128 byte cache block size

32-entry MSHRs

4 GByte per bank

240 cycle DRAM latency

Benchmark Input Data Set fft 64K points

radix 2M integers, radix 1024 ocean 128x128 ocean

Water-spatial 512 molecules em3d 400K nodes, degree 2, span 5,

15% remote

Outlines Wrong Path Effects in SMPs and multi-CMPs

Simulation Methodology

Evaluation Results

Conclusion

Evaluation Results 1 / 5

4 CMPs 8 CMPs

-- L1 and L2 Cache Traffic

• Total memory references increase by 16% and 14% for 4- and 8-CMPs, respectively.

• L2 cache references increase by 35% and 36%, respectively.

• For em3d, the increase in the number of L1 misses increase as much as 70%.

0

10

20

30

40

50

60

70

80

90

100

FFT RADIX OCEAN WATER em3d AVERAGE

L1L2

0

10

20

30

40

50

60

70

80

90

100

FFT RADIX OCEAN WATER em3d AVERAGE

L1L2

Evaluation Results 2 / 5-- Coherence Traffic

• Internal -- 36% External -- 30%

0

10

20

30

40

50

60

70

80

90

100

FFT RADIX OCEAN WATER em3d AVERAGE

Internal Net.External Net.

0

10

20

30

40

50

60

70

80

90

100

FFT RADIX OCEAN WATER em3d AVERAGE

Internal Net.External Net.

4 CMPs 8 CMPs

Evaluation Results 3 / 5-- L1 and L2 cache replacements

• L1 -- 30%, L2 -- 17%

Potential Cache Performance Impact

Type Meaning L1 L2

Used used by a correct-path reference 50% 7%

Unusedevicted before being used or never used by a correct-path

42% 70%

Direct MissReplaces a cache block that is needed by a later correct-path load, and is evicted before being used.

4% 20%

Indirect MissChanges the LRU of a set, which may eventually cause correct-path misses

4% 3%

Evaluation Results 4 / 5-- Write Misses

4 CMPs 8 CMPs

On average 4% On average 7%

0

2

4

6

8

10

FFT RADIX OCEAN WATER em3d AVERAGE

External NetInternal Net

0

2

4

6

8

10

FFT RADIX OCEAN WATER em3d AVERAGE

External NetInternal Net

Evaluation Results 5 / 5-- Cache Line State Transitions

4 CMPs

• Internal: 2% to 13%• External: 1% to 9%

• Internal: 2% to 17%• External: 1% to 10%

0

2

4

6

8

10

12

14

16

18

Inte

rnal

Net

Ext

erna

l Net

Inte

rnal

Net

Ext

erna

l Net

Inte

rnal

Net

Ext

erna

l Net

Inte

rnal

Net

Ext

erna

l Net

Inte

rnal

Net

Ext

erna

l Net

Inte

rnal

Net

Ext

erna

l Net

FFT RADIX OCEAN WATER em3d AVERAGE

O/S -> MM -> 0/S

0

2

4

6

8

10

12

14

16

18

Inte

rnal N

et

Ext

ern

al N

et

Inte

rnal N

et

Ext

ern

al N

et

Inte

rnal N

et

Ext

ern

al N

et

Inte

rnal N

et

Ext

ern

al N

et

Inte

rnal N

et

Ext

ern

al N

et

Inte

rnal N

et

Ext

ern

al N

et

FFT RADIX OCEAN WATER em3d AVERAGE

O/S -> MM -> 0/S

8 CMPs

Outlines Wrong Path Effects in SMPs and multi-CMPs

Simulation Methodology

Evaluation Results

Conclusion

Conclusion It is important to model WP memory references in cache-

coherent multi-CMP systems

For multi-CMPs, not only do the WP affect the performance of individual processors due to prefetching and pollution, they also affect the performance of the entire system by increasing cache coherence transactions cache block state transitions write-backs invalidations resource contention

For a workload with many cache-to-cache transfers, WP can significantly affect coherence actions.

The End

Thank You !


Recommended