Proximity-Aware Directory-based Coherence for Multi-core...

Post on 19-Jul-2020

3 views 0 download

transcript

Proximity-Aware Directory-basedCoherence for Multi-core Processor

Architectures

Jeff BrownRakesh KumarDean Tullsen

UC San Diego ● University of Illinois at Urbana-ChampaignSPAA 19 ● June 9, 2007

Introduction● The chip multiprocessor (CMP)

era is upon us!● Caching complicate writes● Cache Coherence ensures

caching is done safely

● Multi-core designs offer new tradeoffs

Introduction● The chip multiprocessor (CMP)

era is upon us!● Caching complicate writes● Cache Coherence ensures

caching is done safely

● Multi-core designs offer new tradeoffs

P

M

P

M

Introduction● The chip multiprocessor (CMP)

era is upon us!● Caching complicate writes● Cache Coherence ensures

caching is done safely

● Multi-core designs offer new tradeoffs

P

M

P

M

P

P M

M

Background: Directory-basedCache Coherence

● Directory-based; explicit per-block accounting– Doesn't rely on broadcasts

● Directory operation: client/server

Background: Directory-basedCache Coherence

● Directory-based; explicit per-block accounting– Doesn't rely on broadcasts

● Directory operation: client/server– Processors request data, permissions

P

Background: Directory-basedCache Coherence

● Directory-based; explicit per-block accounting– Doesn't rely on broadcasts

● Directory operation: client/server– Processors request data, permissions– Directory controllers manage memory access

P Dir

Background: Directory-basedCache Coherence

● Directory-based; explicit per-block accounting– Doesn't rely on broadcasts

● Directory operation: client/server– Processors request data, permissions– Directory controllers manage memory access

P

M

Dir

Background: Directory-basedCache Coherence

● Directory-based; explicit per-block accounting– Doesn't rely on broadcasts

● Directory operation: client/server– Processors request data, permissions– Directory controllers manage memory access

P

M

Dir

Background: Directory-basedCache Coherence

● Directory-based; explicit per-block accounting– Doesn't rely on broadcasts

● Directory operation: client/server– Processors request data, permissions– Directory controllers manage memory access

● Updates, conflicts

P

M

PDir

Background: HistoricalMP Cache Coherence● Distributed directory, memory

P

M

P

M

P

M

P

M

Background: HistoricalMP Cache Coherence● Distributed directory, memory

P

M

P

M

P

M

P

M

Cache Miss

Background: HistoricalMP Cache Coherence● Distributed directory, memory

P

M

P

M

P

M

P

M

Cache Miss "Home Node"

Background: HistoricalMP Cache Coherence● Distributed directory, memory

P

M

P

M

P

M

P

M

Cache Miss "Home Node"

Background: HistoricalMP Cache Coherence● Distributed directory, memory

P

M

P

M

P

M

P

M

Cache Miss "Home Node"

Data Request

Background: HistoricalMP Cache Coherence● Distributed directory, memory

P

M

P

M

P

M

P

M

Cache Miss "Home Node"

Data Request

Reply

Motivation: Multi-coreCache Coherence

M

M

P

M

P

P P

M

Motivation: Multi-coreCache Coherence

M

M

P

M

P

P P

M

Cache Miss

Motivation: Multi-coreCache Coherence

M

M

P

M

P

P P

M

Cache Miss

Motivation: Multi-coreCache Coherence

"HomeNode"

M

M

P

M

P

P P

M

Cache Miss

Motivation: Multi-coreCache Coherence

"HomeNode"

Data Request

M

M

P

M

P

P P

M

Cache Miss

Motivation: Multi-coreCache Coherence

"HomeNode"

Data Request

M

M

P

M

P

P P

M

Cache Miss

Motivation: Multi-coreCache Coherence

"HomeNode"

Reply

M

M

P

M

P

P P

M

Cache Miss

Motivation: Multi-coreCache Coherence

M

M

P

M

P

P P

M

AdditionalSharer

Motivation: Multi-coreCache Coherence

M

M

P

M

P

P P

M

AdditionalSharer

● Multi-core designs present radically differentrelative latency & bandwidth

Outline

● Introduction & Background

● System Architecture

● Proximity-Aware Coherence

● Results

● Conclusion

Directory-based Cache Coherence● Directory structures

Directory-based Cache Coherence● Directory structures

MainMemory

Directory-based Cache Coherence● Directory structures

MainMemory

Directory-based Cache Coherence● Directory structures

– Directory Memory

MainMemory

DirectoryMemory

Directory-based Cache Coherence● Directory structures

– Directory Memory– Directory Entries

MainMemory

DirectoryMemory

Directory-based Cache Coherence● Directory structures

– Directory Memory– Directory Entries– Directory Controller

MainMemory

DirectoryMemory

Controller

A Traditional Multiprocessor

Core

L2 $

Dir

Mem

Interconnect

Core

L2 $

Dir

Mem

A Traditional Multiprocessor

Core

L2 $

Dir

Mem

Interconnect

Core

L2 $

Dir

Mem

(Chassis, board, etc.)

A Traditional Multiprocessor

Core

L2 $

Dir

Mem

Interconnect

Core

L2 $

Dir

Mem

(Chassis, board, etc.)

Our 16-Core Chip Multiprocessor

Core L2 $

BusDircontrol

Net.switch

Dir $

Mem.channel

Tile0

Tile1

Tile15

...

Our 16-Core Chip Multiprocessor

Core L2 $

BusDircontrol

Net.switch

Dir $

Mem.channel

Tile0

Tile1

Tile15

...

Our 16-Core Chip Multiprocessor

Core L2 $

BusDircontrol

Net.switch

Dir $

Mem.channel

Tile0

Tile1

Tile15

...

Our 16-Core Chip Multiprocessor

Core L2 $

BusDircontrol

Net.switch

Dir $

Mem.channel

Tile0

Tile1

Tile15

...

Our 16-Core Chip Multiprocessor

Core L2 $

BusDircontrol

Net.switch

Dir $

Mem.channel

Tile0

Tile1

Tile15

...

Our 16-Core Chip Multiprocessor

Core L2 $

BusDircontrol

Net.switch

Dir $

Mem.channel

Tile0

Tile1

Tile15

...

Outline

● Introduction & Background

● System Architecture

● Proximity-Aware Coherence

● Results

● Conclusion

Proximity-Aware Coherence● Idea: home node asks sharer nearest requester

to forward its cached copy

Proximity-Aware Coherence● Idea: home node asks sharer nearest requester

to forward its cached copy– Stay on-chip when possible

Proximity-Aware Coherence● Idea: home node asks sharer nearest requester

to forward its cached copy– Stay on-chip when possible– Minimize transit of large data-carrying replies

Proximity-Aware Coherence● Idea: home node asks sharer nearest requester

to forward its cached copy– Stay on-chip when possible– Minimize transit of large data-carrying replies

M

M

P

M

P

P P

M

Proximity-Aware Coherence● Idea: home node asks sharer nearest requester

to forward its cached copy– Stay on-chip when possible– Minimize transit of large data-carrying replies

"HomeNode"

Data Request

Cache Miss

M

M

P

M

P

P P

M

Proximity-Aware Coherence● Idea: home node asks sharer nearest requester

to forward its cached copy– Stay on-chip when possible– Minimize transit of large data-carrying replies

"HomeNode"

M

M

P

M

P

P P

M

AdditionalSharer

Proximity-Aware Coherence● Idea: home node asks sharer nearest requester

to forward its cached copy– Stay on-chip when possible– Minimize transit of large data-carrying replies

"HomeNode"

M

M

P

M

P

P P

M

AdditionalSharer

ForwardRequest

Proximity-Aware Coherence● Idea: home node asks sharer nearest requester

to forward its cached copy– Stay on-chip when possible– Minimize transit of large data-carrying replies

Reply

M

M

P

M

P

P P

M

Proximity-Aware Coherence

● To service read misses for shared data,traditional protocols use main memory

● Other nodes may hold copies

● On the CMP landscape, inter-node latency ismuch less than memory latency

Sharer Selection

● When the home node lacks a cached copy, itselects a sharer to ask

Sharer Selection

● When the home node lacks a cached copy, itselects a sharer to ask

Miss

Home

Sharer Selection

● When the home node lacks a cached copy, itselects a sharer to ask– rand

Miss

Home

Sharer Selection

● When the home node lacks a cached copy, itselects a sharer to ask– rand– near1

Miss

Home

Sharer Selection

● When the home node lacks a cached copy, itselects a sharer to ask– rand– near1– via1

Miss

Home

Sharer Selection

● When the home node lacks a cached copy, itselects a sharer to ask– rand– near1– via1

● Retries didn't prove beneficial

Miss

Home

Outline

● Introduction & Background

● System Architecture

● Proximity-Aware Coherence

● Results

● Conclusion

Methodology

● Detailed, execution-driven processor andnetwork simulation

● "RSIM" simulator, adapted to our CMP model● Parallel workloads from several suites● Hardware, benchmark details in paper

Proximity-Aware: Potential Coverage

appbt fft lu mp3d ocean quicksort unstruct0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

6

5

4

3

2

1

Fra

ctio

nofre

ad

mis

ses

tosh

are

dlin

es

Proximity-Aware: Potential Coverage

appbt fft lu mp3d ocean quicksort unstruct0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

6

5

4

3

2

1

Fra

ctio

nofre

ad

mis

ses

tosh

are

dlin

es

Proximity-Aware: Potential Coverage

appbt fft lu mp3d ocean quicksort unstruct0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

6

5

4

3

2

1

Fra

ctio

nofre

ad

mis

ses

tosh

are

dlin

es

Overallx=43%

Proximity-Aware: Potential Coverage

appbt fft lu mp3d ocean quicksort unstruct0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

6

5

4

3

2

1

Fra

ctio

nofre

ad

mis

ses

tosh

are

dlin

es

Overallx=43%

Proximity-Aware: Potential Coverage

appbt fft lu mp3d ocean quicksort unstruct0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

6

5

4

3

2

1

Fra

ctio

nofre

ad

mis

ses

tosh

are

dlin

es

Overallx=43%

dist 1x=75%

Proximity-Aware: Latency Benefit

appbt fft lu mp3d ocean quicksort

un-struct

mean0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

rand

near1

via1

Nor

mal

ized

L2m

iss

late

ncy

Proximity-Aware: Latency Benefit

appbt fft lu mp3d ocean quicksort

un-struct

mean0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

rand

near1

via1

Nor

mal

ized

L2m

iss

late

ncy

Proximity-Aware: Latency Benefit

appbt fft lu mp3d ocean quicksort

un-struct

mean0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

rand

near1

via1

Nor

mal

ized

L2m

iss

late

ncy

Latency-25%

Proximity-Aware: Latency Benefit

appbt fft lu mp3d ocean quicksort

un-struct

mean0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

rand

near1

via1

Nor

mal

ized

L2m

iss

late

ncy

Latency-25%

Proximity-Aware: Latency Benefit

appbt fft lu mp3d ocean quicksort

un-struct

mean0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

rand

near1

via1

Nor

mal

ized

L2m

iss

late

ncy

Latency-25%

Reply traffic-6%

Proximity-Aware: Latency Benefit

appbt fft lu mp3d ocean quicksort

un-struct

mean0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

rand

near1

via1

Nor

mal

ized

L2m

iss

late

ncy

Latency-25%

Reply traffic-6%

Proximity-Aware: Speedup

appbt fft lu mp3d ocean quicksort

un-struct

mean0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

rand

near1

via1Spe

edup

Proximity-Aware: Speedup

appbt fft lu mp3d ocean quicksort

un-struct

mean0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

rand

near1

via1Spe

edup

Proximity-Aware: Speedup

appbt fft lu mp3d ocean quicksort

un-struct

mean0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

rand

near1

via1Spe

edup

Speedup16%

Proximity-Aware: Speedup

● L2 latency sensitivity of workloads

appbt fft lu mp3d ocean quicksort

un-struct

mean0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

rand

near1

via1Spe

edup

Speedup16%

Conclusion

● The latency/bandwidth aspects of CMPsmotivates multicore-aware coherence redesign

● One such change: Proximity-Aware Coherence– Ideas: stay on-chip, decrease "bulk" transit– Mean speedup 16%, mean L2 latency down 25%

● More aggressive techniques are under study

Conclusion

● The latency/bandwidth aspects of CMPsmotivates multicore-aware coherence redesign

● One such change: Proximity-Aware Coherence– Ideas: stay on-chip, decrease "bulk" transit– Mean speedup 16%, mean L2 latency down 25%

● More aggressive techniques are under study

● Questions?