+ All Categories
Home > Documents > A Tuneable Software Cache Coherence Protocol for ... · A Tuneable Software Cache Coherence...

A Tuneable Software Cache Coherence Protocol for ... · A Tuneable Software Cache Coherence...

Date post: 02-Aug-2018
Category:
Upload: vonga
View: 224 times
Download: 0 times
Share this document with a friend
14
1 A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs Marco Bekooij & Frank Ophelders Outline Context What is cache coherence Addressed challenge Short overview of related work Related issue: memory consistency Proposed software cache coherence protocol Performance evaluation results Concluding remarks
Transcript
Page 1: A Tuneable Software Cache Coherence Protocol for ... · A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs ... Short overview of related work ... Multi-layer AHB

1

A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs

Marco Bekooij & Frank Ophelders

Outline

  Context

  What is cache coherence

  Addressed challenge

  Short overview of related work

  Related issue: memory consistency

  Proposed software cache coherence protocol

  Performance evaluation results

  Concluding remarks

Page 2: A Tuneable Software Cache Coherence Protocol for ... · A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs ... Short overview of related work ... Multi-layer AHB

2

Multi-stream car-entertainment system

Car-radio IC of NXP

Digital In Out (DIO) Switch

Audio DAC 4x

Cordic FIR Ext SPDIF-in PCM I/f SRC

Audio ADC 4x

Host IIS-in 2x IIS-out 2x

IF –IN 1x

Ext IIS-in 3x

Host/ext IIS-out 1x

Keyed AGC 1x

Radio 8*fs In + out

Cordic

DSP EPICS

MEM

ITC AHB if ITC AHB if ITC AHB if ITC AHB if

Controller ARM MEM

Inter Tile Communication (ITC)

Multi-layer AHB bus (3 layer)

VPB Domain 0

VPB Domain 1

VPB Domain 2

MEM MEM MEM MEM MEM DMA SPI CD

Block Dec.

AHB2VPB AHB2VPB AHB2VPB

DSP EPICS

MEM

DSP EPICS

MEM

DSP EPICS

MEM

ARM based subsystem

Tile 0 Tile 1 Tile 2 Tile 3

Accelerators Peripherals

Unsuitable for general purpose applications (e.g. Pthread)

Page 3: A Tuneable Software Cache Coherence Protocol for ... · A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs ... Short overview of related work ... Multi-layer AHB

3

Developed experimental embedded multiprocessor system

  Processors communicate through shared memory   Processors have private caches

  Cache coherence problem!

Shared Memory 1 8 MB

TDM

ARM926EJ-S

PE1 I D

ARM926EJ-S

PE2 D I

$ $

$ $

Instruction Memory PE2

Instruction Memory PE1

Peripherals RS232 Display

Touchscreen Audio in/out Video in/out

Timers

Æthereal network-on-chip

Shared Memory 2 8 MB

TDM

SDRAM 256 MB

RR

Virtex 4

X: 10

P2 P1

Cache coherence problem

  A cache coherency protocol ensures that eventually writes become visible to all processors

X: 3

X: 10

Shared memory

$ $ X: 3 X: 3 Read returns 3 !!!

Page 4: A Tuneable Software Cache Coherence Protocol for ... · A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs ... Short overview of related work ... Multi-layer AHB

4

Addressed challenge

Define a cache coherence protocol that is suitable for real-time embedded systems with a NoC and with off-the-shelf processors

Related work on cache coherency   Hardware cache coherency protocols

–  Snooping based protocols: •  Requires processors to observe all memory accesses

–  Does not match well with a NoC: preferably point-2-point communication instead of broadcasting

–  Directory based protocols •  Significant overhead as a result of accessing the directory

–  Transactional memory •  Relies on speculation: suitable for real-time systems?

–  > Remark: most embedded processors do not support a hardware cache coherency protocol

  Software cache coherency protocols –  Require a specific programming style: explicit coupling between each

synchronization operation and data-structure it protects

  Prevent cache coherency issues: put shared data in uncached address range –  Low efficiency

Page 5: A Tuneable Software Cache Coherence Protocol for ... · A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs ... Short overview of related work ... Multi-layer AHB

5

A B A B

Issues in sharing cache lines

  Cache operations often operate on lines

A B A B

A B

P1 P2 A B A B

Related issue: memory consistency

  Memory accesses reordering by –  Memory system –  Processor –  Compiler

  We need a memory consistency model –  Defines constraints on the order in which memory operations become visible to other

processors –  Enables programmers to reason about outcome

P1 P2 A = 1

flag = 1

while ( flag != 1 );

print A

Page 6: A Tuneable Software Cache Coherence Protocol for ... · A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs ... Short overview of related work ... Multi-layer AHB

6

Network-on-chip

Sequential Memory Consistency

read

write

lock

write

read

unlock

write

write

P1 P2 P3

A=1 while (A!=1);

B = 1

while (B!=1);

Print A

•  All writes must be seen in one single order by all processors (write atomicity)

•  Likely to be inefficient in combination with a NoC

P1 P2 P3

Page 7: A Tuneable Software Cache Coherence Protocol for ... · A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs ... Short overview of related work ... Multi-layer AHB

7

Proposed software cache coherence protocol

Tuneable software cache coherence protocol   Proposed software cache coherence protocol

–  Minimal hardware requirements

•  Suitable for heterogeneous MPSoCs with a NoC •  Off-the-shelf processors and caches are supported

–  Should support cache maintenance operations (clean, invalidate) –  Sufficient for POSIX threads (Pthreads)

•  explicit synchronization operations

  Tuneable –  Separate shared and private data

•  Shared in write-through and private in write-back cache region –  Minimize unnecessary invalidations

•  Putting shared data in a specific cache way

  Suitable for real-time systems –  Bounded protocol overhead, WCET is independent of accesses other processors

Page 8: A Tuneable Software Cache Coherence Protocol for ... · A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs ... Short overview of related work ... Multi-layer AHB

8

Release Consistency

  Ensuring sequential consistency efficiently is (too) costly support release consistency

  Acquire –  Guarantees reading most recent data from

memory

  Release –  Makes writes visible to other processors

  Cache coherence operations only required on acquire and release

read

write

acquire(S)

write

read

release(S)

write

write

SWCC protocol in POSIX threads

  POSIX threads No two threads can access data at the same memory location simultaneously while at least one of the threads is modifying the location...

  Pthread_mutex_lock (acquire) –  Obtain lock –  Clean & invalidate Dcache

  Pthread_mutex_unlock (release) –  Clean Dcache –  Release lock

reads / writes

Pthread_mutex_unlock(S)

reads / writes (exclusive access to shared data)

reads / writes

Pthread_mutex_lock(S)

Page 9: A Tuneable Software Cache Coherence Protocol for ... · A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs ... Short overview of related work ... Multi-layer AHB

9

Tuning the protocol

  Place shared and private data in different address ranges

  Private data does not need to become visible to other processors –  Private data in write-back region of the cache

  Shared data – solve the sharing problem –  Shared data in write-through region of the cache

Execution time FFT Memory accesses FFT

Experiments

  Embedded the software cache coherence operations in POSIX threads calls

  Clean and invalidate entire shared address range on each synchronization –  Entire cache –  Way with shared data –  Address range (MVA)

  Executed Splash2 applications

  Low latency 4 cycles / word

  Each processor gets equal budget –  TDM arbitration on memory port

Shared Memory 16 MB

ARM926EJ-S PE1

I D

ARM926EJ-S PE2

D I

$ $ $ $

Instruction Memory PE2

Instruction Memory PE1

Peripherals RS232 Display

Touchscreen Audio in/out Video in/out

Timers

TDM

Page 10: A Tuneable Software Cache Coherence Protocol for ... · A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs ... Short overview of related work ... Multi-layer AHB

10

Cost of cache coherence operations Two cost types: •  cost of the cache maintenance operation •  cost of unnecessary invalidations

Speedup Splash2 applications

Speedup between 1.89 and 2.01

Page 11: A Tuneable Software Cache Coherence Protocol for ... · A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs ... Short overview of related work ... Multi-layer AHB

11

Increase of memory accesses

Protocol does not increase number of memory accesses significantly

Conclusion

  Presented a cache coherence protocol that is suitable for real-time systems with a NoC and with off-the-shelf processors

  Most important optimization is separation shared and private data

  Experimental results –  Speedup between 1.89 and 2.01

•  Higher synchronization/computation ratio (e.g. hardware floating point support) lower speed-up?

–  Protocol does not significantly increase memory bandwidth requirements

  Suitable for real-time systems because software cache coherency protocol overhead is predictable

Page 12: A Tuneable Software Cache Coherence Protocol for ... · A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs ... Short overview of related work ... Multi-layer AHB

12

Questions?

Backup slides

Page 13: A Tuneable Software Cache Coherence Protocol for ... · A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs ... Short overview of related work ... Multi-layer AHB

13

SWCC protocol in POSIX threads

reads / writes

Pthread_mutex_unlock(S)

reads / writes (exclusive access to shared data)

reads / writes

Pthread_mutex_lock(S)

reads / writes

Pthread_mutex_unlock(S)

reads / writes (exclusive access to shared data)

reads / writes

Pthread_mutex_lock(S)

P1 P2

P1 P2

NoC

Memory

•  Pthread_mutex_lock (acquire) •  Obtain lock •  Clean and invalidate

•  Pthread_mutex_unlock (release) •  Clean •  Release lock

Existing cache coherence protocols   Transactional Memory multiprocessor systems are based on speculation

–  Suitable for real-time systems?

  Hardware protocols –  Snooping in a NoC

•  Requires processors to observe all memory accesses

•  Writes to one location are serialized

P1 Pn

$ $

Shared Memory

...

Bus snoop

Cache to Memory transaction

Page 14: A Tuneable Software Cache Coherence Protocol for ... · A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs ... Short overview of related work ... Multi-layer AHB

14

Existing cache coherence protocols   Hardware protocols

–  Directories in a NoC •  A directory is consulted on memory accesses •  Increase in memory access latency

–  Hardware protocols require support from processors •  Supported by off-the-shelf processors?

P1 Pn

$ $

Shared Memory

...

Interconnect

Directory

Existing cache coherence protocols

  Software protocols •  Explicitly coupling between synchronization and data structure

–  Conditional invalidation [Tartalja, HICSS 1992]

–  Shared regions [Sandhu, ACM SIGPLAN 1993]

•  Private data cached, shared data not cached

–  In [Petrot, DSD 2006]

Enter critical region (D)

Exit critical region (D)

Access D

1)   Check administration 2)   Invalidate ?

Clean if write-back


Recommended