+ All Categories
Home > Documents > ECE8833 Polymorphous and Many-Core Computer Architecture

ECE8833 Polymorphous and Many-Core Computer Architecture

Date post: 10-Feb-2016
Category:
Upload: sonja
View: 25 times
Download: 0 times
Share this document with a friend
Description:
ECE8833 Polymorphous and Many-Core Computer Architecture. Lecture 7 Lock Elision and Transactional Memory. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering. Speculative Lock Elision (SLE) & Speculative Synchronization. Lock May Not Be Needed. - PowerPoint PPT Presentation
Popular Tags:
73
ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. School of Electrical and Computer Engineer Lecture 7 Lock Elision and Transactional Memory
Transcript
Page 1: ECE8833 Polymorphous and Many-Core Computer Architecture

ECE8833 Polymorphous and Many-Core Computer Architecture

Prof. Hsien-Hsin S. LeeSchool of Electrical and Computer Engineering

Lecture 7 Lock Elision and Transactional Memory

Page 2: ECE8833 Polymorphous and Many-Core Computer Architecture

Speculative Lock Elision (SLE) &Speculative Synchronization

Page 3: ECE8833 Polymorphous and Many-Core Computer Architecture

3

Lock May Not Be Needed• OoO won’t speculate beyond lock acquisition, critical section

(CS) executions are serialized• In the example, if condition failed, no shared data is updated• Potential thread-level parallelism is lost

LOCK(queue); if (!search_queue(input)) enqueue(input); UNLOCK(queue);

Thread 1 Thread 2

How to detect such hidden parallelism?

Page 4: ECE8833 Polymorphous and Many-Core Computer Architecture

4

Bottom Line• Appearance of Instantaneous Changes (i.e.,

Atomicity)

• Lock can be elided if– Data read in CS is not modified by other threads– Data write in CS is not read by other threads

• Any violation of above will not commit the instructions in CS

Page 5: ECE8833 Polymorphous and Many-Core Computer Architecture

5

Speculative Lock Elision (SLE) [Rajwar & Goodman, MICRO-34]

Hardware-based scheme (no ISA extension)

• Dynamically identifies synchronization operations

• Predicts them being unnecessary

• Elides them

• When speculation is wrong, recover using existing cache coherence mechanism

Page 6: ECE8833 Polymorphous and Many-Core Computer Architecture

6

SLE Scenario

• Silent store pair– stl_c (store on lock flag) : perform a “lock acquire” (lock=1)– stl (regular store): perform a “lock release” (lock = 0)– Why silent? “Release” will undo the write performed by “acquire”

• Goal– Elide these silent store pair– Speculate all memory operations inside critical sections will occur

atomically– Buffer store results during execution within the critical section

Page 7: ECE8833 Polymorphous and Many-Core Computer Architecture

7

Predict a Lock Acquire

• A lock-predictor detects ldl_l/stl_c pairs• View “elided lock acquire” as making a “branch prediction”

– Buffer register and memory state until SLE is validated• View “elided lock release” as a “branch outcome

resolution”

Page 8: ECE8833 Polymorphous and Many-Core Computer Architecture

8

Speculation During Critical Section • Speculative register state, use either of below

– ROB • Critical section needs to be smaller than ROB• Instructions cannot speculatively retire

– Register checkpoint• Done once after elided lock acquire• Allow speculative retirement for registers

• Speculative memory state– Use write-buffer– Multiple writes can be collapsed inside the write-buffer– Write-buffer cannot be flushed prior to elided lock release

• Rollback the states when mis-speculated

Page 9: ECE8833 Polymorphous and Many-Core Computer Architecture

9

Mis-speculation Triggers• Atomicity violation

– Use existing coherence protocol, the following two basic principle

(1)Any external invalidation to an accessed line(2)Any external request to access an “exclusive” line– Use LSQ if ROB approach is used (in-flight CS

instructions cannot retire and will be checked via snooping)

– Add an access bit to cache if Checkpoint is used

• Violation due to limited resources– Write-buffer is filled before elided lock release– ROB is filled before elided lock release– Uncached access events

Page 10: ECE8833 Polymorphous and Many-Core Computer Architecture

10

Microbenchmark Result

[Rajwar & Goodman, MICRO-34]

Page 11: ECE8833 Polymorphous and Many-Core Computer Architecture

11

Percentage of Dynamic Locks Elided

[Rajwar & Goodman, MICRO-34]

Page 12: ECE8833 Polymorphous and Many-Core Computer Architecture

12

Speculative Synchronization [Martinez et al. ASPLOS-02]• Similar rationale

– Synchronization may be too conservative– bypass synchronization

• Off-load synchronization operations from processor to an Spec.Sync.U (SSU)

• Use a “speculative thread” to pass – Active barriers– Busy locks– Unset flags

• TLS (Thread-level speculation) hardware – Disambiguate data violation– Roll back

• Always keep at least one “safe” thread to– Guarantee forward progress– In case the speculative buffer is overflowed– Actual conflict occurs

Page 13: ECE8833 Polymorphous and Many-Core Computer Architecture

13

Speculative Lock Example

ACQUIRE

RELEASE

A B C D E

SafeSpeculative

Slide Source: Jose Martinez

Page 14: ECE8833 Polymorphous and Many-Core Computer Architecture

14

Speculative Lock Example

A

B C D EACQUIRE

RELEASE

SafeSpeculative

Slide Source: Jose Martinez

Page 15: ECE8833 Polymorphous and Many-Core Computer Architecture

15

Speculative Lock Example

A B

C D

E

ACQUIRE

RELEASE

SafeSpeculative

Slide Source: Jose Martinez

Page 16: ECE8833 Polymorphous and Many-Core Computer Architecture

16

Speculative Lock Example

A B C

D

E

ACQUIRE

RELEASE

SafeSpeculative

Slide Source: Jose Martinez

Page 17: ECE8833 Polymorphous and Many-Core Computer Architecture

17

Speculative Lock Example

A

B C

D

E

ACQUIRE

RELEASE

SafeSpeculative

Slide Source: Jose Martinez

C becomes the new “safe” thread and “lock owner”

Page 18: ECE8833 Polymorphous and Many-Core Computer Architecture

18

Hardware Support for Speculative Synchronization

Processor

L1

L2

Keep synchronization variable under

speculation

Speculative bit per cache line

A

RLogic

Set Acquire and Release bits and take over the job of

“acquiring lock”

Processor Tag

Indicating speculative memory operations

Upon hitting a “lock acquire” instruction, a

library call is invoked to issue a request to the SSU, and processor

moves on to pass lock for speculative

execution

Page 19: ECE8833 Polymorphous and Many-Core Computer Architecture

19

Speculative Lock Request• Processor Side

– Program SSU for speculative lock– Checkpoint register file

• Speculative Synchronization Unit (SSU) Side– Initiate Test&Test&Set loop on lock variable

• Use caches as speculative buffer (like TLS)– Set “Speculative bit” in lines accessed speculatively

Slide Source: Jose Martinez

Page 20: ECE8833 Polymorphous and Many-Core Computer Architecture

20

Lock Acquire• SSU acquires lock (i.e., T&S successful)

– Clears all speculative bits– Becomes idle

• Release (store) later by processor

Page 21: ECE8833 Polymorphous and Many-Core Computer Architecture

21

Release While Speculative• Processor issues release, SSU still trying to acquire

the lock– SSU intercepts release (store) by processor– SSU toggles Release bit — thread “already done”

• SSU can pretend that ownership has been acquired and released (although it never happened)– Acquire and Release bit are cleared– All speculative bits in caches are cleared

Page 22: ECE8833 Polymorphous and Many-Core Computer Architecture

22

Violation Detection• Rely on underlying cache coherence protocol

– A thread receiving an external invalidation– An external read for a local dirty cache line

• If the accessed line is *not* marked speculative normal coherence protocol applied

• If a *speculative thread* receives an external message for a line marked speculative– SSU squashes the local thread– All dirty lines w/ speculative bits are gang-invalidated– All speculative bits are cleared– Processor restores check-pointed states

• Lock owner was never squashed (since none of its cache line would be marked as speculative)

Page 23: ECE8833 Polymorphous and Many-Core Computer Architecture

23

Speculative Synchronization Result• Average sync time reduction: 40%

• Execution time reduction up to 15%, average 7.5%

Page 24: ECE8833 Polymorphous and Many-Core Computer Architecture

Transaction Memory

Page 25: ECE8833 Polymorphous and Many-Core Computer Architecture

25

Current Parallel Programming Model• Shared data

consistency

• Use “Lock”

• Fine grained lock– Error prone– Deadlock prone– Overhead

• Coarse grained lock– Sequentialize threads– Prevent parallelism

// WITH LOCKSvoid move(T s, T d, Obj key){ LOCK(s); LOCK(d); tmp = s.remove(key); d.insert(key, tmp); UNLOCK(d); UNLOCK(s);}

DEADLOCK!(& can’t abort)

move(a, b, key1);

move(b, a, key2);

Thread 0Thread 1

Code example source: Mark Hill @Wisconsin

Page 26: ECE8833 Polymorphous and Many-Core Computer Architecture

26

Parallel Software Problems• Parallel systems are often programmed with

– Synchronization through barriers– Shared objects access control through locks

• Lock granularity and organization must balance performance and correctness– Coarse-grain locking: Lock contention– Fine-grain locking: Extra overhead– Must be careful to avoid deadlocks or data races– Must be careful not to leave anything unprotected for

correctness• Performance tuning is not intuitive

– Performance bottlenecks are related to low level events• E.g. false sharing, coherence misses

– Feedback is often indirect (cache lines, rather than variables)

Page 27: ECE8833 Polymorphous and Many-Core Computer Architecture

27

Parallel Hardware Complexity (TCC’s view)• Cache coherence protocols are complex

– Must track ownership of cache lines– Difficult to implement and verify all corner cases

• Consistency protocols are complex– Must provide rules to correctly order individual loads/stores– Difficult for both hardware and software

• Current protocols rely on low latency, not bandwidth– Critical short control messages on ownership transfers – Latency of short messages unlikely to scale well in the

future– Bandwidth is likely to scale much better

• High speed interchip connections• Multicore (CMP) = on-chip bandwidth

Page 28: ECE8833 Polymorphous and Many-Core Computer Architecture

28

What do we want?• A shared memory system with

– A simple, easy programming model (unlike message passing)

– A simple, low-complexity hardware implementation (unlike shared memory)

– Good performance

Page 29: ECE8833 Polymorphous and Many-Core Computer Architecture

29

Lock Freedom• Why lock is bad?• Common problems in conventional locking

mechanisms in concurrent systems– Priority inversion: When low-priority process is

preempted while holding a lock needed by a high-priority process

– Convoying: When a process holding a lock is de-scheduled (e.g. page fault, no more quantum), no forward progress for other processes capable of running

– Deadlock (or Livelock): Processes attempt to lock the same set of objects in different orders (could be bugs by programmers)

• Error-prone

Page 30: ECE8833 Polymorphous and Many-Core Computer Architecture

30

Using Transactions• What is a transaction?

– A sequence of instructions that is guaranteed to execute and complete only as an atomic unit

Begin TransactionBegin TransactionInst #1Inst #1Inst #2Inst #2Inst #3Inst #3……

End TransactionEnd Transaction– Satisfy the following properties

• Serializability: Transactions appear to execute serially.• Atomicity (or Failure-Atomicity): A transaction either

– commits changes when complete, visible to all; or – aborts, discarding changes (will retry again)

• Isolation: concurrently executing threads cannot affect the result of a transaction, so a transaction produces the same result as when no other task was executing

Page 31: ECE8833 Polymorphous and Many-Core Computer Architecture

31

TCC (Stanford) [Hammond et al. ISCA 2004]

• Transactional Coherence and Consistency• Programmer-defined groups of instructions within a

programBegin TransactionBegin Transaction Start Buffering Results

Inst #1Inst #1Inst #2Inst #2Inst #3Inst #3……

End TransactionEnd Transaction Commit Results Now• Only commit machine state at the end of each

transaction– Each must update machine state atomically, all at once– To other processors, all instructions within one transaction

appear to execute only when the transaction commits– These commits impose an order on how processors may

modify machine state

Page 32: ECE8833 Polymorphous and Many-Core Computer Architecture

32

Transaction Code Example• MIT LTM instruction set

xstart: XBEGIN on_abort lw r1, 0(r2) addi r1, r1, 1

. . . XEND

. . . on_abort:

… // back off j xstart // retry

Page 33: ECE8833 Polymorphous and Many-Core Computer Architecture

33

Transactional Memory• Transactions appear to execute in commit order

– Flow (RAW) dependency cause transaction violation and restart

ld 0xdddd...st 0xbeef

Transaction A

Time

ld 0xbeef

Transaction C

ld 0xbeef

Re-execute Re-execute with new datawith new data

Commit

Arbitrate ld 0xdddd...ld 0xbbbb

Transaction B

Commit

Arbitrate Violation!Violation!

0xbeef

0xbeef

Page 34: ECE8833 Polymorphous and Many-Core Computer Architecture

34

Transaction Atomicity

Load r = A T0 T1

A = 10

T0

T1

Store A = r

What are the values when T0 and T1 are atomically executed?

0AInit MEM

T1 T0

A = 10Add r = r + 5

Load r = A

Store A = r

Add r = r + 5

Page 35: ECE8833 Polymorphous and Many-Core Computer Architecture

35

Transaction Atomicity

Load r = A

T0

T1

Store A = r

What are the values when T0 and T1 are atomically executed?

0AInit MEM

Add r = r + 5

Load r = A

Store A = r

Add r = r + 5

Tim

e

Page 36: ECE8833 Polymorphous and Many-Core Computer Architecture

36

Transaction Atomicity

Store A = 2

Load r = A T0 T1

A = 7

T0

T1

Store A = r

What are the values when T0 and T1 are atomically executed?

0AInit MEM

T1 T0

A = 2

Add r = r + 5

Page 37: ECE8833 Polymorphous and Many-Core Computer Architecture

37

Transaction Atomicity

Store A = 2

T0

T1

Tim

e0A

Init MEMT1 tries to be atomic, unfortunately, some operation modified the shared var A in the middle.

T1: r = 0

A = 2

T1: r = 0+5 = 5

Load r = A

Store A = r

Add r = r + 5

T1: A = 5

Page 38: ECE8833 Polymorphous and Many-Core Computer Architecture

38

Transaction Atomicity

Store A = 2

Store A = 9

T0

T1

Store A = r

0AInit MEM

Load r = A

Add r = r + 2

T0 T1

A = 11

What are the values when T0 and T1 are atomically executed?

T1 T0

A = 2

Page 39: ECE8833 Polymorphous and Many-Core Computer Architecture

39

Transaction Atomicity

T0

T1T2

Tim

e

ReadSet = {A}WriteSet ={}

Load r = A

ReadSet = {B,C}WriteSet ={A}

Commit

Arbitrate

Store A = Y

ReadSet = {X,Y}WriteSet ={A}

Page 40: ECE8833 Polymorphous and Many-Core Computer Architecture

40

Hardware Transactional Memory TaxonomyConflict Detection• Write set against another thread’s read set and write set

– Lazy• Wait till last minute

– Eager• Check on each write • Squash during a transaction

Version Management• Where to put speculative data

– Lazy• Into speculative buffer (assuming transaction will abort)• No rollback needed when abort

– Eager • Into cache hierarchy (assuming transaction will commit• No data copy needed when go through

Page 41: ECE8833 Polymorphous and Many-Core Computer Architecture

41

HTM Taxonomy [LogTM 2006]

Version ManagementLazy Eager

Conflict D

etection

Lazy Optimistic C. Ctrl. DBMS None

Eager MIT LTMIntel/Brown VTM

Conservative C. Ctrl DBMSMIT UTM LogTM

Page 42: ECE8833 Polymorphous and Many-Core Computer Architecture

42

TCC System• Similar to prior thread-level speculation (TLS)

techniques– CMU Stampede– Stanford Hydra– Wisconsin Multiscalar– UIUC speculative multithreading CMP

• Loosely coupled TLS system• Completely eliminates conventional cache

coherence and consistency models– No MESI-style cache coherence protocol

• But require new hardware support

Page 43: ECE8833 Polymorphous and Many-Core Computer Architecture

43

The TCC Cycle• Transactions run in a cycle• Speculatively execute code and

buffer• Wait for commit permission

– Phase provides synchronization, if necessary (assigned phase number, oldest phase commit first)

– Arbitrate with other processors• Commit stores together (as a

packet)– Provides a well-defined write ordering– Can invalidate or update other caches– Large packet utilizes bandwidth

effectively• And repeat

Page 44: ECE8833 Polymorphous and Many-Core Computer Architecture

44

Advantages of TCC• Trades bandwidth for simplicity and latency

tolerance– Easier to build– Not dependent on timing/latency of loads and stores

• Transactions eliminate locks– Transactions are inherently atomic– Catches most common parallel programming errors

• Shared memory consistency is simplified– Conventional model sequences individual loads and stores– Now only have hardware sequence transaction commits

• Shared memory coherence is simplified– Processors may have copies of cache lines in any state

(no MESI !)– Commit order implies an ownership sequence

Page 45: ECE8833 Polymorphous and Many-Core Computer Architecture

45

How to Use TCC• Divide code into potentially parallel tasks

– Usually loop iterations– For initial division, tasks = transactions

• But can be subdivided up or grouped to match HW limits (buffering)

– Similar to threading in conventional parallel programming, but:• We do not have to verify parallelism in advance• Locking is handled automatically• Easier to get parallel programs running correctly

• Programmer then orders transactions as necessary– Ordering techniques implemented using phase number– Deadlock-free (At least one transaction is the oldest one)– Livelock-free (watchdog HW can easily insert barriers anywhere)

Page 46: ECE8833 Polymorphous and Many-Core Computer Architecture

46

How to Use TCC• Three common ordering scenarios

– Unordered for purely parallel tasks– Fully ordered to specify sequential task (algorithm level)– Partially ordered to insert synchronization like barriers

Page 47: ECE8833 Polymorphous and Many-Core Computer Architecture

47

Basic TCC Transaction Control Bits• In each local cache

– Read bits (per cache line, or per word to eliminate false sharing)

• Set on speculative loads • Snooped by a committing transaction (writes by other CPU)

– Modified bits (per cache line)• Set on speculative stores • Indicate what to rollback if a violation is detected • Different from dirty bit

– Renamed bits (optional)• At word or byte granularity• To indicate local updates (RAW) that do not cause a

violation• Subsequent reads that read lines with these bits set, they

do NOT set read bits because local RAW is not considered a violation

Page 48: ECE8833 Polymorphous and Many-Core Computer Architecture

48

During A Transaction Commit• Need to collect all of the modified caches together

into a commit packet• Potential solutions

– A separate write buffer, or– An address buffer maintaining a list of the line tags to be

committed– Size?

• Broadcast all writes out as one single (large) packet to the rest of the system

Page 49: ECE8833 Polymorphous and Many-Core Computer Architecture

49

Re-execute A Transaction• Rollback is needed when a transaction cannot

commit• Checkpoints needed prior to a transaction• Checkpoint memory

– Use local cache– Overflow issue

• Conflict or capacity misses require all the victim lines to be kept somewhere (e.g. victim cache)

• Checkpoint register state– Hardware approach: Flash-copying rename table / arch

register file – Software approach: extra instruction overheads

Page 50: ECE8833 Polymorphous and Many-Core Computer Architecture

50

Sample TCC Hardware• Write buffers and L1 Transaction Control Bits

– Write buffer in processor, before broadcast• A broadcast bus or network to distribute commit packets

– All processors see the commits in a single order– Snooping on broadcasts triggers violations, if necessary

• Commit arbitration/sequence logic

Page 51: ECE8833 Polymorphous and Many-Core Computer Architecture

51

Ideal Speedups with TCC• equake_l : long transactions • equake_s : short transactions

Page 52: ECE8833 Polymorphous and Many-Core Computer Architecture

52

Speculative Write Buffer Needs• Only a few KB of write buffering needed

– Set by the natural transaction sizes in applications– Small write buffer can capture 90% of modified state – Infrequent overflow can be always handled by committing early

Page 53: ECE8833 Polymorphous and Many-Core Computer Architecture

53

Broadcast Bandwidth• Broadcast is bursty• Average bandwidth

– Needs ~16 bytes/cycle @ 32 processors with whole modified lines

– Needs ~8 bytes/cycle @ 32 processors with dirty data only

• High, but feasible on-chip

Page 54: ECE8833 Polymorphous and Many-Core Computer Architecture

54

TCC vs MESI [PACT 2005]• Application, Protocol + Processor count

Page 55: ECE8833 Polymorphous and Many-Core Computer Architecture

55

Implementation of MIT’s LTM [HPCA 05]• Transactional Memory should support transactions

of arbitrary size and duration• LTM ─ Large Transactional Memory• No change in cache coherence protocol• Abort when a memory conflict is detected

– Use coherency protocol to check conflicts – Abort (younger) transactions during conflict resolution to

guarantee forward progress• For potential rollback

– Checkpoint rename table and physical registers – Use local cache for all speculative memory operations – Use shared L2 (or low level memory) for non-speculative

data storage

Page 56: ECE8833 Polymorphous and Many-Core Computer Architecture

56

Multiple In-Flight Transactions

• During instruction decode:– Maintain rename table and “saved” bits in physical registers– “Saved” bits track registers mentioned in current rename table

• Constant # of set bits: every time a register is added to “saved” set we also remove one

OriginalXBEGIN L1ADD R1, R1, R1ST 1000, R1XENDXBEGIN L2ADD R1, R1, R1ST 2000, R1XEND

Rename TableR1 P1, …

Saved Set{P1, …} (was)decodedecode

Page 57: ECE8833 Polymorphous and Many-Core Computer Architecture

57

Multiple In-Flight Transactions

• When XBEGIN is decoded– Snapshots taken of current rename table and S bits– This snapshot is not active until XBEGIN retires

OriginalXBEGIN L1ADD R1, R1, R1ST 1000, R1XENDXBEGIN L2ADD R1, R1, R1ST 2000, R1XEND

Rename TableR1 P1, …R1 P2, …

Saved Set{P1, …}{P2, …}decodedecode

Page 58: ECE8833 Polymorphous and Many-Core Computer Architecture

58

Multiple In-Flight TransactionsOriginalXBEGIN L1ADD R1, R1, R1ST 1000, R1XENDXBEGIN L2ADD R1, R1, R1ST 2000, R1XEND

Rename TableR1 P1, …

R1 P2, …

Saved Set{P1, …}

{P2, …}decodedecode

Page 59: ECE8833 Polymorphous and Many-Core Computer Architecture

59

Multiple In-Flight TransactionsOriginalXBEGIN L1ADD R1, R1, R1ST 1000, R1XENDXBEGIN L2ADD R1, R1, R1ST 2000, R1XEND

Rename TableR1 P1, …

R1 P2, …

Saved Set{P1, …}

{P2, …}decodedecode

Page 60: ECE8833 Polymorphous and Many-Core Computer Architecture

60

Multiple In-Flight Transactions

• When XBEGIN retires– Snapshots taken at decode become active, which will prevent

P1 from reuse– 1st transaction queued to become active in memory– To abort, we just restore the active snapshot’s rename table

OriginalXBEGIN L1ADD R1, R1, R1ST 1000, R1XENDXBEGIN L2ADD R1, R1, R1ST 2000, R1XEND

Rename TableR1 P1, …

R1 P2, …

Saved Set{P1, …}

{P2, …}decodedecode

retireretireActive

snapshot

Page 61: ECE8833 Polymorphous and Many-Core Computer Architecture

61

Multiple In-Flight Transactions

• We are only reserving registers in the active set– This implies that exactly # of arch registers are saved– This number is strictly limited, even as we speculatively

execute through multiple transactions

OriginalXBEGIN L1ADD R1, R1, R1ST 1000, R1XENDXBEGIN L2ADD R1, R1, R1ST 2000, R1XEND

Rename TableR1 P1, …

R1 P2, …R1 P3, …

Saved Set{P1, …}

{P2, …}{P3, …}decodedecode

retireretire

Activesnapshot

Page 62: ECE8833 Polymorphous and Many-Core Computer Architecture

62

Multiple In-Flight Transactions

• Normally, P1 would be freed here • Since it is in the active snapshot’s “saved” set, we

place it onto the register reserved list

OriginalXBEGIN L1ADD R1, R1, R1ST 1000, R1XENDXBEGIN L2ADD R1, R1, R1ST 2000, R1XEND

Rename TableR1 P1, …

R1 P2, …

R1 P3, …

Saved Set{P1, …}

{P2, …}

{P3, …}decodedecode

retireretire

Activesnapshot

Page 63: ECE8833 Polymorphous and Many-Core Computer Architecture

63

Multiple In-Flight Transactions

• When XEND retires:– Reserved physical registers (e.g., P1) are freed,

and active snapshot is cleared– Store queue is empty

OriginalXBEGIN L1ADD R1, R1, R1ST 1000, R1XENDXBEGIN L2ADD R1, R1, R1ST 2000, R1XEND

Rename Table

R1 P2, …

R1 P3, …

Saved Set

{P2, …}

{P3, …}decodedecode

retireretire

Page 64: ECE8833 Polymorphous and Many-Core Computer Architecture

64

Multiple In-Flight Transactions

• Second transaction becomes active in memory

OriginalXBEGIN L1ADD R1, R1, R1ST 1000, R1XENDXBEGIN L2ADD R1, R1, R1ST 2000, R1XEND

Rename Table

R1 P2, …

Saved Set

{P2, …}retireretireActive

snapshot

Page 65: ECE8833 Polymorphous and Many-Core Computer Architecture

65

Cache Overflow Mechanism

• Need to keep – Current (speculative) values– Rollback values

• Common case is commit, so keep Current in cache

• Problem: – uncommitted current values do not fit in

local cache• Solution

– Overflow hashtable as extension of cache

O T tag dataWay 0

T tag dataWay 1

Overflow Hashtablekey data

ST 1000, 55XBEGIN L1LD R1, 1000ST 2000, 66ST 3000, 77LD R1, 1000XEND

Page 66: ECE8833 Polymorphous and Many-Core Computer Architecture

66

Cache Overflow Mechanism

• T bit per cache line– Set if accessed during a transaction

• O bit per cache set– Indicate set overflow

• Overflow storage in physical DRAM– Allocate and resize by the OS– Search when miss : complexity of a page

table walk– If a line is found, swapped with a line in

the set

O T tag dataWay 0

T tag dataWay 1

Overflow Hashtablekey data

ST 1000, 55XBEGIN L1LD R1, 1000ST 2000, 66ST 3000, 77LD R1, 1000XEND

Page 67: ECE8833 Polymorphous and Many-Core Computer Architecture

67

Cache Overflow Mechanism

• Start with non-transactional data in the cache

1000 55

O T tag dataWay 0

T tag dataWay 1

Overflow Hashtablekey data

ST 1000, 55XBEGIN L1LD R1, 1000ST 2000, 66ST 3000, 77LD R1, 1000XEND

Page 68: ECE8833 Polymorphous and Many-Core Computer Architecture

68

Cache Overflow Mechanism

• Transactional read sets the T bit

1 1000 55

O T tag dataWay 0

T tag dataWay 1

Overflow Hashtablekey data

ST 1000, 55XBEGIN L1LD R1, 1000ST 2000, 66ST 3000, 77LD R1, 1000XEND

Page 69: ECE8833 Polymorphous and Many-Core Computer Architecture

69

Cache Overflow Mechanism

• Expect most transactional writes fit in the cache

1 1000 55 1 2000 66

O T tag dataWay 0

T tag dataWay 1

Overflow Hashtablekey data

ST 1000, 55XBEGIN L1LD R1, 1000ST 2000, 66ST 3000, 77LD R1, 1000XEND

Page 70: ECE8833 Polymorphous and Many-Core Computer Architecture

70

Cache Overflow Mechanism

• A conflict miss• Overflow sets O bit• Replacement taken place (LRU)• Old data spilled to DRAM

(hashtable)

1 3000 77 1 2000 661

O T tag dataWay 0

T tag dataWay 1

Overflow Hashtable

1000 55key data

ST 1000, 55XBEGIN L1LD R1, 1000ST 2000, 66ST 3000, 77LD R1, 1000XEND

Page 71: ECE8833 Polymorphous and Many-Core Computer Architecture

71

Cache Overflow Mechanism

• Miss to an overflowed line, checks overflow table

• If found, swap (like a victim cache)• Else, proceed as miss

1 1000 55 1 2000 661

O T tag dataWay 0

T tag dataWay 1

Overflow Hashtable

3000 77key data

ST 1000, 55XBEGIN L1LD R1, 1000ST 2000, 66ST 3000, 77LD R1, 1000XEND

Page 72: ECE8833 Polymorphous and Many-Core Computer Architecture

72

Cache Overflow Mechanism

• Abort– Invalidate all lines with T set (assume L2

or lower level memory contains original values)

– Discard overflow hashtable– Clear O and T bits

• Commit– Write back hashtable; NACK

interventions during this– Clear O and T bits in the cache

0 1000 55 0 2000 660

O T tag dataWay 0

T tag dataWay 1

Overflow Hashtable

3000 77key data

ST 1000, 55XBEGIN L1LD R1, 1000ST 2000, 66ST 3000, 77LD R1, 1000XEND

L2

Page 73: ECE8833 Polymorphous and Many-Core Computer Architecture

73

LTM vs. Lock-based


Recommended