+ All Categories
Home > Documents > Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Date post: 21-Jan-2016
Category:
Upload: marion-wells
View: 214 times
Download: 0 times
Share this document with a friend
88
Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs
Transcript
Page 1: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Software Transactional Memory

TiC 2010

Adam Welc

Programming Systems LabIntel Labs

Page 2: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

2

Agenda

Part 1: STM Overview– Introduction– Language Constructs and Semantics– Design space

Part 2: STM Implementation– Runtime– Compiler– Performance

Page 3: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

3

Concurrent Programming Today

•Mutual exclusion locks (Java monitors, pthread locks etc.) used for concurrency control– Coarse-grained locking limits concurrency– Fine-grained locking is hard: composability,

possibility of deadlocks, etc.

•Transactional Memory (TM) offers an alternative

Page 4: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

4

Designing Map Structure

•Operations

T1

m.get(k);

T2

m.put(k,v);

T3

m.remove(k);

get (Key k)put (Key k, Value v)remove (Key k)

{ seqGet(k); }{ seqPut(k, v); }{ seqRemove(k); }

• How to make it thread-safe?

Page 5: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

5

ConcurrentMap Classsynchronized

Value get(Key k) {

return seqGet(k);

}

synchronized

void put(Key k, Value v) {

seqVal(k, v);

}

synchronized

void remove(Key k) {

seqRemove(k);

}

What if workload

mostly read-only?

Page 6: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

6

Refined ConcurrentMap Class

Value get(Key k) {

// try unsynchronized

Value tmp = seqGet(k);

if (tmp != null) return tmp;

else synchronized(this) {

// possible interference

return seqGet(k);

} }

void put(Key k, Value v) {

synchronized(this) {

seqPut(k, v);

} }

void remove(Key k) {

synchronized(this) {

seqRemove(k);

} }

Page 7: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

7

Actual Code

public Object get(Object key) { int hash = hash(key); // Try first without locking... Entry[] tab = table; int index = hash & (tab.length - 1); Entry first = tab[index]; Entry e;

for (e = first; e != null; e = e.next) { if (e.hash == hash && eq(key, e.key)) { Object value = e.value; if (value != null) return value; else break; } }…

… // Recheck under synch if key not there or interference Segment seg = segments[hash & SEGMENT_MASK]; synchronized(seg) { tab = table; index = hash & (tab.length - 1); Entry newFirst = tab[index]; if (e != null || first != newFirst) { for (e = newFirst; e != null; e = e.next) { if (e.hash == hash && eq(key, e.key)) return e.value; } } return null; } }

DO YOU REALLY

WANT TO WRITE

THIS KIND OF CODE?

Page 8: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

8

Composition

•Simple concurrent accesses work

•Consider concurrent value deposit

int v1 = map.get(k);

v1 += 10;

map.put(k, v1);

synchronized(map) {

}

Back to coarse-grained locking

T1 T2

map.get(k) == 100

int v2 = map.get(k);

v2 += 20;

map.put(k, v2);

synchronized(map) {

}

== 100== 100

== 120

== 120

== 110

== 110

IS LOST

Page 9: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

9

TM Approach

Let TM system take care of the rest

get (Key k)put (Key k, Value v)remove (Key k)

{ __tm_atomic { seqGet(k); }}{ __tm_atomic { seqPut(k, v); }}{ __tm_atomic { seqRemove(k); }}

int v = map.get(k);v += amount;map.put(k, v);

__tm_atomic {

}

Page 10: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

10

Agenda

Part 1: STM Overview– Introduction– Language Constructs and Semantics– Design space

Part 2: STM Implementation– Runtime– Compiler– Performance

Page 11: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

11

Managed vs. Unmanaged STM

• Same core semantics and language constructs (and algorithms)

• Managed (e.g. Java, .NET)– Controlled execution of native code– Dynamic compilation

• Unmanaged (e.g. C, C++)– Problem with legacy binaries– Have to know upfront if code executed

transactionally

Page 12: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

12

Atomic Blocks == Transactions

•Originally a database concept

•Transactional executions– Atomic– Consistent– Isolated– Durable

serial

serializable

Serializable – appearance of serial

Page 13: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

13

Serial Execution

T1 T2

__tm_atomic { int tmp1 = x;

int tmp2 = y;}

__tm_atomic { x = 42;

y = 42;}

int x = 0; int y = 0;

== 42

== 42

== 0

== 0

BOTH RESULTS CORRECT

Page 14: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

14

Serializable Execution

T1 T2

__tm_atomic { int tmp1 = x;

int tmp2 = y;}

__tm_atomic { x = 42;

y = 42;}

int x = 0; int y = 0;

== 42

== 42

== 42

== 42

BOTH RESULTS THE SAME DESPITE

INTERLEAVING

Page 15: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

15

Non-Serializable Execution

T1 T2

__tm_atomic { int tmp1 = x;

}

__tm_atomic { x = 42;

int x = 0; int y = 0;

== 42

== 42

== 0== 42

int tmp2 = y;

y = 42;}

DIFFERENT FROM ANY

SERIAL

TM’s role is to “fix” conflicting executions

ROLL BACK

! CONFLICT !

Page 16: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

16

Transaction Nesting

•Required for composability

•Open nesting– Results exposed upon inner transaction commit– Compensating actions used upon outer

transaction abort– May lead to serializability violations

•Closed nesting– Computation results exposed only upon

outermost transaction commit– Transactions can be flattened - inner

transaction is semantically a no-op

Page 17: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

17

Open Nesting

• Conditional can be entered after inner commit

__tm_atomic {

__tm_atomic { inc(); }

}

__tm_atomic { if (x == 1) { … }}

void inc() { x++; }void dec() { x--; }

int x = 0;

// register dec()

dec();

T1 T2

• Effect is undone but T2 has seen the result!

Page 18: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

18

Closed Nesting

• Conditional can be entered only after outermost commit

__tm_atomic {

__tm_atomic { inc(); }

}

__tm_atomic { if (x == 1) { … }}

void inc() { x++; }void dec() { x--; }

int x = 0;T1 T2

Page 19: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

19

Flatten Or Not To Flatten?

__tm_atomic {

}

__tm_atomic {

}

potential conflict

ROLL BACK

ROLL BACK

Page 20: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

More on Execution Semantics

• Transactions are serializable, but

• The notion comes from database world where all actions are transactional

• What about non-transactional code?

20

Page 21: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Problematic Behavior

T1 T2

__tm_atomic { if (p != NULL)

tmp = *p;}

Should this behavior be allowed? Yes: This program is buggy, p = null should be inside a

transaction No: Transactions should be atomic no matter what

p = null;true

int * p = &x;

NULL POINTER

== null

21

Page 22: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Two Points of View on Atomicity

•Weak atomicity – Transactions serializable with respect to other

transactions

•Strong atomicity– Transactions serializable with respect to all

memory accesses

WEAK ATOMICITY

STRENGTH

STRONG ATOMICITY

22

Page 23: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Weak Atomicity

• Non-transactional accesses bypass STM access protocol– Non-transactional code remains un-instrumented– Most STMs behave this way

• Requires segregation of transactional and non-transactional data– Hard to enforce

• Otherwise – behavior depends on implementation – Unexpected results can be observed

23

Page 24: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Non-Repeatable Read

T1 T2

__tm_atomic { tmp1 = x;

tmp2 = x;}

•Non-txn code can affect transactional computation

x = 42;

int x = 0;

== 42

== 42

== 0

== 0

tmp1 == tmp2tmp1 != tmp2

24

Page 25: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Dirty Read

T1 T2

__tm_atomic { x++;

x++;}

•Txn code can leak intermediate results to non-transactional computation

tmp = x;

int x = 0;

tmp is eventmp is odd

== 0

== 1

== 2

== 1

25

Page 26: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Strong Atomicity

•Non-transactional accesses turned into micro-transactions– Reads and writes block until write gets

committed– Interleaved writes can invalidate a transaction

•Avoids all undesirable behaviors of weak atomicity, but

•All code needs to be instrumented

26

Page 27: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Non-Repeatable Read

T1 T2

__tm_ atomic { tmp1 = x;

tmp2 = x;}

•Write by T2 invalidates T1’s transaction

__tm_atomic { x = 42;}

int x = 0;

== 0

ROLL BACK

27

Page 28: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Dirty Read

T1 T2

atomic { x++;

x++;}

•Blocking effectively reschedules and serializes non-transactional operations

__tm_atomic { tmp = x;}

int x = 0;

== 2

BLOCK== 1

== 2

28

Page 29: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Are We Done?

•Overhead of strong atomicity can be huge (up to 10x slowdown)

•Non-txn code instrumentation may be problematic (precompiled libraries, system calls, etc.)

•Can we find an in-between solution?

WEAK ATOMICITY

STRENGTH

STRONG ATOMICITY

SGLA

29

Page 30: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Single Global Lock Atomicity

• Transactions execute as if protected by a single global lock

__tm_atomic { synchronized(m) {

S; S;

} }

•Matches intuition of weakly atomic STM– Transactions are serialized w.r.t. each other– And, no surprises compared to locks

• STM must provide additional guarantees– Consistency– Privatization safety

30

Page 31: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

31

Consistency

__tm_atomic {

__tm_atomic {

int t1 = x;

int t2 = x;

if (t1 != t2)

*ptr = x;

}

lock(mutex);

int t1 = x;

int t2 = x;

if (t1 != t2)

*ptr = x;

unlock(mutex);

x=y;

}

lock(mutex);

x=y;

unlock(mutex);

int *ptr = NULL;

int x = 0; int y = 1

NULL POINTER

T1 T2

== 1

== 1

== 0

// cannot happen

Page 32: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

32

Privatization Safety

__tm_atomic { t1 = head; if (t1)

__tm_atomic { t2 = head; head = t2->next; t2->next = NULL;}priv = t2->x;…assert (priv == t2->y);

lock(mutex); t2 = head; head = t2->next; t2->next = NULL;unlock(mutex);priv = t2->x;…assert (priv == t2->y);

t1->x = t1->y = 1;}

lock(mutex); t1 = head; if (t1)

t1->x = t1->y = 1;unlock(mutex);

T1 T2

0

0

x

y

next

head

t1

t2 1

1

= NULL;

== 1

== 1== 1

== 0

Page 33: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

33

Agenda

Part 1: STM Overview– Introduction– Language Constructs and Semantics– Design space

Part 2: STM Implementation– Runtime– Compiler– Performance

Page 34: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

34

Transactional Execution Modes

Optimistic Pessimistic

Lock data on write (exclusive write locks)

Record reads

Release write locks and validate reads on commit

Lock data on write (exclusive write locks)

Lock data on read (shared read locks)

Release read and write locks on commit

Pros Cache effects

No read locking cost

Privatization-safety and consistency for free

Filtering

Cons Providing privatization and consistency incurs extra cost

No filtering

Cache effects

Additional read locking cost

•Obstinate – pessimistic transaction that wins all conflicts

Page 35: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

35

Write Buffering vs. In-Place Update

Write Buffering

(a.k.a. Lazy Versioning)

In-Place Update

(a.k.a. Eager Versioning)

Write to private buffer

Copy to memory on commit

Lazy Locking (acquire locks on commit) or Eager Locking (acquire locks on access)

Directly write shared memory

Record old values in a undo log

Eager Locking: acquire write-locks on write

Pros Fast abort Fast commit

Direct reads

Cons Slow commit

Reads have to search buffer

Slow abort

Page 36: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

36

Conflict Detection Granularity

class Foo { int x; int y;}

object-based(Java/C#)

word-based(cacheline-based)

(C/C++)

struct Foo { int x; int y;}

y

x

metadata

vtbl

metadata

metadata

metadata

metadata

metadata

y

x

Owner Table

…… …

… …

Page 37: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

37

Agenda

Part 1: STM Overview– Introduction– Language Constructs and Semantics– Design space

Part 2: STM Implementation– Runtime– Compiler– Performance

Page 38: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

38

Intel C/C++ STM http://whatif.intel.com (NEW RELEASE IN Q3 2010)•Based on Intel’s product compiler

•Features• Consistency and privatization safety preserving close-nested

atomic blocks (__tm_atomic) to support SGLA semantics

• User abort (__tm_abort) for failure atomicity

• Transaction retry (__tm_retry) for condition synchronization

• Multiple transactional execution modes: optimistic and pessimistic STM, obstinate

• Serial execution mode (for I/O and calls to legacy binaries)

• TM support for C++ : virtual functions, (multiple) inheritance, function and class templates, exceptions

Page 39: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

39

System Architecture

transactional C/C++

Intel C/C++ compiler

multicore system

C/C++ support

APPLICATION

LANGUAGESUPPORT

TMRUNTIME

HARDWARE

Page 40: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Runtime Overview

• In-place updates

• Cacheline-level conflict detection granularity

• Information for rollback recorded in undo log

• Reads recorded in read set:– For validation (optimistic mode)– For locking/unlocking (pessimistic and obstinate modes)

• Writes recorded in write set for locking/unlocking (all transactional modes)

• Two-phase locking (2PL) protocol

40

Page 41: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Per thread metadata

•Transaction Descriptor

–Read set: validation or unlocking

–Write set: unlocking

–Undo log: rollback

–… local timestamp, execution mode …

•Transaction Memento

–Checkpoint of machine and transaction state

–For nesting & partial rollback

41

Page 42: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Transation Record (TxnRec)

•Tracks transactional state of shared data

–For optimistic transactions (OptTxnRec)• Unlocked – contains timestamp (more on this later!)• Write-locked – contains transaction descriptor of lock owner

–For pessimistic transactions (PessTxnRec)• Unlocked – contains special mark• Read-locked – contains info about all readers• Write locked – contains info about single writer

•Stored in the owner table mapping each memory word to a single transaction record

42

Page 43: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Optimistic STM Algorithm

•Timestamp-based–Global Timestamp (G_TS): incremented every time a

writing transaction commits

– Local Timestamp (L_TS): records last time transaction was valid

–On transactional read of shared data record timestamp associated with its OptTxnRec in the transaction’s read set

–On transaction termination update local timestamps and write them to OptTxnRec-s of all data updated by this transaction

•Validation for serializability and consistency

•Quiescence for privatization safety

43

Page 44: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

44

Consistency

__tm_atomic {

__tm_atomic {

int t1 = x;

int t2 = x;

if (t1 != t2)

*ptr = x;

}

lock(mutex);

int t1 = x;

int t2 = x;

if (t1 != t2)

*ptr = x;

unlock(mutex);

x=y;

}

lock(mutex);

x=y;

unlock(mutex);

int *ptr = NULL;

int x = 0; int y = 1

NULL POINTER

T1 T2

== 1

== 1

== 0

// cannot happen

Page 45: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Validation

•For every entry in read set, abort transaction if recorded timestamp greater than local timestamp

•Performed on commit to guarantee serializability

•Performed on read to guarantee consistency (when data’s OptTxnRec > local timestamp)

45

Page 46: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Validation

T1 T2__tm_atomic {

__tm_atomic {

int t1 = x;

int t2 = x;

if (t1 != t2)

*ptr = x;

}

x=y;

}

G_TS =

NULL POINTER

x

0OptTxnRec-s

0 1

L_TS = 0W_SET = <&x>

L_TS = 0R_SET = <&x>

1T1

ABORT

// cannot happen

R_SET = <&y>

y

0L_TS = 1

T1

46

Page 47: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

47

Privatization Safety

__tm_atomic { t1 = head; if (t1)

__tm_atomic { t2 = head; head = t2->next; t2->next = NULL;}priv = t2->x;…assert (priv == t2->y);

lock(mutex); t2 = head; head = t2->next; t2->next = NULL;unlock(mutex);priv = t2->x;…assert (priv == t2->y);

t1->x = t1->y = 1;}

lock(mutex); t1 = head; if (t1)

t1->x = t1->y = 1;unlock(mutex);

T1 T2

0

0

x

y

next

head

t1

t2 1

1

= NULL;

== 1

== 1== 1

== 0

Page 48: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Quiescence

•Maintain list of active transactions containing their current local timestamp

•Implicit infinite timestamp for pessimistic transactions

•Committing transaction waits for all active transactions whose timestamp is smaller than its own timestamp

48

Page 49: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Quiescence

__tm_atomic {

t1 = head;

if (t1)

__tm_atomic {

t2 = head;

head = t2->next;

t2->next = NULL;

}

t1->x = t1->y = 1;

}

priv = t2->x;

assert (priv == t2->y);

G_TS = 0 1

T1 T2

L_TS = L_TS =

T1 T2

01

WAIT

0

2

49

Page 50: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

50

Unified STM

• Both optimistic and pessimistic readers can co-exist

• Owner table is shared and contains both OptTxnRec and PessTxnRec

• Read barriers:– Optimistic – reads only OptTxnRec– Pessimistic – reads only PessTxnRec

• Write barriers need to write both TxnRec-s

Page 51: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

51

Owner Table for Unified STM

typedef uintptr_t TxnRec;typedef struct OwnerTableEntryS { TxnRec optimistic; TxnRec pessimistic;} OwnerTableEntry;

……

Owner Table

PessTxnRec OptTxnRec

Page 52: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

52

OptTxnRec

Lock bit0: Write-Locked (Exclusive)

1: Unlocked (Shared)

Upper bitsOwner TxnDesc upper bits

Or timestamp upper bits

31 … 1 0

Page 53: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

53

PessTxnRec

Lock bit0: Write-locked (Exclusive)

1: Unlocked (Shared)

Upgrading bit0: no upgrading request

1: upgrading requested

Owner bitsEach bit represents a pessimistic transaction

Locked if non zero

31 … 2 1 0

Page 54: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

54

xxx … xxxxx0000 … 0000111110

Unified STM Algorithm

T1 (PESS)

__tm_atomic { r1 = x; r3 = x;}

T2 (OPT)

__tm_atomic {

r2 = x;

x = r2 +1;

}

0

x T1

PessTxnRec OptTxnRec

T2

0 000 … 000001 000 … 000

Page 55: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Agenda

Part 1: STM Overview• Introduction• Language Constructs and Semantics• Design space

Part 2: STM Implementation• Runtime• Compiler• Performance

55

Page 56: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

56

Compiler/Runtime Interaction

• Decouple compiler from the runtime– Enables use of different library implementations with the

same compiler (e.g. in-place updates vs. write-buffering)– Enables use of different algorithms within the library

itself (e.g. optimistic vs. pessimistic)

• Calls to the runtime realized through a vtable-like mechanism

• Compiler/runtime ABI:– General – same code used for different algorithms– Rich – to enable additional optimizations

Page 57: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

57

ABI: Txn Begin and Commit

_ITM_transaction * _ITM_getTransaction()– Returns (creates if necessary) a transaction descriptor

uint32 _ITM_beginTransaction(_ITM_transaction* td, uint32 props)– Saves machine state– Pass information to runtime via props (e.g. pr_multiwayCode

- both instrumented and uninstrumented code is available) – Can return more than once (e.g. on abort); possible return

values: a_saveLiveVariables, a_restoreLiveVariables

void _ITM_commitTransaction(_ITM_transaction *td)

Page 58: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

58

ABI: Read and Write Barriers

• Templates:void _ITM_Wtypesig(_ITM_transaction* td, type *addr, type val)

type _ITM_Rtypesig(_ITM_transaction* td, type *addr)

typesig: U[1248] – unsigned int[FDE] – float, double,

long…

•Examples:_ITM_WF(_ITM_transaction *td, float *addr, float val);

_ITM_RU4(_ITM_transaction *td, uint32 *addr);

Page 59: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

59

Simple Atomic Block Translated

uint32Val = 42;

}

uint32 props = pr_multiwayCode;

_ITM_transaction *td = _ITM_getTransaction();

uint32 doWhat =

_ITM_beginTransaction(td, props);

if (doWhat & a_restoreLiveVariables) {

/* code to restore live local variables */

}

if (doWhat & a_saveLiveVariables) {

/* code to save live local variables */

}

_ITM_WU4(td, &uint32Val, 42);

_ITM_commitTransaction(td);

__tm_atomic {

! CONFLICT !

Page 60: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

60

User Abort and Retry Translated

uint32Val = 42;

}

uint32 props = pr_multiwayCode;

_ITM_transaction *td = _ITM_getTransaction();

uint32 doWhat = _ITM_beginTransaction(td, props);

if (doWhat & a_restoreLiveVariables) {

/* code to restore live local variables */

}

if (doWhat & a_saveLiveVariables) {

/* code to save live local variables */

}

_ITM_WU4(td, &uint32Val, 42);

_ITM_commitTransaction(td);

__tm_atomic {

if (!_ITM_RU(td, &cond))

_ITM_abortTransaction(td, userRetry);

if (error) __tm_abort;

if (cond) __tm_retry;

if (_ITM_RU(td, &error))

_ITM_abortTransaction(td, userAbort);

if (doWhat & a_abortTransaction) goto ABORT_TXN;

ABORT_TXN:

Page 61: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

61

Optimizations for Transactions

•Standard optimizations– Careful IR design enables existing optimizations

• Partial redundancy elimination (PRE), dead code elimination, …

– Subtle in presence of nesting

•STM-specific optimizations–No instrumentation when executing in serial mode

– Conversion of generic STM read/write barriers to cheaper variants

– Also:• Flattening nested transactions if no user abort is inside• Barrier elimination for __thread (thread local) or const data

Page 62: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Un-instrumented Serial Mode

if (flag) {

printf(“Hello!”); }

}

uint32 props = pr_multiwayCode;

_ITM_transaction *td = _ITM_getTransaction();

uint32 doWhat = _ITM_beginTransaction(td, props);

if (doWhat & a_restoreLiveVariables) {

/* code to restore live local variables */

}

_ITM_commitTransaction(td);

__tm_atomic {

if (doWhat & a_saveLiveVariables) {

/* code to save live local variables */

}

if (_ITM_RU4(td, &flag)) {

_ITM_changeTransactionMode(td, modeSerialIrrevocable);

printf(“Hello!”);

}

if (doWhat & a_instrumentedCode) {

} else {

if (flag) printf(“Hello!”);

}

62

Page 63: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

ABI: Optimized Barrier Templates

•After read or after write (e.g. eliminate redundant locking operations)void _ITM_W{aRW}typesig(_ITM_transaction* td, type

*addr, type val)

type _ITM_R{aRW}typesig(_ITM_transaction* td, type *addr)

•Read-for-write (e.g. acquire write lock early and eliminate read lock)type _ITM_RfWtypesig(_ITM_transaction* td, type *addr)

63

Page 64: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

6464

Barrier Optimization Example

__tm_atomic { if (x < N) { x++; }}

…t1 = _ITM_RU4(td, &x);if (t1 < N) { t2 = _ITM_RU4(td, &x); _ITM_WU4(td, &x,t2+1);}….

…t1 = _ITM_RU4(td, &x);if (t1 < N) { _ITM_WU4(td, &x,t1+1);}….

…t1 = _ITM_RU4(td, &x);if (t1 < N) { _ITM_WaRU4(td, &x,t1+1);}….

…t1 = _ITM_RfWU4(td, &x);if (t1 < N) { _ITM_WaWU4(td, &x,t1+1);}….

Page 65: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

65

ABI: Undo and Commit Functions

• Programmers may register actions executed by the runtime on transaction termination

void _ITM_addUserCommitAction(_ITM_transaction *td, _ITM_userCommitFunction fn, _ITM_transactionId tid, void *arg)

void _ITM_addUserUndoAction(_ITM_transaction *td, _ITM_userUndoFunction, void *arg)

• Current transaction id_ITM_transactionId _ITM_getTransactionId(_ITM_transaction *tid)(1: non-txn, 2: outer txn begin, ++: inner txn begin)

• Undo and commit actions can be used inside of function wrappers

Page 66: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Transactional Function Wrappers

•Transparently replace a call to non-transactional function with a call to its transactional version

•Transactional wrapper’s code:– Un-instrumented– Can use explicit calls to the runtime

•Intended use - implementation of library functions (e.g. transactions-aware memory management)

__declspec (tm_wrap(foo)) void fooTxn();

66

Page 67: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Memory Management Risks

•Txn allocation, non-txn de-allocation– Re-executions leading to multiple allocations but only one

de-allocation operation

•Non-txn allocation, txn de-allocation– Re-executions leading to the same region being de-

allocated more than once

•Txn allocation, txn de-allocation– Combination of two previous cases depending on when re-

execution gets triggered

67

Page 68: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Memory Management Algorithm

•Uses function wrappers mechanism to take advantage of the existing allocators

•Allocation and de-allocation sites marked with tid

•Allocation creates an allocation record – If allocation record exists on outer commit – remove it– On abort – de-allocate and remove allocation record

•De-allocation removes allocation record– De-allocate immediately if txn_id(de-alloc) <= txn_id(alloc)– Otherwise, de-allocate on commit at the nesting level where

condition holds

68

Page 69: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Safe Memory Management

p1 = malloc(size);

tm_atomic {

p2 = malloc(size);

tm_atomic {

free(p2);

p3 = malloc(size);

p4 = malloc(size);

}

free(p1);

free(p3);

tm_atomic {

free(p4);

}

}

2

13

3

p2

p1p3

p4

AllocationRecordstxn_id

1223333

22

2

4421

>

><

>

defer until txn_id <= 2

defer until txn_id <= 1

defer until txn_id <= 3

execute

69

Page 70: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

70

Functions Code Generation

•tm_callable–Generate two copies, instrumented (transactional) and

uninstrumented (non-transactional)

•tm_pure–Only generate uninstrumented code – does not cause

transaction to go serial

•tm_unknown– Switch to serial mode before a call is made inside a

transaction

–May be promoted to tm_callable or tm_pure by compiler

Page 71: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

71

Code Generation for tm_callable

__declspec(tm_callable)

int inc (int *p)

{

p++;

}

inc:

jmp inc_$nontxn

mov eax, MAGIC

jmp inc_$txn

inc_$nontxn:

inc_$txn:

Page 72: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

72

Code Generation for tm_pure

__declspec(tm_pure)

int peek(int *p)

{

return *p;

}

peek:

jmp peek_$nontxn

mov eax, MAGIC

jmp peek_$nontxn

peek_$nontxn:

Page 73: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

73

Indirect Calls

if (*(fp + MAGIC_OFFSET) == MAGIC) {

call fp + TXN_TWIN_OFFSET;

} else {

switchToSerialMode();

call fp;

}

•No overhead for indirect calls outside of transactions

•Same execution mode available across inheritance hierarchy thanks to virtual function overriding rules

•No annotation on function pointers– Indirect call to non-recompiled tm_pure function causes switch to serial mode

Page 74: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

74

Agenda

Part 1: STM Overview– Introduction– Language Constructs and Semantics– Design space

Part 2: STM Implementation– Runtime– Compiler– Performance

Page 75: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

75

TM in Real World

• Realistic workloads: STAMP, SPLASH, and PARSEC benchmark suites (fluid dynamics, raytracing, etc.)

• Performance bottlenecks– Sometimes we use a single global lock (GLOCK)

as a baseline– Bottleneck discovery performed on optimistic

STM only

Page 76: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

76

False Conflicts

•Poor scalability due to conflicts -- >90% false conflicts

•The same STM had no problems on SPLASH-2

Genome Vacation

Exe

cuti

on

Tim

e (s

)

GLOCK STM

0

5

10

15

20

25

30

1 2 4 8

# threads

0

2

4

6

8

10

12

1 2 4 8

# threads

Page 77: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

77

Mapping to TxnRec-s

0561931

Address

20

0x0000

0x3FFF

Ownership Table

Transaction Record

Reserved to avoid cache line

ping ponging

•Addresses map to a transaction record via a hash function

• Different addresses can map to the same record

Page 78: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

78

Refined Hash Function

• 4 additional bits to index into transaction record

• Reduce false conflict vs. pontentially increasing cache ping-ponging

031

Address

23 561920

0x0000

0x3FFF

Ownership Table

Transaction Record

Page 79: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

79

False Conflicts Reduced

GLOCK STM (old hash) STM (new hash)

0

2

4

6

8

10

12

1 2 4 8

0

5

10

15

20

25

30

1 2 4 8# threads # threads

Genome Vacation

Exe

cuti

on

Tim

e (s

)

Page 80: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

80

Over-Instrumentation

•Compiler generates more barriers than necessary– Thread-local memory accesses, – Objects alternating between modification and constant phase– Constant global objects

TxLD (optimal)

TxLD (compiler)

TxST (optimal)

TxST (compiler)

TxLD overhead

TxST overhead

Genome 58,701,959 624,073,490 2,252,291 19,078,705 10.63x 8.60x

Kmeans 86,666,710 255,662,754 86,666,710 86,666,711 2.95x 1.00x

Vacation 785,775,435 925,584,125 26,300,714 122,543,905 1.18x 4.66x

Transactional Barrier Counts for STAMP

Page 81: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

81

__tm_waiver

•No instrumentation for a block or function marked with __tm_waiver

• Allows incremental optimizations but should be used with caution

__tm_atomic { y= ++x; // instrumented __tm_waiver { ++local; // no instrumentation }}

Page 82: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

82

Over-Instrumentation Reduced

•__tm_waiver used for– thread-local object allocation routines – quasi-static shared objects

0

2

4

6

8

10

12

1 2 4 8

0

5

10

15

20

25

30

1 2 4 8

GLOCK STM (new hash) STM (new hash + __tm_waiver)

# threads # threads

Genome Vacation

Exe

cuti

on

Tim

e (s

)

Page 83: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

83

Quiescence Overhead

•Only some programs use privatization idiom•Provide API to let programmer selectively disable privatization safety

0

0.5

1

1.5

2

sphinx genome kmeans vacation average

2 threads 4 threads 8 threads

spee

du

p

Page 84: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

84

Other Issues

•Small transactions overwhelmed by fixed costs– Fluidanimate: ~1 load and ~1 store per transaction– Different code for small transactions

•Atomic blocks make porting of some benchmarks (e.g., BerkeleyDB) difficult but are more amenable to compiler optimizations

•Annotating transactional functions can be a burden (40% of functions in vacation)

•Many workloads require condition synchronization

Page 85: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

85

Finding the Bottlenecks

•Many workloads would not scale at first

•Cumulative stats would shed no light - low contention, no false conflicts, …

•And then we remembered … the devil is in the details …

Page 86: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

86

Per Critical Section Statistics

Only critical section 601 suffers from high abort rate and prevents scaling

critical section tx_begin commit abort abort %

code size (lines)

602 1314 1312 2 0.15% O(1)

542 222481 221043 1438 0.65% O(1)

559 220908 220908 0 0.00% O(1)

601 12306 6194 6112 49.67% O(1000)

571 42917 42889 28 0.07% O(1)

588 42770 42770 0 0.00% O(1)

301 1313 1312 1 0.08% O(1)

Transactional Statistics for Sphinx

Page 87: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

87

Overall Performance

0

1

2

3

4

5

6

7

8

geno

me

kmea

ns/lo

w

kmea

ns/h

igh

vaca

tion/

low

vaca

tion/

high

chole

sky fft

lu/co

nt.

lu/no

n co

nt.

radix

barn

esfm

m

ocea

n/co

nt.

ocea

n/no

n co

nt.

radio

sity

raytr

ace

volre

nd

water

-nsq

uare

d

water

-spa

tial

fluida

nimat

e

1 thread 2 threads 4 threads 8 threads

STM vs. single-thread GLOCK

spee

du

p

Page 88: Software Transactional Memory TiC 2010 Adam Welc Programming Systems Lab Intel Labs.

Recommended