+ All Categories
Home > Documents > Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Date post: 11-Jan-2016
Category:
Upload: clea
View: 26 times
Download: 2 times
Share this document with a friend
Description:
Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory. Qi Zhu CSE 340, Spring 2008 University of Connecticut Paper Source: ISCA’07, San Diego, CA. Motivation. - PowerPoint PPT Presentation
37
Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory Qi Zhu CSE 340, Spring 2008 University of Connecticut Paper Source: ISCA’07, San Diego, CA
Transcript
Page 1: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Qi ZhuCSE 340, Spring 2008

University of Connecticut Paper Source: ISCA’07, San Diego,

CA

Page 2: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Motivation Hardware transactional memory has

great potential to simplify the creation of correct and efficient multithreaded programs.

Unfortunately, supporting the concurrent execution of an unbounded number of unbounded transactions is challenging. Many proposed implementations are very complex.

Page 3: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

History

In 1993, Herlihy and Moss proposed an elegant and efficient hardware implementation. Unfortunately, this implementation limits the volume of data and does not allow transactions to survive a context switch.

Page 4: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Recent Development

Recently, several proposals have emerged to provide unbounded hardware transactions and support overflowed transactions with the same concurrency as non-overflowed transactions. However, these systems do not have simple implementations.

Page 5: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Methodology Proposed

In order to simplify the design, the author explores a different approach.

The two key concepts the author proposed are:

1. Permissions-only Cache 2. OneTM: OneTM Serialized and

OneTM Concurrent

Page 6: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

An example execution on three systems

(a) fully-concurrent overflow (b)OneTM-Serialized (c ) OneTM-Concurrent

Red field indicates stallBlack field indicates overflow

t1

t2

t3

t4

t5

t6

t7

Page 7: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Making the Fast Case CommonPermissions-Only Cache

The permissions-only cache allows a transaction to access more data than can be buffered on-chip without transitioning to a higher-overhead overflowed execution mode.

Only when the permissions-only cache itself overflows does the system need to fall back on some other mechanism for detecting conflicts for overflowed blocks.

Page 8: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Operation

The permissions-only cache is: (1)read by external coherence

requests as part of conflict detection

(2)updated when a transaction block is replaced from the data cache

(3)invalidated on a commit or abort (4)read on transactional store misses

Page 9: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Efficient Encoding Because the permissions-only cache does

not contain data, it can more efficiently encode the transactional read/write bits.

By using sector cache techniques, the tag overhead can be reduced dramatically. For example, with good page-level spatial locality, a 4KB permissions-only cache allows a transaction to access up to 1MB of data without overflow.

Page 10: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Discussion By efficiently tracking transactions’ read

and write sets, the permissions-only cache increase the size of transactions that can successfully complete without invoking an overflowed execution model.

With a sufficiently large permissions-only cache, the occurrence of overflowed transactions will likely to be rare.

Page 11: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Background on Unbounded Hardware TM To provide basis for later

comparison, we first take a look at three hardware-based unbounded transactional memory proposals that precisely detect conflicts at the cache block granularity.

1. UTM 2. VTM 3. PTM

Page 12: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

UTM

UTM( Unbounded Transactional Memory) was the first transactional memory proposal to support unbounded transactions. UTM maintains its transactional state in a single, shared, memory-resident data structure called the xstate.

Page 13: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

VTM VTM( Virtualizing Transactional Memory)

tracks overflowed transactional state using a shared data structure mapped into the virtual address space (called the XADT). Entries in the XADT are allocated when blocks overflow the cache.

Much like xstate, XADT also uses linked list and supports accessing all entries for a specific virtual memory block or all entries for a specific transaction.

Page 14: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

PTM PTM( Unbounded Page-Based Transaction

Memory) supports unbounded transactional memory by associating state with physical addresses at the block granularity but allocating/reallocating this shadow state on a per-page basis. PTM’s shadow pages behave similarly to UTM’s log pointers, except PTM’s Transaction Access Vector(TAV) lists track data for an entire page.

Page 15: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Discussion

These three proposals require the hardware to dynamically

allocate/deallocate, maintain, and concurrently manipulate complex linked-based structures(UTM’s xstate, VTM’s XADT, and PTM’s Transaction Access Vectors) and the corresponding cached version of these structures.

Page 16: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Discussion

Manipulating and accessing these structures can add overhead to both overflowed transactions and concurrently-executing non-overflowed transactions.

More importantly, the hardware for correctly manipulating these structure is not simple.

Page 17: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Making the Uncommon Case Simple The author propose OneTM, a

transactional memory system in which only a single overflowed transaction per process can be active at a time. The principal advantage of this design is that the implementation is relatively simple. The impact of the concurrency restrictions on overall system throughput is small because the permissions-only cache ensures that overflow will be the uncommon case.

Page 18: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

OneTM OneTM-Serialized Serialized implementation simply stalls

all other threads in an application when one of the threads needs to execute an overflowed transactions. At the same time, overflowed transactions still support an explicit abort operation because overflowed transactions continue to log.

Page 19: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

OneTM Serialized

STSW: Shared (per-process) transaction status word

PTSW: Private (per-thread) transaction status word

Page 20: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Description of transaction status word

(a) Shared Transaction Status Word (STSW)

(b) Private Transaction Status Word (PTSW)

STSW Field Description

Overflowed? Is there a current overflowed transaction?

OTID ID of current overflowed transaction

PTSW Field Description

Overflowed? Is this thread in an overflowed transaction?

TND Nesting depth of current transaction

No-user-abort Disable logging, allows IO

Located at a fixed address in virtual memory known to all threads

Per-thread architected register, saved/restored on context switches

Page 21: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Discussion

Although this implementation is simple, the price of its simplicity is the loss of all concurrency when a transaction overflows, which will have a significant negative impact on performance.

Page 22: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

OneTM-Concurrent

In order to allow other code to execute concurrently with a single overflowed transaction, we introduce per-block persistent metadata.

The system uses this metadata to track the read and write set of the single overflowed transaction; other threads then check the metadata to detect conflicts.

Page 23: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Metadata Operation

When a transaction overflows, it transitions to overflowed execution mode. A simple way of accomplish this is to abort the transaction and restart it in overflowed mode after ensuring that no other thread in the application is already executing in overflowed mode.

Page 24: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Metadata Operation Alternatively, OneTM-Concurrent can avoid an

abort by more gracefully transition to overflowed mode. As before, the processor must first ensure that no other thread in the application is executing in overflowed mode.

Next, the processor walks both the data cache and the permissions-only cache to set the overflowed metadata for blocks read or written by transactions; this action ensures that the conflict detection information for these blocks is not lost if the overflowed transaction is context switched.

Page 25: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Lazy Metadata Clearing OTID: Overflowed Transaction Identifier OTID is used to differentiate between

stale and current metadata. The OTID of the active overflowed is also stored in the STSW( see table before).

When a transaction transitions to overflowed mode, it increments the OTID in the STSW. Instead of explicitly clearing the metadata bits when it completes, the overflowed transaction simply clears the overflowed bit in the STSW as before.

Page 26: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Lazily Coherent Metadata To prevent out-of-order writeback from

overwriting more recent metadata with stale metadata, the system allows only the owner of the block( non-exclusive or exclusive ) to set the metadata.

Once the metadata has been written, it is the owner’s responsibility to ensure the data is eventually written back to memory( or transfer the ownership, and thus the responsibility, on to another processor).

Page 27: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Lazily Coherent Metadata The key to the correctness of this lazy

updating of metadata is that the system guarantees that any new requests for the block receive the most recent version of the metadata. Once an overflowed transaction has set the read bit( and thus has the block in owned state), any other processor that tries to write the block will issue a cache request and receive the most recent version of the metadata, indicating the conflict.

Page 28: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Example Execution

t1

t2

t3

t4

t5

t6

t7

P0 P1 P2 P3

Ld A

Ld A

Ld A

Ld A

STSW {Overflowed?OTID}

{No, #7}

{Yes, #8}

{No, #8}

O:{R,#8}

S:{R,#8}

I:

O:{0}

S:{0}

O:{R,#8}

I:M:{R,#8}

Stall

M:{0}

‘O’ state indicates a single non-exclusive read-only dirty owner

Page 29: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Experimental Evaluation

Parameter Value

Processor Eight in-order x86 cores, 1 IPC

L1 cache 64KB, 4-way set associative, 64B blocks

L1 miss latency

10 cycles

L2 cache 4MB, 4-way set associative, 64B blocks

L2 miss latency

200 cycles

simulated machine configuration

Page 30: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Experimental Evaluation

Program Input

barnes Input-2K

cholesk tk 14.O

Ocean-non-contiguous

n130

radix N262144 r1024

raytrace Teapot.env

volrend Head-scaleddown2

Water-spatial Input-512

Tree-<n> N% scanning

Benchmark summary

Page 31: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Performance

505050505050

100

100

100

110

110

100

505050505050

505050505050

5050505050

70 5

05050505050

505050505050

505050505050

50

7050

7050

50

7050

7050

50

7050

80

50

0.2

0.4

0.6

0.8

1.0

0.0

idealized

OneTM-Concurrent

OneTM-Concurrent+Permissions-only-cache

OneTM Serialized

OneTM-Serialized+Permissions-only-cache

locks

Exe

cutio

n T

ime

(No

rm t

o U

nip

roce

sso

r)

Page 32: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Scalability analysis using the tree-10% microbenchmark

0.5

1.0

2.0

0.0

idealized

OneTM-Concurrent

OneTM-Concurrent+Permissions-only-cache

OneTM Serialized

OneTM-Serialized+Permissions-only-cache

Exe

cutio

n T

ime

(Nor

m to

Idea

lized

)

50

60 5

0

60 5

050

60 5

0

60 5

050

60 5

0

60 5

050

60 5

0

70

50

50

60 5

0

70

50

50

65 5

0

75

50

50

68 5

0

78

50

P=1 P=2 P=4 P=8 P=16 P=32 P=64

1.5

Page 33: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Results Summary These results show that allowing only one

overflowed transaction at a time can result in performance competitive with an ideally concurrent implementation.

The addition of even a small permissions-only cache closes the gap for our benchmarks.

For larger multiprocessors, the performance of OneTM-Concurrent suffers versus the ideal unbounded transactional memory.

Page 34: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Conclusion The permissions-only cache eases the

memory-footprint restrictions imposed by the bounded execution mode of the processor. This structure permits bounded transactions to access potentially hundreds of megabytes before overflowing. The permissions-only cache can be introduced into hardware-only and hardware-software proposed systems to make the fast case common.

Page 35: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Conclusion We also proposed OneTM, an unbounded

transactional memory system that assumes overflowed transaction will be rare to simplify the implementation. Specially, we limit the number of overflowed transaction that may exist in an application at a time to be one.

By bounding concurrency among overflowed transactions, OneTM avoids the complexity of managing and traversing linked data structure in hardware, admitting simple conflict detection and transaction commit.

Page 36: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Questions?

Page 37: Making the Fast Case Common and the Uncommon Case Simple in Unbounded Transnational Memory

Thank you !

I wish everyone have a great summer !


Recommended