+ All Categories
Home > Documents > More on Thread Level Speculation Anthony Gitter Dafna Shahaf Or Sheffet.

More on Thread Level Speculation Anthony Gitter Dafna Shahaf Or Sheffet.

Date post: 19-Dec-2015
Category:
View: 219 times
Download: 1 times
Share this document with a friend
30
More on Thread Level Speculation Anthony Gitter Dafna Shahaf Or Sheffet
Transcript

More on

Thread Level Speculation

Anthony GitterDafna Shahaf

Or Sheffet

Thread Level Speculation (TLS)

A technique for automatic parallelization.• Run threads in parallel, but in a speculative state. • Check for violations.• Commit upon successful completion.• Squash when detecting a violation.

– Propagate the squash onwards.– Re-run the thread.

Thread Level Speculation Example

Mechanism of TLS1. Managing speculative state.2. Disambiguation: checking addresses for violating

dependencies– Eager vs. Lazy

3. Upon commit– Broadcast (Everybody? Relevant?)– Invalidate/update of other threads– Leave speculative state

4. Upon squash– Broadcast– Invalidate changes for this thread– Re-run

At hardware level. Involve Cache.

Simple. Fast.

Scenarios

• Thread attributes:– Length– Memory accesses– Dependences

??

Many

??0

Serial Easily parallel

ShortManyFew

TLS costly

ShortFewFew

TLS works

LongFewFew

TLS costly

LengthAccessesDepend.

When is TLS Too Costly?

• “Too much data” scenario– Thread touches too many addresses.

• “Too much time” scenario– Execution involves many instructions

(e.g. Databases transactions).

Bulk Disambiguation of Speculative Threads in multiprocessors

Ceze, Tuck, Cascaval, Torrellas.

Tolerating Dependences Between Large Speculative Threads Via Sub-Threads

Colohan, Ailamaki, Steffan, Mowry.

Too Many Addresses – Solution 1

Each thread maintains a bitwise mask of the cache.• Flip bit on when touching an address.• Upon completion, check addresses you and others touched.

(Lazy)• Commit / Squash : send mask.• Invalidating/replacing/changing address state in cache:

use mask.

All bitwise operations. Very simple!Infeasible for size reasons (won’t scale).

Solution: Hash!

Introducing BULK - a hardware that hashes the address space into a signature (~2k in size).

0 1 0 1 0 0 0 0 1

0 0 1 1 0 0 1 0 0

0 1 1 1 0 0 1 0 1

Address Space

Signature

Bitwise OR

Upon completion, send signature!

Upon receiving, pull back to a superset of possible addresses.

Bulk Features:

• Separate Reading / Writing signatures.

• Committing: sending signature.• Invalidating: pulling back signature

into a superset.• Granularity is on word level

(not cache line)– since we map addresses

Caveat:We might see violations even if there weren't any!

Bulk Performance

Bulk Performance

Fraction of False Positives as a function of Signature Length

When is TLS Too Costly?

• “Too much data” scenario– Thread touches too many addresses.

• “Too much time” scenario– Execution involves many instructions

(e.g. Databases transactions).

Bulk Disambiguation of Speculative Threads in multiprocessors

Ceze, Tuck, Cascaval, Torrellas.

Tolerating Dependences Between Large Speculative Threads Via Sub-Threads

Colohan, Ailamaki, Steffan, Mowry.

Handling Long Threads (Attempt 1)

Image courtesy Chris Colohan

Q: Does eliminating a data dependence help?

*p=

*q=

=*p

R2

Violation!

=*p

=*q

Parallel

Upon violation – we re-execute a long thread.

Handling Long Threads (Attempt 1)

*p=

*q=

=*p

R2

Violation!

=*p

=*q

Parallel

*q==*q

=*q

Violation!

Eliminate *p Dep.

Image courtesy Chris Colohan

Handling Long Threads (Attempt 2):Sub-Threads

• Sub-threads are checkpoints during thread execution

• No longer “all or nothing”

• Must be lightweight• Help with primary and

secondary violations

*q=Violation! =*q

=*q

Image courtesy Chris Colohan

Sub-thread Implementation

• Assume CMP with shared L2• L1 is unaware of sub-threads

– Speculatively modified bit per cache line• L2 performs eager violation detection

– 2 additional bits per cache line per sub-thread– Replication to track different sub-thread contexts

17

Sub-thread Evaluation

0

0.2

0.4

0.6

0.8

1

1.2

Idle CPU

Failed

Cache Miss

Busy

Tim

e (n

orm

aliz

ed)

New O

rder

New O

rder

150

Deliv

ery

Deliv

ery

Outer

Stock

Lev

el

Paym

ent

Order

Sta

tus

N S L N S L N S L N S L N S L N S L N S L

N = no sub-threadsS = with sub-threads

L = limit, ignoring violationsImage courtesy Chris Colohan

Summary

• Thread attributes:– Length– Memory accesses– Dependences

??

Many

??0

Serial Easily parallel

ShortFewFew

TLS works

LongManyFew

Hopeless??

LengthAccessesDepend.

ShortManyFew

LongFewFew

TLS costlyBULK

TLS costlySub-Threads

Open Questions

• Long threads that also touch many addresses.– Bulk on top of sub-threads?

• Combining lazy/eager evaluations

Thank you!

Backup Slides

21

Buffering Large Threadsstore X, 0x00

L1$

0x00:

0x01:

L2$

X

0x00:

0x01:

L1$

0x00:

0x01:

XS1

Store and load bit per thread

Store and load bit per thread

Slide courtesy Chris Colohan

22

Buffering Large Threadsstore X, 0x00store A, 0x01

L1$

0x00:

0x01:

L2$

X

A

0x00:

L1$

0x00:

0x01:

X

A

S1

S10x01:

Slide courtesy Chris Colohan

23

Buffering Large Threadsstore X, 0x00store A, 0x01 load 0x00

L1$

0x00:

0x01:

L2$

X

A

0x00:

0x01:

L1$

0x00:

0x01:

X

X

A

S1

S1

L2

Slide courtesy Chris Colohan

24

XL2 XS1

Buffering Large Threadsstore X, 0x00store A, 0x01 load 0x00

store Y, 0x00

L1$

0x00:

0x01:

L2$

X

A

0x00:

0x01:

L1$

0x00:

0x01:

XY

AS1

YS2 L2 Replicate line – one version per thread

Replicate line – one version per thread

Slide courtesy Chris Colohan

25

Buffering Large Threadsstore X, 0x00store A, 0x01 load 0x00

load 0x01

store Y, 0x00

L1$

0x00:

0x01:

L2$

X

A

0x00:

0x01:

X

A

Y

L1$

0x00:

0x01:

Y

A

S1

S2 L2

S1 L2

Slide courtesy Chris Colohan

26

Buffering Large Threadsstore X, 0x00store A, 0x01 load 0x00

load 0x01

store Y, 0x00

store B, 0x01

L1$

0x00:

0x01:

L2$

X

A

0x00:

0x01:

X

A

L1$

0x00:

0x01:

Y

A

S1

YS2 L2

S1 L2

B

B

Slide courtesy Chris Colohan

27

Sub-thread Supportstore X, 0x00store A, 0x01 load 0x00

load 0x01

store Y, 0x00

store B, 0x01

L1$

0x00:

0x01:

L2$

X

A

0x00:

0x01:

X

A

L1$

0x00:

0x01:

S1

S1 L2

B

B

Y

YS2 L2

a {b {

Divide into two sub-threads

Only roll backviolated sub-thread

Slide courtesy Chris Colohan

Copyright 2006 Chris Colohan 28

Sub-thread Supportstore X, 0x00store A, 0x01 load 0x00

load 0x01

store Y, 0x00

L1$

0x00:

0x01:

L2$

X

A

0x00:

0x01:

X

A

Y

L1$

0x00:

0x01: A

S1a

S1a

A

A

S2a L2a

L2b

Y

a {b {

Store and load bit per sub-thread

Store and load bit per sub-thread

store B, 0x01

B

Slide courtesy Chris Colohan

Copyright 2006 Chris Colohan 29

AAAL2bS1a

Sub-thread Supportstore X, 0x00store A, 0x01 load 0x00

load 0x01

store Y, 0x00

L1$

0x00:

0x01:

L2$

X

A

0x00:

0x01:

X

Y

L1$

0x00:

0x01:

Y

S1a

A

S2a L2a

B

store B, 0x01

S1b

AB

a {b {

Slide courtesy Chris Colohan

Sub-thread Evaluation

• Evaluate using large database transactions• Parallelize the loops• Can we place an upper bound on the possible

speedup?


Recommended