Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP...

Pınar Tözün

Anastasia Ailamaki

SLICC Self-Assembly of Instruction Cache Collectivesfor OLTP Workloads

Islam Atta

Andreas Moshovos

SLICC

$100 Billion/Yr, +10% annually

•E.g., banking, online purchases, stock market…

Benchmarking

•Transaction Processing Council

•TPC-C: Wholesale retailer

•TPC-E: Brokerage market

Online Transaction Processing (OLTP)

OLTP drives innovation for HW and DB vendors

© Islam Atta 2

SLICC

Many concurrent transactions

Transactions Suffer from Instruction Misses

L1-I size

Footp

rin

t

Each

Tim

e

Instruction Stalls due to L1 Instruction Cache Thrashing© Islam Atta

3

SLICC

Even on a CMP all Transactions Suffer

CoresL1-1 Caches

Transactions

All caches thrashed with similar code blocks

Tim

e

© Islam Atta 4

SLICC

Opportunity

Footprint over Multiple Cores Reduced Instruction Misses

Technology:

• CMP’s aggregate L1

instruction cache capacity

is large enough

Application Behavior:

• Instruction overlap within

and across transactions

Multiple L1-I caches

Multiple threads

Tim

e

© Islam Atta 5

SLICC

Dynamic Hardware Solution

• How to divide a transaction

• When to move

• Where to go

Performance

•Reduces instruction misses by 44% (TPC-C), 68% (TPC-E)

•Performance improves by 60% (TPC-C), 79% (TPC-E)

Robust:

• non-OLTP workload remains unaffected

SLICC Overview

© Islam Atta 6

SLICC

• Intra/Inter-thread instruction locality is high• SLICC Concept

• SLICC Ingredients

• Results

• Summary

Talk Roadmap

© Islam Atta 7

SLICC

Many concurrent transactions

Few DB operations

•28 – 65KB

Few transaction types

•TPC-C: 5, TPC-E: 12

Transactions fit in 128-512KB

OLTP Facts

Overlap within and across different transactions

R() U() I() D() IT() ITP()

PaymentNew Order

CMPs’ aggregate L1-I cache is large enough© Islam Atta

8

SLICC

Instruction Commonality Across Transactions

Lots of code reuse

More Yellow

Even higher across same-type transactions

Most

Few

Single

TPC-C TPC-E

All Threads

Per TransactionType

More Reuse

© Islam Atta 9

SLICC

Enable usage of aggregate L1-I capacity

•Large cache size without increased latency

Exploit instruction commonality

•Localizes common transaction instructions

Dynamic

•Independent of footprint size or cache configuration

Requirements

© Islam Atta 10

SLICC

• Intra/Inter-thread instruction locality is high

• SLICC Concept• SLICC Ingredients

• Results

• Summary

Talk Roadmap

© Islam Atta 11

SLICC

Example for Concurrent Transactions

T1 T2 T3

Code segments that can fit into L1-I

TransactionsControl FlowGraph

© Islam Atta 12

SLICC

T1 T2T1

T1

T3

T2 T3T1

T1

Scheduling Threads

T1 T2

T2 T3

T1 T3

0 1 2 3

CORES

T3

Conventional

L1-I

T1

T2

T3

ThreadsTi

me

T1

T1

0 1 2 3

CORES

SLICC

T1

T2

T3 T2

T1 T3

T3

T1T1

Cache Filled 10 times Cache Filled 4 times

T2 T2T2

© Islam Atta 13

SLICC


• SLICC Concept

• SLICC Ingredients• Results

• Summary

Talk Roadmap

© Islam Atta 14

SLICC

When to migrate?

Step 1:

Detect: cache full

Step 2:

Detect: new code segment

Where to go?

Step 3:

Predict where is the next code segment?

Migration Ingredients

© Islam Atta 15

SLICC


Tim

e

Idle coresWhen to migrate?

Step 1:

Detect: cache full

Step 2:

Detect: new segment

Where to go?

Step 3:

Where is the next segment?

Loops

IdleReturn back

T1

© Islam Atta 16

SLICC


When to migrate?

Step 1:

Detect: cache full

Step 2:

Detect: new segment

Where to go?

Step 3:


Tim

e

T2

© Islam Atta 17

SLICC

Implementation

When to migrate?

Step 1:

Detect: cache full

Step 2:

Detect: new segment

Where to go?

Step 3:


Find signature blocks on

remote cores

Miss Counter

Miss Dilution

© Islam Atta 18

SLICC

More overlap across transactions of the same-type

SLICC: Transaction Type-oblivious

Transaction Type-aware

•SLICC-Pp: Pre-processing to detect similar transactions

•SLICC-SW : Software provides information

Boosting Effectiveness

© Islam Atta 19

SLICC


• SLICC Concept

• SLICC Ingredients

• Results• Summary

Talk Roadmap

© Islam Atta 20

SLICC

How does SLICC affect INSTRUCTION misses?

Our primary goal

How does it affect DATA misses?

Expected to increase, by how much?

Performance impact:

Are DATA misses and MIGRATION OVERHEADS amortized?

Experimental Evaluation

© Islam Atta 21

SLICC

Simulation

•Zesto (x86)

•16 OoO cores, 32KB L1-I, 32KB L1-D, 1MB per core L2

•QEMU extension

•User and Kernel space

Workloads

Methodology

Shore-MT

© Islam Atta 22

SLICC

Baseline: no effort to reduce instruction misses

Effect on MissesB

ett

er

Reduce I-MPKI by 58%. Increase D-MPKI by 7%.

I-MPKI

D-MPKI

Base

SLIC

C

SLIC

C-S

W

Base

SLIC

C

SLIC

C-S

W

Base

SLIC

C

SLIC

C-S

W

TPC-C-10 TPC-E MapReduce

0

5

10

15

20

25

30

35

40

45

MP

KI

© Islam Atta 23

SLICC

Next-line: always prefetch the next-line

Upper bound for Proactive Instruction Fetch [Ferdman, MICRO’11]

Performance

TPC-C-1 TPC-C-10 TPC-E MapReduce1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

Sp

eed

up

Bett

er

TPC-C: +60% TPC-E: +79%

Storage per core- PIF: ~40KB- SLICC: <1KB.

Next-Line

PIF-No Overhead

SLICC

SLICC-SW

© Islam Atta 24

SLICC

OLTP’s performance suffers due to instruction stalls.

Technology & Application Opportunities:

• Instruction footprint fits on aggregate L1-I capacity of CMPs.

• Inter- and intra-thread locality.

SLICC:

• Thread migration spread instruction footprint over multiple cores.

• Reduce I-MPKI by 58%

• Improve performance by

Summary

Baseline: +70%

Next-line: +44%

PIF: ±2% to +21%

© Islam Atta 25

Email: [email protected]: http://islamatta.com

Thanks!

mailto:[email protected]

http://islamatta.com/

SLICC

Example: thread migrates from core A core B.

•Read data on core B that is fetched on core A.

•Write data on core B to invalidate data on core A.

•When returning to core A, cache blocks might be evicted by other

threads.

Why data misses increase?

© Islam Atta 27

SLICC

SLICC Agent per Core

MSV(Miss Shift-Vector)

Count “1”s

MC(Miss Counter)

≥

Fill-up_t

...

Enable shifting

Dilution_t

Locating Missed Blocks on Remote

Cores

Miss Tag-Queue (MTQ)

EnableMigration Select Matching Core

Mat

ched

_t

entr

ies

≥

EnableSearching

+Remote Cache Segment Search

Cache Full DetectionMiss(1)Hit(0)

Miss Dilution Tracking

© Islam Atta 28

SLICC

Zesto (x86)

Qtrace (QEMU extension)

Shore-MT

Detailed Methodology

© Islam Atta 29

SLICC

Hardware Cost

© Islam Atta 30

SLICC

Larger I-caches?

16

32

64

128

256

512

16

32

64

128

256

512

16

32

64

128

256

512

16

32

64

128

256

512

16

32

64

128

256

512

16

32

64

128

256

512

Instructions Data Instructions Data Instructions DataTPC-C-10 TPC-E MapReduce

0

10

20

30

40

50

60

0

0.2

0.4

0.6

0.8

1

1.2

1.4Conflict Capacity Compulsory Speedup

MP

KI

Cache Size (K)

Sp

eed

Up

Bett

er

Bett

er

© Islam Atta 31

SLICC

Different Replacement Policies?

TPC-C TPC-E MapReduce0

5

10

15

20

25

30

35

40LRU LIP BIP DIP SRRIP BRRIP DRRIP

L1 I

nstr

ucti

on

MP

KI

Bett

er

© Islam Atta 32

SLICC

Parameter Space (1)

Base

128

256

384

512

128

256

384

512

128

256

384

512

128

256

384

512

128

256

384

512

Base

128

256

384

512

128

256

384

512

128

256

384

512

128

256

384

512

128

256

384

512

2 4 6 8 10 2 4 6 8 10TPC-C TPC-E

0

10

20

30

40

50

60

70

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

I-MPKI D-MPKI Speedup

Fill-up_t (top), Matched_t (bottom)

MP

KI

Sp

eed

up

Bett

er

Bett

er

© Islam Atta 33

SLICC

Parameter Space (2)

2 4 6 81

01

21

41

61

82

02

22

42

62

83

0 2 4 6 81

01

21

41

61

82

02

22

42

62

83

0

TPC-C TPC-E

0

10

20

30

40

50

60

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2I-MPKI D-MPKI Speedup

Dilution_t

MP

KI

Sp

eed

up

Bett

er

Bett

er

© Islam Atta 34

SLICC

Partial Bloom Filter

Cache Signature Accuracy

512 1K 2K 4K 8K 512 1K 2K 4K 8KTPC-C TPC-E

96

97

98

99

100

101

BF AccuracyA

ccu

racy

(%)

Bett

er

© Islam Atta 35

Date post:	29-Dec-2015
Category:	Documents
Upload:	sharyl-howard
View:	220 times
Download:	1 times

Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP...

Documents