Software and Hardware Support for Locality Aware High Performance Computing Xiaodong Zhang National...

Software and Hardware Support for Locality Aware High Performance Computing

Xiaodong Zhang

National Science Foundation

College of William and Mary

This talk does not necessarily reflect NSF`s official opinions

Acknowledgement

Participants of the project David Bryan, Jefferson Labs (DOE) Stefan Kubricht, Vsys Inc. Song Jiang and Zhichun Zhu, William and Mary Li Xiao, Michigan State University. Yong Yan, HP Labs. Zhao Zhang, Iowa State University.

Sponsors of the project Air Force Office of Scientific Research National Science Foundation Sun Microsystems Inc.

CPU-DRAM Gap

60% per year

7% per year

1

10

100

1000

10000

CPUDRAM

50% per year

Cache Miss Penalty

A cache miss = Executing hundreds of CPU instructions (thousands in the future). 2 GHz, 2.5 avg. issue rate: issue 350 instructions in

70 ns access latency.

A small cache miss rate A high memory stall time in total execution time. On average, 62% memory stall time for SPEC2000.

I/O Bottleneck is Much Worse Disk access time is limited by mechanical delays. A fast Seagate Cheetah X15 disk (15000 rpm):

average seek time: 3.9 ms, rotation latency: 2 ms internal transfer time for a strip unit (8KB): 0.16 ms Total disk latency: 6.06 ms.

External transfer rate increases 40% per year. from disk to DRAM: 160 MBps (UltraSCSI I/O bus)

To get 8KB from disk to DRAM takes 11.06 ms. More than 22 million CPU cycles of 2GHz!

CPU Registers

L1TLB

L3

L2

Row buffer

DRAMBus adapterController

buffer

Buffer cache

CPU-memory bus

I/O bus

I/O controller

disk

Disk cache

TLB

registers

L1

L2

L3

Controller buffer

Buffer cache

disk cache

Row buffer

Memory Hierarchy with Multi-level Caching

Algorithm implementationAlgorithm implementation

CompilerCompiler

Micro architectureMicro architecture

Micro architectureMicro architectureMicro architectureMicro architecture

Operating systemOperating system

Other Systems Effects to Locality

Locality exploitation is not guaranteed by the buffers!

Initial and runtime data placement. static and dynamic data allocations, and interleaving.

Data replacement at different caching levels. LRU is used but fails sometimes.

Locality aware memory access scheduling. reorder access sequences to use cached data.

Outline Cache optimization at the application level.

Designing fast and high associativity caches

Exploiting multiprocessor cache locality at runtime.

Exploiting locality in DRAM row buffer.

Fine-grain memory access scheduling.

Efficient replacement in buffer cache.

Conclusion

Application Software Effort: Algorithm

Restructuring for Cache Optimization

Traditional algorithm design means: to give a sequence of computing steps subject

to minimize CPU operations. It ignores:

inherent parallelizations and interactions (e.g. ILP, pipelining, and multiprogramming),

memory hierarchy where data are laid out, and increasingly high data access cost.

Mutually Adaptive Between Algorithms and Architecture

Restructuring commonly used algorithms by effectively utilizing caches and TLB, minimizing cache and TLB misses. A highly optimized application library is very useful.

Restructuring techniques data blocking: grouping data in cache for repeat usage data padding to avoid conflict misses using registers as fast data buffers

Two Case Studies Bit-Reversals:

basic operations in FFT and other applications data layout and operations cause large conflict misses

Sortings: merge-, quick-, and insertion-. TLB and cache misses are sensitive to the operations.

Our library outperforms systems approaches We know exactly where to pad and block!

Usage of the two libraries (both are open sources) bit-reversals: an alternative in Sun’s scientific library. Sorting codes are used a benchmark for testing compilers.

Microarchitecture Effort: Exploit DRAM Row Buffer Locality

DRAM features: High density and high capacity Low cost but slow access (compared to SRAM) Non-uniform access latency

Row-buffer serves as a fast cache the access patterns here has been paid little attention. Reusing buffer data minimizes the DRAM latency.

CPU Registers

L1TLB

L3

L2

Row buffer


buffer

Buffer cache

CPU-memory bus

I/O bus

I/O controller

disk

Disk cache

TLB

registers

L1

L2

L3

Controller buffer

Buffer cache

disk cache

Row buffer

Locality Exploitation in Row Buffer

DRAM Access = Latency + Bandwidth Time

Precharge

Row Access

Bus bandwidth time

DRAM Core

Row Buffer

Processor

Column Access

DRAMDRAMLatencyLatency

Nonuniform DRAM Access Latency

Case 1: Row buffer hit (20+ ns)

Case 2: Row buffer miss (core is precharged, 40+ ns)

Case 3: Row buffer miss (not precharged, ≈ 70 ns)precharge row access col. access

row access col. access

col. access

Row buffer misses come from a sequence of accesses to different pages in the same bank.

Amdahl’s Law applies in DRAM

70

70

20

60

16070

6.4GB/s(Rambus)

2.1GB/s(PC2100)

0.8GB/s(PC100)

As the bandwidth improves, DRAM latency will decide cache miss penalty.

Time (ns) to fetch a 128-byte cache block:

Row Buffer Locality Benefit

Objective: serve memory requests without accessing the DRAM core as much as possible.

missbuffer rowhitbuffer row LatencyLatency

Reduce latency by up to 67%.

Row Buffer Misses are Surprisingly High

Standard configuration Conventional cache

mapping Page interleaving for

DRAM memories 32 DRAM banks, 2KB

page size SPEC95 and SPEC2000

Why is the reason behind this?

0102030405060708090

100

tom

ca

tv

hy

dro

2d

mg

rid

ap

plu

co

mp

res

s

ijpe

g

Conventional Page Interleaving

Page 0 Page 1 Page 2 Page 3

Page 4 Page 5 Page 6 Page 7

… … … …

Bank 0

Address format

Bank 1 Bank 2 Bank 3

page index page offsetbank

r pk

Address Mapping Symmetry

cache tag cache set index block offset

page index page offset

t s b

bank

r pk

cache-conflicting: same cache index, different tags. row-buffer conflicting: same bank index, different pages. address mapping: bank index cache set index Property: xy, x and y conflict on cache also on row buffer.

page:

cache:

Sources of Misses Symmetry: invariance in results under transformations.

Address mapping symmetry propogates conflicts from cache address to memory address space:

cache-conflicting addresses are also row-buffer conflicting addresses

cache write-back address conflicts with the address of the to be fetched block in the row-buffer.

Cache conflict misses are also row-buffer conflict misses.

Breaking the Symmetry by Permutation-based Page Interleaving

k

XOR

k

page index page index page offsetpage offsetnew bank

k

page offsetpage offsetindex bank

L2 Cache tag

Permutation Property (1)

Conflicting addresses are distributed onto different banks

memory banks0000000100100011010001010110011110101011

Permutation-basedinterleaving

1011 1010

1010

1001

1000

1010

1010

1010

L2 Conflicting addresses

xor

Different bank indexes

Conventionalinterleaving

Same bank index


The spatial locality of memory references is preserved.

memory banks0000000100100011010001010110011110101011

1000 1010

1000 1010

1000 1010

1000 1010

… …

Within one pagePermutation-based

interleavingConventionalinterleaving

Same bank indexxor

Same bank index


Pages are uniformly mapped onto ALL memory banks.

C+1P

2C+2P

bank 0 bank 1 bank 2 bank 3

C

2C+3P

C+3P

2C

0 1P 2P 3P

C+2P

2C+1P

4P 5P 6P 7P

… … … …

C+5P C+4P C+7P C+6P

… … … …

2C+6P 2C+7P 2C+4P 2C+5P

… … … …

Row-buffer Miss Rates

0102030405060708090

100Cache line

Page

Swap

Permutation

Comparison of Memory Stall Time

0

0.2

0.4

0.6

0.8

1

1.2

1.4

No

rmal

ized

Mem

ory

Sta

ll T

ime Cache line

Page

Swap

Permutation

Improvement of IPC

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

1.60

1.80

Nor

mil

ized

IP

C

tom

catv

swim

su2c

or

hyd

ro2d

mgr

id

app

lu

turb

3d

wav

e5

TP

C-C

cachelinepageswappermutation

Where to Break the Symmetry?

Break the symmetry at the bottom level (DRAM address) is most effective:

Far away from the critical path (little overhead)

Reduce the both address conflicts and write-back conflicts.

Our experiments confirm this (30% difference).

System Software Effort: Efficient Buffer Cache Replacement

Buffer cache borrows a variable space in DRAM.

Accessing I/O data in buffer cache is about a million times faster than in the disk.

Performance of data intensive applications relies on exploiting locality of buffer cache.

Buffer cache replacement is a key factor.

CPU Registers

L1TLB

L3

L2

Row buffer


buffer

Buffer cache

CPU-memory bus

I/O bus

I/O controller

disk

Disk cache

TLB

registers

L1

L2

L3

Controller buffer

Buffer cache

disk cache

Row buffer

Locality Exploitation in Buffer Cache

The Problem of LRU Replacement

File scanning: one-time accessed blocks are not replaced timely;

Loop-like accesses: blocks to be accessed soonest can be unfortunately replaced;

Accesses with distinct frequencies: Frequently accessed blocks can be unfortunately replaced.

Inability to cope with weak access locality

Reasons for LRU to Fail and but Powerful

• Why LRU fails sometimes?

• A recently used block will not necessarily be used again or soon.

• The prediction is based on a single source information.

• Why it is so widely used?

• Simplicity: an easy and simple data structure.

• Work well for accesses following LRU assumption.

Our Objectives and Contributions

• Address the limits of LRU fundamentally.

• Retain the low overhead and strong locality merits of LRU.

Significant efforts have been made to improve/replace LRU,

• Case by case; or

• High runtime overhead

Our objectives:

Related Work

Aided by user-level hints Application-hinted caching and prefetching [OSDI, SOSP, ...] rely on users` understanding of data access patterns.

Detection and adaptation of access regularities SEQ, EELRU, DEAR, AFC, UBM [OSDI, SIGMETRICS …] case-by-case oriented approaches

Tracing and utilizing deeper history information LRFU, LRU-k, 2Q (VLDB, SIGMETRICS, SIGMOD …) high implementation cost, and runtime overhead.

Observation of Data Flow in LRU Stack

• Blocks are ordered by recency in the LRU stack.

• Blocks enter the stack top, and leave from its bottom.

A block evicted from the bottom of the stack should have been evicted much earlier ! 1

6

32

5

LRU stack

.

.

.

The stack is long and bottom is the only exit.

Inter-Reference Recency (IRR)

IRR of a block: number of other unique blocks accessed between two consecutive references to the block.

Recency: number of other unique blocks accessed from last reference to the current time.

1 2 3 4 3 1 5 6 5

IRR = 3

R = 2

Basic Ideas of LIRS

A high IRR block will not be frequently used. High IRR blocks are selected for replacement.

Recency is used as a second reference. LIRS: Low Inter-reference Recency Set algorithm

Keep Low IRR blocks in buffer cache.

Foundations of LIRS: effectively use multiple sources of access information. Responsively determine and change the status of each block. Low cost implementations.

Data Structure: Keep LIR Blocks in Cache Low IRR (LIR) block and High IRR (HIR) block

LIR block set

(size is Llirs )

HIR block set

Cache size

L = Llirs + LhirsLhirs

Llirs

Physical CacheBlock Sets

Replacement Operations of LIRS

Llirs=2, Lhirs=1

V time /Blocks

1 2 3 4 5 6 7 8 9 10 R IRR

A X X X 1 1

B X X 3 1

C X 4 inf

D X X 2 3

E X 0 inf

LIR block set = {A, B}, HIR block set = {C, D, E}

E becomes a resident HIR determined by its low recency

D is referenced at time 10

V time /Blocks

1 2 3 4 5 6 7 8 9 10 R IRR

A X X X 1 1

B X X 3 1

C X 4 inf

D X X XX 0 3

E X 1 Inf

The resident HIR block E is replaced !

Which Block is replaced ? Replace an HIR Block

V time /Blocks

1 2 3 4 5 6 7 8 9 10 R IRR

A X X X 2 1

B X X 3 1

C X 4 inf

D X X XX 0 2

E X 1 Inf

How is LIR Set Updated ? LIR Block Recency is Used

HIR is a natural place for D, but this is not insightful.

V time /Blocks

1 2 3 4 5 6 7 8 9 10 R IRR

A X X X 2 1

B X X 3 1

C X 4 inf

D X X XX 0 2

E X 1 Inf

After D is Referenced at Time 10

D enters LIR set, and B steps down to HIR set

Because D`s IRR< Rmax in LIR set

The Power of LIRS Replacement

File scanning: one-time access blocks will be replaced timely; (due to their high IRRs)

Loop-like accesses: blocks to be accessed soonest will NOT be replaced; (due to their low IRRs)

Accesses with distinct frequencies: Frequently accessed blocks will NOT be replaced. (dynamic status changes)

Capability to cope with weak access locality

LIRS Efficiency: O(1)

Rmax

(Maximum Recency of LIR blocks)

IRR HIR

(New IRR of a

HIR block)

Yes!. this efficiency is achieved by our LIRS stack:

• Both recencies and useful IRRs are automatically recorded.

• Rmax of the block in the stack bottom is larger than IRRs of others.

• No comparison operations are needed.

Can O(LIRS) = O(LRU)?

LIRS Operations

resident in cache

LIR block

HIR block

Cache size

L = 5Llir =

3

Lhir =2

53216948

LIRS stack

53

LRU Stack for HIRs

• Initialization: All the referenced blocks are given an LIR status until LIR block set is full.

We place resident HIR blocks in a small LRU Stack.

• Upon accessing an LIR block (a hit)

• Upon accessing a resident HIR block (a hit)

• Upon accessing a non-resident HIR block (a miss)

Access an LIR block (a Hit)

53216948

S

53

Q

532169

4

8

S

53

Q

Access 4 Access 8

resident in cache

LIR block

HIR block

Cache size

L = 5Llir =

3

Lhir =2

5321

48

S

53

Q

69

S

d

Access an HIR Resident Block (a Hit)

5321

48

S

53

Q

Access 3 Access 5

1

348

S

5

Q

5

resident in cache

LIR block

HIR block

Cache size

L = 5Llir =

3

Lhir =2

3

1

48

S

5

Q

52

S

d

Access a Non-Resident HIR Block ( a Miss)

Access 7

5

348

S

7

Q

7

5

1

348

S

5

Q

5resident in cache

LIR block

HIR block

Cache size

L = 5Llir =

3

Lhir =2

Access a Non-Resident HIR block (a Miss) (Cont)

resident in cache

5 block number LIR block

HIR block

Cache size

L = 5Llir =

3

Lhir =2

Access 9

5

348

S

7

Q

7

5

7

348

S

9

Q

9

75

Access 5

4

S Q

8

9

87

5

3

LIRS Stack Simplifies Replacement

Recency is ordered in stack with Rmax LIR block in bottom

No need to keep track of each HIR block`s IRR because

A newly accessed HIR block`s IRRs in stack = recency < Rmax.

A small LRU stack is used to store resident HIR blocks.

Additional operations of pruning and demoting are constant.

Although LIRS operations are much more dynamic than LRU, its complexity is identical to LRU.

Performance Evaluation

Trace-driven simulations on different patterns shows

LIRS outperforms existing replacement algorithms in almost all the cases.

The performance of LIRS is not sensitive to its only parameter Lhirs.

Performance is not affected even when LIRS stack size is bounded.

The time/space overhead is as low as LRU. LRU can be regarded as a special case of LIRS.

Selected Workload Traces

• 2-pools is a synthetic trace to simulate the distinct frequency case.

• cpp is a GNU C compiler pre-processor trace

• cs is an interactive C source program examination tool trace.

• glimpse is a text information retrieval utility trace.

• link is a UNIX link-editor trace.

• postgres is a trace of join queries among four relations in a relational database system

• sprite is from the Sprite network file system

• mulit1: by executing 2 workloads, cs and cpp, together.

• multi2: by executing 3 workloads, cs, cpp, and postgres, together.

• multi3: by executing 4 workloads, cpp, gnuplot, glimpse, and postgres, together

(1) various patterns, (2) non-regular accesses , (3) large traces.

Looping Pattern: postgres (Time-space map)

Looping Pattern: postgres (Hit Rates)

Potential Impact of LIRS

A LIRS patent has been filed, pending for approval.

Has been positively evaluated by IBM Almaden Research.

A potential adoption from LaserFiche in digital library.

The trace-driven simulation package has been distributed to many universities for research and classroom teaching.

Conclusion

Locality-aware research is long term and multidisciplinary. Application software support

+: optimization is effective for architecture dependent library. -: cache optimization only, and case by case

Hardware support +: touching fundamental problems, such as address symmetry. - : optimization space is very limited due to cost consideration.

System software support +: a key for locality optimization of I/O and virtual memory -: lack application knowledge, and kernel modifications.

Selected References

Application software for cache optimization Cache effective sortings, ACM Journal on Exp. Alg., 2000. Fast bit-reversals, SIAM Journal on Sci. Comp., 2001

Fast and high associativity cache designs Multicolumn caches, IEEE Micro, 1997 Low power caches, IEEE Micro, 2002.

Hardware support for DRAM locality exploitation Permutation-based page interleaving, Micro-33, 2000. Fine-grain memory access scheduling, HPCA-8, 2002.

System software support buffer cache optimization LIRS replacement, SIGMETRICS’02, 2002. TPF systems, Software: Practice & Experience, 2002.

Date post:	28-Dec-2015
Category:	Documents
Upload:	jade-powell
View:	215 times
Download:	1 times

Software and Hardware Support for Locality Aware High Performance Computing Xiaodong Zhang National...

Documents