CS671 Parallel Programming in the Many-Core Era
Lecture 4: Introduction to Locality Theory and Practice
Zheng Zhang
Rutgers University
Review: Memory Wall
‣ The processor memory performance gap
Memory Hierarchy
‣Hierarchical memory* L1, L2, L3 cache* scratch-pad, off-chip memory, disk cache ...* automatic placement and replacement* separation of concerns: data usage vs. coherence management
‣Trading space for time* the faster the access* the smaller the data capacity
‣Software solution* exploit locality -- temporal and/or spatial* transform computation order or data layout* compilers, runtime, performance tuning tools
The Story of the Locality Theory
‣Started as an empirical observation “During any interval of execution, a program favors a subset of its pages, and this set of favored pages changes slowly” -- Peter Denning
‣How to quantify?* the performance of a machine* the demand of a program* the locality of an operation* is there a “primary” metric?
‣Two example quantities* reuse time & footprint
Locality Statistics‣ Miss Ratio
‣ Reuse Distance
‣ Footprint
Locality Statistics‣ Miss Ratio
‣ Reuse Distance
‣ Footprint
Cache Miss Ratio
‣ Cache Performance of the Integer portion of the SPEC CPU2000
Locality Statistics‣ Miss Ratio
‣ Reuse Distance
‣ Footprint
Reuse Distance‣ Reuse distance of an access to datum d
the number of distinct data accessed after the last access to d
‣ Locality signature of an executionthe distribution of all finite reuse distances determines working set size and miss rate of caches of all sizes
Reuse Distance Calculation I
‣ Naive counting, O(N) time per access, O(N) space-- N is the number of memory accesses-- M is the number of distinct data elements
‣Too costly: N up to 120 billion, M 25 million
Reuse Distance Calculation II
‣Stack algorithm [Mattson+ IBM 70]-- O(M) time per access, O(M) space
Reuse Distance Calculation III
‣Tree based algorithm -- search tree [Olken LBL 81, Sugumar&Abraham UM 93] O(log M) time per access, O(M) space
Reuse Distance Calculation III
• Stack algorithm [Mattson+ IBM 70] O(M) time per access, O(M) space
• Search tree [Olken LBL 81, Sugumar&Abraham UM 93] O(log M) time per access, O(M) space
• Space cost remains a major problem
[Ding+ PLDI’03/TOPLAS’09]O(N log logM) time and O(logM) space
Locality Statistics‣ Miss Ratio
‣ Reuse Distance
‣ Footprint
Footprint‣ Amount of data access in an execution period
‣Example: “abbb”
‣Example “xyz xyz”
Footprint
• fp(w): average footprint of ALL windows of length w• length-n trace, O(n^2) windows• 1 billion accesses, half quintillion windows
• 3 length-2 windows: “ab”, “bb”, “bb”• footprints 2, 1, 1• the average fp(2) = (2 + 1 + 1)/3 = 4/3
• fp( i ) = i for 0 <= i <= 3• fp( i ) = 3 for i > 3
Reuse Time?[Xiang+ ASPLOS’13]
Footprint Measurement‣Working set
limit value in an infinitely long trace [Denning & Schwartz 1972]
‣ Direct countingsingle window size [Thiebaut & Stone TOCS’87] seminal paper on footprints in shared cache
‣ Statistical approximation[Denning & Schwartz 1972; Suh et al. ICS’01; Berg & Hagersten PASS’04; Chandra et al. HPCA’05; Shen et al. POPL’07]
‣Precise definition/solutionfootprint distribution, O(n log m) [Xiang et al. PPoPP’11]footprint function, O(n) [Xiang et al. PACT’11]