Hardware Modeling 2 Cache Analyses Peter Puschner slides credits: P. Puschner, R. Kirner, B. Huber VU 2.0 182.101 SS 2015
Recap: Caches in WCET Analysis
Purpose: Bridge gap between fast CPU and memory Essential to analyze caches on many architectures
Example: 40 cycles for a miss on MPC755 What: Instructions, Data, BTB, TLB Design: Direct Mapped, Set/Fully Associative Replacement Policy: LRU, FIFO, PLRU, PRR More Characteristics: read-only / write through / write
back, write (no) allocate, Multi-Level Caches (inclusive/exclusive), ...
2
Caches in WCET Analysis
For software running on hardware with caches, computing the WCET by IPET alone (CFG + CCG) gets too complex
Ignoring caches leads to unacceptable overestimations
ð Decomposition of WCET analysis into 2+ phases 1. Categorization of memory access wrt. cache
behavior (e.g., always hit, always miss, etc.); Low-Level Analysis uses cache categorization.
2. WCET computation: IPET with no or simplified cache model
3
4
Categories of Cache Behavior ah always hit each access to the cache is a hit
(MUST analysis) am always
miss each access to the cache is a miss (MAY analysis ➭complement)
ps(S) persistent for each entering of context S, first access is nc, but all other accesses are hits (PERSISTANCE analysis)
nc not classified
the access is not classified as one of the above categorizations
Direct Mapped Cache
5
Line is selectedby ld(m) address
bits
Line 1
Line 2
Line m
m lines
Line: valid bit (v), tagand data (k bytes)
...v w1 w2 wktag
tag ld(m) bits
address word
ld(k) bits
DM-$ Analysis Example
6
Compiled from e.g. x, y, z = a, b, 0!while (x > 0 &&! y > 0)!{ z += x-- + y-- }!x,y = 0,0!
START
tag, line, offset0, 0, 00, 0, 1
0,1,10,2,0
0,2,10,3,0
0,3,1
0, 1, 0
END
1,0,0
DM-$ Analysis Example
7
Compiled from e.g. x, y, z = a, b, 0!while (x > 0 &&! y > 0)!{ z += x-- + y-- }!x,y = 0,0!
START
tag, line, offset0, 0, 00, 0, 1
0,2,0
0,3,0
0,3,1
0, 1, 0
END
1,0,0always miss
conflict with (0,0,x)
0,1,10,2,0
0,2,10,3,0
0, 1, 0
continue with 2nd loop iteration
always hit(2..n loop iteration)
0, 1, 1
0, 2, 1
alwayshit
8
Cache Classification (Hit/Miss)
Goal: A mechanized analysis, which classifies each cache access in a certain context (e.g. call context) as either Ø Always hit: in all possible executions, this access to the
cache will be a cache hit (the accessed cache block is guaranteed to be in the cache)
Ø Always miss: in all possible executions, this access to the cache will be a cache miss (the accessed cache block is guaranteed NOT to be in the cache)
Ø Not classified: The accessed cache block may or may not be in the cache
9
Automated Categorization of Memory Accesses
à Based on Abstract Interpretation and fixed-point analysis of cache states in the CFG
à Cache update function: models changes of the cache state for memory accesses
à Join function: Combines states at control-flow joins
à Concrete Semantics: Set of possible cache configurations (tags only, no data) at each program point
à Abstract Semantics: Efficient approximation in an abstract, “more efficient” domain
10
Data-Flow Analysis (DFA) DFA analysis is based on the data-flow structure of the
system behavior of interest (e.g. forward and backward propagation) • PRED(n) are the virtual predecessors of CFG node n
regarding the data flow of interest (Cache Analysis: usually CFG predecessors)
The data domain L of the analysis forms a lattice, on which the transfer function Fn(): L → L models the semantics of the system behavior of interest.
To merge two or more states, a join function ⊔: L × L → L is used to compute the least upper bound
11
Data-Flow Analysis (2) Data-flow equations modeling the data-
flow between nodes:
IN(n) = ⊔ ( { OUT(j) | j ∈ PRED(n) } ) OUT(n) = Fn ( IN(n) )
node n
…
OUT(n)
IN(n)
Fn()
12
Data-Flow Analysis (3)
Monotonicity requirements for solving the data-flow equation iteratively: • the transfer functions Fn(s) as well as the join function
s1⊔s2 must be monotone to ensure termination of the analysis.
Monotonicity: a function f: AàB is monotone, iff
∀a,a’∈A. (a ⊆A a’) à ( f(a) ⊆B f(a’) )
13
Data-Flow Analysis (4) Iterative Algorithm to find least fixpoint for data-flow equations:
for i ← 1 to N do /* initialize node i: */
OUT(i) = ⊥ while (sets are still changing) do
for i ← 1 to N do /* recompute sets at node i: */ IN(i) = ⊔ ( { OUT(j) | j ∈ PRED(n) } ) OUT(i) = Fn( IN(i) )
14
Concrete & Abstract Semantics
Concrete Cache Semantics: Model the semantics of the relevant aspects of the program (here: cache state & update). The concrete semantics collects the set of all possible cache states for each program point. Abstract Cache Semantics: Semantics in a different, usually finite domain, connected to the concrete semantics by an abstraction/concretization function.
N-way Set-Associative Cache
15
Set is selectedby ld(m) address
bits
Block 1,1
Block 2,1
Block 1,2 Block 1,n
Block 2,2 Block 2,n
...
...
Block m,1 Block m,2 Block m,n...
... ... ...
Replacement Strategyupdates blocks in one set
1
2
m sets
n ways
Block (Line): valid bit (v), tagand data (k bytes)
...v w1 w2 wktagtag ld(m) bits
address word
ld(k) bits
Fully-associative Cache (Associativity N)
16
Cache is updated based
on value of TAG.Replacement
Policydetermines the
updatestrategy used.
Way 1
Way 2
Way N
Line: valid bit (v), tagand data (k bytes)
...v w1 w2 wktag
tag
address word
offset Associativity: N
LRU, FIFO:youngest
LRU, FIFO:oldest, evicted on miss
17
Concrete Cache Semantics (Fully Associative Cache)
Cache Configuration: Mapping from cache line to tag S (data is irrelevant)
Domain: For each program point, set of all possible cache states
State at start node: Singleton set with empty cache, or set of all possible cache configurations
Update: For a cache configuration C and cache reference S, the new cache configuration C’ after accessing S
18
Concrete LRU Update (Fully Associative Cache)
Update Function for 4-way cache (1 line per way) with LRU a
b
c
d
c
a
b
d
access c
a
b
c
d
e
a
b
c
access e
HIT
MISS
19
Abstract Cache Semantics for MUST / MAY Analysis
Abstract Cache Configuration Compact representation of cache configuration set MUST: For each tag S, the maximum age MAY: For each tag S, the minimum age
Join: MUST: For each tag S, the maximum age MAY: For each tag S, the minimum age
Update (LRU) Accessed Tag: Youngest Set MUST: For other tags, increase age if may be aged MAY: For other tags, increase age if must be aged
20
Abstract Cache Representation
a <= 1b <= 3c <= 4
d,e <= 5+
{ a }
{ }
{ b }
{ c }
MUST Analysis
or
a >= 2b >= 4c >= 5
d,e >= 1
{ d,e }
{ }
{ a }
{ b }
MAY Analysis
or
⊤ = ∀x, x ≤ N+1
⊤ = ∀x, x ≥ 1
21
Abstract Cache Semantics (MUST Concretization)
a <= 1b <= 3c <= 4
d,e <= 5+
{ a }
{ }
{ b }
{ c }
MUST Analysis
or
Concretization
a
b
c
d
a
b
c
e
a
b
d
c
a
b
e
c
a
c
b
d
a
c
b
e
a
d
b
c
a
e
b
c
22
Abstract Cache Semantics (MUST Join)
a <= 1b <= 3c <= 4
d,e <= 5+
MUST Join
join
a <= 2c <= 4d <= 4
b,e <= 5+
a <= 2b <= 5+c <= 4
d,e <= 5+
{ a }
{ }
{ b }
{ c }
{ }
{ a }
{ }
{ c,d }
{ }
{ a }
{ }
{ c }
23
Abstract Cache Update Function: (LRU Cache, MUST analysis)
when accessing block c max-age’(c) = 1 max-age(d) ≥ max-age(c) à max-age’(d) = max-age(d) max-age(d) < max-age(c) à max-age’(d) = max-age(d) + 1
24
Abstract Cache Update Function: (LRU Cache, MUST analysis)
when accessing block c max-age’(c) = 1 max-age(d) ≥ max-age(c) à max-age’(d) = max-age(d) 1. assume age(d) < age(c) à max-age(d) ≥ age(d)+1 2. assume age(d) > age(c) à age’(d) = age(d) max-age(d) < max-age(c) à max-agd’(d) = max-age(d) + 1 1. If age(d) < age(c), age’(d) = age(d) + 1 ≤ max-age(d) + 1 2. If age(d) > age(c), age’(d) = age(d) ≤ max-age(d) + 1
25
Cache Hit/Miss Classification using MUST analysis If at some program point, tag S must be in the cache, i.e., its maximum age is less than or equal to the associativity, then
The cache access is classified as ALWAYS HIT If at some program point, it is not the case that tag S may be in the cache, i.e., its minimum age is greater than the associativity of the cache, then
The cache access is classified as ALWAYS MISS Otherwise
The cache access is NOT CLASSIFIED
Abstract Cache Semantics (MAY Concretization)
a >= 2b >= 4c >= 5
d,e >= 1
{ d,e }
{ a }
{ }
{ b }
MAY Analysis
or
Concretization
d
a
e
b
e
a
d
b
d
e
a
b
e
d
a
b
Abstract Cache Semantics (MAY Join)
a >= 2b >= 4c >= 5
d,e >= 1
{ }
{ e }
{ }
{ a }
MAY Analysis
join
{ d,e }
{ a }
{ }
{ b }
a >= 4b >= 5c >= 5d >= 5e >= 2
a >= 2b >= 4c >= 5d >= 1e >= 1
{ d, e }
{ a }
{ }
{ b }
28
Abstract Cache Update Function: (LRU Cache, MAY analysis)
when accessing block c min-age’(c) = 1 min-age(d) ≤ min-age(c) à min-age’(d) = min-age(d) + 1 1. if age(d) > age(c) ≥ min-age(d) à
age’(d) = age(d) ≥ min-age(d) + 1 2. assume age(d) < age(c) à age’(d) = age(d) + 1
min-age(d) > min-age(c) à min-age’(d) = min-age(d)
29
Cache Hit/Miss Classification using MUST and MAY analysis If at some program point, tag S must be in the cache, i.e., its maximum age is less than or equal to the associativity, then
The cache access is classified as ALWAYS HIT If at some program point, it is not the case that tag S may be in the cache, i.e., its minimum age is greater than the associativity of the cache, then
The cache access is classified as ALWAYS MISS Otherwise
The cache access is NOT CLASSIFIED What is the benefit of ALWAYS MISS over NOT CLASS.?
30
Discussion
31
Consider a data cache (1 word line size), with address of odd_even_counter sta<cally known:
static unsigned odd_even_counter[2]; ++odd_even_counter[sensor() % 2]; ++odd_even_counter[sensor() % 2]; ++odd_even_counter[sensor() % 2]; ++odd_even_counter[sensor() % 2]; ++odd_even_counter[sensor() % 2];!
Which access will be a cache miss? How many access will be cache hits?
Persistence Analysis
32
Sometimes, we do not know whether one particular access will be always a hit or a miss.
A cache element is said to be persistent (with respect to a program scope S), if in every execution of the scope, all but the first access are guaranteed to be cache hits
Data Caches benefit from persistence analysis, because address (implies tag) is not exactly known (e.g., arrays)
Persistence Analysis
33
Published persistence analysis until ~2009 was unsound. Only recently, development of correct persistence analyses (LRU only), published e.g. in (Ju,Huynh,Roychoudhury).
Abstract Domain: For each tag, set of possible younger tags (YS) accessed in the program scope of interest.
If |YS(c)| is less than the associativity of the cache, the element is persistent in the scope (i.e., it is not evicted once loaded)
DFA-based Persistence Analysis
34
Known DFA Persistence Analyses only work with LRU cache
Another technique based on static scopes (LRU, FIFO): If during one execution of a program scope at most N elements are accessed, then all of them are persistent in an N-way cache.
Open Problem (for all persistence analyses): How to find good program scopes? Functions and Loops are obvious candidates. Which heuristics?
Scope-Based Persistence Analysis
35
Usually assumes that the address of accessed elements is known, or within some small interval (e.g., if array index is unknown)
Precision can be further improved by analyzing array indices and access patterns.
If address is unknown, set-associative caches become less effective: access may affect any set. Modularity?
To improve analysis results, cache locking or cache splitting can be used, disabling the cache for “unpredictable accesses”.
Data Cache Analysis Remarks
36
Applying the Cache Categorizations to ILP
In integer linear programming (ILP) we typically calculate the WCET by maximizing Σ xi · ti • ti … execution time of CFG edge I (constant) • xi … execution frequency of CFG edge I
(to be determined)
The hit and miss count of the cache are modeled by additional flow variables: xi = xi,h + xi,m
Thus, the updated goal function is Σ xi,h · ti,h + Σ xi,m · ti,m
37
Applying the Cache Categorizations to ILP (2)
Depending on the cache categorization of a memory reference at edge i additional flow constraints are added:
• always hit [ah]: xi,m = 0
• always miss [am]: xi,h = 0
• global persistency [gp]: xi,h ≥ xi - 1
• local persistency [ps(S)]: xi,h ≥ xi – (∑ xk | edge k is entry to context S)
• [nc]: no additional constraints are created
38
Remarks to DFA-Based Cache Modeling
Persistence analysis is not necessary to distinguish first from subsequent loop iterations. To this end, the CFG is virtually rewritten to separate first loop iterations from the others (Virtual Loop Unpeeling1)
The separation of cache classification and WCET calculation
in DFA-based cache analysis scales well compared to the integrated approach where cache classification was modeled as cache conflict graph within the ILP problem.
1 Sometimes called “virtual loop unrolling”
39
Remarks to DFA-Based Cache Modeling (2)
• The DFA-‐based cache analysis works quite well for set-‐
associa<ve caches with LRU (least recently used) replacement strategy: – LRU has the nice locality property that the content of one cache line is not affected by memory accesses that map to other cache lines.
• However, to improve hardware performance, oVen much less predictable replacement strategies are used: – ColdFire MCF 5307: pseudo-‐round robin replacement – PowerPC 750/755: pseudo-‐LRU replacement
40
Remarks to DFA-Based Cache Modeling (3)
Avg. performance of PRR and PLRU is similar to LRU, but predictability is much worse! Analysis Results with PLRU: MAY analysis does not yield any informa<on at all! (star<ng with unknown cache, no block is found to be removed) MUST analysis provides some informa<on (but less than for LRU): at most 4 blocks are found in each cache set (out of 8 blocks in prac<ce) S/ll ongoing research (WCET’2010)
Pseudo-‐LRU (PLRU): The cache lines are leaves of a tree where on each node of the tree a path bit is placed. The replacement line is determined by following from top along the path indicated by the path bit. On each regular access, the path bits along this access are set to the other direc<on.
b0 b1 b2
b4 b3 b5 b6
L1 L2 L3 L4 L5 L6 L7 L0
0
1
1
1
0 1 0 1 0 1 0
0 1 0
41
Remarks to DFA-Based Cache Modeling (4)
Pseudo-‐Round-‐Robin (PRR): On a 4-‐way set-‐assoc. caches a two-‐bit replacement counter is used. This counter is shared for all cache lines and is only modified (increased mod 4) on replacement. Thus, each cache line has an influence on the others! Analysis Results with PRR: MAY analysis does not yield any informa<on at all! (without counter or age informa<on, one can never know which block is removed from cache) MUST analysis provides only ligle informa<on (much less as for LRU): when a block b is accessed, it goes into to cache, but without counter or age informa<on, we do not know, which block is removed à all elements currently in the set must be removed (only 1 out of possibly 4 elements can be found to be in the cache) With PRR, only 1 way is effecBvely used FIFO caches: Cache hit/miss classifica<on difficult (ECRTS’10)
42
Summary & Discussion
Topic of this lecture: cache access classification Abstract Interpretation: DFA + Abstract Cache States Cache Hit/Miss Classification: MUST/MAY analysis, for
instruction caches Replacement Policies: Most work published on LRU; also
applicable to direct mapped caches. FIFO,PLRU & PRR are less predictable.
Discussion: Preemption? Unpredictable Accesses? Alternatives (Scratchpad)?
43
References 1. CMHC: Henrik Theiling, Christian Ferdinand, Reinhard
Wilhelm, Fast and Precise WCET Prediction by Separate Cache and Path Analyses, Real-Time Systems 18(2/3), Kluwer, 2000.1
2. Data-Cache Analysis: Bach Khoa Huynh, Lei Ju, and Abhik Roychoudhury. 2011. Scope-aware Data Cache Analysis for WCET Estimation. Proc. IEEE RTAS `11.
3. FIFO Cache Analysis: Daniel Grund and Jan Reineke. 2010. Precise and Efficient FIFO-Replacement Analysis Based on Static Phase Detection. In Proceedings of the 2010 22nd Euromicro Conference on Real-Time Systems (ECRTS '10).
1 For persistance analysis, refer to [2], not [1]
44
References 1. Preemption: Chang-Gun Lee, Joosun Hahn, Yang-Min
Seo, Sang Lyul Min, Rhan Ha, Seongsoo Hong, Chang Yun Park, Minsuk Lee, and Chong Sang Kim. 1998. Analysis of Cache-Related Preemption Delay in Fixed-Priority Preemptive Scheduling. IEEE Trans. Comput. 47, 6 (June 1998).
2. Abstract Interpretation: Julien Bertrane, Patrick Cousot, Radhia Cousot, Jérôme Feret, Laurent Mauborgne, Antoine Miné and Xavier Rival.2010. Static Analysis and Verification of Aerospace Software by Abstract Interpretation. Paper 2010-3385 in American Institue of Aeronautics and Astronautics (AIAA)
Extra Material SS 2011
45
Exercise: 2-way set-assoc cache: MUST, MAY, PS
46
Compiled from e.g. x, y, z = a, b, 0!while (x > 0 &&! y > 0)!{ z += x-- + y-- }!x,y = 0,0!
START
tag, set, offset0, 0, 00, 0, 1
0,1,11,0,0
1,0,11,1,0
1,1,12,0,0
0, 1, 0
END