S e l f i s h - L R U : P r e e m p t i o n - A w a r e C a c h i n g f o r
P r e d i c t a b i l i t y a n d P e r f o r m a n c e
J a n R e i n e k e S a a r l a n d U n i v e r s i t y, G e r m a n y S e b a s t i a n A l t m e y e r U n i v e r s i t y o f A m s t e r d a m , N e t h e r l a n d s D a n i e l G r u n d T h a l e s G e r m a n y S e b a s t i a n H a h n S a a r l a n d U n i v e r s i t y, G e r m a n y C l a i r e M a i z a I N P G r e n o b l e , Ve r i m a g , F r a n c e
20th IEEE Real-Time and Embedded Technology and Applications Symposium April 15-17, 2014 Berlin, Germany
Task 1
Task 2
C o n t e x t : P r e e m p t i v e S c h e d u l i n g
2
Non-preemptive Execution:
C o n t e x t : P r e e m p t i v e S c h e d u l i n g
3
Preemptive Execution:
Task 1
Task 2
Task 1
Task 2
Cache-Related Preemption Delay (CRPD)
C a v e a t : P r e e m p t i o n s a r e n o t f r e e !
4
Preemptive Execution:
C o n t r i b u t i o n o f t h i s p a p e r
5
Selfish-LRU: a new cache replacement policy, that
➡ Increases performance by reducing the CRPD ➡ Simplifies static analysis of the CRPD
C o n t r i b u t i o n o f t h i s p a p e r
5
Selfish-LRU: a new cache replacement policy, that
➡ Increases performance by reducing the CRPD ➡ Simplifies static analysis of the CRPD
Selfish-LRU is a preemption-aware variant of least-recently used (LRU)
ABCD
most-recently used
least-recently used
EABC
E: miss
BEAC
B: hit
BEAC
B: hit
L e a s t - R e c e n t l y U s e d ( L R U )
6
“Replace data that has not been used for the longest time”
➡ Usually works well due to temporal locality
C R P D E x a m p l e u n d e r L R U R e p l a c e m e n t
7
Assume simple preempted task:
for i in [1,10]:!! ! do something()
for i in [1,10]:!! ! access A!! ! access B!! ! access C!! ! access D
C R P D E x a m p l e u n d e r L R U R e p l a c e m e n t
7
Assume simple preempted task:
for i in [1,10]:!! ! do something()
for i in [1,10]:!! ! access A!! ! access B!! ! access C!! ! access D
DCBA
ADCB
A: hit
BADC
B: hit
CBAD
C: hit
DCBA
D: hit
Without preemption (after warmup): 0 misses
DCBA
XDCB
X: miss
C R P D E x a m p l e u n d e r L R U R e p l a c e m e n t
8
Assume simple preempting task:
do something_else() access X
Preemption between loop iterations: 1 access
XDCB
AXDC
A: miss
BAXD
B: miss
CBAX
C: miss
DCBA
D: miss
C R P D E x a m p l e u n d e r L R U R e p l a c e m e n t
9
First loop iteration after preemption: 4 misses
XDCB
AXDC
A: miss
BAXD
B: miss
CBAX
C: miss
DCBA
D: miss
C R P D E x a m p l e u n d e r L R U R e p l a c e m e n t : Two types of misses related to preemption
10
1. Replaced Misses 2. Reordered Misses
XDCB
AXDC
A: miss
BAXD
B: miss
CBAX
C: miss
DCBA
D: miss
C R P D E x a m p l e u n d e r L R U R e p l a c e m e n t : Two types of misses related to preemption
10
1. Replaced Misses 2. Reordered Misses
Liu et al., PACT 2008: reordered misses account for 10% to 28% of all preemption-related misses
ABCD
most-recently used
least-recently used
EABD
E: miss
BEAD
B: hit
FBEA
F: miss
S e l f i s h - L R U : I d e a
11
Prioritize blocks of currently running task:
Intuition: “Memory blocks of currently running task more likely to be accessed again soon.”
DCBA
ADCB
A: hit
BADC
B: hit
CBAD
C: hit
DCBA
D: hit
S e l f i s h - L R U : C R P D E x a m p l e R e v i s i t e d
12
Assume simple preempted task:
for i in [1,10]:!! ! do something()
for i in [1,10]:!! ! access A!! ! access B!! ! access C!! ! access D
Without preemption (after warmup): 0 misses
➡ Same behavior as LRU
DCBA
XDCB
X: miss
S e l f i s h - L R U : C R P D E x a m p l e R e v i s i t e d
13
Assume simple preempting task:
do something_else() access X
Preemption between loop iterations: 1 access
XDCB
ADCB
A: miss
BADC
B: hit
CBAD
C: hit
DCBA
D: hit
S e l f i s h - L R U : C R P D E x a m p l e R e v i s i t e d
14
First loop iteration after preemption: 1 miss
➡ No reordering misses
S e l f i s h - L R U : P r o p e r t i e s
15
S e l f i s h - L R U : P r o p e r t i e s
15
Property 1: Selfish-LRU does not exhibit reordering misses.
➡ Often: smaller CRPD ➡ Simplifies static analysis of the CRPD
S e l f i s h - L R U : P r o p e r t i e s
15
Property 1: Selfish-LRU does not exhibit reordering misses.
➡ Often: smaller CRPD ➡ Simplifies static analysis of the CRPD
Property 2: In non-preempted execution, Selfish-LRU = LRU.
➡ No change in “regular” WCET analysis
Preempting
Preempted
S e l f i s h - L R U : C R P D A n a l y s i s
16
Preempting
Preempted
S e l f i s h - L R U : C R P D A n a l y s i s
16
1. Number of useful cache blocks (UCBs)?
Preempting
Preempted
S e l f i s h - L R U : C R P D A n a l y s i s
16
1. Number of useful cache blocks (UCBs)?
2. Number of evicting cache blocks (ECBs)?
➡ Smaller Bound
Preempting
Preempted
S e l f i s h - L R U : C R P D A n a l y s i s
16
1. Number of useful cache blocks (UCBs)?
2. Number of evicting cache blocks (ECBs)?
➡ Smaller Bound
3. Combination of ECBs and UCBs based on Resilience➡ Simplified and Smaller Bound
S e l f i s h - L R U : I m p l e m e n t a t i o n
17
Required modifications: • Manage task ids (TID) in operating system • Make TID available to cache in TID register • Augment cache lines with TID of “owner” task ‣ Conservative estimate: < 3% space overhead
• Modified replacement logic
Similar to virtually-addressed caches
E x p e r i m e n t a l E v a l u a t i o n
18
E x p e r i m e n t a l E v a l u a t i o n
18
Main goal: Compare Selfish-LRU with LRU in terms of performance and predictability !
E x p e r i m e n t a l E v a l u a t i o n
18
Main goal: Compare Selfish-LRU with LRU in terms of performance and predictability ! ➡ Modified MPARM simulator ➡ CRPD analyses implemented in AbsInt’s aiT
E x p e r i m e n t a l E v a l u a t i o n
18
Main goal: Compare Selfish-LRU with LRU in terms of performance and predictability !
Secondary goal: (see paper for details) Compare CRPD approach with cache partitioning
➡ Modified MPARM simulator ➡ CRPD analyses implemented in AbsInt’s aiT
E x p e r i m e n t a l E v a l u a t i o n : B e n c h m a r k s a n d C a c h e C o n f i g u r a t i o n
19
Benchmarks: • Four of the largest Mälardalen benchmarks • Four models from the SCADE distribution • Two SCADE models from an embedded systems course
Cache configuration: Capacity: 2 KB, 4 KB, 8 KB Associativity: 4, 8 Number of sets: 32, 64, 128
E x p e r i m e n t a l E v a l u a t i o n : S i m u l a t i o n R e s u l t s , “ L a r g e ” P r e e m p t i n g Ta s k
20
Measured number of additional
misses
Cache configuration: Capacity: 2 KiB, Associativity 4, Number of sets: 32
Preempted Tasks
Fig. 1. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 4, n = 32, and thus C = 2 KiB.
Fig. 2. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 8, n = 32, and thus C = 4 KiB.
Fig. 3. Measured number of context-switch misses for each benchmark whenpreempted by edn. k = 4, n = 32, and thus C = 2 KiB.
Fig. 7. Observed misses during the execution of different task sets.k = 4, n = 128, and thus C = 8 KiB.
Fig. 4. Bounds on the number of context-switch misses when preempted bypilot. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.
Fig. 5. Bounds on the number of context-switch misses when preempted bypilot. k = 8, n = 32, and thus C = 4 KiB determined by static analysis.
Fig. 6. Bounds on the number of context-switch misses when preempted byedn. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.
Fig. 8. Observed misses during the execution of different task sets.k = 8, n = 64, and thus C = 8 KiB.
E x p e r i m e n t a l E v a l u a t i o n : S i m u l a t i o n R e s u l t s , “ L a r g e ” P r e e m p t i n g Ta s k
20
Measured number of additional
misses
Cache configuration: Capacity: 2 KiB, Associativity 4, Number of sets: 32
Preempted Tasks
Fig. 1. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 4, n = 32, and thus C = 2 KiB.
Fig. 2. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 8, n = 32, and thus C = 4 KiB.
Fig. 3. Measured number of context-switch misses for each benchmark whenpreempted by edn. k = 4, n = 32, and thus C = 2 KiB.
Fig. 7. Observed misses during the execution of different task sets.k = 4, n = 128, and thus C = 8 KiB.
Fig. 4. Bounds on the number of context-switch misses when preempted bypilot. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.
Fig. 5. Bounds on the number of context-switch misses when preempted bypilot. k = 8, n = 32, and thus C = 4 KiB determined by static analysis.
Fig. 6. Bounds on the number of context-switch misses when preempted byedn. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.
Fig. 8. Observed misses during the execution of different task sets.k = 8, n = 64, and thus C = 8 KiB.
Large share of replaced misses ➡ Fairly small improvement
E x p e r i m e n t a l E v a l u a t i o n : S i m u l a t i o n R e s u l t s , “ S m a l l ” P r e e m p t i n g Ta s k
21
Fig. 1. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 4, n = 32, and thus C = 2 KiB.
Fig. 2. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 8, n = 32, and thus C = 4 KiB.
Fig. 3. Measured number of context-switch misses for each benchmark whenpreempted by edn. k = 4, n = 32, and thus C = 2 KiB.
Fig. 7. Observed misses during the execution of different task sets.k = 4, n = 128, and thus C = 8 KiB.
Fig. 4. Bounds on the number of context-switch misses when preempted bypilot. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.
Fig. 5. Bounds on the number of context-switch misses when preempted bypilot. k = 8, n = 32, and thus C = 4 KiB determined by static analysis.
Fig. 6. Bounds on the number of context-switch misses when preempted byedn. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.
Fig. 8. Observed misses during the execution of different task sets.k = 8, n = 64, and thus C = 8 KiB.
Measured number of additional
misses
Cache configuration: Capacity: 2 KiB, Associativity 4, Number of sets: 32
Preempted Tasks
E x p e r i m e n t a l E v a l u a t i o n : S i m u l a t i o n R e s u l t s , “ S m a l l ” P r e e m p t i n g Ta s k
21
Fig. 1. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 4, n = 32, and thus C = 2 KiB.
Fig. 2. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 8, n = 32, and thus C = 4 KiB.
Fig. 3. Measured number of context-switch misses for each benchmark whenpreempted by edn. k = 4, n = 32, and thus C = 2 KiB.
Fig. 7. Observed misses during the execution of different task sets.k = 4, n = 128, and thus C = 8 KiB.
Fig. 4. Bounds on the number of context-switch misses when preempted bypilot. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.
Fig. 5. Bounds on the number of context-switch misses when preempted bypilot. k = 8, n = 32, and thus C = 4 KiB determined by static analysis.
Fig. 6. Bounds on the number of context-switch misses when preempted byedn. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.
Fig. 8. Observed misses during the execution of different task sets.k = 8, n = 64, and thus C = 8 KiB.
Measured number of additional
misses
Cache configuration: Capacity: 2 KiB, Associativity 4, Number of sets: 32
Preempted Tasks
Small share of replaced misses ➡ Fairly significant improvement
E x p e r i m e n t a l E v a l u a t i o n : A n a l y s i s R e s u l t s , “ L a r g e ” P r e e m p t i n g Ta s k
22
Bound on number of additional
misses
Cache configuration: Capacity: 2 KiB, Associativity 4, Number of sets: 32
Preempted Tasks
Fig. 1. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 4, n = 32, and thus C = 2 KiB.
Fig. 2. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 8, n = 32, and thus C = 4 KiB.
Fig. 3. Measured number of context-switch misses for each benchmark whenpreempted by edn. k = 4, n = 32, and thus C = 2 KiB.
Fig. 7. Observed misses during the execution of different task sets.k = 4, n = 128, and thus C = 8 KiB.
Fig. 4. Bounds on the number of context-switch misses when preempted bypilot. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.
Fig. 5. Bounds on the number of context-switch misses when preempted bypilot. k = 8, n = 32, and thus C = 4 KiB determined by static analysis.
Fig. 6. Bounds on the number of context-switch misses when preempted byedn. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.
Fig. 8. Observed misses during the execution of different task sets.k = 8, n = 64, and thus C = 8 KiB.
E x p e r i m e n t a l E v a l u a t i o n : A n a l y s i s R e s u l t s , “ L a r g e ” P r e e m p t i n g Ta s k
22
Bound on number of additional
misses
Cache configuration: Capacity: 2 KiB, Associativity 4, Number of sets: 32
Preempted Tasks
Fig. 1. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 4, n = 32, and thus C = 2 KiB.
Fig. 2. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 8, n = 32, and thus C = 4 KiB.
Fig. 3. Measured number of context-switch misses for each benchmark whenpreempted by edn. k = 4, n = 32, and thus C = 2 KiB.
Fig. 7. Observed misses during the execution of different task sets.k = 4, n = 128, and thus C = 8 KiB.
Fig. 4. Bounds on the number of context-switch misses when preempted bypilot. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.
Fig. 5. Bounds on the number of context-switch misses when preempted bypilot. k = 8, n = 32, and thus C = 4 KiB determined by static analysis.
Fig. 6. Bounds on the number of context-switch misses when preempted byedn. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.
Fig. 8. Observed misses during the execution of different task sets.k = 8, n = 64, and thus C = 8 KiB.
All misses are replaced misses ➡ No improvement
E x p e r i m e n t a l E v a l u a t i o n : A n a l y s i s R e s u l t s , “ S m a l l ” P r e e m p t i n g Ta s k
23
Bound on number of additional
misses
Cache configuration: Capacity: 4 KiB, Associativity 8, Number of sets: 32
Preempted Tasks
Fig. 1. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 4, n = 32, and thus C = 2 KiB.
Fig. 2. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 8, n = 32, and thus C = 4 KiB.
Fig. 3. Measured number of context-switch misses for each benchmark whenpreempted by edn. k = 4, n = 32, and thus C = 2 KiB.
Fig. 7. Observed misses during the execution of different task sets.k = 4, n = 128, and thus C = 8 KiB.
Fig. 4. Bounds on the number of context-switch misses when preempted bypilot. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.
Fig. 5. Bounds on the number of context-switch misses when preempted bypilot. k = 8, n = 32, and thus C = 4 KiB determined by static analysis.
Fig. 6. Bounds on the number of context-switch misses when preempted byedn. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.
Fig. 8. Observed misses during the execution of different task sets.k = 8, n = 64, and thus C = 8 KiB.
E x p e r i m e n t a l E v a l u a t i o n : A n a l y s i s R e s u l t s , “ S m a l l ” P r e e m p t i n g Ta s k
23
Bound on number of additional
misses
Cache configuration: Capacity: 4 KiB, Associativity 8, Number of sets: 32
Preempted Tasks
Fig. 1. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 4, n = 32, and thus C = 2 KiB.
Fig. 2. Measured number of context-switch misses for each benchmark whenpreempted by pilot. k = 8, n = 32, and thus C = 4 KiB.
Fig. 3. Measured number of context-switch misses for each benchmark whenpreempted by edn. k = 4, n = 32, and thus C = 2 KiB.
Fig. 7. Observed misses during the execution of different task sets.k = 4, n = 128, and thus C = 8 KiB.
Fig. 4. Bounds on the number of context-switch misses when preempted bypilot. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.
Fig. 5. Bounds on the number of context-switch misses when preempted bypilot. k = 8, n = 32, and thus C = 4 KiB determined by static analysis.
Fig. 6. Bounds on the number of context-switch misses when preempted byedn. k = 4, n = 32, and thus C = 2 KiB determined by static analysis.
Fig. 8. Observed misses during the execution of different task sets.k = 8, n = 64, and thus C = 8 KiB.
Small share of replaced misses ➡ Fairly large improvement
S u m m a r y a n d F u t u r e W o r k
24
Selfish-LRU eliminates reordered misses:
➡ Increases performance by reducing the CRPD ➡ Simplifies static analysis of the CRPD ➡ Large improvements for small preempting tasks like interrupt handlers
S u m m a r y a n d F u t u r e W o r k
24
Selfish-LRU eliminates reordered misses:
➡ Increases performance by reducing the CRPD ➡ Simplifies static analysis of the CRPD ➡ Large improvements for small preempting tasks like interrupt handlers
Apply same idea in shared caches in multi-cores?