11/18/2009
1
“CACHE OPTIMIZATION
FOR REAL-TIME
EMBEDDED SYSTEMS”
Abu Asaduzzaman
Florida Atlantic University
Department of Computer and Electrical Engineering and Computer Science
Boca Raton, Florida, USA
Ph.D. Dissertation Defense
NOTE: References and abbreviations used in this presentation are as they
appear in the original dissertation manuscript.
Asaduzzaman — Ph.D. Dissertation Defense 2
Dissertation Committee
Dissertation Advisor:
Dr. Imad Mahgoub,
Professor & Director of CEECS Graduate Programs and Research
Director of Mobile Computing Lab
Committee Members:
Dr. Borko Furht, Professor & Chair CEECS Department
Dr. Ravi Shankar, Professor & Director CSI Lab
Dr. Mohammad Ilyas, Professor & Associate Dean of College of Eng & CS
Dr. Valentine Aalo, Professor
Asaduzzaman — Ph.D. Dissertation Defense 3
Selected Publications (out of 21)
� “Improving Performance / Power Ratio of Embedded Multicore Systems
by Using Miss Table and Victim Caches”, IJHPSA-2009, Under Review,
Sub. Code: IJHPSA-11388, Oct. 2009.
� “Cache Modeling and Optimization for Portable Devices Running MPEG-4
Video Decoder”, MTAP-2006, Vol. 28(1), pp. 239-256, 2006.
� “Impact of L1 Entire Locking and L2 Way Locking on Performance,
Power Consumption, and Predictability of Multicore Real-Time Systems”,
IEEE AICCSA’09, May, 2009, Rabat, Morocco.
� “Cache Optimization for Mobile Devices Running Multimedia
Applications”, IEEE ISMSE'04, Dec., 2004, Miami, Florida.
� “Evaluation of Application-Specific Multiprocessor Mobile System”,
SPECTS’04, July, 2004, San Jose, California.
Asaduzzaman — Ph.D. Dissertation Defense 4
Outline
� Problem Statement and Contributions
� Background (survey, motivation)
� Cache Optimization for Single-Core Systems (performance)
� Cache Optimization for Multi-Core Sys (performance/power)
� Cache Locking in Real-Time Systems (exec. time predictability)
� Miss Table Based Cache Locking with Victim Caches
� Schematic Diagram, Workflow
� Workload Characterization (for cache locking)
� Simulation (predictability and performance/power ratio)
� Conclusion and Future Extensions
11/18/2009
2
Asaduzzaman — Ph.D. Dissertation Defense 5
Problem Statement
“Cache Optimization for Real-Time Embedded Systems”
� Even though cache improves performance, it poses some serious
challenges in real-time embedded systems as it consumes vast
amount of energy and introduces execution time unpredictability.
� Existing cache optimization solutions are not adequate for real-
time embedded systems because they do not address
performance, power consumption, and predictability issues at the
same time. Moreover, existing solutions are not suitable for dealing
with multi-core architecture issues.
Asaduzzaman — Ph.D. Dissertation Defense 6
Contributions
“Cache Optimization for Real-Time Embedded Systems”
� Cache modeling and optimization technique for single-core
systems to improve performance.
� Cache modeling and optimization technique for multi-core systems
to improve performance/power ratio.
� Cache locking scheme to improve predictability for real-time sys.
� Miss Table based cache locking scheme with victim caches to
improve predictability and performance/power ratio for real-time
embedded systems at the same time. This scheme is effective for
both single-core and multi-core systems.
� Workload characterization for cache optimization.
Asaduzzaman — Ph.D. Dissertation Defense 7
“Cache Optimization for Real-Time
Embedded Systems”
� Problem Statement and Contributions
� Background (survey, motivation)
� Cache Optimization for Single-Core Systems (performance)
� Cache Optimization for Multi-Core Sys (performance/power)
� Cache Locking in Real-Time Systems (predictability)
� Miss Table Based Cache Locking with Victim Caches
� Schematic Diagram, Workflow
� Workload Characterization
� Simulation Results (predictability and performance/power ratio)
� Conclusion and Future Extensions
Asaduzzaman — Ph.D. Dissertation Defense 8
Background
CPU
400 MHz
CPU
400 MHz
Bus 100 MHz
Cache
Bus 100 MHz
CPU
400 MHz
CL2
Bus 100 MHz
D1 I1
(a) CPU and main memory (b) Processor cache
(c) Multi-level cache in single-core
Main
Memory
100 MHz
Main
Memory
100 MHz
Main
Memory
100 MHz
Intel
1990s
CPU
400 MHz
CL2
Bus 100 MHz
D1 I1
CPU
400 MHz
D1 I1
Almost every processor has cache in its architecture. So, cache
optimization is important.
(d) Multi-level cache in multi-core
Main
Memory
100 MHz
IBM
1960s
11/18/2009
3
Asaduzzaman — Ph.D. Dissertation Defense 9
Background (1)
Cache in single- and multi-core
Cache solutions should cover both single- and multi-core architectures.
Only one core has CL1.
CL2 is shared by all cores.
Each core has CL1.
(CL2 is shared or dedicated)
Shared CL2
Dedicated CL2
(No CL3)
(Dedicated CL2)
Shared CL3Inclusive Exclusive Victim
Single-level cache
(Only CL1)
Multi-level cache
(CL1, CL2, etc)
Single-
Core CPU
Multi-Core
CPU
(a) Single-core
(c) Multi-core(b) Cache architecture
Cache Architectures
~
Set Si
S
Sets
W blocks per set
(W = associativity level)
Block
(Size Sb)
Possible blocks into which an instruction at
address Ad can be mapped; Si = (Ad/Sb) mod S
Important
cache parameters
Cache size, C = Sb * B
Sb � Line size
Cache blocks, B = S * W
S � Number of sets
W � Associativity levels
Asaduzzaman — Ph.D. Dissertation Defense 10
Background (2)
� Based on the current design trend, most processors have
multi-level caches (to improve performance).*
� Cache makes the system more unpredictable and
consumes significant amount of power.**
� Power consumption and execution time unpredictability
may defeat the performance gain of real-time embedded
systems.***
* [35'82], [109'06], [110'06], [134'05], [166'09]
** [128'04], [131'04], [144'06], [163'07]
*** [47'06], [70'06], [93'03], [127'06]
Asaduzzaman — Ph.D. Dissertation Defense 11
Background (3)
� System performance can be improved by customizing
cache parameters, using victim cache, and applying
stream buffering.* (power?)
� Multi-core improves performance/power ratio.** (multi-
core, caches, power?)
� Cache-locking improves predictability.*** (performance,
power, predictability?)
* [5'06], [44'06], [56'06], [58'90], [82'03], [122'07], [155'04], [158'09], [163'07]
** [47'06], [127'06], [166'09]
*** [9'09], [11'07], [12'05], [97'06], [128'04], [139'03]
Asaduzzaman — Ph.D. Dissertation Defense 12
Cache Optimization for Real-Time Embedded Systems
1
3
2
54
Workload
Cache
Optimization
Simulation
Embedded
System
1a
Cache
Modeling in ES
1b
Pre-
Fetching
1c
Cache
Locking
2a 2b 2c
System-
On-a-ChipDesign
MP-
SOC
RT
MP-SOC
Legends:
ES – Embedded System
MP-SOC – Multi-Processor System-On-a-Chip
RT – Real-Time
Cache ES
AppsTools
Reviewed Article Summary:
Total: 172 (Year 1982 – 2009)
Year 2006+: 45%
Year 2000+: 85%
Other
Related Issues
Background (4)
1) Performance
2) Power
3) PredictabilityP, P/P
Literature Survey
11/18/2009
4
Asaduzzaman — Ph.D. Dissertation Defense 13
� A general-purpose computing platform is studied using MPEG-2 application.*� Cache can be optimized to increase performance of
multimedia applications.
� Techniques like selective caching, cache locking, and pre-fetching may improve cache performance for application specific systems.
� However, no power analysis is conducted. Also, the methodology is not re-useable.
*[121]
Background (5) Literature Survey
Asaduzzaman — Ph.D. Dissertation Defense 14
Background (6)
� Victim cache and stream buffering is introduced by Jouppi to improve performance.*� With VC, effective cache size is more than CL2.
� Separate cache to support stream buffering.
� Stream buffering improves performance/power ratio.**
� VC is suitable for embedded systems (limited area) and multimedia applications (large memory accesses).***
� However, no predictability analysis, VC can be reused for stream buffering.
* [58]
** [44]
*** [155]
Literature Survey
Asaduzzaman — Ph.D. Dissertation Defense 15
Background (7)
� Cache locking improves predictability in real-time system.*� Locks the blocks inside the cache during exec.
� Trade-off predictability Vs performance.
� Cache locking in shared caches of multi-core architecture is explored by Suhendra.**� No L1, but L2, locking on PowerPC 750GX.
� Existing solutions do not address performance, power consumptions, and predictability at the same time.
* [12], [21], [97], [99], [128], [139]
** [A], [B], [124]
Literature Survey
Asaduzzaman — Ph.D. Dissertation Defense 16
“Cache Optimization for Real-Time
Embedded Systems”
� Problem Statement and Contributions
� Background (survey, motivation)
� Cache Optimization for Single-Core Systems (performance)
� Cache Optimization for Multi-Core Sys (performance/power)
� Cache Locking in Real-Time Systems (predictability)
� Miss Table Based Cache Locking with Victim Caches
� Schematic Diagram, Workflow
� Workload Characterization
� Simulation (predictability and performance/power ratio)
� Conclusion and Future Extensions
11/18/2009
5
Asaduzzaman — Ph.D. Dissertation Defense 17
Cache Optimizing for Single-Core Sys
� Supported by Motorola-FAU OPP Project.
� Use a single-core architecture.� Split CL1 (I1 and D1), unified CL2
� H.264/AVC decoding algorithm
� Encoded video is decoded using cache subsystem
� Impact of cache size, line size, and associativity level on performance (miss ratio).
� Published in IEEE ISMSE'04 conference proceedings.
Asaduzzaman — Ph.D. Dissertation Defense 18
Cache Optimizing for Single-Core Sys (2)S
hare
d B
us
Main
Memory
I1
D1
CL2
Write
CL1
Processing
Core
Read
Simulated Single-Core Architecture
Workload (using Cachegrind): Decoding .264 file
Selected architecture and workload
Simulation Results
Asaduzzaman — Ph.D. Dissertation Defense 19
Cache Optimizing for Single-Core Sys (3)
1 2
3 4
VisualSim simulation results show that this cache optimization technique
can be used to improve the performance of single-core systems.
Asaduzzaman — Ph.D. Dissertation Defense 20
“Cache Optimization for Real-Time
Embedded Systems”
� Problem Statement and Contributions
� Background (survey, motivation)
� Cache Optimization for Single-Core Systems (performance)
� Cache Optimization for Multi-Core Sys (performance/power)
� Cache Locking in Real-Time Systems (predictability)
� Miss Table Based Cache Locking with Victim Caches
� Schematic Diagram, Workflow
� Workload Characterization
� Simulation Results (predictability and performance/power ratio)
� Conclusion and Future Extensions
11/18/2009
6
Asaduzzaman — Ph.D. Dissertation Defense 21
Cache Optimizing for Multi-Core Sys
� Supported by Motorola-FAU OPP Project.
� Use a multi-core architecture.� Selected a 2-core system
� Each core has its own unified CL1; no CL2
� MPEG4 decoding algorithm
� Impact of cache size, line size, and associativity level on mean delay and power consumption.
� Published in SPECTS’04 conference proceedings.
Asaduzzaman — Ph.D. Dissertation Defense 22
Cache Optimizing for Multi-Core Sys (2)
DSP AP
CL1 CL1
Encoded
Video
Main
Memory
Playback
Application
Open/OS
RT/OS
Shared Bus
R/W Read
IPC/MU
AP – Application Processor
DSP – Digital Signal ProcessorIPC/MU – Inter-Process Communication /
Messaging UnitRT/OS – Real-Time Operating System
R/W – Read / Write
Workload: Decoding MPEG4 file
I
1
B
2
B
3
P
4
Prediction
Bi-directional
Prediction
B
5
B
6
P
7
Frame
MPEG4 Group Of Pictures
with 7 frames
(1 frame ≈ 152 KB)
Selected architecture and workload
Simulated Dual-Core
Architecture
Simulation Results
Asaduzzaman — Ph.D. Dissertation Defense 23
Cache Optimizing for Multi-Core Sys (3)
1 2
3 4
Adding a core reduces mean delay by >45%
while consumes <5% additional energy.
Simulation results show that this cache optimization technique can be
used to improve the performance/power ratio of multi-core systems.
Asaduzzaman — Ph.D. Dissertation Defense 24
“Cache Optimization for Real-Time
Embedded Systems”
� Problem Statement and Contributions
� Background (survey, motivation)
� Cache Optimization for Single-Core Systems (performance)
� Cache Optimization for Multi-Core Sys (performance/power)
� Cache Locking in Real-Time Systems (predictability)
� Miss Table Based Cache Locking with Victim Caches
� Schematic Diagram, Workflow
� Workload Characterization
� Simulation Results (predictability and performance/power ratio)
� Conclusion and Future Extensions
11/18/2009
7
Asaduzzaman — Ph.D. Dissertation Defense 25
Cache Locking in Real-Time Systems
� Introduce a block selection scheme for cache locking.� Lock the blocks that will create more misses if not locked.
� Use Heptane provided simulation platform.� Single-core processor; level-1 instruction cache locking.
� Takes C code; generates syntax tree-graph.
� Post-process tree-graph info (block address, cache misses) to get the blocks to be locked.
� Impact of cache locking and cache parameters on execution time predictability.
� Published in IEEE IIT'07 conference proceedings.
Heptane (Hades Embedded Processor Timing ANalyzEr) supports Pentium1, Hitachi H8/300, strongARM, and MIPS architectures.
Asaduzzaman — Ph.D. Dissertation Defense 26
Cache Locking in Real-Time Systems (2)Block Selection Scheme
Simulation Results
Asaduzzaman — Ph.D. Dissertation Defense 27
Cache Locking in Real-Time Systems (3)
2
1
3
Simulation results show that this cache locking technique can be used
to improve the predictability and performance up to 25% cache locking.
Asaduzzaman — Ph.D. Dissertation Defense 28
“Cache Optimization for Real-Time
Embedded Systems”
� Problem Statement and Contributions
� Background (survey, motivation)
� Cache Optimization for Single-Core Systems (performance)
� Cache Optimization for Multi-Core Sys (performance/power)
� Cache Locking in Real-Time Systems (predictability)
� Miss Table Based Cache Locking with Victim Caches
� Schematic Diagram, Workflow
� Workload Characterization
� Simulation (predictability and performance/power ratio)
� Conclusion and Future Extensions
11/18/2009
8
Asaduzzaman — Ph.D. Dissertation Defense 29
Miss Table Based Cache Locking with
Victim Caches
Miss Table (MT)� Aggressive and/or foolish cache locking may decrease the
performance.
� We introduce MT at cache level to hold info about blocks that will cause more misses if not locked.
� MT size (expected) = Cache size / Block size.
� MT should be accessed by CL1, CL2, and VCs (if any).
� Improves cache locking and cache replacement performance.
Asaduzzaman — Ph.D. Dissertation Defense 30
Miss Table Based Cache Locking with
Victim Caches (2)
Victim Cache (VC)� VC, between CL1s and CL2, temporarily store the victim
blocks from CL1 (instead of destroying them).
� We use VC to support stream buffering.
� VC reduces access latencies.*
� VC has the potential to reduce energy consumption and to improve predictability for real-time applications.**
* [75’07]
** [4’99]
Asaduzzaman — Ph.D. Dissertation Defense 31
MT Based Cache Locking with VCs (3)
CL1 (I1/D1)
L V Tag Data2’
4’
4
Blk# #MissL V Tag Data
Reg1Miss Table (MT)
Main Memory
(MM)
CL2
L V Tag Data
2Victim Cache (VC)
3
4’
4’Schematic diagram showing one core,
MT, VC, and cache memory hierarchy.
VC MT
Schematic diagram showing MT and VC
1
Core
3’
Asaduzzaman — Ph.D. Dissertation Defense 32
MT Based Cache Locking with VCs (8)
Core 1
I1 D1
Core 2
I1 D1
Core 3
I1 D1
Core 4
I1 D1
Bus2
Main Memory
Bus1 Bus1
MTVC
VCVC
VC
CL2
(Simulated Architecture)
Schematic diagram of a 4-core system with MT and VCs
11/18/2009
9
Asaduzzaman — Ph.D. Dissertation Defense 33
MT Based Cache
Locking with VCs (4)
Workflow diagram of a multi-core system
� Maximum delay for each set
of N tasks is considered for
mean delay per task.
� Power consumed by all cores
to finish all tasks are
considered for total power
consumption.
Lock up to 25% of CL2 (max miss).
N
Load MT (from offline analysis).
Load CL2 and CL1 with selected blocks.
Start
End
Generate N (or less) tasks.
Core.3 Core.NCore.2Core.1 …
Delay = SUM (MAX (delay for each N cores))
Power = SUM (power for all N cores)
Core.3’ Core.N’Core.2’Core.1’ …
Mean Delay per Task = Total delay / Number of tasks
Total Power = Total power for all tasks
NAll done?
Each core processes its task (see Figure 6.8b) …
Y
Cache locking?
Asaduzzaman — Ph.D. Dissertation Defense 34
MT Based Cache Locking with VCs (4)
Y
Y
Fill CL2 by the required block
from main memory.
Fill CL1 from CL2.
CL2 hit?
YVC hit?
Perform stream buffering. Update VC by
the additional blocks from main memory.
NStream buffering?
Core.k 1 ≤ k ≤ N
Core.k’
CL1 hit?
Find victim block (VB) from CL1 using MT.
Swap blocks
between CL2
and VC.
Using MT, replace a
VC block with VB.
Workflow diagram of Core.K
� VC temporarily holds the
victim blocks from CL1.
� In case of stream buffering,
requested block goes to CL1
and additional blocks go to VC.
Asaduzzaman — Ph.D. Dissertation Defense 35
MT Based Cache Locking with VCs (5)
Phase-I: Code Division
Application is divided into smaller
(end-to-end) functions so that each
function can be assigned to a core.
Phase-II: Code Estimation
Important operations (integer, floating-
point, load/store, and branch) for each
application are estimated.
Phase-III: Block Selection
Memory blocks are selected for
cache locking by post-processing
the Heptane tree-graph.
Workload Characterization
Data needed to run the simulation:� Type of each instruction
� Number of instructions
� Hit (and miss) ratio
� Block address and cache misses
Asaduzzaman — Ph.D. Dissertation Defense 36
The best scenario for
FFT code:
By locking 25% cache,
124 cache misses
out of 246
(more than 50%)
may be avoided.
MT Based Cache Locking with VCs (6)
Block Selection for
Cache Locking
11/18/2009
10
Asaduzzaman — Ph.D. Dissertation Defense 37
MT Based Cache Locking with VCs (7)
Assumptions
� We simulate an 4-core system where CL1 (I1, D1) is
private to each core and CL2 is shared by all cores.
� Cache locking is done at level-2 cache (modified cache
replacement policy for CL2 to exclude the locked blocks).
� Selective pre-loading and stream buffering are considered.
� Write-back memory update policy is used.
� The bus delay between CL2 and main memory is 10 times
longer than that between CL1 and CL2.
Simulation
Asaduzzaman — Ph.D. Dissertation Defense 38
MT Based Cache Locking with VCs (8)
Core
I1 D1
Core
I1 D1
Core
I1 D1
Core
I1 D1
Bus2
Main Memory
Bus1 Bus1
VC
VCVC
VC
CL2
Simulation
Simulated Architecture
MT
Asaduzzaman — Ph.D. Dissertation Defense 39
MT Based Cache Locking with VCs (9)
Applications
Parameters
Simulation
Moving Picture Experts Group’s MPEG4 (part-2)
Advanced Video Coding – widely known as H.264/AVC
Fast Fourier Transform (FFT)
Matrix Inversion (MI)
Discrete Fourier Transform (DFT)
Asaduzzaman — Ph.D. Dissertation Defense 40
MT Based Cache Locking with VCs (10)
Heptane
VisualSim
Simulation Tools
11/18/2009
11
Asaduzzaman — Ph.D. Dissertation Defense 41
MT Based Cache Locking with VCs (11)Simulation Results: Impact of MT based cache locking
For MPEG4, MT based cache locking (up to 25%) decreases mean delay
per task. MT and cache locking has no positive impact on FFT delay.
Asaduzzaman — Ph.D. Dissertation Defense 42
MT Based Cache Locking with VCs (12)Simulation Results: Impact of MT based cache locking with VCs
For MPEG4, MT based cache locking (up to 25%) with VCs improves
performance. However, FFT performance is not improved by MT and VCs.
Asaduzzaman — Ph.D. Dissertation Defense 43
MT Based Cache Locking with VCs (13)
1 2
Simulation Results: Impact of MT based cache locking with VCs
Similarly, MT based cache locking (up to 25%) with VCs reduces total
power consumption for MPEG4. However, MT and VCs has no positive
impact on total power consumption for FFT.
Asaduzzaman — Ph.D. Dissertation Defense 44
MT Based Cache Locking with VCs (14)Summary: Impact of MT based cache locking with VCs
In summary, 33% reduction in mean delay per task and 41% reduction in
total power consumption are achieved for MPEG4 using MT and VCs.
Also, as explained earlier, predictability is improved by avoiding more than
50% cache misses by locking 25% cache using MT.
11/18/2009
12
Asaduzzaman — Ph.D. Dissertation Defense 45
“Cache Optimization for Real-Time
Embedded Systems”
� Problem Statement and Major Contributions
� Background (survey, motivation)
� Cache Optimization for Single-Core Systems (performance)
� Cache Optimization for Multi-Core Sys (performance/power)
� Cache Locking in Real-Time Systems (predictability)
� Miss Table Based Cache Locking with Victim Caches
� Schematic Diagram, Workflow
� Workload Characterization
� Simulation Results (predictability and performance/power ratio)
� Conclusion and Future Extensions
Asaduzzaman — Ph.D. Dissertation Defense 46
Conclusion
� We develop cache modeling and optimization techniques to improve performance and decrease power consumption.
� We develop cache locking scheme to improve predictability.
� We introduce a Miss Table based cache locking scheme with victim caches to improve predictability and performance/power ratio for real-time embedded systems at the same time.
� We develop techniques to generate workload for cache optimization.
� Simulation results show that proposed Miss Table improves the performance of cache locking and cache replacement policy.
� Simulation results also show that victim caches improve hit ratio (by supporting stream buffering).
� It is also observed that proposed Miss Table based cache locking with victim caches has significant impact on predictability and performance / power ratio for MPEG4 code, but not for FFT code.
Asaduzzaman — Ph.D. Dissertation Defense 47
Future Extensions
� This work can be extended to address the following
important open issues:
� Investigating data cache locking for embedded systems
� Exploring cache locking at various cache levels
� Studying different cache memory subsystems for power-aware
multi-core architectures
CACHE OPTIMIZATION FOR REAL-TIME
EMBEDDED SYSTEMS
Abu Asaduzzaman
E-mail:
Telephone:
+1-561-297-2452 (Land)
+1-561-843-2231 (Cell)
Asaduzzaman — Ph.D. Dissertation Defense