Copy Asaduzzaman PhD Dissertation Defense.ppt · Ph.D. Dissertation Defense NOTE: References and...

11/18/2009

1

“CACHE OPTIMIZATION

FOR REAL-TIME

EMBEDDED SYSTEMS”

Abu Asaduzzaman

Florida Atlantic University

Department of Computer and Electrical Engineering and Computer Science

Boca Raton, Florida, USA

Ph.D. Dissertation Defense

NOTE: References and abbreviations used in this presentation are as they

appear in the original dissertation manuscript.

Asaduzzaman — Ph.D. Dissertation Defense 2

Dissertation Committee

Dissertation Advisor:

Dr. Imad Mahgoub,

Professor & Director of CEECS Graduate Programs and Research

Director of Mobile Computing Lab

Committee Members:

Dr. Borko Furht, Professor & Chair CEECS Department

Dr. Ravi Shankar, Professor & Director CSI Lab

Dr. Mohammad Ilyas, Professor & Associate Dean of College of Eng & CS

Dr. Valentine Aalo, Professor


Selected Publications (out of 21)

� “Improving Performance / Power Ratio of Embedded Multicore Systems

by Using Miss Table and Victim Caches”, IJHPSA-2009, Under Review,

Sub. Code: IJHPSA-11388, Oct. 2009.

� “Cache Modeling and Optimization for Portable Devices Running MPEG-4

Video Decoder”, MTAP-2006, Vol. 28(1), pp. 239-256, 2006.

� “Impact of L1 Entire Locking and L2 Way Locking on Performance,

Power Consumption, and Predictability of Multicore Real-Time Systems”,

IEEE AICCSA’09, May, 2009, Rabat, Morocco.

� “Cache Optimization for Mobile Devices Running Multimedia

Applications”, IEEE ISMSE'04, Dec., 2004, Miami, Florida.

� “Evaluation of Application-Specific Multiprocessor Mobile System”,

SPECTS’04, July, 2004, San Jose, California.


Outline

� Problem Statement and Contributions

� Background (survey, motivation)

� Cache Optimization for Single-Core Systems (performance)

� Cache Optimization for Multi-Core Sys (performance/power)

� Cache Locking in Real-Time Systems (exec. time predictability)

� Miss Table Based Cache Locking with Victim Caches

� Schematic Diagram, Workflow

� Workload Characterization (for cache locking)

� Simulation (predictability and performance/power ratio)

� Conclusion and Future Extensions

11/18/2009

2


Problem Statement

“Cache Optimization for Real-Time Embedded Systems”

� Even though cache improves performance, it poses some serious

challenges in real-time embedded systems as it consumes vast

amount of energy and introduces execution time unpredictability.

� Existing cache optimization solutions are not adequate for real-

time embedded systems because they do not address

performance, power consumption, and predictability issues at the

same time. Moreover, existing solutions are not suitable for dealing

with multi-core architecture issues.


Contributions

“Cache Optimization for Real-Time Embedded Systems”

� Cache modeling and optimization technique for single-core

systems to improve performance.

� Cache modeling and optimization technique for multi-core systems

to improve performance/power ratio.

� Cache locking scheme to improve predictability for real-time sys.

� Miss Table based cache locking scheme with victim caches to

improve predictability and performance/power ratio for real-time

embedded systems at the same time. This scheme is effective for

both single-core and multi-core systems.

� Workload characterization for cache optimization.


“Cache Optimization for Real-Time

Embedded Systems”





� Cache Locking in Real-Time Systems (predictability)



� Workload Characterization

� Simulation Results (predictability and performance/power ratio)



Background

CPU

400 MHz

CPU

400 MHz

Bus 100 MHz

Cache

Bus 100 MHz

CPU

400 MHz

CL2

Bus 100 MHz

D1 I1

(a) CPU and main memory (b) Processor cache

(c) Multi-level cache in single-core

Main

Memory

100 MHz

Main

Memory

100 MHz

Main

Memory

100 MHz

Intel

1990s

CPU

400 MHz

CL2

Bus 100 MHz

D1 I1

CPU

400 MHz

D1 I1

Almost every processor has cache in its architecture. So, cache

optimization is important.

(d) Multi-level cache in multi-core

Main

Memory

100 MHz

IBM

1960s

11/18/2009

3


Background (1)

Cache in single- and multi-core

Cache solutions should cover both single- and multi-core architectures.

Only one core has CL1.

CL2 is shared by all cores.

Each core has CL1.

(CL2 is shared or dedicated)

Shared CL2

Dedicated CL2

(No CL3)

(Dedicated CL2)

Shared CL3Inclusive Exclusive Victim

Single-level cache

(Only CL1)

Multi-level cache

(CL1, CL2, etc)

Single-

Core CPU

Multi-Core

CPU

(a) Single-core

(c) Multi-core(b) Cache architecture

Cache Architectures

~

Set Si

S

Sets

W blocks per set

(W = associativity level)

Block

(Size Sb)

Possible blocks into which an instruction at

address Ad can be mapped; Si = (Ad/Sb) mod S

Important

cache parameters

Cache size, C = Sb * B

Sb � Line size

Cache blocks, B = S * W

S � Number of sets

W � Associativity levels


Background (2)

� Based on the current design trend, most processors have

multi-level caches (to improve performance).*

� Cache makes the system more unpredictable and

consumes significant amount of power.**

� Power consumption and execution time unpredictability

may defeat the performance gain of real-time embedded

systems.***

* [35'82], [109'06], [110'06], [134'05], [166'09]

** [128'04], [131'04], [144'06], [163'07]

*** [47'06], [70'06], [93'03], [127'06]


Background (3)

� System performance can be improved by customizing

cache parameters, using victim cache, and applying

stream buffering.* (power?)

� Multi-core improves performance/power ratio.** (multi-

core, caches, power?)

� Cache-locking improves predictability.*** (performance,

power, predictability?)

* [5'06], [44'06], [56'06], [58'90], [82'03], [122'07], [155'04], [158'09], [163'07]

** [47'06], [127'06], [166'09]

*** [9'09], [11'07], [12'05], [97'06], [128'04], [139'03]


Cache Optimization for Real-Time Embedded Systems

1

3

2

54

Workload

Cache

Optimization

Simulation

Embedded

System

1a

Cache

Modeling in ES

1b

Pre-

Fetching

1c

Cache

Locking

2a 2b 2c

System-

On-a-ChipDesign

MP-

SOC

RT

MP-SOC

Legends:

ES – Embedded System

MP-SOC – Multi-Processor System-On-a-Chip

RT – Real-Time

Cache ES

AppsTools

Reviewed Article Summary:

Total: 172 (Year 1982 – 2009)

Year 2006+: 45%

Year 2000+: 85%

Other

Related Issues

Background (4)

1) Performance

2) Power

3) PredictabilityP, P/P

Literature Survey

11/18/2009

4


� A general-purpose computing platform is studied using MPEG-2 application.*� Cache can be optimized to increase performance of

multimedia applications.

� Techniques like selective caching, cache locking, and pre-fetching may improve cache performance for application specific systems.

� However, no power analysis is conducted. Also, the methodology is not re-useable.

*[121]

Background (5) Literature Survey


Background (6)

� Victim cache and stream buffering is introduced by Jouppi to improve performance.*� With VC, effective cache size is more than CL2.

� Separate cache to support stream buffering.

� Stream buffering improves performance/power ratio.**

� VC is suitable for embedded systems (limited area) and multimedia applications (large memory accesses).***

� However, no predictability analysis, VC can be reused for stream buffering.

* [58]

** [44]

*** [155]

Literature Survey


Background (7)

� Cache locking improves predictability in real-time system.*� Locks the blocks inside the cache during exec.

� Trade-off predictability Vs performance.

� Cache locking in shared caches of multi-core architecture is explored by Suhendra.**� No L1, but L2, locking on PowerPC 750GX.

� Existing solutions do not address performance, power consumptions, and predictability at the same time.

* [12], [21], [97], [99], [128], [139]

** [A], [B], [124]

Literature Survey



Embedded Systems”











11/18/2009

5


Cache Optimizing for Single-Core Sys

� Supported by Motorola-FAU OPP Project.

� Use a single-core architecture.� Split CL1 (I1 and D1), unified CL2

� H.264/AVC decoding algorithm

� Encoded video is decoded using cache subsystem

� Impact of cache size, line size, and associativity level on performance (miss ratio).

� Published in IEEE ISMSE'04 conference proceedings.


Cache Optimizing for Single-Core Sys (2)S

hare

d B

us

Main

Memory

I1

D1

CL2

Write

CL1

Processing

Core

Read

Simulated Single-Core Architecture

Workload (using Cachegrind): Decoding .264 file

Selected architecture and workload

Simulation Results


Cache Optimizing for Single-Core Sys (3)

1 2

3 4

VisualSim simulation results show that this cache optimization technique

can be used to improve the performance of single-core systems.



Embedded Systems”











11/18/2009

6


Cache Optimizing for Multi-Core Sys

� Supported by Motorola-FAU OPP Project.

� Use a multi-core architecture.� Selected a 2-core system

� Each core has its own unified CL1; no CL2

� MPEG4 decoding algorithm

� Impact of cache size, line size, and associativity level on mean delay and power consumption.

� Published in SPECTS’04 conference proceedings.


Cache Optimizing for Multi-Core Sys (2)

DSP AP

CL1 CL1

Encoded

Video

Main

Memory

Playback

Application

Open/OS

RT/OS

Shared Bus

R/W Read

IPC/MU

AP – Application Processor

DSP – Digital Signal ProcessorIPC/MU – Inter-Process Communication /

Messaging UnitRT/OS – Real-Time Operating System

R/W – Read / Write

Workload: Decoding MPEG4 file

I

1

B

2

B

3

P

4

Prediction

Bi-directional

Prediction

B

5

B

6

P

7

Frame

MPEG4 Group Of Pictures

with 7 frames

(1 frame ≈ 152 KB)

Selected architecture and workload

Simulated Dual-Core

Architecture

Simulation Results


Cache Optimizing for Multi-Core Sys (3)

1 2

3 4

Adding a core reduces mean delay by >45%

while consumes <5% additional energy.

Simulation results show that this cache optimization technique can be

used to improve the performance/power ratio of multi-core systems.



Embedded Systems”











11/18/2009

7


Cache Locking in Real-Time Systems

� Introduce a block selection scheme for cache locking.� Lock the blocks that will create more misses if not locked.

� Use Heptane provided simulation platform.� Single-core processor; level-1 instruction cache locking.

� Takes C code; generates syntax tree-graph.

� Post-process tree-graph info (block address, cache misses) to get the blocks to be locked.

� Impact of cache locking and cache parameters on execution time predictability.

� Published in IEEE IIT'07 conference proceedings.

Heptane (Hades Embedded Processor Timing ANalyzEr) supports Pentium1, Hitachi H8/300, strongARM, and MIPS architectures.


Cache Locking in Real-Time Systems (2)Block Selection Scheme

Simulation Results


Cache Locking in Real-Time Systems (3)

2

1

3

Simulation results show that this cache locking technique can be used

to improve the predictability and performance up to 25% cache locking.



Embedded Systems”











11/18/2009

8


Miss Table Based Cache Locking with

Victim Caches

Miss Table (MT)� Aggressive and/or foolish cache locking may decrease the

performance.

� We introduce MT at cache level to hold info about blocks that will cause more misses if not locked.

� MT size (expected) = Cache size / Block size.

� MT should be accessed by CL1, CL2, and VCs (if any).

� Improves cache locking and cache replacement performance.


Miss Table Based Cache Locking with

Victim Caches (2)

Victim Cache (VC)� VC, between CL1s and CL2, temporarily store the victim

blocks from CL1 (instead of destroying them).

� We use VC to support stream buffering.

� VC reduces access latencies.*

� VC has the potential to reduce energy consumption and to improve predictability for real-time applications.**

* [75’07]

** [4’99]


MT Based Cache Locking with VCs (3)

CL1 (I1/D1)

L V Tag Data2’

4’

4

Blk# #MissL V Tag Data

Reg1Miss Table (MT)

Main Memory

(MM)

CL2

L V Tag Data

2Victim Cache (VC)

3

4’

4’Schematic diagram showing one core,

MT, VC, and cache memory hierarchy.

VC MT

Schematic diagram showing MT and VC

1

Core

3’



Core 1

I1 D1

Core 2

I1 D1

Core 3

I1 D1

Core 4

I1 D1

Bus2

Main Memory

Bus1 Bus1

MTVC

VCVC

VC

CL2

(Simulated Architecture)

Schematic diagram of a 4-core system with MT and VCs

11/18/2009

9


MT Based Cache

Locking with VCs (4)

Workflow diagram of a multi-core system

� Maximum delay for each set

of N tasks is considered for

mean delay per task.

� Power consumed by all cores

to finish all tasks are

considered for total power

consumption.

Lock up to 25% of CL2 (max miss).

N

Load MT (from offline analysis).

Load CL2 and CL1 with selected blocks.

Start

End

Generate N (or less) tasks.

Core.3 Core.NCore.2Core.1 …

Delay = SUM (MAX (delay for each N cores))

Power = SUM (power for all N cores)

Core.3’ Core.N’Core.2’Core.1’ …

Mean Delay per Task = Total delay / Number of tasks

Total Power = Total power for all tasks

NAll done?

Each core processes its task (see Figure 6.8b) …

Y

Cache locking?



Y

Y

Fill CL2 by the required block

from main memory.

Fill CL1 from CL2.

CL2 hit?

YVC hit?

Perform stream buffering. Update VC by

the additional blocks from main memory.

NStream buffering?

Core.k 1 ≤ k ≤ N

Core.k’

CL1 hit?

Find victim block (VB) from CL1 using MT.

Swap blocks

between CL2

and VC.

Using MT, replace a

VC block with VB.

Workflow diagram of Core.K

� VC temporarily holds the

victim blocks from CL1.

� In case of stream buffering,

requested block goes to CL1

and additional blocks go to VC.



Phase-I: Code Division

Application is divided into smaller

(end-to-end) functions so that each

function can be assigned to a core.

Phase-II: Code Estimation

Important operations (integer, floating-

point, load/store, and branch) for each

application are estimated.

Phase-III: Block Selection

Memory blocks are selected for

cache locking by post-processing

the Heptane tree-graph.

Workload Characterization

Data needed to run the simulation:� Type of each instruction

� Number of instructions

� Hit (and miss) ratio

� Block address and cache misses


The best scenario for

FFT code:

By locking 25% cache,

124 cache misses

out of 246

(more than 50%)

may be avoided.


Block Selection for

Cache Locking

11/18/2009

10



Assumptions

� We simulate an 4-core system where CL1 (I1, D1) is

private to each core and CL2 is shared by all cores.

� Cache locking is done at level-2 cache (modified cache

replacement policy for CL2 to exclude the locked blocks).

� Selective pre-loading and stream buffering are considered.

� Write-back memory update policy is used.

� The bus delay between CL2 and main memory is 10 times

longer than that between CL1 and CL2.

Simulation



Core

I1 D1

Core

I1 D1

Core

I1 D1

Core

I1 D1

Bus2

Main Memory

Bus1 Bus1

VC

VCVC

VC

CL2

Simulation

Simulated Architecture

MT



Applications

Parameters

Simulation

Moving Picture Experts Group’s MPEG4 (part-2)

Advanced Video Coding – widely known as H.264/AVC

Fast Fourier Transform (FFT)

Matrix Inversion (MI)

Discrete Fourier Transform (DFT)



Heptane

VisualSim

Simulation Tools

11/18/2009

11


MT Based Cache Locking with VCs (11)Simulation Results: Impact of MT based cache locking

For MPEG4, MT based cache locking (up to 25%) decreases mean delay

per task. MT and cache locking has no positive impact on FFT delay.


MT Based Cache Locking with VCs (12)Simulation Results: Impact of MT based cache locking with VCs

For MPEG4, MT based cache locking (up to 25%) with VCs improves

performance. However, FFT performance is not improved by MT and VCs.



1 2

Simulation Results: Impact of MT based cache locking with VCs

Similarly, MT based cache locking (up to 25%) with VCs reduces total

power consumption for MPEG4. However, MT and VCs has no positive

impact on total power consumption for FFT.


MT Based Cache Locking with VCs (14)Summary: Impact of MT based cache locking with VCs

In summary, 33% reduction in mean delay per task and 41% reduction in

total power consumption are achieved for MPEG4 using MT and VCs.

Also, as explained earlier, predictability is improved by avoiding more than

50% cache misses by locking 25% cache using MT.

11/18/2009

12



Embedded Systems”

� Problem Statement and Major Contributions











Conclusion

� We develop cache modeling and optimization techniques to improve performance and decrease power consumption.

� We develop cache locking scheme to improve predictability.

� We introduce a Miss Table based cache locking scheme with victim caches to improve predictability and performance/power ratio for real-time embedded systems at the same time.

� We develop techniques to generate workload for cache optimization.

� Simulation results show that proposed Miss Table improves the performance of cache locking and cache replacement policy.

� Simulation results also show that victim caches improve hit ratio (by supporting stream buffering).

� It is also observed that proposed Miss Table based cache locking with victim caches has significant impact on predictability and performance / power ratio for MPEG4 code, but not for FFT code.


Future Extensions

� This work can be extended to address the following

important open issues:

� Investigating data cache locking for embedded systems

� Exploring cache locking at various cache levels

� Studying different cache memory subsystems for power-aware

multi-core architectures

CACHE OPTIMIZATION FOR REAL-TIME

EMBEDDED SYSTEMS

Abu Asaduzzaman

E-mail:

[email protected]

[email protected]

Telephone:

+1-561-297-2452 (Land)

+1-561-843-2231 (Cell)

Asaduzzaman — Ph.D. Dissertation Defense

Date post:	12-Aug-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

Copy Asaduzzaman PhD Dissertation Defense.ppt · Ph.D. Dissertation Defense NOTE: References and...

Documents