[IEEE 2011 14th International Conference on Computer and Information Technology (ICCIT) - Dhaka,...

An E�cient Memory Block Selection Strategy to Improve the Performance of Cache Memory Subsystem

Abu Asaduzzaman Dept of Electrical Engineering and Computer Science, Wichita State University, Wichita, Kansas, USA

[email protected]

Abstract

Although cache improves performance by reducing the speed-gap between the CPU and main memory, cache increases the timing unpredictability due to its dynamic nature. Cache also requires signi�cant amount of power to be operated. Unpredictability and power consumption become even worse in multicore systems due the presence of multiple levels of caches. Recent studies indicate that predictability can be increased and total power consumption can be decreased without compromising performance by locking appropriate memory blocks. The success of cache locking depends on the accurate selection of blocks to be locked. In this work, we propose an easy but e�cient memory block selection strategy to enhance cache locking and cache replacement enactment and overall cache memory subsystem performance. Proposed scheme determines the blocks that produce more cache misses if not locked and stores the block address and miss information (BAMI) at cache level. Cache locking technique should lock memory blocks with higher cache misses and cache replacement policy should select victim blocks with lower cache misses using BAMI. We simulate single-core and multi-core systems, both with two-level cache memory subsystem, to evaluate the proposed block selection scheme. Experimental results show that the predictability can be improved by increasing hit ratio up to 11% and total power consumption can be decreased up to 20% by using our memory block selection scheme.

Keywords: cache memory subsystem, cache replacement policy, memory block selection, total power consumption

I. INTRODUCTION

Cache memory subsystem is the primary performance bottleneck for processing real-time and hard real-time applications on single-core and multi-core systems. Real-time application processing seriously su�er due to the memory ine�ciency from dropped frames, blocking, or other annoying artifacts [1][2]. In order to support real-time and hard real-time processing cache memory hierarchy is changing. As we know, cache is introduced to improve performance by reducing the speed-gap between the processor and the main memory. However, two critical issues, namely total power consumption and execution time predictability, need to be addressed to support real-time applications. A signi�cant amount of power is required to keep the caches operational as caches are power-hungry. Cache is one of the major sources of unpredictability due to its adaptive and dynamic characteristics. As a result of the presence of cache, programs may behave in an unexpected way and it becomes very di�cult to develop real-time applications for such a system. A lot of work has been done to predict the worst-case behavior of real-time applications in order to determine the safe and precise bounds on tasks worst case execution time (WCET) and cache-related preemption delay [3]. Cache locking is an important mechanism that adapts

caches to the needs of real-time application processing. Recent studies show that the time required to perform a memory access is predictable with static/dynamic data/instruction cache locking [1][4][5]. It is also observed that cache locking improves predictability by removing both intra-task and inter-task interferences [6][7]. Cache locking performance can be improved if blocks are e�ectively selected and copied into the way(s) that should be locked. Here, the term ‘cache locking’ is used to mean ‘way cache locking’ (not ‘entire cache locking’) and the term ‘cache’ refers ‘w-way set associative cache’. In this work, we present a memory block selection scheme that can be used to boost up (way) cache locking, cache replacement, and overall cache memory performance. This paper is organized as follow. In Section II, some related articles are discussed. In Section III, proposed memory block selection scheme is presented. Experimental details to evaluate the proposed scheme are described in Section IV. In Section V, the experimental results are discussed. We conclude our work in Section VI. Finally, we present a memory block selection table in Appendix A.

II. S URVEY

A lot of progress has been made in the past regarding cache memory performance to address the performance, predictability, and power consumption issues. Recently published articles address cache memory issues in multicore systems. A selection of related published articles is discussed in this section. A static cache locking approach is presented in [3]. According to this approach, caches in real-time systems statically lock the contents so as to make memory access times and cache-related preemption delays more predictable. However, it is not clear how and which cache blocks are selected for locking. Also, this approach should be tested on larger real (non-synthetic) benchmarks, speci�cally in task sets whose size is much larger than the cache size. In [4], a static method is presented to limit the worst-case instruction cache miss ratio of a program. This method needs no manual annotations in the code and is safe in the meaning that no under-estimation is possible. However, this method is not reliable; it may produce inaccurate results. The impact of I1 (level-1 instruction cache) entire locking and CL2 (level-2 cache) partial locking on the predictability of multicore real-time systems is studied in [1]. Experimental results show that for smaller code segments (like FFT), CL2 partial locking improves predictability more than I1 entire locking does; but for large applications (like MPEG4), I1 entire locking outperforms CL2 partial locking. This work needs to be tested and veri�ed against some standard benchmarks.

Proceedings of 14th International Conference on Computer and Information Technology (ICCIT 2011) 22-24 December, 2011, Dhaka, Bangladesh

987-161284-908-9/11/$26.00 © 2011 IEEE

In [6], a technique is proposed to obtain predictability in preemptive multitasking systems in the presence of data caches. Cache partitioning, dynamic cache-locking, and static cache analysis are done to provide worst-case performance estimates in a safe and tight way. Cache partitioning divides the cache among tasks to eliminate inter-task cache interferences. The cache is loaded with data likely to be accessed so that their cache utilization is maximized. Experimental results indicate that this scheme is very predictable, without compromising the performance of the transformed programs. However, this technique focuses only on data cache locking and is not suitable for instruction cache locking analysis. Compile-time cache analysis is combined with data-cache locking algorithm to estimate the worst-case memory performance (WCMP) in a safe, tight and fast way in [7]. In order to get predictable cache behavior, the cache for those parts of the code where the static analysis fails is locked. According to the experimental results this scheme is very predictable. However, this approach eliminates all overestimation for the set of benchmarks when the state of the cache is unknown, giving an exact WCMP of the transformed program that may cause decrease in system performance. Finally, a methodology to select a set of instructions to be preloaded and locked in the cache using genetic algorithm is proposed in [8]. The implemented algorithm makes a well-directed search. This algorithm shows performance improvement and simultaneously estimates a tight upper bound of the response time of tasks. However, the algorithm does not use information about task structure. There is no guarantee that this selection algorithm should improve predictability.

III. PROPOSED MEMORY BLOCK SELECTION STRATEGY

In this work, we introduce an efficient memory block selection strategy primarily for way/cache locking which is based on the static tree-graph generated by Heptane package [9]. The main objective of this scheme is to select the blocks (as many as possible) that might generate more misses (in the near future). Heptane takes the C source file of the application as input and generates a tree-graph. The nodes of the syntax tree represent the structure of the C program and the leaves represent the basic blocks. Leaves in the syntax tree coincide with the nodes in the control-flow graph. We collect the following information from the tree graph.

• Name of the node

• Number of instructions

• Total number of cycles

• Cache miss (and cache hit) for each node

From off-line analysis, we determine which code section of the source file causes more misses. We divide the analysis in several parts including root node of main C source file, calling function for C source file, all leaf node analysis for root node, and top loop node level analysis. Major steps of memory block selection scheme are shown in Figure 1. The process starts by generating a Heptane tree-graph using the application code. All the blocks that cause cache misses are collected by doing off-line analysis of the tree-graph. We collect instruction block

(IB) address cache miss (and cache hit) information based on the tree graph generated by the Heptane WCET analyzer. Then the list of the blocks is sorted in a way so that the block that causes the maximum number of misses becomes the number one candidate to be locked, and so on. If the number of blocks (that causes misses) is bigger than the locked cache size, then the candidate blocks are determined based on cache size, line size, and locking capacity. In order to implement the cache locking scheme, a small routine is required to be executed at the system start-up time to load the content of the cache with the selected IB-address values and lock the cache/ways so that the contents remain available during the whole system execution.

Fig. 1 Workflow diagram of proposed memory block selection strategy.

Proposed memory block selection scheme determines the right amount of effective blocks to be locked by cache locking and helps select the effective block to be replaced by cache replacement scheme, which is very important to increase predictability and decrease power consumption. In addition to cache locking, this block selection scheme may be used for pre-loading, pre-fetching, and stream buffering to select the block(s) with maximum number of cache misses. Also, this scheme may be used to select the victim block (block with the minimum number of cache misses) for cache replacement.

IV. EXPERIMENTAL SETUP

In this work, we model two systems, one with single core and the other with multiple cores; both systems have two-level cache memory subsystems. We use two simulation tools – Heptane and VisualSim. Heptane is used to obtain the number of misses by each block and to simulate single-core I1 cache locking. VisualSim is used to simulate both single-core and multi-core systems. We use proposed memory block selection strategy in I1/CL2 cache locking and I1 cache replacement. In the following subsections, we briefly describe the target architecture, Heptane package, VisualSim simulator, static cache analysis, I1 cache locking, and the applications used.

A. Target Architecture Two computing architectures are considered to evaluate our proposed memory block selection scheme. The first one is a simplified version of Pentium 4 processor from Intel as illustrated in Figure 2. The architecture consists of a cache memory subsystem with two-level cache. First level cache (CL1) is split into instruction cache (I1) and data cache (D1) and second level cache (CL2) is unified. Both I1 and CL2 cache locking are simulated (I1 cache locking by Heptane and CL2 cache locking by VisualSim). Block address and miss information (BAMI) is implemented at cache level such that both CL1 (to enhance cache locking and replacement) and CL2 (to improve cache locking) can access it when needed.

Fig. 2 A single-core system with CL1 (I1+D1), CL2, and BAMI.

Schematic diagram in Figure 3 represents the second architecture we simulate which is a simplified version of Intel Xeon quad-core. Each core has its own private I1 and D1 and unified CL2 is shared. All I1s and CL2 have access to BAMI. VisualSim is used to model the multi-core system and Heptane generated workload is used to run the VisualSim simulation program.

Fig. 3 A multicore system with four cores (each core has CL1), shared CL2, and BAMI.

In this study, we consider I1 = D1 with cache size ranges from 4 KB to 64 KB, CL2 with cache size ranges from 256 KB to 4096 KB, line size from 32 to 512 Bytes, and the associativity level from 1 (direct-mapped cache) to 16 (set associative cache). An instruction is assumed to execute in 1 clock cycle in the case of a cache hit, and 10 clock cycles otherwise.

B. Simulation Tools We use Heptane (short for Hades Embedded Processor Timing ANalyzEr) [9] and VisualSim (short for VisualSim Architect) [10] to in this work. Heptane is suitable for characterizing the applications and to simulate single-core one-level cache systems. VisualSim is suitable for modeling and simulation of multi-core systems with multi-level caches. VisualSim uses Heptane generated workload to run the simulation program; simulation cockpit is used to change the values of the input parameters (without changing the program) and store the results as text and/or graph files.

C. Cache Locking In this work, we consider w-way set-associative cache, where some ways can be locked (excluded from cache replacement) for the entire execution period. According to way/cache locking technique, the block that causes more misses have higher chances to be locked. This algorithm aims at optimizing the task set schedule ability by maximizing the hit ratio. On the considered architecture when cache locking is used, the cache-related preemption delay is constant and equal to the delay required to refill the processor pre-fetch buffer (10 clock cycles in this case).

D. Cache Replacement Policy

We start with random cache replacement policy for both CL1 and CL2. However, to take advantage of BAMI introduced by proposed block selection scheme, we modify random cache replacement policy for CL1. While selecting a victim block, it uses BAMI and selects a block that has the minimum number of cache misses. This modification makes better use of cache by maximizing the hit ratio.

E. Applications

In this work, we use five popular algorithms/applications to run the simulation program, namely Fast Fourier Transform (FFT), Matrix Inversion (MI), Discrete Fourier Transform (DFT), Moving Picture Experts Group’s MPEG4, and Advanced Video Coding – widely known as H.264/AVC. Some important information about these applications is listed in Table I. Computing time (no-locking and I1-locking) is obtained using Heptane simulator for a single-core system with CL1 (I1+D1) size is 4KB+4KB, line size is 128B, and associativity level is 8-way. Here, memory blocks are randomly selected for locking and cache locking shows some performance improvement. Our proposed memory block selection technique should select blocks that cause more cache misses and should increase cache hit rate.

Table I Important parameters and their values

Application (Code

Segment)

Code Size

(Bytes)

Computing Time (Kilo Cycles)

No Locking Cache Locking

FFT 8507 445368 396294 MI 6606 839335 751468

DFT 5211 983052 880442 MPEG4 87503 5475836 4076419

H.264/AVC 75037 4805167 3363616

V. RESULTS AND DISCUSSION

In this work, we introduce an efficient memory block selection scheme to improve cache locking, cache replacement, and overall cache memory performance. In the following subsections we evaluate the proposed scheme by presenting some important experimental results.

A. Proposed Block Selection Scheme Vs. Random Selection

Using proposed memory block selection scheme in a single-core system for I1 cache locking, I1 hit ratio is obtained by varying locked cache size from 0% (no locking) to 50% of the cache size for FFT code. This experiment is repeated 25 times where randomly selected memory blocks are locked. As shown in Table II, the hit ratio keeps increasing as the portion of the locked cache size increases from 0% (no locking). Experimental results also show that hit ratio is the maximum at 25% locking. After locking more than 25% of the cache size, hit ratio starts decreasing. It is observed that our proposed scheme offers higher hit ratio (i.e., better predictability) than the random scheme.

Table II Cache Locking and I1 Hit Ratio for FFT – Proposed Scheme Vs. Random Selection

I1 size 4KB, Line size 128 Byte, and Assoc. 8-way

Portion of I1 Locked

I1 Cache Size 4 KB

Num of Block

Locked

I1 Hit Ratio

Random Selection

Proposed Strategy

0% 0 0.771 0.822 12.5% 4 0.865 0.915 25.0% 7 0.897 0.946 37.5% 10 0.862 0.914 50.0% 16 0.851 0.905

We also collect I1 hit ratio for MI and DFT; the results due to FFT, MI, and DFT are very similar. Therefore, FFT should also represent the omitted MI and DFT in the rest of the paper. As 25% cache locking shows the highest cache hit ratio, in the following subsections, we discuss the impact of proposed memory block selection scheme for 25% cache locking.

B. Impact of Proposed Block Selection Scheme on Single-Core Cache System

We apply our memory block selection strategy in a single-core system with varying I1 cache size and line size and obtain I1 hit ratio. As shown in Figure 4, for fixed line size and associativity level, the hit ratio increases for no-locking, I1-locking (random), and I1-locking (selection) with the increase of the I1 cache size as the number of compulsory misses decrease. We observe that hit ratio due to the proposed scheme is higher than that of random selection.

Fig. 4 Impact of proposed scheme on I1 hit ratio for various I1 cache size.

Similarly, hit ratio increases for increased line size (from 32 Bytes). However, after a certain point (here, after 128 Bytes), hit ratio starts decreasing as conflict misses increase due to cache pollution [see Figure 5]. It is important to note that, for any cache size or line size, the hit ratio due to proposed block selection scheme is higher than that of randomly selected blocks. This is because, our block selection scheme knowingly selects blocks that might cause more misses if not locked.

Fig. 5 Impact of proposed scheme on I1 hit ratio for various I1 line size.

C. Impact of Proposed Block Selection Scheme on Multicore Cache System

We apply our memory block selection strategy in an 4-core system with varying total I1 size and locked I1 size (in %) and obtain mean delay per task and total power consumption for FFT code [see Figures 6 and 7]. Figure 6 illustrates how mean delay per task decreases as I1 size increases. Similarly, Figure 7 illustrates how total power consumption decreases with increased I1 locked cache size (up to 25%). It is noted that, the mean delay per task and total power consumption due to proposed block selection scheme is lower than that of randomly selected blocks for all cases. We obtain similar results using MI and DFT codes.

Fig. 6 Impact of proposed scheme on mean delay per task for FFT code on a system with 4 cores.

Fig. 7 Impact of proposed scheme on total power consumption for FFT code on a system with 4 cores.

VI. CONCLUSION

Using cache is a very effective and popular way to improve overall system performance. However, due to its dynamic and adaptive behavior, cache introduces execution time unpredictability. Unpredictability and power consumption become even worse in most multicore systems as they have multiple levels of caches. Various cache optimization techniques including cache replacement and cache locking

have been proposed to address these issues. The success of cache locking and cache replacement schemes depend on effective memory block selection strategy so that blocks with maximum numbers of misses can be locked and blocks with no or minimum number of misses can be replaced. In this work, we present a simple but efficient memory block selection strategy that help enhance cache locking, cache replacement, and overall cache memory subsystem performance. According to the proposed scheme, memory blocks that cause more cache misses are determined and the block address and miss information (BAMI) is stored at cache level. Cache locking and cache replacement techniques use BAMI to determine a block to be locked and replaced, respectively. To evaluate our scheme, we simulate single-core and multi-core systems, both with two-level cache memory subsystem. In the single-core system, we implement cache replacement at I1 and cache locking at I1 using Heptane and cache locking at CL2 using VisualSim. We obtain I1 hit ratio for varying I1 cache size and line size. In the multi-core system, we implement cache locking and cache replacement at CL2 using VisualSim and obtain mean delay per task and total power consumption for varying CL2 cache size and locked CL2 size. In all cases, we consider random selection and proposed selection strategies. We use FFT, MI, and DFT workload to simulate the single-core system and MPEG4 and H.264/AVC workload to simulate the multi-core system. Experimental results show that our proposed memory block selection scheme improves predictability by increasing cache hit ratio. It is also noted that the hit ratio is the maximum at 25% locking. From single-core simulation, it is observed that the highest hit ratio is obtained using our memory block selection scheme for different I1 cache size and line size. From multi-core simulation, it is observed that the lowest mean delay per task and the lowest total power consumption are obtained using our memory block selection scheme for different CL2 cache size and locked CL2 size. According to the experimental results, the predictability is improved by increasing hit ratio up to 11% and total power consumption is decreased up to 20% by using our memory block selection scheme as our scheme selects blocks that might create more misses if not locked. We plan to study the impact of multi-level cache locking in a multi-core system in our next endeavor.

VII. APPENDIX

APPENDIX A: Memory Block Selection Table

Table III shows the block addresses and cache misses per block obtained for FFT code. The table is sorted in descending order of the number of cache misses per block. Cache blocks can be selected for cache locking depending on the cache size, line size, and locked cache size. For an example: if cache size is 2 KB, line size is 128 B, and associativity level 8-way, then number of blocks is 2*1024/128 = 16. So, information about top 16 blocks (or less) should be stored. Now, locking 2 of 8 ways (i.e., 25% cache size) means 4 (25% of 16 is 4) blocks should be selected for locking. The best case scenario is that the top four entries (block addresses 0, 80, 300, and 100) are selected to be locked. In this example, total number of cache

misses is 246. By locking the first four entries, 124 cache misses can be avoided (i.e., cache miss is reduced by more than 50%). Therefore, execution time predictability can be increased by locking a small portion of the cache.

It should be noted that blocks with zero or small number of cache misses have no or less impact on predictability and should be ignored from locking. In this case, the bottom three entries (block addresses 440, 120, and 260) are not considered for cache locking.

REFERENCES

[1] A. Asaduzzaman, I. Mahgoub, F.N. Sibai, “Impact of L1 Entire Locking and L2 Way Locking on the Performance, Power Consumption, and Predictability of Multicore Real-Time Systems”, IEEE AICCSA'09, Morocco, 2009.

[2] A. Asaduzzaman, I. Mahgoub, Et al, “Cache Optimization for Mobile Devices Running Multimedia Applications”, IEEE International Symposium on Multimedia Software Engineering (ISMSE’04), Miami, Florida, 2004.

[3] I. Puaut, “Cache analysis vs static cache locking for schedulability analysis in multitasking real-time systems,” 23rd Real-Time System Symposium, INSA/IRISA, France, 2004.

[4] F. Sebek and J. Gustafsson, “Determining the Worst-Case Instruction Cache Miss-Ratio,” Sweden, 2002.

[5] A.M. Molnos, M.J.M. Heijligers, et al. Data Cache Optimization in Multimedia Applications. Proceedings of the 14th Annual Workshop on Circuits, Systems and Signal Processing (ProRISC‘03), Veldhoven, The Netherlands, pages 529-532, 2003.

[6] X. Vera, B. Lisper, J. Xue, “Data caches in multitasking hard real-time systems,” Real-Time Systems Symposium, 24th IEEE Volume, pp. 154-165, 2003.

[7] X. Vera, B. Lisper, J. Xue, “Data Cache locking for Higher Program Predictability,” SIGMETRICS’03, CA, USA, June 2003.

[8] A.M. Campoy, Et al., “Using Genetic Algorithms in Content Selection for Locking-Caches,’ IASTED International Symposia Applied Informatics, pp. 271-276, Austria, 2001.

[9] “Heptane (Hades Embedded Processor Timing ANalyzEr),” A Static WCET Analyzer, 2011. www.irisa.fr/aces/work/heptane-demo/heptane.html

[10] “VisualSim Architecture,” A system-level simulator, Mirabilis Design, 2011. www.mirabilisdesign.com

Date post:	15-Oct-2016
Category:	Documents
Upload:	abu
View:	215 times
Download:	2 times

[IEEE 2011 14th International Conference on Computer and Information Technology (ICCIT) - Dhaka,...

Documents