+ All Categories
Home > Documents > PACT: Priority-Aware Phase-based Cache Tuning for …tosiron/papers/2017/pact_ISVLSI17.pdfembedded...

PACT: Priority-Aware Phase-based Cache Tuning for …tosiron/papers/2017/pact_ISVLSI17.pdfembedded...

Date post: 22-Jul-2020
Category:
Upload: others
View: 4 times
Download: 0 times
Share this document with a friend
6
PACT: Priority-Aware Phase-based Cache Tuning for Embedded Systems Sam Gianelli and Tosiron Adegbija Department of Electrical & Computer Engineering University of Arizona, Tucson, AZ, USA Email: {samjgianelli, tosiron}@email.arizona.edu Abstract—Due to the cache’s significant impact on an embed- ded system, much research has focused on cache optimizations, such as reduced energy consumption or improved performance. However, throughout an embedded system’s lifetime, the sys- tem may have different optimization priorities, due to variable operating conditions and requirements. Variable optimization priorities, embedded systems’ stringent design constraints, and the fact that applications typically have execution phases with varying runtime resource requirements, necessitate new robust optimization techniques that can dynamically adapt to different optimization goals. In this paper, we present priority-aware phase- based cache tuning (PACT), which tunes an embedded system’s cache at runtime in order to dynamically adhere the cache configurations to varying optimization goals (specifically EDP, energy, and execution time), application execution phases, and operating conditions, while accruing minimal runtime overheads. Index Terms—Configurable memory, design space exploration, cache tuning, cache memories, low-power design, low-power embedded systems, adaptable hardware I. I NTRODUCTION AND MOTIVATION Caches contribute significantly to an embedded system’s energy consumption and performance. As a result, much optimization research focuses on optimizations that improve the cache’s energy efficiency and performance, while accruing minimal optimization overheads. Configurable caches [20] have been widely studied as a viable architecture for min- imizing the cache’s energy consumption. Using configurable caches, the cache configurations (i.e., cache size, associativity, and line/block size) can be dynamically tuned or specialized to executing applications’ resource requirements, in order to minimize energy overheads from an over-provisioned cache. However, cache optimization, otherwise known as cache tuning, is very challenging due to embedded systems’ typical design constraints (e.g., area, energy, real-time constraints), potentially large configurable cache design spaces, and the runtime variability of embedded systems applications and executing conditions [3]. Depending on different factors, such as resource availability, user requirements, and application characteristics, an embedded system’s optimization goals (e.g., energy/execution time minimization) may change throughout its lifetime. Ideally, a runtime cache tuning technique must account for variable application requirements and optimization goals. Cache optimization challenges are further compounded by the presence of different execution phases in emerging embed- ded systems applications [17]. A phase is an execution interval during which the application’s execution characteristics (e.g., cache miss rates, instructions per cycle (IPC), etc.) are rel- atively stable. Different phases within the same application, due to their different characteristics, may have different re- source requirements. To account for these variable resource requirements, phase-based cache tuning [3], [9] can achieve fine-grained optimization by tuning the cache configurations to determine configurations that satisfy the different application phases’ requirements. Prior work has shown that phase based cache tuning can reduce the cache’s energy consumption by up to 62% [10]. However, phase-based cache tuning can result in runtime execution time overhead. To mitigate this overhead, prior work [3] proposed phase distance mapping (PDM) as an analytical approach to phase-based cache tuning. PDM calculates the relative distance between a new executing phase and a pre- viously tuned base phase, and approximates the new phase’s best cache configuration based on the distance between the new and base phases’ execution characteristics. PDM, much like several prior cache optimization techniques [9], [15], [18], has the drawback of only focusing on one optimization goal throughout the system’s lifetime. Due to the adversar- ial nature of optimization goals—improving the energy, for example, may degrade execution time—effective optimization techniques must be adaptable to changing runtime operating conditions (the current state of the device, e.g., critically low battery) that may necessitate variable optimization goals. In this paper, we present Priority-Aware phase-based Cache Tuning (PACT). PACT incorporates the notion of priorities to phase-based cache tuning using PDM, and allows different optimization goals to be dynamically prioritized in order to satisfy variable operating conditions. Specifically, PACT allows the energy delay product (EDP), energy, or execution time to be prioritized, without incurring any additional tuning overhead with respect to prior work. Using experimental results, we show that PACT can trade off non-prioritized optimization goals for prioritized optimization goals in order to satisfy varying runtime resource requirements. PACT deter- mines configurations that are optimal or near-optimal for the specified optimization priority, and improves over PDM for energy and execution time optimization. 978-1-5090-6762-6/17/$31.00 c 2017 IEEE
Transcript
Page 1: PACT: Priority-Aware Phase-based Cache Tuning for …tosiron/papers/2017/pact_ISVLSI17.pdfembedded systems, adaptable hardware I. INTRODUCTION AND MOTIVATION Caches contribute significantly

PACT: Priority-Aware Phase-based Cache Tuningfor Embedded Systems

Sam Gianelli and Tosiron AdegbijaDepartment of Electrical & Computer Engineering

University of Arizona, Tucson, AZ, USAEmail: {samjgianelli, tosiron}@email.arizona.edu

Abstract—Due to the cache’s significant impact on an embed-ded system, much research has focused on cache optimizations,such as reduced energy consumption or improved performance.However, throughout an embedded system’s lifetime, the sys-tem may have different optimization priorities, due to variableoperating conditions and requirements. Variable optimizationpriorities, embedded systems’ stringent design constraints, andthe fact that applications typically have execution phases withvarying runtime resource requirements, necessitate new robustoptimization techniques that can dynamically adapt to differentoptimization goals. In this paper, we present priority-aware phase-based cache tuning (PACT), which tunes an embedded system’scache at runtime in order to dynamically adhere the cacheconfigurations to varying optimization goals (specifically EDP,energy, and execution time), application execution phases, andoperating conditions, while accruing minimal runtime overheads.

Index Terms—Configurable memory, design space exploration,cache tuning, cache memories, low-power design, low-powerembedded systems, adaptable hardware

I. INTRODUCTION AND MOTIVATION

Caches contribute significantly to an embedded system’senergy consumption and performance. As a result, muchoptimization research focuses on optimizations that improvethe cache’s energy efficiency and performance, while accruingminimal optimization overheads. Configurable caches [20]have been widely studied as a viable architecture for min-imizing the cache’s energy consumption. Using configurablecaches, the cache configurations (i.e., cache size, associativity,and line/block size) can be dynamically tuned or specializedto executing applications’ resource requirements, in order tominimize energy overheads from an over-provisioned cache.

However, cache optimization, otherwise known as cachetuning, is very challenging due to embedded systems’ typicaldesign constraints (e.g., area, energy, real-time constraints),potentially large configurable cache design spaces, and theruntime variability of embedded systems applications andexecuting conditions [3]. Depending on different factors, suchas resource availability, user requirements, and applicationcharacteristics, an embedded system’s optimization goals (e.g.,energy/execution time minimization) may change throughoutits lifetime. Ideally, a runtime cache tuning technique mustaccount for variable application requirements and optimizationgoals.

Cache optimization challenges are further compounded bythe presence of different execution phases in emerging embed-ded systems applications [17]. A phase is an execution intervalduring which the application’s execution characteristics (e.g.,cache miss rates, instructions per cycle (IPC), etc.) are rel-atively stable. Different phases within the same application,due to their different characteristics, may have different re-source requirements. To account for these variable resourcerequirements, phase-based cache tuning [3], [9] can achievefine-grained optimization by tuning the cache configurations todetermine configurations that satisfy the different applicationphases’ requirements. Prior work has shown that phase basedcache tuning can reduce the cache’s energy consumption byup to 62% [10].

However, phase-based cache tuning can result in runtimeexecution time overhead. To mitigate this overhead, prior work[3] proposed phase distance mapping (PDM) as an analyticalapproach to phase-based cache tuning. PDM calculates therelative distance between a new executing phase and a pre-viously tuned base phase, and approximates the new phase’sbest cache configuration based on the distance between thenew and base phases’ execution characteristics. PDM, muchlike several prior cache optimization techniques [9], [15],[18], has the drawback of only focusing on one optimizationgoal throughout the system’s lifetime. Due to the adversar-ial nature of optimization goals—improving the energy, forexample, may degrade execution time—effective optimizationtechniques must be adaptable to changing runtime operatingconditions (the current state of the device, e.g., critically lowbattery) that may necessitate variable optimization goals.

In this paper, we present Priority-Aware phase-based CacheTuning (PACT). PACT incorporates the notion of priorities tophase-based cache tuning using PDM, and allows differentoptimization goals to be dynamically prioritized in orderto satisfy variable operating conditions. Specifically, PACTallows the energy delay product (EDP), energy, or executiontime to be prioritized, without incurring any additional tuningoverhead with respect to prior work. Using experimentalresults, we show that PACT can trade off non-prioritizedoptimization goals for prioritized optimization goals in orderto satisfy varying runtime resource requirements. PACT deter-mines configurations that are optimal or near-optimal for thespecified optimization priority, and improves over PDM forenergy and execution time optimization.978-1-5090-6762-6/17/$31.00 c©2017 IEEE

Page 2: PACT: Priority-Aware Phase-based Cache Tuning for …tosiron/papers/2017/pact_ISVLSI17.pdfembedded systems, adaptable hardware I. INTRODUCTION AND MOTIVATION Caches contribute significantly

II. BACKGROUND AND RELATED WORK

Cache tuning uses a configurable cache and a cache tuningmechanism—software or hardware cache tuners—to determinethe best cache configurations that satisfy an application’sexecution requirements. The key challenge in cache tuning isaccurately determining the best configuration without incur-ring significant tuning overheads. To address this challenge,several prior cache tuning techniques [3], [6], [9]–[11], [14],[16], [19] have been proposed. In general, these prior phase-based cache tuning techniques can be broadly categorized asexhaustive search, heuristic, or analytical methods [3].

Chen et al. [6] proposed an algorithm that exhaustivelyexplored the entire cache design space to determine theoptimal configuration. While this method was highly accurate,the proposed method also incurred significant overheads due tothe amount of time required to fully explore the cache designspace, thus impeding optimization potentials. To reduce thecache tuning overhead, Rawlins et al. [16] proposed a cachetuning heuristic that significantly pruned the design spaceduring the exploration process, while achieving near-optimalresults.

To further reduce cache tuning overheads, analytical meth-ods have been proposed to directly determine the best cacheconfigurations without the need to explore inferior configura-tions. Gordon-Ross et al [10] used an oracle-based approach tonon-intrusively predict the best configuration without incurringruntime tuning overheads with respect to execution time.However, the proposed approach incurred hardware overheadsas a result of the oracle hardware. Other analytical techniqueshave been proposed (e.g., [8], [15]); however, most of thesetechniques are typically computationally complex, thus imped-ing optimization potentials in resource constrained embeddedsystems.

To address the challenges of prior work, Adegbija et al.[3] proposed phase distance mapping (PDM), which wasbased on the hypothesis that the more disparate two phases’characteristics are, the more disparate the phases’ best con-figurations are likely to be. Given a known phase’s bestconfiguration, PDM predicts a new phase’s best configurationusing the distance between the two phases’ characteristics—the phase distance—to estimate the distance between the twophases’ best configurations. However, PDM only targets EDPoptimization and cannot satisfy changing runtime optimizationgoals.

We address challenges in prior work by developing a newpriority-aware phase-based cache tuning (PACT) method thatallows dynamic cache tuning to specifically prioritize differentoptimization goals based on the current system operatingconditions/device state. We show that PACT achieves thesevariable optimization goals without incurring any runtimetuning overheads with respect to prior work.

III. PRIORITIZED PHASE-BASED CACHE TUNING

In the discussions herein, we assume that the embeddedsystem is equipped with a highly configurable cache [20]with configurable size, line size, and associativity. A highly

Phase Pi,s executed

Phase characteristics

Search phase history table for Pi,s

Pi,s configuration, Cpi,s

New Phase?

Execute phase Pi

with Cpi,s

Phase classification Device State

Retrieve Configuration Cpi,s

form phase history table

Priority Aware Cache Tuning (PACT)

Algorithm

Add PI,s to phase history table

No

Yes

Fig. 1: Flowchart of PACT

configurable cache utilizes multiple memory banks, each ofwhich can be shutdown to configure the cache size, or con-catenated to configure the associativity. Given a physical cacheline size (e.g., 16B), multiple lines can be concatenated tologically form larger line sizes. In this work, we use a physicalbase 32KB configurable cache with 2KB banks. This cacheoffers configurable cache sizes ranging from 2KB to 32KB,associativities ranging from 1- to 4-way, and line sizes rangingfrom 16B to 64B, all in power of two increments. We direct thereader to [20] for details on the configurable cache’s circuitryand design.

To perform phase-based cache tuning, executing appli-cations’ characteristics must first be classified into phases.Phase classification [17] breaks an application into executionintervals—intervals can be measured in number of instructionsor time—and clusters the intervals with similar execution char-acteristics to form phases. Since phase classification has beenwidely studied in prior work, we skip the details in this paper.In this section, we present the problem statement, motivatedby mobile devices (e.g., smartphones, tablets), describe anoverview of the proposed approach, and present the PACTalgorithm.

A. Problem Statement

The primary challenge of phase-based cache tuning is accu-rately determining the best configurations at runtime withoutaccruing significant runtime overheads. For example, our con-figurable cache in this work features a design space with 432possible configurations (both instruction and data caches). Thisdesign space could exponentially increase in more complexcaches. Exploring such a large design space at runtime (usingexhaustive search, heuristics, or even analytical methods) formultiple optimization goals could be prohibitive for resourceconstrained embedded systems. Thus, the objective of PACT isto determine the best cache configurations that satisfy variableruntime optimization goals, while accruing minimal runtimetuning overhead.

Page 3: PACT: Priority-Aware Phase-based Cache Tuning for …tosiron/papers/2017/pact_ISVLSI17.pdfembedded systems, adaptable hardware I. INTRODUCTION AND MOTIVATION Caches contribute significantly

Given an embedded system, the optimization goals canbe determined by different operating conditions or user re-quirements. For example, a battery-operated mobile devicemay have different operating conditions depending on thebattery state. When operating on full battery, the EDP maybe prioritized to account for both energy and execution timeoptimization. When the battery is at critical levels (e.g., 10% ofbattery power), energy efficiency is prioritized over executiontime. When the system is being charged, the execution timeis prioritized, since energy is no longer a significant resourceconstraint.

B. Overview of PACT

To ensure low-overhead, high-speed, and accurate tuning,we considered two options for adding priorities to phase-based tuning. Since a phase’s best configurations, after theyare determined, are traditionally stored in a phase history table[3], the first option uses log(n) additional bits to store eachphase’s current optimization priority, where n is the numberof device states (corresponding to optimization goals). Analternative option includes a separate transformation lookuptable that stores configuration transformations between differ-ent priorities for each application phase. Thus, the table onlystores the phases’ configurations for a single priority and theconfiguration transformations (i.e., change in configuration).When a new priority is required, the transformation is appliedto the current configuration to determine a new configurationthat satisfies the new optimization goal. In this work, weuse the first option, due to its simplicity and low-overheadcharacteristics.

Figure 1 presents an overview of PACT. PACT takes as inputthe phase characteristics, obtained during phase classification,and the current device state, which indicates the prioritizedoptimization goal. For example, the device state could beinformed by a smartphone’s battery state: ’fully charged’, ’lowpower’, or ’charging’, to indicate prioritization of EDP, energy,or execution time, respectively.

When a phase Pi is executed, PACT searches the phasehistory table for Pi or for phases with similar characteristics(cache miss rates) within a similarity threshold to Pi. Thesimilarity threshold is a designer specified or runtime tunablefeature of PACT that represents a tradeoff between tuningaccuracy and tuning overheads. A larger similarity thresholdreduces the tuning overhead at the expense of accuracy, whilea smaller similarity threshold increases accuracy at the expenseof tuning overhead.

In this work, we empirically established our similaritythreshold, using the base cache configuration, by normalizingeach phase’s cache miss rate to the base phase’s cache missrate. We used the first executing phase as the base phase, andused a similarity threshold of 10. Phases with a normalizedmiss rate within the ranges 0-10 used the same phase historytable entries; similarly for phases with normalized miss rateswithin 10-20, 20-30, etc. At runtime, the similarity thresholdcan easily be dynamically determined as shown in prior work[3].

While j ≤ n

While j ≤ n

While Ci ≤ Cmax

While Ci ≤ Cmax

checkState(P)Ci = Ci * 2

checkState(P)Ci = Ci * 2

Yes

While Ai ≤ Amax

While Ai ≤ Amax

checkState(P)Ai = Ai * 2

checkState(P)Ai = Ai * 2

Yes

While Li ≤ Lmax

While Li ≤ Lmax

checkState(P)Li = Li * 2

checkState(P)Li = Li * 2

Yes

Yes

No

No

No

checkState(P)checkState(P)No

OutputConfigPbest

OutputConfigPbest

InputConfigPMRU, P, n

InputConfigPMRU, P, n

Fig. 2: PACT algorithm

Fig. 3: PACT algorithm’s checkState subroutine

If an entry is found, the phase Pi, or a similar phase Pi,s, hasbeen previously executed, and the phase’s best configurationCpi is retrieved from the phase history table and used toexecute Pi. A new phase’s similarity to an existent phase is afunction of the phase distance between the two phases, whichcan be measured by the Euclidean distance between the twophases’ instruction and data cache miss rates.

If no entry is found, Pi is a new phase, and PACT executesthe tuning algorithm (Section III-C) to determine Cpi. Cpi isthen added to the phase history table, and the phase Pi isexecuted using Cpi.

C. PACT Algorithm

Figure 2 depicts the PACT algorithm, which determinesan executing phase’s best cache configuration, given a spec-

Page 4: PACT: Priority-Aware Phase-based Cache Tuning for …tosiron/papers/2017/pact_ISVLSI17.pdfembedded systems, adaptable hardware I. INTRODUCTION AND MOTIVATION Caches contribute significantly

ified priority. The inputs to the algorithm are the numberof iterations n, the most recently used cache configurationConfigPMRU comprised of the most recently used cachesize, associativity, and line size, CMRU , AMRU , and LMRU ,respectively, and the current priority, R. The number of itera-tions, n, specifies the number of phase executions required tofine-tune the phase’s best configuration; the default value of nis designer-specified, depending on the executing applications,but can be dynamically adjusted at runtime depending on thequality of the configurations being determined by PACT. Forexample, n may be dynamically increased or decreased foran executing phase depending on how much improvementis achieved with respect to the base configuration, and howoften the phase is executed. A larger value of n wouldbenefit a phase with several executions and may yield cacheconfigurations that are closer to the optimal. We analyzeddifferent values of n, and empirically determined that n = 3,which we used for our experiments (Section IV), provided asufficient tradeoff between number of iterations and optimiza-tion potential.

When a new phase is executed, the algorithm startswith the most recently used configurations and iterativelycycles through the caches sizes, associativities, and linesizes in power-of-two increments—this process is iterativelyperformed for all caches in the system (e.g. data andinstruction)—until the maximum values are reached or ifthe current configuration yields better results than the storedbest configuration. At system startup, the most recently usedconfigurations default to the base configurations. Figure 3depicts the algorithm for a checkState subroutine, which PACTuses to monitor the prioritized optimization goal for eachiteration. For each phase, checkState(R) determines if thecurrently executing configuration [ConfigPi] improves overthe stored best configuration [ConfigPbest] stored in the phasehistory table for priority R. If [ConfigPi] improves over[ConfigPbest], [ConfigPbest] is set to [ConfigPi] in thephase history table.

IV. EXPERIMENTS

A. Experimental Setup

To evaluate PACT’s effectiveness, we used 12 benchmarksfrom the SPEC CPU2006 benchmark suite [2]. We used SPECbenchmarks since they feature greater execution complexityand runtime variability, and provide a more rigorous test forPACT. We fast forwarded each application for 300 millioninstructions, and ran the reference input sets for 1 billioninstructions. We used SimPoint 3.2 [12] to determine thedistinct phases in each.

To model a system similar to modern day embedded systemsmicroprocessors and gather execution statistics, we imple-mented the proposed approach using GEM5 [5]. We simulateda system comprised of private level one instruction and datacaches, with a base configuration featuring 32 KB size, 4-wayset associativity, and 64 byte line size, similar to an ARMCortex-A9 [1]. Given this base configuration, the configurable

size ranged from 2 KB to 32 KB, associativity ranged from 1-way to 4-way, and the line size ranged from 16 byte to 64 byte,all in power of two increments. We assumed a system withdedicated tuners within each core; thus, the cores are tunedindependently of each other. We used McPAT [13] to calculatethe system’s total power consumption, which we then used,combined with execution statistics from GEM5, to calculatethe energy consumption.

B. Results

1) PACT Results and Comparison to the Optimal and PriorWork: We evaluated PACT by comparing the results usingconfigurations determined by PACT to results obtained usingthe base configuration, the optimal configuration (determinedthrough exhaustive search), and PDM (to represent priorwork). Figure 4 depicts the EDP, energy, and execution timeachieved by PACT as compared to the optimal and PDMconfigurations. The PACT, optimal, and PDM results are allnormalized to the base configuration in order to evaluate theimprovements with respect to the base configuration.

Figure 4(a) compares the EDP achieved by PACT to theoptimal and PDM when the EDP is prioritized. As comparedto the base configuration, PACT reduced the EDP by 16.5%on average across all the applications, with reductions ashigh as 36% for omnetpp. On average, PACT determinedconfigurations that were within 7.7% of the optimal; for 6 outof the 12 applications, PACT’s configuration were within lessthan 1% of the optimal. In a few of the applications, PACT’sconfigurations were worse than the optimal. For mcf, forexample—this was was the worst case—PACT’s configurationwas within 20.1% of the optimal and increased the EDP by 3%with respect to the base configuration. We attribute this behav-ior to the fact that PACT’s tuning was oblivious to some of theapplications’ intrinsic memory access behaviors. For instance,mcf has long memory access latencies, which conflicted withPACT’s attempt to simultaneously tune both the energy and theexecution time (delay). However, as expected, PACT achievedsimilar EDP savings as PDM, since PDM natively optimizesthe EDP.

Figure 4(b) compares the energy consumption achieved byPACT to the optimal and PDM when the energy is prioritized.On average, PACT reduced the energy by 18.2% comparedto the base configuration, with reductions as high as 28% forlibquantum. Unlike in the case of EDP prioritization, PACTdid not degrade any application’s energy consumption withrespect to the base configuration. PACT determined configu-rations that were within 4.2% of the optimal, on average, andoutperformed PDM by 2.2% across all the applications.

Figure 4(c) compares the execution time achieved by PACTto the optimal and PDM when the execution time is prioritized.On average across all the applications, PACT reduced theexecution time by 2.7% compared to the base configuration,with reductions as high as 10% for omnetpp. The executiontime reduction was much less (compared to the EDP andenergy reductions) because the base configuration was near-optimal for most applications, as evidenced by the fact that

Page 5: PACT: Priority-Aware Phase-based Cache Tuning for …tosiron/papers/2017/pact_ISVLSI17.pdfembedded systems, adaptable hardware I. INTRODUCTION AND MOTIVATION Caches contribute significantly

(a)

(b)

(c)

0

0.2

0.4

0.6

0.8

1

1.2ED

P n

orm

aliz

ed

to

bas

e c

ach

e

con

figu

rati

on

OptimalEDP PACT EDP PDM EDP

0

0.2

0.4

0.6

0.8

1

1.2

Ene

rgy

no

rmal

ize

d t

o b

ase

ca

che

co

nfi

gura

tio

n

OptimalEnergy PACT Energy PDM Energy

0

0.2

0.4

0.6

0.8

1

1.2

Exe

cuti

on

Tim

e n

orm

aliz

ed

to

b

ase

cac

he

co

nfi

gura

tio

n

OptimalExTime PACT ExTime PDM ExTime

Fig. 4: PACT compared to the optimal and PDM. PACT,optimal, and PDM are all normalized to the base configuration.

PACT’s configurations were within 0.3% of the optimal. Theseresults show PACT’s ability to prioritize optimization goals asrequired.

2) Prioritization Tradeoffs: To further illustrate PACT’sability to trade off non-prioritized optimization goals forthe prioritized optimization goals, we compared the non-prioritized EDP, energy, or execution time obtained with PACT(with one of the goals prioritized) to the base configuration.Figure 5 shows the results of all the optimization goals (EDP,energy, and execution time) when the system is running undereach of the three different priorities. Figure 5 (a) comparesPACT’s energy and execution time with the EDP prioritizedto the base configuration’s energy and execution time. Whenthe EDP is prioritized, PACT reduced the EDP, energy, andexecution time by 16.5%, 16.1%, and 1%, respectively, onaverage across all the applications. For two applications hmmer

and mcf, prioritizing the EDP increased the execution time by6% and 13%, respectively. However, for both applications, weobserved energy reductions of 15% and 8%, respectively, ascompared to the base configuration.

Figure 5 (b) compares PACT’s EDP and execution time,with the energy prioritized, to the base configuration’s EDPand execution time. On average, PACT reduced the EDPand energy by 17.8% and 18.2%, respectively, while theexecution time was reduced by less than 1%. We observedthat energy prioritization resulted in significant tradeoffs ofexecution time for some of the benchmarks. For example, inthe largest tradeoff observed, PACT reduced the energy by3% for leslie3d, while the EDP and execution time increasedby 16.4% and 20.4%, respectively. Similarly to mcf (SectionIV-B1), we attribute this behavior to leslie3d’s memory accesscharacteristics, which feature long memory access latencies.

Figure 5 (c) compares PACT’s EDP and energy, with theexecution time prioritized, to the base configurations’ EDPand energy. On average, PACT reduced the EDP, executiontime, and energy by 12.7%, 10.7%, and 2.7%. For several ofthe applications, PACT determined the base configuration asthe best configuration (since the base configuration had thebest execution time for those applications), thus, the energyand EDP for those applications were identical to the base. Ingeneral, these results illustrate the adversarial nature of energyand execution time—prioritizing one is usually at the expenseof the other—and PACT’s ability to tradeoff the execution timeto adhere to more stringent energy constraints or vice versa,based on the executing conditions and device state (e.g., whenthe system’s battery is in a critical state).

3) PACT Overhead: PACT’s overhead comprises of thehardware and runtime tuning overheads. The hardware over-head comprises of the phase history table and the tuner(Section II), which orchestrates the runtime tuning process. Weestimated, using synthesizable VHDL and Synopsys DesignCompiler [7] simulations, that PACT incurs 1.2% and 1%hardware area and power overheads, respectively, with respectto an ARM Cortex A9 microprocessor.

We quantified the runtime tuning overhead using the totaltuning stall cycles [4] as: total tuning stall cycles = (numberof configurations explored - 1) * tuning stall cycles perconfiguration. On average, for each phase history table entry,PACT explored 5% of the design space (Section III-A) andincurred 258 stall cycles for each configuration change. Usingthese estimates, PACT accrued a runtime tuning overhead of4799 cycles per benchmark. With a 1.9GHz clock frequency,this overhead translates to 2.526µs across all benchmarks.

V. CONCLUSIONS

In this paper, we presented Priority-Aware phase-basedCache Tuning (PACT), which uses the existing phasedistance mapping (PDM) framework to determine the bestcache configurations for varying runtime optimization goals.We showed PACT’s ability to trade off non-prioritizedoptimization goals when a specific goal must be prioritizeddue to changing operating conditions or device states. Our

Page 6: PACT: Priority-Aware Phase-based Cache Tuning for …tosiron/papers/2017/pact_ISVLSI17.pdfembedded systems, adaptable hardware I. INTRODUCTION AND MOTIVATION Caches contribute significantly

(a)

(b)

(c)

0

0.2

0.4

0.6

0.8

1

1.2

Op

tim

izat

ion

go

als

no

rmal

ize

d t

o

bas

e c

ach

e c

on

figu

rati

on

(E

DP

pri

ori

tize

d)

OptimalEDP EDP Energy ExTime

00.20.40.60.81

1.21.4

Op

tim

izat

ion

go

als

no

rmal

ize

d

to b

ase

cac

he

co

nfi

gura

tio

n

(En

erg

y p

rio

riti

zed

)

OptimalEnergy EDP Energy ExTime

0

0.2

0.4

0.6

0.8

1

1.2

Op

tim

izat

ion

go

als

no

rmal

ize

d

to b

ase

cac

he

co

nfi

gura

tio

n

(ExT

ime

pri

ori

tize

d)

OptimalExTime EDP Energy ExTime

Fig. 5: Impact of prioritizing one optimization goals on thenon-prioritized optimization goals.

experimental results show that PACT performed similarly toPDM for EDP optimization (since PDM focuses on EDPoptimization). Furthermore, PACT improved over PDM forenergy, and execution time optimizations. For future work,we intend to explore techniques for achieving results thatare closer to the optimal, without degrading the prioritizationpotential. In addition, we intend to extend PACT to complexsystems with multilevel cache hierarchies.

REFERENCES

[1] Arm. http://www.arm.com. Accessed: December 2016.[2] Spec cpu2006. http://www.spec.org/cpu2006. Accessed: January 2016.[3] T. Adegbija, A. Gordon-Ross, and A. Munir. Phase distance mapping: a

phase-based cache tuning methodology for embedded systems. DesignAutomation for Embedded Systems, 18(3-4):251–278, 2014.

[4] T. Adegbija, A. Gordon-Ross, and M. Rawlins. Analysis of cache tunerarchitectural layouts for multicore embedded systems. In PerformanceComputing and Communications Conference (IPCCC), 2014 IEEE In-ternational, pages 1–8. IEEE, 2014.

[5] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu,J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, et al. The gem5simulator. Computer Architecture News, 40(2):1, 2012.

[6] L. Chen, X. Zou, J. Lei, and Z. Liu. Dynamically reconfigurable cachefor low-power embedded system. In Third International Conference onNatural Computation (ICNC 2007), volume 5, pages 180–184. IEEE,2007.

[7] D. Compiler. Synopsys inc, 2000.[8] A. Ghosh and T. Givargis. Cache optimization for embedded processor

cores: An analytical approach. ACM Transactions on Design Automationof Electronic Systems (TODAES), 9(4):419–440, 2004.

[9] A. Gordon-Ross, J. Lau, and B. Calder. Phase-based cache reconfigura-tion for a highly-configurable two-level cache hierarchy. In Proceedingsof the 18th ACM Great Lakes symposium on VLSI, pages 379–382.ACM, 2008.

[10] A. Gordon-Ross and F. Vahid. A self-tuning configurable cache. InProceedings of the 44th annual Design Automation Conference, pages234–237. ACM, 2007.

[11] H. Hajimiri, P. Mishra, and S. Bhunia. Dynamic cache tuning forefficient memory based computing in multicore architectures. In VLSIDesign and 2013 12th International Conference on Embedded Systems(VLSID), 2013 26th International Conference on, pages 49–54. IEEE,2013.

[12] G. Hamerly, E. Perelman, J. Lau, and B. Calder. Simpoint 3.0: Fasterand more flexible program phase analysis. Journal of Instruction LevelParallelism, 7(4):1–28, 2005.

[13] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P.Jouppi. The mcpat framework for multicore and manycore architectures:Simultaneously modeling power, area, and timing. ACM Transactionson Architecture and Code Optimization (TACO), 10(1):5, 2013.

[14] O. Navarro, T. Leiding, and M. Hubner. Configurable cache tuning witha victim cache. In Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC), 2015 10th International Symposium on, pages 1–6.IEEE, 2015.

[15] M. Peng, J. Sun, and Y. Wang. A phase-based self-tuning algorithmfor reconfigurable cache. In Digital Society, 2007. ICDS’07. FirstInternational Conference on the, pages 27–27. IEEE, 2007.

[16] M. Rawlins and A. Gordon-Ross. Cpact-the conditional parameteradjustment cache tuner for dual-core architectures. In Computer Design(ICCD), 2011 IEEE 29th International Conference on, pages 396–403.IEEE, 2011.

[17] T. Sherwood, E. Perelman, G. Hamerly, S. Sair, and B. Calder. Discov-ering and exploiting program phases. IEEE micro, 23(6):84–93, 2003.

[18] T. Sondag and H. Rajan. Phase-based tuning for better utilizationof performance-asymmetric multicore processors. In Code Generationand Optimization (CGO), 2011 9th Annual IEEE/ACM InternationalSymposium on, pages 11–20. IEEE, 2011.

[19] K. Vivekanandarajah, T. Srikanthan, and C. T. Clarke. Profile directedinstruction cache tuning for embedded systems. In Emerging VLSITechnologies and Architectures, 2006. IEEE Computer Society AnnualSymposium on, pages 6–pp. IEEE, 2006.

[20] C. Zhang, F. Vahid, and W. Najjar. A highly configurable cachearchitecture for embedded systems. In Computer Architecture, 2003.Proceedings. 30th Annual International Symposium on, pages 136–146.IEEE, 2003.


Recommended