Coding for Efficient Caching in MulticoreEmbedded Systems
Tosiron Adegbija and Ravi TandonDepartment of Electrical and Computer Engineering
University of ArizonaEmail: {tosiron,tandonr}@email.arizona.edu
Abstract—We present an information theoretic approach tocaching in multicore embedded systems. In contrast to con-ventional approaches where caches are treated independently,we leverage novel cache placement and coded data deliveryalgorithms that treat the caches holistically, and provably reducethe communication overhead resulting from main memory ac-cesses. The proposed approach intelligently places data across theprocessors’ caches such that in the event of cache misses, the mainmemory can opportunistically send coded data blocks that aresimultaneously useful to multiple processors. Using architecturalsimulations, we demonstrate that the coded caching approachsignificantly reduces the communication overhead, thus reducingthe overall memory access energy and latency, while imposingminimal overheads. In a quad-core embedded system, comparedto conventional caching schemes, the coded caching approachreduced the access energy and latency by an average of 36%and 16%, respectively.
Index Terms—Cache optimization, coded caching, energy sav-ings, low-power embedded systems.
I. INTRODUCTION AND MOTIVATION
Caches are commonly used in embedded systems to bridgethe processor-memory performance gap by exploiting execut-ing applications’ spatial and temporal locality. Caches alsoaccount for a significant portion of an embedded system’spower/energy consumption, which has necessitated much re-search focus on cache optimization techniques [17]. Dueto high memory latency and memory bandwidth limitations,cache optimization is critical for improving the end-to-end per-formance of embedded systems. Cache optimization, however,is challenging, especially in embedded systems, since thesesystems typically have stringent design constraints with respectto size, battery capacity, real-time deadlines, cost, etc. Despitethese design constraints, embedded systems are expected toexecute algorithmically complex and memory-intensive appli-cations due to consumer demands for complex applicationsand functionalities. To satisfy this growing demand, embeddedsystems are being equipped with multicore processors thatfeature complex memory hierarchies, as opposed to single-core processors. These technological advances make cacheoptimization even more challenging.
To meet the often conflicting goals of achieving best pos-sible cache performance and energy efficiency in modernembedded microprocessors, several researchers have proposedcircuit-level [18], [22] and architectural [4], [10] cache op-
timization techniques. Due to the increasing compute andmemory complexity of modern embedded systems applica-tions, novel radical and innovative techniques are requiredto achieve optimal caching that maximizes performance andenergy efficiency, without introducing the attendant overheadsof traditional optimization techniques, such as area, designtime, and computational complexity.
In this work, we approach cache optimization by focusingon minimizing the required main memory to cache communi-cation during application executions, as this communicationis a major source of overhead in embedded systems [19].In the event of a cache miss, data must be transferred froma lower memory level (e.g., level two (L2) cache or mainmemory) to the first level cache (L1) for subsequent use bythe processor. Modern embedded systems microprocessors’memory subsystem consume significant amounts of energyand time by continuously transferring large amounts of datafrom main memory to processor caches [19]. In several cases,the data transferred to the cache is only briefly used by theprocessor and subsequently replaced by other data that are notreused often. Sometimes, the data stored in the cache featurehigh levels of redundancy. These caching behaviors exacerbatecache optimization challenges.
Drawing from information theory concepts, we explorethe role of coded caching in improving the efficiency ofcaching in multicore embedded systems. Efficient cache uti-lization is especially critical for multicore systems that featurerandom scheduling [23] and highly persistent applications—applications that reoccur several times throughout a system’slifetime (e.g., smartphone apps). Our goal is to reduce theoverheads resulting from main memory accesses in thesesystems, while introducing minimal optimization overheads.Recent work in information theory [16] has shown that conven-tional approaches of treating distributed caches independentlycan be far from optimal, and coding across caches can beleveraged to reduce the communication overhead. Therefore,we leverage this information theoretic approach to analyze theoptimal utilization of multicore caches in state-of-the-art andemerging embedded systems microprocessors.
In this work, we investigate the idea of coding data acrosscaches and jointly operating the caches in order to significantlyreduce the communication overheads and provably improveoptimization goals, such as memory access energy and latency.To this end, we propose to use a runtime Systematic Cache978-1-5090-6762-6/17/$31.00 c©2017 IEEE
Placement Algorithm (SCPA) that intelligently determines thecache fill data, and a Systematic Cache Delivery Algorithm(SCDA) that delivers the appropriate data block(s) to theprocessor in the event of a data request. This approach placesdata across the processors’ caches such that in the eventof cache misses, the main memory opportunistically sendscoded data blocks that are simultaneously useful to multipleprocessors, thus significantly reducing the overhead from mainmemory accesses.
We model state-of-the-art embedded systems’ memory hi-erarchy using GEM5 [6] architectural simulations, and bench-marks from the SPEC2006 benchmark suite [3] to representthe increasing compute and memory- complexity of emergingembedded systems applications. Using our experiments, weshow that coded caching can reduce the average energyconsumption and improve performance by 36% and 16%,respectively, as compared to conventional caching.
II. BACKGROUND AND RELATED WORK
The work presented herein is complementary to previouscache optimization techniques. Thus, in this section, wepresent a brief background and overview of related work oncache optimization in computer architecture, and recent resultson information theoretic aspects of caching.
A. Cache OptimizationThe memory hierarchy can consume more than 50% of the
total system power, especially in embedded systems. As a re-sult, much research has focused on optimization techniques toimprove the efficiency of the memory hierarchy, especially thecache [17]. Significant emphasis has been placed on dynamicoptimizations, as an alternative to static cache optimizations[11]. Unlike static optimizations, dynamic optimizations canaccommodate changing application behaviors during runtimeand appreciably improve performance and energy savingspotential.
Several cache optimizations target low cache access latencythrough data migration, replication, or banked shared level two(L2) caches and techniques that place data in cache banksclose to the referencing core [7]. However, these methodsare usually not fully dynamic and cannot react to changingruntime application dynamics [5]. Cache partitioning [9] isanother well researched optimization that dedicates portions ofa shared last level cache (typically the L2 cache) to executingtasks in order to increase performance.
Our approach for cache optimization complements previoustechniques; we exploit information theory concepts to optimizecache data placement in order to minimize main memoryaccesses. The proposed approach incurs minimal overheads interms of hardware, computational complexity/execution time,and energy, and is therefore practical for resource constrainedembedded systems. Furthermore, the proposed approach re-quires minimal design time effort and reduces overall memoryaccess energy and latency, as a direct consequence of reducingthe main memory accesses.
System Startup/Runtime
Cache
Placement
Applications(A, B, C, D) in caches ?
Coded UncodedData delivery Data delivery
CacheUpdate
Applications(A, B, C, D) executed ?
NoYes
Fig. 1: A high-level overview of the coded caching approach.
B. Information Theory of Distributed Caching
Much recent research has focused on understanding thefundamental information theoretic limits of distributed cachingsystems. A novel information theoretic model for distributedcaching was recently introduced in [16]; the authors showedthat coded content delivery could lead to significant reduc-tion in communication overhead in contrast to conventionalapproaches that treat caches independently. This work hasbeen extended to a variety of problems and scenarios, suchas wireless device to device (D2D) networks [13], minimizingcontent delivery latency over wireless networks [21], etc.
In addition to these wide variety of applications, significantrecent progress has also been made on understanding thefundamental tradeoffs between the memories of distributedcaches and the communication overhead in the event ofcache miss. In particular, the optimal tradeoff between cachememories and the communication was characterized to withina constant factor of 12 in [16]. Recent work [20] improvedthe approximation ratio to 8 by developing new informationtheoretic converse techniques.
To the best of our knowledge, there is no prior work thatexplores the benefits of coded caching in multi-core embeddedsystems. The information theoretic and coding aspects ofthe proposed approach and algorithms are inspired by recentadvances in our understanding of distributed caching.
III. CODING FOR EFFICIENT CACHING
Figure 1 illustrates the high-level flow of the coded cachingapproach. At system startup—or during runtime, when newapplications are executed—the cache placement algorithmpreemptively fills all the caches with data from the mostpersistent executing applications. Note that this approach alsoapplies to application threads, however, we describe our workusing applications. When applications are executed across thedifferent cores, if the caches contain some or all of the applica-tions’ data, the coded data delivery algorithm systematicallyfetches data from the main memory to be useful to all the
RAM
L1 CacheM
Core 1
a1 b1
Application A:
Application B:
a1 a2
b1 b2
Size
c1 d1
Application C:
Application D:
c1
d1 d2 d3 d4
c4c3c2
b3 b4
a4a3
L1 CacheM
Core 2
L1 CacheM L1 CacheM
a2 b2 c2 d2 a3 b3 c3 d3 a4 b4 c4 d4
F/4 F/4 F/4 F/4
Core 3 Core 4
�a2 b1
a3 � c1
a4 � d1
b3 � c2
b4 � d2
c4 � d3
Coded transmission:
6 ⇥ F/4 = 3F/2
Executes A Executes B Executes C Executes DExecutes AExecutes D Executes B
......
......T
ime
P1 P2 P3 P4
Executes C
Fig. 2: Coded caching scheme for a quad-core microprocessor.
cores, while complementing the data already present in thecaches.
If the applications’ data are not in the caches, a cache updateis performed, wherein the caches are filled using the cacheplacement algorithm. The coded caching approach will bemost effective for a system with persistent applications, suchas smartphones, where multiple applications may be executedseveral times throughout the system’s lifetime, during differenttime intervals, and on different cores. In this section, wedescribe the coded caching approach for optimally utilizingmulticore systems’ caches using a specific sample system, andsubsequently present the ideas for a general scenario.
A. Description of Coded Caching for Multicore EmbeddedSystems
Figure 2 illustrates the coded caching approach in a quad-core microprocessor. For this description, we make the fol-lowing simplifying assumptions: each core has a private levelone (L1) cache, P1, P2, P3, and P4, respectively; there arefour executing applications, A,B,C,D; all the applicationshave equal persistence on all cores (i.e., they are randomlyscheduled and have equal probability of being executed); andthe data blocks have equal persistence. We generalize thistechnique to an arbitrary system in Section III-B. Codedcaching comprises of two key stages:1. Cache placement stage: At system startup or at executionintervals, the cores’ L1 caches are filled with coded data, whilesatisfying the memory constraints. This stage is agnostic of theapplications that will be requested by the processors.
In the cache placement stage, each application’s data isbroken into four parts, and preemptively stored in each cache.For example, application A is broken into four 8 KB parts,a1, a2, a3, and a4, which are stored in caches P1, P2, P3, andP4, respectively. Similarly, B,C, and D are broken into fourparts and stored in each cache. Thus, cache P1 is filled witha1, b1, c1, and d1, P2 is filled with a2, b2, c2, and d2, and soon.2. Data delivery stage: During runtime, when each corerequests data blocks, a portion of the data blocks will be
guaranteed cache hits in the core’s private L1 cache; theremaining data, if any, is then fetched from the main memory.
In the data delivery stage, coded data is fetched fromthe main memory in order to be simultaneously useful formultiple cores. Assuming the cores with caches P1, P2, P3,and P4 are executing applications (or threads) A,B,C, and D,respectively, six coded (XOR’d) data ‘chunks’ are sent frommain memory on a common data bus [15] as follows: a2⊕ b1for P1 and P2; a3⊕ c1 for P1 and P3; a4⊕d1 for P1 and P4;b3 ⊕ c2 for P2 and P3; b4 ⊕ d2 for P2 and P4; and c4 ⊕ d3for P3 and P4. If during the next time period, the four coresexecute the applications in a different order (C,D,A, and B,for example), different coded data chunks are fetched frommain memory to satisfy all four cores.
B. Generalizing Coded Caching to an Arbitrary MulticoreSystem
The description in Section III-A is a first step towardsusing coded caching for optimal cache utilization in multicoreembedded systems. However, some of the assumptions in thedescription introduce some important caveats that present op-portunities for future work. To address some of these caveats,we present a systematic cache placement algorithm (SCPA)and systematic coded delivery algorithm (SCDA) that createcoding opportunities across all cores in order to significantlyreduce the communication overhead [16]. These algorithmsgeneralize coded caching to an arbitrary number of cores, K,any number of applications, N , and any L1 cache size M .
Systematic Cache Placement Algorithm (SCPA). Con-sider cache size M ∈ {0, N
K , 2NK , 3N
K , . . . , N}, and set t =KM/N , such that t is an integer between {0, 1, . . . ,K}. Thecache placement algorithm works as follows: each applica-tion’s data, Dn is split into
(Kt
)non-overlapping sub-blocks,
where the size of each sub-block is F/(Kt
). Formally, Dn is
split into sub-blocks as follows:
Dn = {Dn,S , S ∈ {1, . . . ,K}, |S| = t}
The cache placement scheme works as follows: cache of corek stores the sub-block Dn,S if k ∈ S. Hence, each cachestores N
(K−1t−1)
sub-blocks. The total amount of memory usedat each cache is thus, N
(K−1t−1)× F
(Kt )= FNt/K = FM ,
thereby satisfying the cache memory constraint. We can ob-serve that the cache placement described in Figure 2 followsthis algorithm, for K = 4 processors, N = 4 applications, andM = N/K = 1.
Systematic Coded Delivery Algorithm (SCDA). Now,consider that each core executes a particular application ata given time. We denote the requested applications at anygiven time as r1, r2, . . . , rK , where rk ∈ {1, . . . , N}. Inother words, core k references data Drk corresponding toapplication rk. The coded delivery of data blocks works asfollows: consider a subset S of size |S| = t+1 cores. Every tcores in this subset have sub-blocks stored in the cores’ localcaches, and these sub-blocks are needed at the other cores in
TABLE I: Cache and main memory configuration parameters.Configuration Parameters
L1 data cache32 KB, 4-way, 64 byte blocks,0.67 ns access time, 1 R port,
1 W port, 4 banks, 32nm technology
DRAM main memory2 GB, 4 banks, 1 R/W ports,
128 bits, 9.12 nsaccess time, 32nm technology
the set S. Given a core s, the sub-block Drs,S\{s} is neededat core s to process the application rs. Hence, for each subsetS ⊂ {1, . . . ,K} of cardinality S = t + 1, the main memorycan transmit ⊕s∈SDrs,S\{s} over the common bus to all thecores, where ⊕ denotes bitwise XOR operation. Each of theseXOR’d sums contribute to a communication of F/
(Kt
)bits,
and the total number of such subsets is(
Kt+1
). Hence the total
communication by the main memory over the common databus is:
FRcoded =
(K
t+ 1
)F(Kt
) = FK − t
t+ 1= F
(K(1−M/N)
1 +KM/N
)In summary, by exploiting coding opportunities, the totalamount of communication [16] is:
Rcoded = K(1−M/N)︸ ︷︷ ︸Local caching gain
× 1
1 +KM/N︸ ︷︷ ︸Global Caching gain
The conventional scheme (treating each core independently)leads to a total communication of Rconventional = K(1−M/N).We thus observe that the conventional schemes can onlyextract a local caching gain of 1−M/N , since each core/cacheis treated independently. On the other hand, the coded schemeextracts not only a local caching gain, but a global cachinggain of 1
1+KM/N by virtually coupling the caches togetherand exploiting this strategy to create coding opportunities.
IV. EXPERIMENTAL RESULTS
A. Experimental SetupTo quantify the effectiveness of coded caching, we evaluated
the energy and latency of the executing applications’ cacheand main memory accesses; every other metric remained con-stant. We used fifteen benchmarks from the SPEC CPU2006benchmark suite, cross-compiled for the ARM instruction setarchitecture, and executed using the reference input sets. Thebenchmarks were selected to represent a variety of applicationcharacteristics (memory/compute intensity and integer/floatingpoint applications). We used SPEC benchmarks, as opposed totraditional embedded systems benchmarks (e.g., EEMBC [2])because SPEC benchmarks exhibit greater execution memoryand compute complexity, and more accurately represent theincreasing complexities of emerging embedded systems appli-cations.
To reduce simulation time, while maintaining the overallapplication behaviors, we profiled the SPEC benchmarks usinga combination of Simpoint [12], and techniques describedin previous work [4] to extract the benchmarks’ executionphases and the phases’ persistence. A phase is a length of
TABLE II: Workload groups used in our experiments.Workload Benchmarksworkload 1 calculix/h264ref/bzip/bwavesworkload 2 omnetpp/milc/gromacs/xalancbmkworkload 3 libquantum/hmmer/mcf/namdworkload 4 astar/gobmk/gobmk/soplexworkload 5 soplex/bwaves/hmmer/omnetppworkload 6 astar/mcf/calculix/xalancbmkworkload 7 gobmk/h264ref/milc/gromacsworkload 8 bzip2/libquantum/namd/bwavesworkload 9 bwaves/omnetpp/h264ref/xalancbmkworkload 10 namd/hmmer/libquantum/gobmk
execution during which an application executes stable execu-tion characteristics (e.g., cache miss rates, branch mispredicts,instructions per cycle, etc.). A phase’s persistence is the rate ofthe phase’s recurrence throughout the application’s execution.Thus, the higher a phase’s persistence, the more accurately itrepresents the application’s full execution. We used 100 mil-lion instructions of each benchmark’s most persistent phasesfor our simulations.
To simulate the coded caching approach, we used GEM5[6] to model a quad-core embedded system microprocessorwith cache configurations similar to the ARM Cortex A15microprocessor [1] featuring 32 KB, 4-way set associativeprivate L1 caches with 64 byte line sizes. We used CACTI todetermine the access energy and latencies for the L1 cachesand main memory as shown in Table I. For this paper, wefocused on the L1 data cache. We have begun to evaluate thescalability to more complex systems with more than four coresand with multi-level caches, but limit the results presentedherein to a quad-core single-level cache system.
In order to reduce the sensitivity of the results to a par-ticular set of simulated workloads, we created ten multipro-grammed workloads by randomly selecting four from thefifteen SPEC2006 benchmarks; each workload executes oneapplication to completion on each core. This kind of executionis analogous to a multithreaded application with no inter-core data dependencies [8]. Table II lists the workload groupsand benchmarks used in our experiments. To account for thepotential difference in application data sizes, we bounded theXOR’d data by the smallest data sizes in the workload groups.
B. Main memory accesses
The key goal of the proposed approach is to reduce theoverall access energy and latency by reducing the number ofmain memory accesses, which constitute a significant portionof access overheads. The approach targets a system withno application-core affinity, i.e., equal application persistenceand equal probability of execution on a given core; thesekinds of systems are more prone to overheads from memoryaccesses. Figure 3 depicts the percentage reduction in mainmemory accesses achieved by coded caching as compared touncoded caching. On average, over all the workloads, codedcaching reduced the number of memory accesses by 31%. Themain memory accesses reduced by 27% to 34%, illustratingthe consistency of the coded approach in a system with noapplication-core affinity.
0%
5%
10%
15%
20%
25%
30%
35%
40%Pe
rcen
tage
redu
ctio
n in
mai
n m
emor
y ac
cess
es
Fig. 3: Percentage reduction in memory accesses achieved bycoded caching as compared to uncoded caching.
-10%
0%
10%
20%
30%
40%
50%
Per
cen
tage
en
ergy
red
uct
ion
an
d
per
form
ance
imp
rove
men
t
Energy Performance
Fig. 4: Percentage access energy and latency reduction ofcoded caching compared to uncoded caching.
We also investigated a system with high application-coreaffinity—applications were executed multiple times on thesame core before executing on a different core. In this system,coded caching only reduced the average number of mainmemory accesses by 1%; the maximum reduction was 6%,while the number of main memory accesses increased for oneworkload (workload 10) by 5% (graphs omitted for brevity).We plan to explore, in future work, the tradeoff point withrespect to application-core affinity that allows benefits fromcoded caching.
C. Overall Access Energy and LatencyIn this subsection, we compare the overall access (cache
+ main memory) energy and latency achieved by the codedcaching technique to that of conventional uncoded caching.Note that the coded approach can be used in synergy withother cache optimizations; however, we ignore the impact ofother cache optimizations, as they are orthogonal to codedcaching. We assume a system with random scheduling ofapplications across cores and no application-core affinity,similar to the illustration in Figure 2. The access energy andlatency evaluations include cache fill, cache accesses, and mainmemory accesses, and consider a dynamic execution scenario,where the executing applications are not known a priori. Welimited the access latency of each workload by the applicationwith the longest latency.
In general, the proposed approach reduced the overall accessenergy and latency as a direct result of a reducing the number
of main memory accesses (Section IV-B). Figure 4 depictsthe percentage access energy and latency reduction achievedby the coded caching approach compared to the traditionaluncoded caching approach for different workloads (Table II).On average over all the workloads, coded caching reduced theaccess energy and latency by 36% and 16%, respectively. Eventhough coded caching reduced the latency by up to 27% forworkload 7, there was a wide range of behaviors across thedifferent workloads. Thus, we analyzed the individual appli-cations’ memory characteristics (memory references, workingset size, data reuse, etc.), and observed that the efficiency ofcoded caching is strongly tied to the memory characteristicsof the applications being executed.
For example, coded caching achieved a much lower latencyreduction for workload 5 and workload 8, and increased thelatency for workload 9. The common factor among theseworkloads is that the workloads include bwaves. We initiallyattributed this behavior to bwaves’ high memory footprint(as a function of the working set size) compared to theother applications. However, we observed that the benefits ofcoded caching is more impacted by applications’ data reusethan memory footprint. Bwaves exhibits lower data reuse thanmost of the other applications, resulting in a higher potentialfor conflict cache misses and more main memory accesses.Thus, bwaves’ low data reuse resulted in lower coded cachingbenefits for its workload group.
We observed a similar behavior with workloads involvingomnetpp. Even though omnetpp’s working set size was similarto other applications in the workload groups, it exhibitedlow data reuse and high instruction dependency (with verylow IPC), resulting in higher communication overhead. Thus,workload 9’s overall access latency increased because it con-tains both bwaves and omnetpp. Workload 9’s access energyreduction was also the lowest among all the workloads at 32%.
D. Key ObservationsOur experimental results demonstrate three key observa-
tions. First, coded caching is most effective in systems withhigh application persistence and random execution, i.e., sys-tems with no determinism to the scheduling of applications orapplication threads to cores. Second, coded caching is signif-icantly impacted by the application execution characteristics,especially data reuse. Applications with high data reuse benefitmore from coded caching; however, some benefit can bederived from coded caching even with applications featuringlow data reuse by co-scheduling them with other applicationsthat have high data reuse. Finally, our results show that codedcaching has much potential, which warrants further studieson the intricacies of coded caching, especially in synergywith other cache optimization techniques (e.g., configurablecaches).
E. Implementation Overhead
The implementation overheads comprise of the computa-tional complexity of the proposed algorithms, and the hard-ware overhead of implementing the XOR operations. The
hardware overhead comprises of a low-overhead coprocessor,with bitwise XOR gates, that codes the data blocks beforetransmission from the main memory. Since XOR operationsare very cheap operations, we propose that the system corescan decode the data prior to execution without imposingsignificant computational overhead.
The computational complexity of coded caching is pro-portional to the number of sub-blocks created. In particular,for a K core processor (Section III-B), we create
(Kt
)sub-
blocks, where t is a parameter which depends on the amountof memory per cache. Since Kt/tt ≤
(Kt
)≤ Kt/t!, the worst
case complexity of the coded caching process is exponentialin K, the number of cores. While this may be reasonable forsmall values of K, i.e., K = 2, 4, an interesting future researchdirection is to design coded caching mechanisms with linearcomplexity.
To estimate the coprocessor’s overhead, we assumed asystem similar to the ARM Cortex A15 [15] with a 128-bit AXI data bus. For maximum throughput, the coprocessorcan perform bitwise XOR operations for 128-bit inputs. Usingstandard XOR cells from a 180nm fabrication process [14], thecoprocessor’s propagation delay, power, and area are 91.23ps,0.019mW , and 5.85x103mm2, respectively. Relative to moststate of the art microprocessors (e.g., ARM Cortex A15), theseoverheads are negligible. We plan to validate these estimatesin future work.
V. CONCLUSIONS AND FUTURE WORK
In this paper, we explore the potentials of coded cachingto minimize the communication overhead between L1 cachesand the main memory. Drawing from information theoreticalconcepts, we explore the use of cache data placement anddata delivery algorithms that treat the caches holistically andintelligently place data across the caches in a microprocessor,such that in the event of cache misses, the main memoryopportunistically sends coded data blocks that are simultane-ously useful to multiple processors. Our experiments showthat coded caching can reduce the energy consumption andimprove the performance by up to 36% and 16%, respectively,as compared to uncoded caching.
We have begun to extend the approach presented herein tomore complex systems comprising of more than four coresand multilevel caches, and we intend to present this extendedapproach in future work. In addition, we intend to validatethe hardware overheads estimated herein, and explore imple-menting coding for other embedded systems microprocessorcomponents, involving data storage and transfers, that impactsystem performance and energy consumption. For example,instruction window resources, such as the reorder buffer(ROB), instruction queue (IQ), and load-store queue (LSQ),can benefit from the proposed techniques, since they serve asa form of caching for application instructions.
REFERENCES
[1] Arm. http://www.arm.com. Accessed: January 2016.[2] The embedded microprocessor benchmark consortium.
http://www.eembc.org/. Accessed: January 2016.[3] Spec cpu2006. http://www.spec.org/cpu2006. Accessed: January 2016.[4] T. Adegbija and A. Gordon-Ross. Phase-based cache locking for
embedded systems. In Proceedings of the 25th edition on Great LakesSymposium on VLSI, pages 115–120. ACM, 2015.
[5] B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. In Microarchitecture, 2004. MICRO-37 2004.37th International Symposium on, pages 319–330. IEEE, 2004.
[6] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu,J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, et al. The gem5simulator. Computer Architecture News, 40(2):1, 2012.
[7] J. A. Brown, R. Kumar, and D. Tullsen. Proximity-aware directory-based coherence for multi-core processor architectures. In Proceedingsof the nineteenth annual ACM symposium on Parallel algorithms andarchitectures, pages 126–134. ACM, 2007.
[8] T. E. Carlson, W. Heirman, K. Van Craeynest, and L. Eeckhout.Barrierpoint: Sampled simulation of multi-threaded applications. InPerformance Analysis of Systems and Software (ISPASS), 2014 IEEEInternational Symposium on, pages 2–12. IEEE, 2014.
[9] D. Chandra, F. Guo, S. Kim, and Y. Solihin. Predicting inter-thread cachecontention on a chip multi-processor architecture. In High-PerformanceComputer Architecture, 2005. HPCA-11. 11th International Symposiumon, pages 340–351. IEEE, 2005.
[10] H. Hajimiri and P. Mishra. Intra-task dynamic cache reconfiguration.In VLSI Design (VLSID), 2012 25th International Conference on, pages430–435. IEEE, 2012.
[11] H. Hajimiri, P. Mishra, and S. Bhunia. Dynamic cache tuning forefficient memory based computing in multicore architectures. In VLSIDesign and 2013 12th International Conference on Embedded Systems(VLSID), 2013 26th International Conference on, pages 49–54. IEEE,2013.
[12] G. Hamerly, E. Perelman, J. Lau, and B. Calder. Simpoint 3.0: Fasterand more flexible program phase analysis. Journal of Instruction LevelParallelism, 7(4):1–28, 2005.
[13] M. Ji, G. Caire, and A. F. Molisch. Wireless device-to-device cachingnetworks: Basic principles and system performance. arXiv : 1305.5216,May 2013.
[14] K. Juretus and I. Savidis. Reduced overhead gate level logic encryption.In Proceedings of the 26th edition on Great Lakes Symposium on VLSI,pages 15–20. ACM, 2016.
[15] T. Lanier. Exploring the design of the cortex-a15 processor. URL:http://www. arm. com/files/pdf/atexploring the design of the cortex-a15.pdf (visited on 12/11/2013), 2011.
[16] M. A. Maddah-Ali and U. Niesen. Fundamental limits of caching. IEEETransactions on Information Theory, 60(5):2856–2867, May 2014.
[17] S. Mittal. A survey of architectural techniques for improving cachepower efficiency. Sustainable Computing: Informatics and Systems,4:33–43, 2014.
[18] S. Motaman, A. Iyengar, and S. Ghosh. Synergistic circuit and systemdesign for energy-efficient and robust domain wall caches. In Proceed-ings of the 2014 international symposium on Low power electronics anddesign, pages 195–200. ACM, 2014.
[19] O. Mutlu and L. Subramanian. Research problems and opportunitiesin memory systems. Supercomputing Frontiers and Innovations, 1(3),2014.
[20] A. Sengupta, R. Tandon, and T. Clancy. Improved approximationof storage-rate tradeoff for caching via new outer bounds. In IEEEInternational Symposium on Information Theory, pages 1691–1695, June2015.
[21] A. Sengupta, R. Tandon, and O. Simeone. Cache aided wirelessnetworks: Tradeoffs between storage and latency. In To appear 50thAnnual Conference on Information Sciences and Systems (CISS), March2016.
[22] L. Yavits, A. Morad, and R. Ginosar. Cache hierarchy optimization.Computer Architecture Letters, 13(2):69–72, 2014.
[23] S. Zhuravlev, S. Blagodurov, and A. Fedorova. Addressing sharedresource contention in multicore processors via scheduling. In ACMSigplan Notices, volume 45, pages 129–142. ACM, 2010.