Cross-layer Techniques for Optimizing Systems Utilizing Memories withAsymmetric Access Characteristics
Yong Li and Alex K. Jones
Department of ECEUniversity of Pittsburgh
Pittsburgh, USAEmail: {yol26, akjones}@pitt.edu
Abstract—Non-volatile memory technologies promise a vari-ety of advantages for memory architectures of next generationcomputing systems. However, these capabilities come at thecost of some inefficiencies governing the operation of thesememories. The most well understood is the asymmetry ofaccess. In order to most effectively take advantage of thebenefits of these memory technologies in terms of densityand reduced static power in systems while mitigating accesscomplexity an one-size fits all method is not sufficient forall types of applications. Instead, cross-layer techniques thatinclude the compiler, operating system, and hardware layer canextract characteristics from the application that can be usedto deliver the highest possible performance while minimizingpower consumption for systems using these memories.
Keywords-Compiler; Retention; STT-RAM; Network-on-Chip; Cache; Buffer;
I. Introduction
CMOS technology scaling is increasingly affected by
the high leakage power and thermal challenges due to
reduced device dimensions and near-threshold voltage op-
eration. Potential improvements include the use of high
performance and low power alternative memory technolo-
gies. In particular, Spin-Torque Transfer Magnetic RAM
(STT-RAM) has received considerable attention [1] and
is regarded as one of the most promising future memory
candidates. Substantial research efforts have been made
to mitigate its drawbacks, most notably the asymmetric
access characteristics. Common methodologies to address
the memory asymmetry issue include using counter-based
mechanisms to track the data accesses and appropriately
migrate data blocks [2], pro-actively operating on predicated
data elements [3] and sacrificing data retention time for
balanced read/write penalties [4].
In this paper, we discuss cross-layer methods to optimize
systems that use STT-RAM in various components within
chip-multiprocessor (CMP) systems such as the shared
caches and buffers within the network-on-chip (NoC). We
discuss how high-level knowledge of the application behav-
ior can be used to mitigate the access asymmetry of read
and write behaviors. A cross-layer approach can leverage a
high-level view of the data access behavior of an application
and use this information to configure features of the system
hardware a priori in order to increase the performance
and power-efficiency. In contrast to a hardware-oriented ap-
proach that reacts to system behavior, a cross layer approach
can often avoid misleading or temporary program behavior
that can lead to inefficiencies and configuration thrashing.
In particular, we describe methodologies that leverage
compiler analyses, which expose data access and commu-
nication patterns for multi-threaded programs. Further, we
present mechanisms for how this data can be communicated
from the system level into the runtime system. Finally,
we demonstrate how including configurability into the ar-
chitecture can use this information to optimize the use
of STT-RAM for improved efficiency. We consider sys-
tem configurability such as: hybrid SRAM/STT-RAM cache
designs [2], hybrid SRAM/STT-RAM buffer designs [5]
combined with fast, circuit-like paths [6] in NoCs, and use
of reduced data retention times in STT-RAM to improve
write performance [4].
We demonstrate the effectiveness of this approach through
two case studies. In the first case-study we examine compiler
guided writing within a hybrid STT-RAM with primarily
standard STT-RAM that requires a long write delay com-
bined with a small amount of STT-RAM optimized or faster
writing due to a reduced retention time but, which requires
refreshing. Our results show that the compiler-guided data
distribution ensures that 89% of writes are handled in the
faster writing memory, which only comprises 3% of the total
cache capacity and yields a 5% performance improvement
and near 10% power saving compared to a traditional SRAM
design. In our second case-study we examine method to
integrate STT-RAM buffers into a configurable NoC. In our
studies, nearly 90% of the network traffic can take advantage
of a limited set of compiler guided circuit-switch style fast-paths leading to a more than 5% reduction in network delay
in spite of the increase buffer write time of STT-RAM while
leveraging its low leakage benefit.
II. RelatedWork
Due to the asymmetrically high delay/energy write charac-
teristics of many emerging non-volatile memories, intensive
research efforts have been made to mitigate these write
2012 IEEE Computer Society Annual Symposium on VLSI
978-0-7695-4767-1/12 $26.00 © 2012 IEEE
DOI 10.1109/ISVLSI.2012.65
404
penalties at various levels of the system. For example, Guo
et al. [7] use STT-RAM to re-design a number of non-
write-intensive micro-architectural components and adopt a
subbank write buffering policy with read-write bypassing to
increase write throughput and hide the high write latency.
Rasquinha et al. [8] address the high write energy of STT-
RAM by adopting a new replacement policy that increases
the residency time of dirty lines at the expense of higher
miss rates.
Wu et al. [2] proposes a region-based hybrid cache archi-
tecture (RHCA) and level-based hybrid cache architecture
(LHCA) with various non-volatile memories including STT-
RAM and phase-change memory (PCM). Data is swapped
between slow and fast regions based on data access behavior
recorded with hardware counters. A similar concept was
proposed for a hybrid STT-RAM/SRAM cache [9]. Jang
et al. [5] propose a hybrid STT-RAM/SRAM input buffer
design for NoCs to save static power while achieving high
network throughput.
Recently, it has been proposed to reduce STT-RAM write
delay/energy at the expense of a shorter data retention
time [4], [10]. This requires the system ensure that data does
not degrade.
These hybrid architectures coupled with the reduced
retention STT-RAM provide opportunities for cross-layer
optimizations where system configurability can exploit ap-
plication specific characteristics to perform more efficiently
than a generic approach. We describe two examples of this
in the next two sections of the paper.
III. Compiler-Cache Cross-layer Optimization
Most applications have a relatively small working-set of
frequently written (FW) data. FW data can be identified by
the compiler. Through effective use of a small bank of STT-
RAM with relaxed1 retention (STT-RR), the write penalties
of STT-RAM can be mitigated.
A. Cache Design
Building on a standard set-associative cache, each set can
contain primarily traditional STT-RAM with a small fraction
STT-RR blocks serving as write-friendly (WF) memory and
a significant portion of standard STT-RAM, as depicted
in Figure 1. To leverage the static power benefit of STT-
RAM and achieve the write performance comparable to
SRAM, STT-RR must be predominantly used to service
writes. However, as STT-RR requires a refresh mechanism,
thus, frequently read (FR) data should be predominantly
stored in standard retention STT-RAM (STT-SR). As access
characteristics change during application execution, data
must be appropriated migrated or swapped between STT-
SR and STT-RR banks.
1We use the terms reduced and relaxed retention interchangeably to referto the same concept.
The migration/swap decision can be made either through
hardware, for example detecting multiple writes in se-
quence indicating FW data and migrating this location
to the WF memory [9]. However, this approach can be
easily misled by unpredictable runtime data access behavior.
Significant mispredictions can incur an expensive penalty
of serving accesses in the wrong type of memory (e.g.,
heavy write accesses occurring in standard STT-SR) and
large migration/swap overheads. In contrast, a cross-layer
compiler mechanism has the advantages of taking actions
preemptively to hide migration/swap latencies while more
effectively detecting data access patterns compared to simple
run-time approaches.
STT-SR
Decoder
Counter Migration/Swap
Control
Write Req.
Software Guide
Cache One Cache Set
STT-RR
(WF)
Figure 1: Cache configuration utilizing hybrid STT-RAM.
B. Compiler Analysis
Using compiler-based dispatch, which identifies the write
reuse patterns and inserts instructions to guide the hardware
to perform the migration/swap, dynamically, it is possible
to avoid the pitfalls of a runtime-only system [11]. High
temporal reuse serves as a indicator of FW data. In the next
two sections we describe compiler methods to detect high
temporal reuse in array and linked-style data structures.
1) Write Reuse Identification for Arrays: The temporal
reuse information of affine array accesses can be analyzed
by solving linear algebra equations [12]. Consider Figure 2
as an example. Given the array accesses A[i + 2][ j] and
B[2][i][2 ∗ i + 1] in the nested loop shown in Figure 2(a),
we first convert the subscript functions to the matrix expres-
sions, as illustrated in Figure 2(b). The array access now
can be represented as C ∗ k + O, where C is the coefficient
matrix, k is the index vector and O denotes the offset vector.
Determining whether the array access has temporal write
reuse now is equivalent to deriving the condition under
which the equation C ∗ k′ +O = C ∗ k′′ +O has solutions (k′and k′′ represent two different index vectors in the iteration
space). In linear algebra theory, the necessary and sufficient
condition under which the above equation has solutions is
that C is not full ranked. In our example, the coefficient
matrix of A[i+ 2][ j] has a rank of 2, indicating no temporal
reuse. B[2][i][2∗i+1] has temporal reuse since the rank of its
coefficient matrix is 1, which is smaller than its dimension.
An array write access exhibits spatial write reuse when
the innermost enclosing loop index varies only the last
coordinate of that array. To discover spatial write reuse, we
405
Coefficient Matrix For A[i+2][j] Coefficient Matrix For B[2][i][2*i+1]
1 0 0 1 0 0 1 0 2 0
i j
i j
*
*
+
+
2 0 2 0 1
1 0
0 00 01 0
Truncated Matrix Offset Matrixfor(i=0;i<N;i++) for(j=0;j<M;j++) { A[i+2][j] = 0; B[2][i][2*i+1]+=1; }
(a) (b)
Figure 2: Array accesses and the corresponding matrix
representations (a): array accesses (b): matrix representation
use a truncated coefficient matrix by dropping the last row of
the original coefficient matrix, as illustrated in Figure 2(b). If
the rightmost column in the truncated coefficient matrix (the
coefficients that correspond to the innermost loop index) is
a null vector and the rightmost element in the dropped row
is nonzero, it is assured that the innermost loop only varies
the last coordinate of the corresponding array.
In the above example, A[i + 2][ j] exhibits spatial reuse
since the rightmost column in the truncated matrix is a
null vector and the rightmost element in the dropped raw
is nonzero. Using the same rule we can determine that
B[2][i][2 ∗ i + 1] does not have spatial reuse.2) Write Reuse Identification for Linked Data Structures:
To analyze the write reuse pattern for linked data structures
such as linked lists and trees, a CFG (control flow graph)
of the program is constructed. A CFG G = (V, E, r) is a
directed graph, with nodes V , edges E, and an entry node
r. Each node v in V is a basic block, which consists of a
sequence of statements that have exact one entry point and
exit point.
nd->x = 5; nd->next = A; nd->prev = B;
typedef struct node { … int x; struct node *next; struct node *prev; … } node_t; node_t *nd;
typedef struct node { … int x;struct node *next;struct node *prev;
… } node_t;
node_t *nd;
nd->x = 5;
nd->next = A; nd->prev = B;
nd->next = B; nd->prev= A;
nd->x = 5;
nd->prev = B; nd->next = B; nd->prev= A;
nd->x = 5; foo(); nd->next = A; nd->prev = B;
nd->next = A;
(a)
(b)
(d)
(c)
(e)
…
Figure 3: Code and control flow graph examples for write
reuse identification. (a): type definition code (b): write reuse
in the same basic block (c): write reuse across one basic
block and all its successors (d): write reuse broken by
function call (e): write reuse broken by one successor
Presuming standard compiler passes such as constant
propagation, expression folding and branch elimination are
applied on the CFG, the CFG is traversed to examine the ba-
sic blocks. Memory writes (right-hand side of assignments)
exhibit memory reuse when several accesses (e.g., three or
more) within the basic block are dereferenced from the same
base pointer. This implies multiple writes within a small
address range, such as a 64-128 byte contiguous region,
indicating reuse in the same cache line. For example, in
Figure 3 the basic block in (b) exhibits reuse as the three
elements in nd defined in (a) stored in contiguous memory
are all written in the same block.
In a similar fashion, a group of accesses can indicate
reuse even if they span multiple contiguous basic blocks
assuming no function calls separate them in the CFG. They
can also span conditionals in the case that all corresponding
paths in the CFG exhibit similar frequency of access. Thus,
in (c) reuse is present because three contiguous locations
are written in all paths. However, in (d) the function call
separates the access and in (e) the one path does not store to
nd->prev, each indicating no reuse. These rules help ensure
these analyzed writes occur in a single memory block. From
this perspective the writes are intensive/frequent.
C. Code Instrumentation
To communicate the information about the FW data to
the system, we use code instrumentation. When write reuse
has been identified, the compiler inserts a pre-dispatchinstruction into the code prior to the memory access to
notify the CPU to perform the migration or swap operation
to ensure the writes occur in WF memory.
IV. Compiler-Network Cross-layer Optimization
Multi-threaded applications often imply a particular traffic
pattern either from data partitioning or data flow between
threads. However, shared caches can often obfuscate this
traffic pattern [13], but the underlying data partitioning
and resulting traffic pattern can often be extracted by the
compiler [13] and used to configure a hybrid packet/circuit-
switched network [13], [14]. By employing compiler, cache,
and network cooperation in this fashion, the traditionally
SRAM buffers can be predominantly replaced with STT-
RAM in the NoC to reduce static power while maintaining
a high performance.
A. Hybrid Network Infrastructure
A configurable hybrid NoC using express virtual chan-nels [15] allows a circuit-like function to exist within a
packet-switch. Packet switched traffic requires significant
overhead at each switch point including stages like virtual
channel allocation, route computation, switch allocation and
switch traversal. Express channels are pre-configured and
allow express traffic to bypass the routing logic at each
switch and only add one cycle latency per hop for buffering.
Express channels/circuits can be established at runtime
using hardware counters that track the communication char-
acteristics of the network [14]. For a cross-layer approach
the compiler extracted communication pattern can be used to
schedule the available circuits efficiently without the runtime
overheads [13]. If the majority of the traffic uses circuits,
then virtual channel buffers for the packet-switch traffic
are less performance critical and can employ STT-RAM
with a slower writing time. Figure 4 provides an example
switch configuration for using STT-RAM-based NoC where
the shaded buffers support the packet-switched traffic and
406
employ STT-RAM. The unshaded buffers support circuit
traffic and use WF memory for best performance.
STT-RAM
STT-RAM
WFMEM
WFMEM
Figure 4: Hybrid router design using STT-RAM
B. Compiled Communication Pattern
Discovering the data partitioning and communication pat-
tern of the application begins with array access regionanalysis. Within a loop nest for a single threaded program,
a reference to an array A can be generally represented as
A[ f (L)], where f (L) is the subscript function defined on a
set of loop indices L = i1, ..., im. The span of f (L) resulting
from ik is the maximum distance traversed by varying only
ik from its lower bound lk to its upper bound uk [16], [17]:
spanik = | f (i1, ...ik−1, uk, ik+1, ..., im)−f (i1, ...ik−1, lk, ik+1, ..., im)| (1)
Similarly, the stride is defined as the minimum distance
across memory by changing only ik by its step sk :
strideik = | f (i1, ...ik−1, ik + sk, ik+1, ..., im)−f (i1, ...ik−1, ik, ik+1, ..., im)| (2)
Thus, an array access region can be described by the
following form, where O denotes the starting offset:
R = (Astridei1 ,...,strideimspani1 ,...,spanim
+ O) (3)
The region concept can be extended to multi-threaded ap-
plications to represent data partitions and resulting commu-
nication [13]. The MMAP (Multi-threaded Memory Access
Pattern) provides a data access pattern representation for
multi-threaded applications. Given a phase of parallel code
for n threads, the MMAP for an array access is a list of nregions with each region corresponding to one thread:
M = {R(0), ...,R(n − 1)} (4)
Each region in the MMAP can be calculated by replacing
Table I: SRAM and STT Parameters (45nm)
Size 128K SRAM 512K STT
Area 1.339mm2 1.362mm2
Read Latency 1.778ns 1.823ns
Write Latency 1.868ns 8.725ns(STT-ST)1.982ns(STT-RR-26.5us)
Read Energy 0.574nJ 0.550nJWrite Energy 0.643nJ 3.243nJ
Leakage Power 0.185W 0.013W
thread specific variables2 with the actual values for the
corresponding thread. An MMAP, M, is non-overlappingwhen ∀R(x),R(y) in M, R(x)
⋂R(y) = ∅. Otherwise, the
MMAP is overlapping.
The access weight for region R(x), denoted as W(R(x)),
is defined as the number of accesses from thread x on its
region R(x). The access weight can be calculated based on
the information from the loops associated with the region.
An estimate of the data partition can be determined by the
compiler by through MMAP analysis and the calculation of
their relative access weights. For a program with n threads
numbered from 0 to n − 1, a partition P is a set of non-
overlapping regions: {R∗(0), ...,R∗(n−1)}. Tile x is the owner
for region R∗(x) in the partition.
When data owned by a particular tile is accessed by a
remote tile this results in communication in the system. Let
R∗(y) represent the data region owned by tile y in a partition.
Then in the same fashion, if data is distributed to the tiles as
dictated by the compiler, accesses by other tiles to data in
R∗(y) require communication. The volume of these remote
accesses is defined as non-local access weight and provides
the potential communication weight between two tiles x and
y in a particular phase within the application execution.
To determine a partition from an overlapping MMAP, each
overlapping array element can be placed into the region that
minimizes non-local access weight.
The compiler can communicate this partitioning infor-
mation into the runtime system through the page-table.
During data allocation the compiler-extracted partitioning
provides each page (or potentially sub-pages) with an owner.
During data access the owner is used to enforce placement
within the cache. The communication pattern extracted by
the compiler, based on the partitioning, can be used to
guide the establishment of the circuits in the network. The
circuit configurations can be communicated to the runtime
system by adding network configuration instructions into the
code [18] at appropriate points determined by the compiler.
V. Evaluation
In our case studies, we use a modified version of
CACTI [19] and HSPICE simulations to model the latency,
power and area of our memory designs, as shown in Table I.
The write latencies for STT-RAM with different retention
times are calculated based on the relationship between
2a variable is thread specific if it has different values for different threads
407
Table II: Baseline architecture configurations
System 16 cores, 2GHz, 2 issue, 64-bit Solaris OSL1 Cache 16KB/core (private), 4-way assoc., 64B block SRAM,
1-cycle latency, write-through, MESI protocolL2 Cache 128KB/bank (shared), 32-way assoc. 64B block SRAM,
4-cycle latency, write-backL2 Energy refer to Table INetwork 4×4 mesh, SRAM buffers, 3 cycles/hop (packet switch)
Main Memory 4GB, 150-cycle latency
Table III: STT-RAM cache configuration (* requires refresh)
L2 Cache 512KB/bank same area replacement, 64B,32-way assoc. (31 STT-SR + 1 STT-RR*)
L2 Read Latency 4 cyclesL2 Write Latency 18 cycles (STT-SR), 4 cycles (STT-RR*)L2 Retention Time 4 years (STT-SR), 26.5 us (STT-RR*)
MTJ switching time and the switching current at the given
retention levels [4]. The simulated architecture is described
in Table II including a baseline and our hybrid STT-RAM
design for comparison. We utilize multi-threaded workloads
from the SPLASH-2 [20] and PARSEC [21] benchmark
suites and Wind River Simics [22] as our simulation en-
vironment.
A. Compiler-Cache Cross-layer Results
0%20%40%60%80%
100%
Ratio
of W
rites
Writes on STT-RR Writes on STT-SR
Figure 5: Ratio of writes on STT-SR vs STT-RR memories
0.8
0.85
0.9
0.95
1
1.05
Mem
ory
Late
ncy
SRAM HYBRID
Figure 6: Normalized memory access delay
Table III shows parameters for replacing the SRAM
L2 cache with a hybrid STT-RAM cache, which provides
additional capacity with the same die area as SRAM. Fig-
ure 5 shows that STT-RR, which comprises just over 3%
of the cache capacity, serves 89% of the write requests.
This indicates an efficient data distribution that dispatches
most writes onto the WF memory (STT-RR). By utilizing
various forms of STT-RAM, the cache saves considerable
static power. The result is an reduced average memory
access latency due to the increased cache capacity and
reduced power consumption from the static power savings,
as illustrated in Figure 6 and Figure 7, respectively. For
the outlier, SWAPTIONS, the memory latency is degraded
00.20.40.60.8
1
Ener
gy
SRAM Dynamic Hybrid DynamicSRAM Static Hybrid Static
Figure 7: Normalized power consumption
compared to the SRAM design since it does not benefit from
the extra capacity achieved by the STT-RAM cache and the
longer latency from the writes that occur in STT-SR causes
the performance degradation. On average, the hybrid STT-
RAM cache design reduces the memory latency by 4.6%
while achieving 10% total energy savings. If the refresh
of STT-RR cannot be tolerated, standard SRAM can serve
the role of WF memory without significant static power
overhead.
B. Network Performance
Figure 8: Communication pattern for OCEAN
To demonstrate the effectiveness of the compiler ex-
tracting the communication pattern, Figure 8 compares the
OCEAN benchmark where the left matrix is the compiler
generated pattern and the right is the runtime generated
pattern, each for 64 processors. Communication volume is
indicated by the intensity of the color, black is negligible
communication, and red, orange, yellow, and white represent
increasingly heavy communication. The compiler generated
pattern provides a good approximation of the runtime effect,
which is less precise due to effects of classification at the
page granularity.
Table IV: Hybrid network configuration
Network 4×4 mesh, STT-RR buffers hybrid packet/circuit switchPacket Switch 4 virtual channels per port, 6 cycles/hop, 3.2s retention timeCircuit Switch 4 planes, 1 cycle/hop, 26.5us retention time
0%20%40%60%80%
100%
% F
lits
Trav
ellin
g on
Di
ffer
ent N
etw
orks
Packet Partial Circuit Circuit
Figure 9: Amount of flits traveling on different networks
408
The network parameters are shown in Table IV. We
utilize a medium retention time of 3.2s for packet switch
buffers, which writes approximately 4X slower than 26.5us
retention memory [4]. We assume packets in the network for
longer than 3s would be dropped. Circuit switch traffic is
never delayed making 26.5us retention adequate. Using the
communication patterns to guide circuit establishment in the
hybrid network helps satisfy the aims of efficient utilization
of both the fast writing and the low power memory in
our hybrid network design. As shown in Figure 9, nearly
90% of the network traffic traverses circuits. Consequently,
Figure 10 reports an average of 5.3% reduction in network
latency due to the use of hybrid network design, while
achieving significant static power savings.
0.60.70.80.9
11.1
Net
wor
k La
tenc
y
Packet Hybrid
Figure 10: Normalized network latency
VI. Conclusion
In this paper, we describe cross-layer approaches to detect
data access and communication patterns within the applica-
tions and use this information to efficiently configure hard-
ware that uses STT-RAM. Our evaluation through two case
studies demonstrates that caches and on-chip interconnects
built with STT-RAM storage can effectively cooperate and
utilize the cross-layer information to improve performance
and reduce power consumption.
References
[1] M. Hosomi, H. Yamagishi, T. Yamamoto, and K. B. et al., “Anovel nonvolatile memory with spin torque transfer magneti-zation switching: Spin-ram,” IEDM Technical Digest, vol. 2,no. 25, pp. 459–462, 2005.
[2] X. Wu, J. Li, L. Zhang, E. Speight, R. Rajamony, andY. Xie, “Hybrid cache architecture with disparate memorytechnologies,” in Proc. of intl symp on comp arch (ISCA).New York, NY, USA: ACM, 2009, pp. 34–45.
[3] M. Qureshi, M. Franceschini, A. Jagmohan, and L. Lastras,“PreSET: Improving read-write performance of phase changememories by exploiting asymmetry in write times,” in Proc.of intl symp on comp arch (ISCA), 2012.
[4] Z. Sun, X. Bi, H. Li, W.-F. Wong, Z.-L. Ong, X. Zhu, andW. Wu, “Multi Retention Level STT-RAM Cache Designswith a Dynamic Refresh Scheme,” 2011, pp. 329–338.
[5] H. Jang, B. S. An, N. Kulkarni, K. H. Yum, and E. J. Kim, “Ahybrid buffer design with stt-mram for on-chip interconnects,”in Proceedings of the 6th ACM/IEEE international symposiumon Networks-on-Chip, ser. NOCS ’12, 2012.
[6] A. Kumar, L.-S. Peh, P. Kundu, and N. K. Jha, “Expressvirtual channels: towards the ideal interconnection fabric,” inISCA, 2007, pp. 150–161.
[7] X. Guo, E. Ipek, and T. Soyata, “Resistive computation:avoiding the power wall with low-leakage, stt-mram basedcomputing,” in Proc. of intl symp on comp arch (ISCA).New York, NY, USA: ACM, 2010, pp. 371–382. [Online].Available: http://doi.acm.org/10.1145/1815961.1816012
[8] M. Rasquinha, D. Choudhary, S. Chatterjee, S. Mukhopad-hyay, and S. Yalamanchili, “An energy efficient cachedesign using spin torque transfer (stt) ram,” in Proceedingsof the 16th ACM/IEEE international symposium on Lowpower electronics and design, ser. ISLPED ’10. New York,NY, USA: ACM, 2010, pp. 389–394. [Online]. Available:http://doi.acm.org/10.1145/1840845.1840931
[9] G. Sun, X. Dong, Y. Xie, J. Li, and Y. Chen, “A novelarchitecture of the 3d stacked mram l2 cache for cmps,” inProceedings of the High Performance Computer Architecture,2009, pp. 239–249.
[10] C. W. Smullen, V. Mohan, A. Nigam, S. Gurumurthi, andM. R. Stan, “Relaxing non-volatility for fast and energy-efficient stt-ram caches,” High-Performance Computer Archi-tecture, International Symposium on, vol. 0, pp. 50–61, 2011.
[11] Y. Li, Y. Chen, and A. K. Jones, “A software approach forcombating asymmetries of non-volatile memories,” in Proc.of the International Symposium on Low Power Electronicsand Design (ISLPED), 2012.
[12] M. E. Wolf, “Improving locality and parallelism in nestedloops,” Ph.D. dissertation, Stanford, CA, USA, 1992, uMIOrder No. GAX93-02340.
[13] Y. Li, A. Abousamra, R. Melhem, and A. Jones, “Compiler-assisted data distribution and network configuration for chipmultiprocessors,” Parallel and Distributed Systems, IEEETransactions on, vol. PP, no. 99, p. 1, 2011.
[14] A. Abousamra, R. Melhem, and A. K. Jones, “Winning withpinning in noc,” in Proc. of IEEE Hot Interconnects, 2009.
[15] A. Kumar, L.-S. Peh, P. Kundu, and N. K. Jha, “Expressvirtual channels: Towards the ideal interconnection fabric,”in Proc of the intl symp on comp arch (ISCA), June 2007.
[16] Y. Paek, “Automatic parallelization for distributed memorymachines based on access region analysis,” Ph.D. dissertation,Univ.of Illinois at Urbana-Champaign, Dept. of ComputerScience, Apr. 1997.
[17] Y. Paek, E. Z. A. Navarro, J. Hoeflinger, and D. Padua,“An advanced compiler framework for noncache-coherentmultiprocessors,” in IEEE Trans. Parallel and DistributedSystems, vol. 3, no. 3, Mar. 2002, pp. 241–259.
[18] S. Shao, A. Jones, and R. Melhem, “Compiler techniquesfor efficient communications in circuit switched networks formultiprocessor systems,” Parallel and Distributed Systems,IEEE Transactions on, vol. 20, no. 3, pp. 331 –345, march2009.
[19] P. Shivakumar and N. P. Jouppi, “Cacti 3.0: An integratedcache timing, power, and area model,” hp, Tech. Rep., August2001.
[20] J. M. Arnold, D. A. Buell, and E. G. Davis, “Splash 2,” inSPAA ’92: Proceedings of the fourth annual ACM symposiumon Parallel algorithms and architectures. New York, NY,USA: ACM, 1992, pp. 316–322.
[21] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The parsecbenchmark suite: Characterization and architectural implica-tions,” Princeton University, Tech. Rep. TR-811-08, January2008.
[22] P. S. Magnusson, M. Christensson, J. Eskilson, D. Fors-gren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, andB. Werner, “Simics: A full system simulation platform,” IEEEComputer, vol. 35, no. 2, pp. 50–58, February 2002.
409