+ All Categories
Home > Documents > [IEEE 2012 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) - Amherst, MA, USA...

[IEEE 2012 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) - Amherst, MA, USA...

Date post: 04-Dec-2016
Category:
Upload: alex-k
View: 216 times
Download: 4 times
Share this document with a friend
6
Cross-layer Techniques for Optimizing Systems Utilizing Memories with Asymmetric Access Characteristics Yong Li and Alex K. Jones Department of ECE University of Pittsburgh Pittsburgh, USA Email: {yol26, akjones}@pitt.edu Abstract—Non-volatile memory technologies promise a vari- ety of advantages for memory architectures of next generation computing systems. However, these capabilities come at the cost of some ineciencies governing the operation of these memories. The most well understood is the asymmetry of access. In order to most eectively take advantage of the benefits of these memory technologies in terms of density and reduced static power in systems while mitigating access complexity an one-size fits all method is not sucient for all types of applications. Instead, cross-layer techniques that include the compiler, operating system, and hardware layer can extract characteristics from the application that can be used to deliver the highest possible performance while minimizing power consumption for systems using these memories. Keywords-Compiler; Retention; STT-RAM; Network-on- Chip; Cache; Buer; I. Introduction CMOS technology scaling is increasingly aected by the high leakage power and thermal challenges due to reduced device dimensions and near-threshold voltage op- eration. Potential improvements include the use of high performance and low power alternative memory technolo- gies. In particular, Spin-Torque Transfer Magnetic RAM (STT-RAM) has received considerable attention [1] and is regarded as one of the most promising future memory candidates. Substantial research eorts have been made to mitigate its drawbacks, most notably the asymmetric access characteristics. Common methodologies to address the memory asymmetry issue include using counter-based mechanisms to track the data accesses and appropriately migrate data blocks [2], pro-actively operating on predicated data elements [3] and sacrificing data retention time for balanced read/write penalties [4]. In this paper, we discuss cross-layer methods to optimize systems that use STT-RAM in various components within chip-multiprocessor (CMP) systems such as the shared caches and buers within the network-on-chip (NoC). We discuss how high-level knowledge of the application behav- ior can be used to mitigate the access asymmetry of read and write behaviors. A cross-layer approach can leverage a high-level view of the data access behavior of an application and use this information to configure features of the system hardware a priori in order to increase the performance and power-eciency. In contrast to a hardware-oriented ap- proach that reacts to system behavior, a cross layer approach can often avoid misleading or temporary program behavior that can lead to ineciencies and configuration thrashing. In particular, we describe methodologies that leverage compiler analyses, which expose data access and commu- nication patterns for multi-threaded programs. Further, we present mechanisms for how this data can be communicated from the system level into the runtime system. Finally, we demonstrate how including configurability into the ar- chitecture can use this information to optimize the use of STT-RAM for improved eciency. We consider sys- tem configurability such as: hybrid SRAM/STT-RAM cache designs [2], hybrid SRAM/STT-RAM buer designs [5] combined with fast, circuit-like paths [6] in NoCs, and use of reduced data retention times in STT-RAM to improve write performance [4]. We demonstrate the eectiveness of this approach through two case studies. In the first case-study we examine compiler guided writing within a hybrid STT-RAM with primarily standard STT-RAM that requires a long write delay com- bined with a small amount of STT-RAM optimized or faster writing due to a reduced retention time but, which requires refreshing. Our results show that the compiler-guided data distribution ensures that 89% of writes are handled in the faster writing memory, which only comprises 3% of the total cache capacity and yields a 5% performance improvement and near 10% power saving compared to a traditional SRAM design. In our second case-study we examine method to integrate STT-RAM buers into a configurable NoC. In our studies, nearly 90% of the network trac can take advantage of a limited set of compiler guided circuit-switch style fast- paths leading to a more than 5% reduction in network delay in spite of the increase buer write time of STT-RAM while leveraging its low leakage benefit. II. Related Work Due to the asymmetrically high delay/energy write charac- teristics of many emerging non-volatile memories, intensive research eorts have been made to mitigate these write 2012 IEEE Computer Society Annual Symposium on VLSI 978-0-7695-4767-1/12 $26.00 © 2012 IEEE DOI 10.1109/ISVLSI.2012.65 404
Transcript
Page 1: [IEEE 2012 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) - Amherst, MA, USA (2012.08.19-2012.08.21)] 2012 IEEE Computer Society Annual Symposium on VLSI - Cross-Layer Techniques

Cross-layer Techniques for Optimizing Systems Utilizing Memories withAsymmetric Access Characteristics

Yong Li and Alex K. Jones

Department of ECEUniversity of Pittsburgh

Pittsburgh, USAEmail: {yol26, akjones}@pitt.edu

Abstract—Non-volatile memory technologies promise a vari-ety of advantages for memory architectures of next generationcomputing systems. However, these capabilities come at thecost of some inefficiencies governing the operation of thesememories. The most well understood is the asymmetry ofaccess. In order to most effectively take advantage of thebenefits of these memory technologies in terms of densityand reduced static power in systems while mitigating accesscomplexity an one-size fits all method is not sufficient forall types of applications. Instead, cross-layer techniques thatinclude the compiler, operating system, and hardware layer canextract characteristics from the application that can be usedto deliver the highest possible performance while minimizingpower consumption for systems using these memories.

Keywords-Compiler; Retention; STT-RAM; Network-on-Chip; Cache; Buffer;

I. Introduction

CMOS technology scaling is increasingly affected by

the high leakage power and thermal challenges due to

reduced device dimensions and near-threshold voltage op-

eration. Potential improvements include the use of high

performance and low power alternative memory technolo-

gies. In particular, Spin-Torque Transfer Magnetic RAM

(STT-RAM) has received considerable attention [1] and

is regarded as one of the most promising future memory

candidates. Substantial research efforts have been made

to mitigate its drawbacks, most notably the asymmetric

access characteristics. Common methodologies to address

the memory asymmetry issue include using counter-based

mechanisms to track the data accesses and appropriately

migrate data blocks [2], pro-actively operating on predicated

data elements [3] and sacrificing data retention time for

balanced read/write penalties [4].

In this paper, we discuss cross-layer methods to optimize

systems that use STT-RAM in various components within

chip-multiprocessor (CMP) systems such as the shared

caches and buffers within the network-on-chip (NoC). We

discuss how high-level knowledge of the application behav-

ior can be used to mitigate the access asymmetry of read

and write behaviors. A cross-layer approach can leverage a

high-level view of the data access behavior of an application

and use this information to configure features of the system

hardware a priori in order to increase the performance

and power-efficiency. In contrast to a hardware-oriented ap-

proach that reacts to system behavior, a cross layer approach

can often avoid misleading or temporary program behavior

that can lead to inefficiencies and configuration thrashing.

In particular, we describe methodologies that leverage

compiler analyses, which expose data access and commu-

nication patterns for multi-threaded programs. Further, we

present mechanisms for how this data can be communicated

from the system level into the runtime system. Finally,

we demonstrate how including configurability into the ar-

chitecture can use this information to optimize the use

of STT-RAM for improved efficiency. We consider sys-

tem configurability such as: hybrid SRAM/STT-RAM cache

designs [2], hybrid SRAM/STT-RAM buffer designs [5]

combined with fast, circuit-like paths [6] in NoCs, and use

of reduced data retention times in STT-RAM to improve

write performance [4].

We demonstrate the effectiveness of this approach through

two case studies. In the first case-study we examine compiler

guided writing within a hybrid STT-RAM with primarily

standard STT-RAM that requires a long write delay com-

bined with a small amount of STT-RAM optimized or faster

writing due to a reduced retention time but, which requires

refreshing. Our results show that the compiler-guided data

distribution ensures that 89% of writes are handled in the

faster writing memory, which only comprises 3% of the total

cache capacity and yields a 5% performance improvement

and near 10% power saving compared to a traditional SRAM

design. In our second case-study we examine method to

integrate STT-RAM buffers into a configurable NoC. In our

studies, nearly 90% of the network traffic can take advantage

of a limited set of compiler guided circuit-switch style fast-paths leading to a more than 5% reduction in network delay

in spite of the increase buffer write time of STT-RAM while

leveraging its low leakage benefit.

II. RelatedWork

Due to the asymmetrically high delay/energy write charac-

teristics of many emerging non-volatile memories, intensive

research efforts have been made to mitigate these write

2012 IEEE Computer Society Annual Symposium on VLSI

978-0-7695-4767-1/12 $26.00 © 2012 IEEE

DOI 10.1109/ISVLSI.2012.65

404

Page 2: [IEEE 2012 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) - Amherst, MA, USA (2012.08.19-2012.08.21)] 2012 IEEE Computer Society Annual Symposium on VLSI - Cross-Layer Techniques

penalties at various levels of the system. For example, Guo

et al. [7] use STT-RAM to re-design a number of non-

write-intensive micro-architectural components and adopt a

subbank write buffering policy with read-write bypassing to

increase write throughput and hide the high write latency.

Rasquinha et al. [8] address the high write energy of STT-

RAM by adopting a new replacement policy that increases

the residency time of dirty lines at the expense of higher

miss rates.

Wu et al. [2] proposes a region-based hybrid cache archi-

tecture (RHCA) and level-based hybrid cache architecture

(LHCA) with various non-volatile memories including STT-

RAM and phase-change memory (PCM). Data is swapped

between slow and fast regions based on data access behavior

recorded with hardware counters. A similar concept was

proposed for a hybrid STT-RAM/SRAM cache [9]. Jang

et al. [5] propose a hybrid STT-RAM/SRAM input buffer

design for NoCs to save static power while achieving high

network throughput.

Recently, it has been proposed to reduce STT-RAM write

delay/energy at the expense of a shorter data retention

time [4], [10]. This requires the system ensure that data does

not degrade.

These hybrid architectures coupled with the reduced

retention STT-RAM provide opportunities for cross-layer

optimizations where system configurability can exploit ap-

plication specific characteristics to perform more efficiently

than a generic approach. We describe two examples of this

in the next two sections of the paper.

III. Compiler-Cache Cross-layer Optimization

Most applications have a relatively small working-set of

frequently written (FW) data. FW data can be identified by

the compiler. Through effective use of a small bank of STT-

RAM with relaxed1 retention (STT-RR), the write penalties

of STT-RAM can be mitigated.

A. Cache Design

Building on a standard set-associative cache, each set can

contain primarily traditional STT-RAM with a small fraction

STT-RR blocks serving as write-friendly (WF) memory and

a significant portion of standard STT-RAM, as depicted

in Figure 1. To leverage the static power benefit of STT-

RAM and achieve the write performance comparable to

SRAM, STT-RR must be predominantly used to service

writes. However, as STT-RR requires a refresh mechanism,

thus, frequently read (FR) data should be predominantly

stored in standard retention STT-RAM (STT-SR). As access

characteristics change during application execution, data

must be appropriated migrated or swapped between STT-

SR and STT-RR banks.

1We use the terms reduced and relaxed retention interchangeably to referto the same concept.

The migration/swap decision can be made either through

hardware, for example detecting multiple writes in se-

quence indicating FW data and migrating this location

to the WF memory [9]. However, this approach can be

easily misled by unpredictable runtime data access behavior.

Significant mispredictions can incur an expensive penalty

of serving accesses in the wrong type of memory (e.g.,

heavy write accesses occurring in standard STT-SR) and

large migration/swap overheads. In contrast, a cross-layer

compiler mechanism has the advantages of taking actions

preemptively to hide migration/swap latencies while more

effectively detecting data access patterns compared to simple

run-time approaches.

STT-SR

Decoder

Counter Migration/Swap

Control

Write Req.

Software Guide

Cache One Cache Set

STT-RR

(WF)

Figure 1: Cache configuration utilizing hybrid STT-RAM.

B. Compiler Analysis

Using compiler-based dispatch, which identifies the write

reuse patterns and inserts instructions to guide the hardware

to perform the migration/swap, dynamically, it is possible

to avoid the pitfalls of a runtime-only system [11]. High

temporal reuse serves as a indicator of FW data. In the next

two sections we describe compiler methods to detect high

temporal reuse in array and linked-style data structures.

1) Write Reuse Identification for Arrays: The temporal

reuse information of affine array accesses can be analyzed

by solving linear algebra equations [12]. Consider Figure 2

as an example. Given the array accesses A[i + 2][ j] and

B[2][i][2 ∗ i + 1] in the nested loop shown in Figure 2(a),

we first convert the subscript functions to the matrix expres-

sions, as illustrated in Figure 2(b). The array access now

can be represented as C ∗ k + O, where C is the coefficient

matrix, k is the index vector and O denotes the offset vector.

Determining whether the array access has temporal write

reuse now is equivalent to deriving the condition under

which the equation C ∗ k′ +O = C ∗ k′′ +O has solutions (k′and k′′ represent two different index vectors in the iteration

space). In linear algebra theory, the necessary and sufficient

condition under which the above equation has solutions is

that C is not full ranked. In our example, the coefficient

matrix of A[i+ 2][ j] has a rank of 2, indicating no temporal

reuse. B[2][i][2∗i+1] has temporal reuse since the rank of its

coefficient matrix is 1, which is smaller than its dimension.

An array write access exhibits spatial write reuse when

the innermost enclosing loop index varies only the last

coordinate of that array. To discover spatial write reuse, we

405

Page 3: [IEEE 2012 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) - Amherst, MA, USA (2012.08.19-2012.08.21)] 2012 IEEE Computer Society Annual Symposium on VLSI - Cross-Layer Techniques

Coefficient Matrix For A[i+2][j] Coefficient Matrix For B[2][i][2*i+1]

1 0 0 1 0 0 1 0 2 0

i j

i j

*

*

+

+

2 0 2 0 1

1 0

0 00 01 0

Truncated Matrix Offset Matrixfor(i=0;i<N;i++) for(j=0;j<M;j++) { A[i+2][j] = 0; B[2][i][2*i+1]+=1; }

(a) (b)

Figure 2: Array accesses and the corresponding matrix

representations (a): array accesses (b): matrix representation

use a truncated coefficient matrix by dropping the last row of

the original coefficient matrix, as illustrated in Figure 2(b). If

the rightmost column in the truncated coefficient matrix (the

coefficients that correspond to the innermost loop index) is

a null vector and the rightmost element in the dropped row

is nonzero, it is assured that the innermost loop only varies

the last coordinate of the corresponding array.

In the above example, A[i + 2][ j] exhibits spatial reuse

since the rightmost column in the truncated matrix is a

null vector and the rightmost element in the dropped raw

is nonzero. Using the same rule we can determine that

B[2][i][2 ∗ i + 1] does not have spatial reuse.2) Write Reuse Identification for Linked Data Structures:

To analyze the write reuse pattern for linked data structures

such as linked lists and trees, a CFG (control flow graph)

of the program is constructed. A CFG G = (V, E, r) is a

directed graph, with nodes V , edges E, and an entry node

r. Each node v in V is a basic block, which consists of a

sequence of statements that have exact one entry point and

exit point.

nd->x = 5; nd->next = A; nd->prev = B;

typedef struct node { … int x; struct node *next; struct node *prev; … } node_t; node_t *nd;

typedef struct node { … int x;struct node *next;struct node *prev;

… } node_t;

node_t *nd;

nd->x = 5;

nd->next = A; nd->prev = B;

nd->next = B; nd->prev= A;

nd->x = 5;

nd->prev = B; nd->next = B; nd->prev= A;

nd->x = 5; foo(); nd->next = A; nd->prev = B;

nd->next = A;

(a)

(b)

(d)

(c)

(e)

Figure 3: Code and control flow graph examples for write

reuse identification. (a): type definition code (b): write reuse

in the same basic block (c): write reuse across one basic

block and all its successors (d): write reuse broken by

function call (e): write reuse broken by one successor

Presuming standard compiler passes such as constant

propagation, expression folding and branch elimination are

applied on the CFG, the CFG is traversed to examine the ba-

sic blocks. Memory writes (right-hand side of assignments)

exhibit memory reuse when several accesses (e.g., three or

more) within the basic block are dereferenced from the same

base pointer. This implies multiple writes within a small

address range, such as a 64-128 byte contiguous region,

indicating reuse in the same cache line. For example, in

Figure 3 the basic block in (b) exhibits reuse as the three

elements in nd defined in (a) stored in contiguous memory

are all written in the same block.

In a similar fashion, a group of accesses can indicate

reuse even if they span multiple contiguous basic blocks

assuming no function calls separate them in the CFG. They

can also span conditionals in the case that all corresponding

paths in the CFG exhibit similar frequency of access. Thus,

in (c) reuse is present because three contiguous locations

are written in all paths. However, in (d) the function call

separates the access and in (e) the one path does not store to

nd->prev, each indicating no reuse. These rules help ensure

these analyzed writes occur in a single memory block. From

this perspective the writes are intensive/frequent.

C. Code Instrumentation

To communicate the information about the FW data to

the system, we use code instrumentation. When write reuse

has been identified, the compiler inserts a pre-dispatchinstruction into the code prior to the memory access to

notify the CPU to perform the migration or swap operation

to ensure the writes occur in WF memory.

IV. Compiler-Network Cross-layer Optimization

Multi-threaded applications often imply a particular traffic

pattern either from data partitioning or data flow between

threads. However, shared caches can often obfuscate this

traffic pattern [13], but the underlying data partitioning

and resulting traffic pattern can often be extracted by the

compiler [13] and used to configure a hybrid packet/circuit-

switched network [13], [14]. By employing compiler, cache,

and network cooperation in this fashion, the traditionally

SRAM buffers can be predominantly replaced with STT-

RAM in the NoC to reduce static power while maintaining

a high performance.

A. Hybrid Network Infrastructure

A configurable hybrid NoC using express virtual chan-nels [15] allows a circuit-like function to exist within a

packet-switch. Packet switched traffic requires significant

overhead at each switch point including stages like virtual

channel allocation, route computation, switch allocation and

switch traversal. Express channels are pre-configured and

allow express traffic to bypass the routing logic at each

switch and only add one cycle latency per hop for buffering.

Express channels/circuits can be established at runtime

using hardware counters that track the communication char-

acteristics of the network [14]. For a cross-layer approach

the compiler extracted communication pattern can be used to

schedule the available circuits efficiently without the runtime

overheads [13]. If the majority of the traffic uses circuits,

then virtual channel buffers for the packet-switch traffic

are less performance critical and can employ STT-RAM

with a slower writing time. Figure 4 provides an example

switch configuration for using STT-RAM-based NoC where

the shaded buffers support the packet-switched traffic and

406

Page 4: [IEEE 2012 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) - Amherst, MA, USA (2012.08.19-2012.08.21)] 2012 IEEE Computer Society Annual Symposium on VLSI - Cross-Layer Techniques

employ STT-RAM. The unshaded buffers support circuit

traffic and use WF memory for best performance.

STT-RAM

STT-RAM

WFMEM

WFMEM

Figure 4: Hybrid router design using STT-RAM

B. Compiled Communication Pattern

Discovering the data partitioning and communication pat-

tern of the application begins with array access regionanalysis. Within a loop nest for a single threaded program,

a reference to an array A can be generally represented as

A[ f (L)], where f (L) is the subscript function defined on a

set of loop indices L = i1, ..., im. The span of f (L) resulting

from ik is the maximum distance traversed by varying only

ik from its lower bound lk to its upper bound uk [16], [17]:

spanik = | f (i1, ...ik−1, uk, ik+1, ..., im)−f (i1, ...ik−1, lk, ik+1, ..., im)| (1)

Similarly, the stride is defined as the minimum distance

across memory by changing only ik by its step sk :

strideik = | f (i1, ...ik−1, ik + sk, ik+1, ..., im)−f (i1, ...ik−1, ik, ik+1, ..., im)| (2)

Thus, an array access region can be described by the

following form, where O denotes the starting offset:

R = (Astridei1 ,...,strideimspani1 ,...,spanim

+ O) (3)

The region concept can be extended to multi-threaded ap-

plications to represent data partitions and resulting commu-

nication [13]. The MMAP (Multi-threaded Memory Access

Pattern) provides a data access pattern representation for

multi-threaded applications. Given a phase of parallel code

for n threads, the MMAP for an array access is a list of nregions with each region corresponding to one thread:

M = {R(0), ...,R(n − 1)} (4)

Each region in the MMAP can be calculated by replacing

Table I: SRAM and STT Parameters (45nm)

Size 128K SRAM 512K STT

Area 1.339mm2 1.362mm2

Read Latency 1.778ns 1.823ns

Write Latency 1.868ns 8.725ns(STT-ST)1.982ns(STT-RR-26.5us)

Read Energy 0.574nJ 0.550nJWrite Energy 0.643nJ 3.243nJ

Leakage Power 0.185W 0.013W

thread specific variables2 with the actual values for the

corresponding thread. An MMAP, M, is non-overlappingwhen ∀R(x),R(y) in M, R(x)

⋂R(y) = ∅. Otherwise, the

MMAP is overlapping.

The access weight for region R(x), denoted as W(R(x)),

is defined as the number of accesses from thread x on its

region R(x). The access weight can be calculated based on

the information from the loops associated with the region.

An estimate of the data partition can be determined by the

compiler by through MMAP analysis and the calculation of

their relative access weights. For a program with n threads

numbered from 0 to n − 1, a partition P is a set of non-

overlapping regions: {R∗(0), ...,R∗(n−1)}. Tile x is the owner

for region R∗(x) in the partition.

When data owned by a particular tile is accessed by a

remote tile this results in communication in the system. Let

R∗(y) represent the data region owned by tile y in a partition.

Then in the same fashion, if data is distributed to the tiles as

dictated by the compiler, accesses by other tiles to data in

R∗(y) require communication. The volume of these remote

accesses is defined as non-local access weight and provides

the potential communication weight between two tiles x and

y in a particular phase within the application execution.

To determine a partition from an overlapping MMAP, each

overlapping array element can be placed into the region that

minimizes non-local access weight.

The compiler can communicate this partitioning infor-

mation into the runtime system through the page-table.

During data allocation the compiler-extracted partitioning

provides each page (or potentially sub-pages) with an owner.

During data access the owner is used to enforce placement

within the cache. The communication pattern extracted by

the compiler, based on the partitioning, can be used to

guide the establishment of the circuits in the network. The

circuit configurations can be communicated to the runtime

system by adding network configuration instructions into the

code [18] at appropriate points determined by the compiler.

V. Evaluation

In our case studies, we use a modified version of

CACTI [19] and HSPICE simulations to model the latency,

power and area of our memory designs, as shown in Table I.

The write latencies for STT-RAM with different retention

times are calculated based on the relationship between

2a variable is thread specific if it has different values for different threads

407

Page 5: [IEEE 2012 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) - Amherst, MA, USA (2012.08.19-2012.08.21)] 2012 IEEE Computer Society Annual Symposium on VLSI - Cross-Layer Techniques

Table II: Baseline architecture configurations

System 16 cores, 2GHz, 2 issue, 64-bit Solaris OSL1 Cache 16KB/core (private), 4-way assoc., 64B block SRAM,

1-cycle latency, write-through, MESI protocolL2 Cache 128KB/bank (shared), 32-way assoc. 64B block SRAM,

4-cycle latency, write-backL2 Energy refer to Table INetwork 4×4 mesh, SRAM buffers, 3 cycles/hop (packet switch)

Main Memory 4GB, 150-cycle latency

Table III: STT-RAM cache configuration (* requires refresh)

L2 Cache 512KB/bank same area replacement, 64B,32-way assoc. (31 STT-SR + 1 STT-RR*)

L2 Read Latency 4 cyclesL2 Write Latency 18 cycles (STT-SR), 4 cycles (STT-RR*)L2 Retention Time 4 years (STT-SR), 26.5 us (STT-RR*)

MTJ switching time and the switching current at the given

retention levels [4]. The simulated architecture is described

in Table II including a baseline and our hybrid STT-RAM

design for comparison. We utilize multi-threaded workloads

from the SPLASH-2 [20] and PARSEC [21] benchmark

suites and Wind River Simics [22] as our simulation en-

vironment.

A. Compiler-Cache Cross-layer Results

0%20%40%60%80%

100%

Ratio

of W

rites

Writes on STT-RR Writes on STT-SR

Figure 5: Ratio of writes on STT-SR vs STT-RR memories

0.8

0.85

0.9

0.95

1

1.05

Mem

ory

Late

ncy

SRAM HYBRID

Figure 6: Normalized memory access delay

Table III shows parameters for replacing the SRAM

L2 cache with a hybrid STT-RAM cache, which provides

additional capacity with the same die area as SRAM. Fig-

ure 5 shows that STT-RR, which comprises just over 3%

of the cache capacity, serves 89% of the write requests.

This indicates an efficient data distribution that dispatches

most writes onto the WF memory (STT-RR). By utilizing

various forms of STT-RAM, the cache saves considerable

static power. The result is an reduced average memory

access latency due to the increased cache capacity and

reduced power consumption from the static power savings,

as illustrated in Figure 6 and Figure 7, respectively. For

the outlier, SWAPTIONS, the memory latency is degraded

00.20.40.60.8

1

Ener

gy

SRAM Dynamic Hybrid DynamicSRAM Static Hybrid Static

Figure 7: Normalized power consumption

compared to the SRAM design since it does not benefit from

the extra capacity achieved by the STT-RAM cache and the

longer latency from the writes that occur in STT-SR causes

the performance degradation. On average, the hybrid STT-

RAM cache design reduces the memory latency by 4.6%

while achieving 10% total energy savings. If the refresh

of STT-RR cannot be tolerated, standard SRAM can serve

the role of WF memory without significant static power

overhead.

B. Network Performance

Figure 8: Communication pattern for OCEAN

To demonstrate the effectiveness of the compiler ex-

tracting the communication pattern, Figure 8 compares the

OCEAN benchmark where the left matrix is the compiler

generated pattern and the right is the runtime generated

pattern, each for 64 processors. Communication volume is

indicated by the intensity of the color, black is negligible

communication, and red, orange, yellow, and white represent

increasingly heavy communication. The compiler generated

pattern provides a good approximation of the runtime effect,

which is less precise due to effects of classification at the

page granularity.

Table IV: Hybrid network configuration

Network 4×4 mesh, STT-RR buffers hybrid packet/circuit switchPacket Switch 4 virtual channels per port, 6 cycles/hop, 3.2s retention timeCircuit Switch 4 planes, 1 cycle/hop, 26.5us retention time

0%20%40%60%80%

100%

% F

lits

Trav

ellin

g on

Di

ffer

ent N

etw

orks

Packet Partial Circuit Circuit

Figure 9: Amount of flits traveling on different networks

408

Page 6: [IEEE 2012 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) - Amherst, MA, USA (2012.08.19-2012.08.21)] 2012 IEEE Computer Society Annual Symposium on VLSI - Cross-Layer Techniques

The network parameters are shown in Table IV. We

utilize a medium retention time of 3.2s for packet switch

buffers, which writes approximately 4X slower than 26.5us

retention memory [4]. We assume packets in the network for

longer than 3s would be dropped. Circuit switch traffic is

never delayed making 26.5us retention adequate. Using the

communication patterns to guide circuit establishment in the

hybrid network helps satisfy the aims of efficient utilization

of both the fast writing and the low power memory in

our hybrid network design. As shown in Figure 9, nearly

90% of the network traffic traverses circuits. Consequently,

Figure 10 reports an average of 5.3% reduction in network

latency due to the use of hybrid network design, while

achieving significant static power savings.

0.60.70.80.9

11.1

Net

wor

k La

tenc

y

Packet Hybrid

Figure 10: Normalized network latency

VI. Conclusion

In this paper, we describe cross-layer approaches to detect

data access and communication patterns within the applica-

tions and use this information to efficiently configure hard-

ware that uses STT-RAM. Our evaluation through two case

studies demonstrates that caches and on-chip interconnects

built with STT-RAM storage can effectively cooperate and

utilize the cross-layer information to improve performance

and reduce power consumption.

References

[1] M. Hosomi, H. Yamagishi, T. Yamamoto, and K. B. et al., “Anovel nonvolatile memory with spin torque transfer magneti-zation switching: Spin-ram,” IEDM Technical Digest, vol. 2,no. 25, pp. 459–462, 2005.

[2] X. Wu, J. Li, L. Zhang, E. Speight, R. Rajamony, andY. Xie, “Hybrid cache architecture with disparate memorytechnologies,” in Proc. of intl symp on comp arch (ISCA).New York, NY, USA: ACM, 2009, pp. 34–45.

[3] M. Qureshi, M. Franceschini, A. Jagmohan, and L. Lastras,“PreSET: Improving read-write performance of phase changememories by exploiting asymmetry in write times,” in Proc.of intl symp on comp arch (ISCA), 2012.

[4] Z. Sun, X. Bi, H. Li, W.-F. Wong, Z.-L. Ong, X. Zhu, andW. Wu, “Multi Retention Level STT-RAM Cache Designswith a Dynamic Refresh Scheme,” 2011, pp. 329–338.

[5] H. Jang, B. S. An, N. Kulkarni, K. H. Yum, and E. J. Kim, “Ahybrid buffer design with stt-mram for on-chip interconnects,”in Proceedings of the 6th ACM/IEEE international symposiumon Networks-on-Chip, ser. NOCS ’12, 2012.

[6] A. Kumar, L.-S. Peh, P. Kundu, and N. K. Jha, “Expressvirtual channels: towards the ideal interconnection fabric,” inISCA, 2007, pp. 150–161.

[7] X. Guo, E. Ipek, and T. Soyata, “Resistive computation:avoiding the power wall with low-leakage, stt-mram basedcomputing,” in Proc. of intl symp on comp arch (ISCA).New York, NY, USA: ACM, 2010, pp. 371–382. [Online].Available: http://doi.acm.org/10.1145/1815961.1816012

[8] M. Rasquinha, D. Choudhary, S. Chatterjee, S. Mukhopad-hyay, and S. Yalamanchili, “An energy efficient cachedesign using spin torque transfer (stt) ram,” in Proceedingsof the 16th ACM/IEEE international symposium on Lowpower electronics and design, ser. ISLPED ’10. New York,NY, USA: ACM, 2010, pp. 389–394. [Online]. Available:http://doi.acm.org/10.1145/1840845.1840931

[9] G. Sun, X. Dong, Y. Xie, J. Li, and Y. Chen, “A novelarchitecture of the 3d stacked mram l2 cache for cmps,” inProceedings of the High Performance Computer Architecture,2009, pp. 239–249.

[10] C. W. Smullen, V. Mohan, A. Nigam, S. Gurumurthi, andM. R. Stan, “Relaxing non-volatility for fast and energy-efficient stt-ram caches,” High-Performance Computer Archi-tecture, International Symposium on, vol. 0, pp. 50–61, 2011.

[11] Y. Li, Y. Chen, and A. K. Jones, “A software approach forcombating asymmetries of non-volatile memories,” in Proc.of the International Symposium on Low Power Electronicsand Design (ISLPED), 2012.

[12] M. E. Wolf, “Improving locality and parallelism in nestedloops,” Ph.D. dissertation, Stanford, CA, USA, 1992, uMIOrder No. GAX93-02340.

[13] Y. Li, A. Abousamra, R. Melhem, and A. Jones, “Compiler-assisted data distribution and network configuration for chipmultiprocessors,” Parallel and Distributed Systems, IEEETransactions on, vol. PP, no. 99, p. 1, 2011.

[14] A. Abousamra, R. Melhem, and A. K. Jones, “Winning withpinning in noc,” in Proc. of IEEE Hot Interconnects, 2009.

[15] A. Kumar, L.-S. Peh, P. Kundu, and N. K. Jha, “Expressvirtual channels: Towards the ideal interconnection fabric,”in Proc of the intl symp on comp arch (ISCA), June 2007.

[16] Y. Paek, “Automatic parallelization for distributed memorymachines based on access region analysis,” Ph.D. dissertation,Univ.of Illinois at Urbana-Champaign, Dept. of ComputerScience, Apr. 1997.

[17] Y. Paek, E. Z. A. Navarro, J. Hoeflinger, and D. Padua,“An advanced compiler framework for noncache-coherentmultiprocessors,” in IEEE Trans. Parallel and DistributedSystems, vol. 3, no. 3, Mar. 2002, pp. 241–259.

[18] S. Shao, A. Jones, and R. Melhem, “Compiler techniquesfor efficient communications in circuit switched networks formultiprocessor systems,” Parallel and Distributed Systems,IEEE Transactions on, vol. 20, no. 3, pp. 331 –345, march2009.

[19] P. Shivakumar and N. P. Jouppi, “Cacti 3.0: An integratedcache timing, power, and area model,” hp, Tech. Rep., August2001.

[20] J. M. Arnold, D. A. Buell, and E. G. Davis, “Splash 2,” inSPAA ’92: Proceedings of the fourth annual ACM symposiumon Parallel algorithms and architectures. New York, NY,USA: ACM, 1992, pp. 316–322.

[21] C. Bienia, S. Kumar, J. P. Singh, and K. Li, “The parsecbenchmark suite: Characterization and architectural implica-tions,” Princeton University, Tech. Rep. TR-811-08, January2008.

[22] P. S. Magnusson, M. Christensson, J. Eskilson, D. Fors-gren, G. Hallberg, J. Hogberg, F. Larsson, A. Moestedt, andB. Werner, “Simics: A full system simulation platform,” IEEEComputer, vol. 35, no. 2, pp. 50–58, February 2002.

409


Recommended