CS-2002-05 Building DRAM-based High Performance ...€¦ · cache system in detail. We will...

CS-2002-05

Building DRAM-based High Performance Intermediate

Memory System

Junyi Xie and Gershon Kedem

Department of Computer Science

Duke University

Durham, North Carolina 27708-0129

May 15, 2002

Abstract

Although the speed of modern microprocessor improves at a rate of around 80% per year, the

memory speed improvement is unfortunately only about 5% every year. According to Amdal’s

law, the overall computer performance is determined not only by the processor, but by the memory

system speed. Previous research shows that typically the access time to memory system accounts

for more than 50% of the execution time of the program. Given the widening performance gap

between processors and memory system, it is essential to build high-performance memory sys-

tems to alleviate the disparity between them.In this research project, we build a DRAM-based

cost effective large and fast cache system which is able to decrease the high cost incurred by

the power hungry static SRAM and improve cache and overall system performance. In addition,

based on this design, we also integrate into this cache model two prefetch schemes to predict and

aggressively prefetch the next referenced cache line from next level of memory hierarchy to fur-

ther improve cache performance. Compared with traditional prefetching cache, our prefetching

schemes introduce much less hardware cost since we only need to maintain the prediction infor-

mation for the small SRAM buffer. Our simulation shows that our 1M DRAM based on-chip L2

cache with only 64K fast SRAM buffer can outperform a typical 256K on-chip SRAM based L2

cache by 118% in average on 10 benchmarks from SPEC95 and SPEC2000 with high miss rate

over 5% on baseline cache.

1

1 Introduction

The memory hierarchy of current computing systems has been under the pressure placing by contin-

ued improvements in processor performance, in particular, the sharp increase in clock frequencies.

The speed gap between processor and memory hierarchy is continuously widening since the general

purpose processors are getting faster and faster. Generally speaking, the processor speed increases

by a factor of 2 every 18 months, however, DRAMs speed improves only at the rate of 5% per year.

As a result, the computer system has already passed the time point at which overall performance is

not only determined by processor speed but also by memory system speed since the performance of

computing systems now highly depends on performance of memory system. Current system design-

ers employ a wide range of techniques to reduce or tolerate memory system delays including dynamic

scheduling, speculation execution, multilevel caches, non-blocking accesses and prefetching in cache

hierarchy. One of the intensively explored fields in these techniques are to build high performance

cache hierarchy within current memory technologies. Currently almost all microprocessors make use

of on-chip level-1 or level-2 cache to provide fast access to data to achieve high processor perfor-

mance. These on-chip caches are usually built on fast and expensive static memory(SRAM) to keep

pace with processors. However, due to chip area constraints and other restrictions such as energy con-

sumption, on-chip cache can not be built large enough to provide high data hit rate. In consequence,

due to huge miss penalty resulting from the wide speed gap between processor and memory, system

performance will be slowed down greatly. In this project, we explore the possibility to build high

performance and cost effective external cache based on dense and relatively slow DRAM memory.

2

In our design, we combine the large DRAM arrays together with small fast SRAM buffers on single

integrated circuit to form a cache subsystem. The cache design proposed in this paper can be either

external or on-chip L2 cache. Typically, size ratio of DRAM arrays over SRAM buffer is over 16.

Since the DRAM arrays account for most of size of the cache system, the cost per bit in the cache can

be kept low accordingly. Moreover, we also design it as a DRAM-page based prefetching cache using

different prediction schemes. We organize this report as follows, first in section 2 we introduce the

previous related research work in this field. In section 3 we will introduce the design of DRAM-based

cache system in detail. We will introduce the page based prediction and prefetching for DRAM-based

cache in section 4. In section 5,we presents the simulation environment and experimentation result to

compare performance of this cache with the baseline cache architecture. We conclude this report by

a summary and future work in section 6.

2 Related Work

To reduce the overhead of retrieving data from memory system for programs with large working

set, different technologies including fast tag matching for wide set-associative cache[15] and cached

DRAM have been used for some time. In 1995, Alexander & Kedem [2] [1] introduced a new archi-

tecture of main memory system in which the L2 cache is integrated into the DRAM array to form a

distributed cache. Wong[18] quantified the performance improvement which is obtained by adding

some SRAM cache on the DRAM chip on various design alternatives for the line size and associativity

3

of the SRAM cache in 1997. Newer DRAM technologies such as CDRAM and EDRAM use a small

SRAM page to replace the page buffer and use on-chip DRAM caching to eliminate the drawbacks

of page-mode DRAMs[4] [8] [14]. Koganti&Kedem [13][12] introduced the WCDRAM architecture

which takes advantage of very wide cache-lines integrated into DRAM in order to reduce the average

DRAM access time. Because data prefetching can hide memory latency by bringing data into higher

level of memory before they are required, it is widely used and different prefetching and prediction

schemes have been proposed. Smith[16] proposed the one-block-looking ahead and evaluated this

scheme extensively. In 1990, Jouppi[11] first introduced concept of stream buffer which prefetches

consecutive blocks starting from missed cache block into a small buffer associated with cache. Stride

prefetching was first proposed by Chen&Baer [3][5][6] which uses stride in past references to predict

future referenced cache block. Although history based prediction scheme does not help for DRAM-

based cache in this project, it is quite effective in conventional cache system. 1st order Markov

prediction was proposed by J. Pomerence[7]. Alex&Kedem[1][2] use distributed table-based pre-

diction to prefetching cache blocks and in 1997, Joseph&Grunwald[10] extensively evaluated cache

block prefetching from L2 to L2 cache based on 1st order Markov predictors. Kedem&Yu [19] ex-

plored DRAM-page based prefetching from memory to L2 cache using different prediction schemes.

Lin&Reinhardt[17] evaluated the aggressive prefetch unit integrated into L2 cache and memory con-

troller. Their schemes only issue prefetches when the Rambus channels are idle. And Okuda el[9].

described the circuit technologies that have been developed for high speed large bandwidth on-chip

12ns 8MB DRAM secondary cache.

4

3 DRAM-based External Cache

3.1 Overview

Figure 1 is block diagram of DRAM-SRAM cache and Figure 2 is block diagram of the data module

in figure 1. Here we use the cached DRAM ICs to implement this cache design. By using DRAM

array which is associated with a fast SRAM buffer, our design can implement a very large external

cache. The cache subsystem consists of three components:cache controller, cache-data memory and

cache-tag memory. The cache-data memory is in the cache where the cached data are stored. It can

be divided into multiple independent banks. For illustration purpose, we describe here an instance

with four independent banks, each of which can operate independently. Therefore, the data stored in

the DRAM array is four-way interleaved. Within each bank, the DRAM array does not interact with

outside directly. Instead, data is transferred between DRAM array and outside memory bus via a small

fast SRAM buffer. Generally the SRAM buffer is used to hold a set of active DRAM cache pages.

Logically it is the SRAM buffer that is connected to the outside. And the sense amplifier works as

an interface between the DRAM array and associated SRAM buffer. Due to the fact that we integrate

the DRAM array and SRAM buffer into the same IC chip, we can easily achieve very high internal

bus bandwidth between SRAM buffer and DRAM array. In particular, it takes one cycle to transfer

one complete DRAM page between sense amplifier and the associated SRAM buffer. In addition, the

DRAM array is implemented with technology optimized for speed instead of density. Specifically,

if we assume that the row access time of conventional DRAM chips is around 30ns, it takes only

5

CacheController

Primarytag array

Secondarytag array

Processor Interface

DRAMbank

SRAM buffer

DRAMAddress

Control

SRAMAddress

DRAMbank

SRAM buffer

DRAMbank

SRAM buffer

DRAMbank

SRAM buffer

Data-module

Figure 1: DRAM-based Cache Block Diagram

12-18ns for the DRAM arrays used in this cache subsystem to do one row access operation and make

data ready in sense amplifier. The cache-tag memory is composed of two sub tag arrays: primary tag

array and second tag array. Like any conventional cache architecture, the tag arrays use fast SRAM to

hold the tag information to speedup the cache access. Since most of cached data reside in the DRAM

array, we can build very large yet cost effective cache subsystem without being limited by the cost

and size constraints imposed by SRAM based cache. In this project, we simulate an internal four

way set associative cache, alternatives such as direct map or other set associative caches can also be

implemented.

6

Write queue

DRAM Array

Sense Amplifiers with latches

SRAM Array

DRAM Array


SRAM Array

DRAM Array


SRAM Array

DRAM Array


SRAM Array

Figure 2: Data Module of DRAM-based Cache

3.2 Cache Controller and Tag Arrays

In the block diagram Figure 1, there are three major components built into the cache subsystem, a

cache controller, tag memory and cache-data memory. Just as the counterpart in conventional cache

system, the cache controller here works as the interaction between processor and cache memory

system to coordinate data traffic between them. When a reference access arrives at the cache, the

controller first looks up the tag arrays to see whether the required data is already stored in cache. If it

is a cache hit, the cache controller updates the tag memory and transfer the required data back to the

processor. If not, it issues a memory request to fetch the corresponding cache line from next level of

memory hierarchy. In addition, it also manages the internal data transfer between the DRAM arrays

and associated SRAM buffers. If prefetching hardware is integrated into the DRAM-based cache, the

7

cache controller predicts candidates of next possibly referenced cache line for current cache access

and issues corresponding memory request to fetch the prefetched cache line.

In addition to the DRAM banks and associated SRAM buffers, the cache system works like any

other cache system using tag memory to manage the cached data. Basically there are two tag arrays

connected with the cache controller: primary tag array and second tag array, both implemented in fast

SRAM. Just like the tag array in conventional cache, the primary tag array is used to access the data

in cache by matching tag portion of the address of incoming data requiring and that of stored data.

However, there are two differences between the tag array of the DRAM-based external cache and that

of conventional cache. First, in conventional cache, each cache line or each set of cache lines can

be identified by tag portion in its address. Cache tag array maintains one tag entry for each set of

cache blocks in set associative cache or, each cache block in direct mapped cache. The index portion

of incoming reference address is used to index the target cache block and tag portion is sufficient to

decide whether a cache block is in cache given block address of reference address. However, in the

DRAM-based cache, tag array maintains one tag entry for each DRAM page rather than each cache

block in DRAM arrays. In consequence, the tag portion of incoming reference address can only be

used to indicate residence of the corresponding DRAM page in which the target cache block can re-

side. The residence of one DRAM page does not necessarily mean every cache line in this page also

resides in cache. This is because the DRAM based cache stores cached data in a way that it takes one

DRAM page as one cache block. Therefore, when a missing cache line is fetched from next level of

memory, and if the corresponding DRAM page in which this cache line resides is not yet allocated in

8

cache, the cache controller will allocate a whole DRAM page for this cache line. At this time point,

this allocated DRAM page in cache has only one valid cache line and all other cache lines are invalid

or empty. We call the invalid or empty cache lines in one DRAM page in cache asholes. Obviously,

only tag information of incoming reference address itself is not enough to decide whether the target

cache line is already in cache since there can possibly exist holes in the corresponding cached DRAM

page. To address this problem, we introduce an array of bits for each DRAM page in cache to indicate

whether the specific cache line is present in cache. For example, if the DRAM page size in cache is

4K Bytes and the cache line size 128 Bytes, there are 32 cache lines in one DRAM page. As a result,

we maintain an array of 32 binary bits for each DRAM page as indication of presence of the 32 cache

lines. When a cache line is fetched from next level of memory, the corresponding present bit is set

to 1 which means this cache line is already in cache. The presence array is also implemented in tag

memory. The tag array together with presence array are enough for cache controller to decide whether

the specific cache line of incoming reference address is already cached in DRAM based cache. Figure

3 shows the structure of the tag entry in primary tag memory, in whichTAG is the tag portion of

cached DRAM page, bitV andB indicate whether this page is valid and buffered in SRAM.PA is

the presence array andI is the index of the DRAM page in SRAM buffer if it is buffered. Figure 4

shows an example entry in primary tag array. In this example, the tag of a valid, buffered cache page

is 0x0724 and its index is 0x4E. In this page only the first two cache lines are present, so present

bits array is 0x3. This cache page is buffered in the 11th(0xB) entry in SRAM buffer. Thesecond

difference between DRAM-based tag memory and that in conventional cache is that there are two sub

9

tag memories: primary and second tag array in DRAM-based cache while there is only on tag array in

conventional cache. We maintain two tag arrays for DRAM-based cache because besides the primary

tag array by which we use to index the target cache line, we need a second tag array as a backward

pointer to index the cached DRAM page which is buffered in fast SRAM. We employ a write back

policy between SRAM buffer and DRAM array. Thus when a page in SRAM should be evicted, we

need to write it back to DRAM if it is dirty. The second tag array is used to index where to write

back a dirty page in SRAM. Figure 5 shows the entry in secondary tag array, in whichIndex is used to

index the tag entry in primary tag memory,TAG is the tag portion of the buffered page in SRAM and

bit D indicates whether this page is dirty or not. Figure 6 shows entry of a dirty page in SRAM which

buffers the cache page of figure 4. We will introduce the index algorithms and replacement policy of

this cache in detail in the section Cache Operations.

V Tag Index Present Bits Buffered AddressB

Figure 3: Primary Tag Entry

1 0x0724 0x4E 0x3 0xB1

Figure 4: Sample Primary Tag Entry

10

Tag Index D

Figure 5: Second Tag Entry

0x0724 0x4E 1

Figure 6: Sample Second Tag Entry

3.3 Cache-data Memory

The DRAM array is used to memorize the cached data.It accounts for most of the IC area compared

with SRAM buffer, thus greatly reduces the cost per bit. The DRAM arrays can be modeled to one

or several independent banks and each bank is associated with SRAM buffer and performs indepen-

dently. Figure 7 shows one single bank DRAM-based cache in which only cache-data memory is

illustrated and both cache controller and tag memories are omitted. For each bank of DRAM array,

the DRAM arrays are associated with a buffer made of fast and expensive static memory(SRAM).

Logically the SRAM buffer is organized as a set of rows, each of which is as wide as that in DRAM

array so each row in SRAM can hold data of one DRAM page. Additionally, the SRAM buffer acts

as a fully associative cache with LRU replacement policy. In order to understand the cache behavior,

we can view the DRAM array with SRAM buffer as a data cache and the SRAM buffer serves as an

internal cache to DRAM arrays. Besides, the wide on-chip data bus with high bandwidth between

11

DRAM Bank

Sense Amp.

SRAMBuffer

Col. Sel.

DRAMAddress

SRAMAddress

Figure 7: DRAM-based Cache Block Diagram

SRAM buffer and DRAM array make it possible to transfer one DRAM page between SRAM buffer

and the sense amplifier in only one clock cycle. The data stored in one DRAM page(one SRAM row)

map one segment of consecutive main memory, but it does not mean the whole segment with size of

one DRAM page, should be transferred together between main memory and cache system. Actually,

from the prospective of either upper or next level memory, this cache subsystem works as any con-

ventional cache. Thus, the data transfer between this cache and upper or next level memory hierarchy

is still cache line based, in which each external data transfer will transfer one cache line.

12

3.4 Cache Operations

When the processor initiates a data request to on-chip L1 cache and if L1 miss occurs, this request

is passed from L1 cache to this external cache system. The cache controller extracts the tag, index

portion and cache line offset within one page of the incoming address and does a look-up in primary

tag memory to see whether the target cache line is already in the cache. If this is a cache hit, the

incoming address is parsed and the row access address is sent to DRAM array address bus to activate

one DRAM page into sense amplifier. And at the same time, if the DRAM page is already buffered in

SRAM buffer, the buffered address is read from primary tag and sent to address bus of SRAM buffer

and thus cache line is served to the request. We call this kind of hit thefast hit since the data can be

served directly from fast SRAM buffer. Otherwise, if the DRAM page is not buffered yet, we need

to transfer this page from DRAM array to SRAM buffer first and then serve the data to processor. In

this case, we call itdelayed hit since it takes longer to serve the data due to internal page transfer. The

last possible case is cache miss, which means the data is neither in DRAM array nor SRAM buffer,

in this case the cache system initiates an access request to the next level of memory hierarchy as any

conventional cache does. Next two subsection give detailed descriptions on the cache behavior when

an external memory request arrives.

3.4.1 Read Memory Request

Table 1 and 2 summerize the cache behaviors and corresponding actions when the cache system gets

a data read request. First of all it does a look up in the primary tag array. Depending on different

13

status of tag matching, valid bit, presence bit and buffered bit, there are at most six possible caches of

cache access as illustrated in table 1 and 2:

1. The corresponding page in DRAM is invalid. It is obviously a cache miss since the whole

cache page does not reside in the DRAM array. In this case, cache controller will open an

unused cache page in DRAM and SRAM, evicting an old page if necessary and fetching the

cache line from next level of memory.

2. The tag does not match although the related cache page is valid in DRAM. As same as cache

one, cache controller needs to allocate a cache page in DRAM and SRAM and fetch the corre-

sponding cache line from next level of memory.

3. If the tag is matched successfully and page is valid but presence and buffered bit is zero. That

means although the enclosing cache page is already in DRAM but the target cache line is miss-

ing. In this cache, cache controller will fetch it first and allocate a row in SRAM for this page.

4. This case is the same as case three except that the buffered bit is set. In this case, cache controller

does the same thing as in case three. The only difference is that it does not need allocate a

SRAM page since it is already buffered.

5. If the target cache line is already in DRAM array. This is a case of cache hit. If the page is not

yet buffered(buffered bit is zero.), it will be copied from DRAM to SRAM first and then cache

can serve data. It is a delayed hit since data is delayed by the miss in SRAM.

14

6. The last case is fast hit. In this case, the target cache line is not only already in DRAM array but

also buffered in SRAM. Thus data can be served directly at the latency of fast SRAM buffer.

3.4.2 Write Memory Request

When a write access request arrives, the cache behaviour is almost the same as the read access. When

a write miss occurs, the request is passed to the next level of memory and a cache page is allocated

in both DRAM and SRAM to hold the fetched cache line and the dirty bit is set for the page. In

this sense, it is a write allocate cache. And from the prospective of processor, this cache system

applies a write-through policy so when no unused DRAM page is available, we just discard a DRAM

page under a certain replacement policy, e.g, LRU, and nothing is written back to the next level of

memory. Actually write-through policy is not the only choice because we can apply the write-back

policy instead of write-through policy. For write-back, we can choose to write back the whole DRAM

page back or, just write back those dirty cache lines in DRAM page. However, in either case we need

to maintain a write back with the size of one cache page because we can not predict how many cache

lines in the page are dirty and need to be written back. This incurs extra hardware cost obviously

and the extra write back buffer is much larger than that in conventional cache since it needs to hold

the whole cache page rather than single cache line. In this project we choose write-through due to its

hardware simplicity. On the other hand, from viewpoint within the cache, the cache subsystem applies

write back policy between the SRAM buffer and DRAM array. That means, when a dirty SRAM is to

be evicted under certain replacement policy, the cache controller needs to first copy the victim cache

15

Action Valid Tag Match Present Buffered Status

1 0 - - - Miss

2 1 N - - Miss

3 1 Y N N Miss

4 1 Y Y N Delayed Hit

5 1 Y N Y Miss

6 1 Y Y Y Hit

Table 1: Cache Behavior On Data Read

Action Description

1 open cache page in both DRAM and SRAM, evict if necessary, fetch cache line

2 replace cache in DRAM, open a page in SRAM, evict if necessary, fetch cache line

3 evict SRAM page, copy page from DRAM into SRAM, fetch cache line

4 evict SRAM page, copy page from DRAM into SRAM, serve data

5 fetch cache line

6 serve data directly

Table 2: Cache Action Description On Data Read

16

page in SRAM buffer into the DRAM array then allocate the victim page to a new DRAM page. The

write back policy makes it possible that the data in DRAM array may be inconsistent with that in

SRAM buffer and thus the data in DRAM may not be the most updated. But it does not have any

effect for incoming data reference request since all data requests are served by SRAM in cache hit.

We apply write back policy on internal data transfer in order to reduce traffic on internal wide data

bus between SRAM and DRAM. In this inner write-back process, the second tag array is used to be a

backward pointer from SRAM buffer to DRAM array .

4 Data Prefetching

4.1 Overview

Data prefetching is to predict what data will be referenced in near future based on some heuristics

and prefetch the cache before processor asking for it. Data prefetching can effectively hide high

cache miss penalty since it makes the data ready for processor before the miss occurs. Now a variant

of prefetching techniques have been widely employed in different cache designs. In DRAM-based

cache subsystem, intuitively there are two opportunities which we can exploit to do data prefetching.

One is to prefetch cache line between the cache subsystem and next level of memory to increase

overall hit rate of cache subsystem. The other lies inside the cache between the small fast SRAM

buffer and DRAM arrays. Intuitively, inner prefetching can not improve overall hit rate since it does

not prefetchng anything from outside. But it can help reduce the miss in SRAM and make more cache

17

hits in SRAM thus reduce the average cache access time. In this project we take advantage of the page

based prediction and prefetching between cache and next level of memory. We do not focus on inner

prefetching for two reasons. First, our simulation shows that a very small SRAM buffer, e.g, 64K or

128K, can catch most of the cache hits to this cache subsystem. In particular, our simulation on 1M

DRAM array shows that 64K SRAM buffer can catch more than 96% of the cache hit in average. That

means more than 96% of the cache hits to the cache system fall into SRAM buffer and only less than

4% of cache hits are delayed hits. Prefetching another page from DRAM to SRAM may not improve

the performance since there is no much room to reduce the delayed hit rate and, due to the small size

of SRAM, it will cause possible cache pollution in SRAM to counteract benefits from prefetching.

The other reason is that implementing prefetching from DRAM to SRAM issues much more hardware

cost than prefetching between cache subsystem and next level of memory. It is because we need to

make a prediction table for each cache page to do predict which cache line to prefetch. In SRAM the

size is very limited such as 16 cache pages thus we only need to make 16 prediction table for each of

them. However, DRAM array is usually bigger than SRAM by a factor of 16 or even more. Therefore,

implementing inner prefetching will not make much sense if the miss in DRAM is low enough from

prefetching between cache subsystem and next level of memory.

4.2 DRAM-page Based Prefetching

In DRAM-page based prefetching, when a cache line is demand fetched from main memory to cache,

cache system predicts another cache line, which resides in the same DRAM-page as the demand

18

fetched cache line, and prefetches it with the demand fetched cache line into cache together. Good

prediction heuristics can predict the cache line which is most possibly referenced in near future. In

current DRAM technology, one DRAM access cycle can be divided into three parts in order: row

access, column access and precharge period. A row access operation reads out a whole page of data

into sense amplifier, and following consecutive pipelined column access command are issued to read

out specific cache line. Once an access is complete, the memory controller must precharge DRAM

bank and associated sense amplifiers and in precharge period, data in sense amplifier are lost. The key

fact making DRAM based prefetching effective is that once a DRAM page is open, the incremental

task of fetching additional cache line from that page requires smaller access time compared with

fetching another cache line in another DRAM page. Actually, additional time to read another cache

line in the same open page requires only about 20% of time to read read another cache line from

DRAM.

4.3 Prediction Heuristics

When a prefetching is triggered by a trigger address, different prediction schemes can be used to

generate prefetching candidates. The commonly used schemes include history based prediction such

as 1st order Markov prediction, stride predition, and one block lookahead(OBL) prediction. For

DRAM-based cache subsystem, first history based prediction schemes are excluded. This is because

all cache lines within one DRAM page are evicted together when this page is evicted. Therefore,

the history reference pattern does not make any sense for prediction because all cache lines in the

19

history reference pattern are already in the cache and thus can not generate any prefetching candidate.

For example, we assume for a certain DRAM page in SRAM(we only do prefetching for pages in

SRAM), the past reference pattern is�� where�� are different cache lines in the page.

Suppose the current referenced cache line is�, according to the pattern we should prefetch� into

cache since� should be the most possible cache line referenced in near future. However, since all

referenced cache lines are in the single DRAM page in SRAM and they are always evicted together

when this page is evicted from SRAM, we can be sure that cache line� is already in this page when

making prediction since it appears in the history reference pattern. Thus any prediction based on

history reference pattern loses its meaning since all predicted cache lines are already in cache and

do not require any prefetching. In consequence, we use stride prediction and OBL based aggressive

hole-looking prediction as prediction schemes in this DRAM based cache subsystem.

These two schemes are actually very simple and easy to implement in hardware without incurring

much hardware overhead. The first one is stride prefetching. In stride prefetching, we use two shift

registers to keep the last two referenced cache lines. When a trigger address is encountered, we

compare the difference(stride) between the last two referenced cache lines in shift registers and the

difference between current referenced cache line and last referenced cache line. If they are the same,

we can know so far the three references demonstrate a fixed stride. Thus it is reasonable to predict

that the next referenced cache line is the current one plus this stride. For example, if shift registers

hold entry�� and current referenced address in cache line� then stride is� � � � �� and

the next cache line to be referenced is probably� � � � . So we prefetch cache line if this cache

20

line not yet in cache subsystem. The second scheme is hole-searching prefetching. In this scheme,

when a trigger address is encountered, the presence array of this page in tag memory is search from

current referenced cache line to see whether there is a hole - an absent cache line - in the cache page,

if so, use this absent cache line as prediction candidate. If the search goes to the end of one page, it

rewinds and starts from the beginning until the whole presence array is searched. If there is no hole in

this page which means all cache lines in this page are already in cache, then no prefetching is issued

since no prediction is generated. In our simulation, it shows that these simple schemes work quite

well for all cache configurations.

4.4 Hardware Cost

Since we only make prediction and do prefetching for the cache pages in SRAM buffer and size

of SRAM buffer is generally very small. The extra hardware overhead for prefetching can be kept

much lower compared with that in traditional external cache. In this DRAM based cache, each row

in SRAM buffer is associated with two shift registers to implement stride prediction. The length of

the shift register is the number of cache lines contained in one cache page. For example, if the size

of cache page is� bytes which is typical in current DRAM technology, and the cache line size is�

bytes, thus the length of shift register�is

� � ��

�(1)

The shift register is used to implement stride prediction scheme. In hole-searching prediction scheme,

we do not need any extra storage since we can decide if a hole exists in cache page by the presence

21

array in primary tag memory. Since history based prediction schemes can no longer be applied, we

do not need to allocate extra memory to record the past reference pattern for each cache line. In

the contrast, conventional prefetching cache requires a prediction table cache(PTC) built on fast and

expensive SRAM associated with prefetch controller. For a conventional cache of size 256KB, its

PTC is usually 32K to hold reference trace for each cache line in cache. But DRAM-based cache

does not require any extra memory except limited number of shift registers. Therefore, DRAM-based

cache reduces much hardware overhead compared with conventional prefetching cache.

5 Simulation Results

In this section we present the evaluation methodology and simulation environment first. It then

presents the simulation results on DRAM-based cache with different configurations and prefetching

schemes.

5.1 Methodology

To evaluation performance of DRAM-based cache, we build an event driven cache simulator based

on the detailed execution driven out-of-order processor simulator in Simplescalar. We also model

the main memory behaviors precisely using event driven simulation. The processor we simulate is a

four-way issue super-scalar with speculative out of order execution processor. It has has 32 register

update units (RUU) and a 16 entry load/store queue(LSQ). The processor has separate on-chip 16K

22

Parameter Cycle

SRAM latency in cache 6

DRAM latency in cache 12

latency to start a memory transaction 60

time to send row address 10

latency between accepting the row address and accepting column address20

latency between accepting the row address and sending back data 50

time to send all column addresses 80

time to send first chunk of data back 10

time to send data back 80

time to precharge before accessing another page 30

latency to finish a memory transaction 60

Table 3: Paramters of DRAM cache and Main Memory

23

directed mapped data and instruction L1 cache with cache line size of 32Bytes. The default processor

clock frequency is 1GHz. The unified dram-based cache system is pipelined so it can handle one

request each processor cycle. The data L1 cache applies write allocate, write back policy and LRU

replacement policy. For main memory, there are three sorts of transactions: read one cache line,

write one cache line and read two cache lines in one DRAM page. Table 3 shows the parameters of

DRAM-based cache and main memory. So the roundtrip memory latency is 190 cycles(latency to start

a memory transaction + time to send row address + latency between accepting the row address and

sending back data + time to send first chunk of data back + latency to finish a memory transaction).

The benchmarks we use are SPEC95 and SPEC2000 suites including both integer and float point

Figure 8: Selected Benchmarks

benchmarks. The baseline cache architecture we use is a conventional 256K four-way associative

L2 cache with block size 64Bytes. Since we would like to measure DRAM cache on the high miss

24

rate benchmarks in SPEC95 and SPEC2000. We run simulation on all benchmarks we have. From

these benchmarks we choose following ten benchmarks which have L2 cache miss rate higher than

5%. Figure 8 shows the hit rate of selected integer and FP benchmarks. Among those benchmarks,

�� are from SPEC95 and��

are from SPEC2000. And, since integer benchmarks generally have lower miss rate than FP bench-

marks, only two of ten programs chosen are integer benchmarks(mcf, twolf). The average L2 cache

hit rate for these ten benchmarks is 71%.

5.2 DRAM-based Cache Subsystem

When using cheap DRAM arrays as back end storage for cached data, we can build a very large

on-chip L2 cache. In the first set of experiments we explore what impact a large back-end storage

will have on system performance. We build an internal on-chip DRAM-based prefetching cache

simulator in which the DRAM arrays are modeled as four bank, four-way associative cache. The

DRAM arrays are equally distributed along four banks. Each bank is associated with a full associative

SRAM buffer with size of 64 KBytes. So the total size of SRAM in cache is 256 KBytes. And

the cache page size is 4096 Bytes and cache line size is 128 Bytes, so each cache page can hold

32 cache lines. We also implement the page based prediction and prefetching mechanisms for this

cache. Figure 9 shows the performance comparison between conventional cache and DRAM-based

cache. The prefeching performance consists of both the stride prediction and the hole-searching

prediction. Actually, there is no much difference on performance between stride prediction and hole-

25

Figure 9: Selected Benchmarks

search prediction, we will describe it in next subsection. The IPC shown in figure is the average of all

selected benchmarks and has been normalized with base line architecture. From the figure first of all

we can know that by using the DRAM-based cache the performance is increased about 120% in size

1MB, 2MB, 4MB and 8MB DRAM arrays and more than 150% in 16M and 64M DRAM arrays. The

average performance gain of all sizes of DRAM cache is 171%. By prefetching mechanisms, system

performance increases by another 17% in average. We also can see that when we increase the size

of DRAM from 1M to 8M, the overall performance is not improved much for both non-prefetching

and prefetching DRAM-based cache. When DRAM size increases to 16MB, system performance

increases about 27% in non-prefetching cache and about 18% in prefetching cache. However, when

we continue increasing size of DRAM from 16MB to 64MB, the performance does not increase as

much as size increases. Therefore, it is reasonably to know that DRAM based caches with 1MB

26

and 16MB DRAM arrays and only 256KB SRAM buffer should be cost-effective choices for most

benchmarks depending on respective expected performance gain. Figure 10 shows the performance

improvement of all selected benchmarks on 16M DRAM cache with 256K SRAM buffer. In average

benchmark programs benefit from DRAM-based cache and prefetching but prefetching does not work

for benchmark�� and��. It can be explained by the low prefetch accuracy in��. Figure

11 shows the prefetch accuracy of hole-searching prediction and stride prediction of all benchmarks.

Prefetch accuracy means the percentage of prefetched cache lines which are really referenced before

evicted from SRAM buffer. Thus only prefetched cache line referenced in SRAM can be counted as

an accurate prefetching. It is easy to see the the prefetch accuracy of�� is extremely low and it

is 1.7% for both prediction schemes. This is because the frequent cross page reference in�� and

due to the limited size of SRAM buffer, most referenced page are evicted from SRAM buffer before

another reference falls into this page. Low prediction may reduce system performance due to the

cache pollution. For other benchmarks, the difference of prefetch accuracy between two prediction

schemes are limited within 5%.

5.3 Fast SRAM Buffer

We also explore performance impact of different size of SRAM buffer. Obviously, the more SRAM

buffer the DRAM-based cache uses, the more cache hits are caught by SRAM buffer and thus increase

system performance since data in SRAM can be served immediately.

Figure 12 shows the delayed hit rate comparison in 1M four-bank, four-way associative dram-

27

Figure 10: 16M DRAM w/ 256K SRAM

based cache system with different size of SRAM buffer. Here delayed hit rate is the average of

all selected benchmarks we use in simulation. The cache page size is 4KB and cache line size is

128Bytes. We simulate different size of SRAM buffer from 16KB to 256KB. In the lowest end,

SRAM buffer is only 16KB and each of four banks of DRAM arrays has only 4K SRAM buffer

associated. This is the size of cache page so in this case, there is only one cache page buffered in

SRAM at any time. In figure 12 the Delayed Hit Rate��is defined as following

��

��

From figure 12, we can see that when SRAM buffer is 64KB or more, the delayed hit rate is reduced to

less than 4%. In particular, delayed hit rate is only 3.31% for 64KB SRAM and it decreases to 1.68%

when we increase its size to 256KB. However, if SRAM size is less than 64KB, the delayed hit rate

jumps more than 10%. In the case of 32KB SRAM which means the SRAM buffer associated with

28

Figure 11: Prefetching Accuracy

each bank can only hold two cache pages, the delayed hit rate is 11.9% and if each bank of DRAM

is associated with SRAM holding only one cache page, the delayed hit rate increases to 32.7%. Thus

we can see that64KB SRAM buffer is enough to catch more than 96% of cache hits from L1 cache

and only less than 4% of hits requires cache page transfer from DRAM arrays to SRAM buffer.

In the case of 64K, each bank of DRAM arrays is associated with 16KB SRAM buffer which can

only hold 4 cache pages assuming the cache page is 4KB but actually the DRAM-based cache with

small 64KB SRAM buffer works quite well. Figure 13 shows the performance comparison between

different size of SRAM buffer. In figure 13 the performance compact of different size of SRAM

is apparent. Using 64KB SRAM buffer, the IPC normalized over baseline architecture is 2.18 but

it is only improved by less than 1% to around 2.20 if we use 256KB SRAM. On the other hand,

performance drops apparently when we use SRAM buffer smaller than 64KB and there is only trivial

29

Figure 12: Delayed Hit Rate

performance difference between 32KB and 16KB SRAM buffer.

5.4 Prediction Schemes Comparison

Because in DRAM-based cache subsystem the history based prediction schemes do no longer hold,

we simulate other two prediction schemes, one is aggressively searching the holes in cache page and

the other is to catch the constant stride between past two consecutive references and use this stride to

make prediction. In the second case, if there is no such stride in past references, we turn back to the

hole-searching prediction scheme to generate prefetching candidate. From figure 9 and figure 10 it is

easy to see that there is no apparent performance difference between these two prediction schemes.

Using stride prediction scheme does not provide much performance improvement over simply looking

30

Figure 13: Normalized IPC

for holes in cache page. To address this problem, we define rate of stride prediction as

��

��(2)

Following figure 14 shows for 16M 4-bank 4-way associative DRAM based cache with 256K SRAM

buffer, the�� of all ten benchmarks. From figure 14 we can see that in average stride

prediction only accounts for less than 1% of all prefetching candidates. Benchmars�� has most

stride prediction rate as only 4.4%. That is because for each cache reference, no matter it is a cache

hit or cache miss, hole-searching prediction will generate a prefetching candidate if there is any hole

existent in cache page and there is no constant stride among current and past two references. It

is different from conventional cache in which the one-block looking ahead prediction scheme only

probes limited times, if the candidate is already in cache, it does not issue a prefetching to next level

of memory, while in DRAM-based cache given there is any hole in cache page, a prefetching will be

31

issued. Since stride predicted candidate accounts only less than 1% of all candidates, it is reasonable

that there is no much performance difference between the two prediction schemes we use.

Figure 14: Stride Prediction Rate

6 Conclusion and Future Work

With the widening speed between fast modern processors and relatively slow memory system, it is

more important than ever to reduce miss rate of cache system due to the long miss penalty. Cost

and energy constraints limit the usage of large on-chip or off-chip cache subsystem. In this project

we propose a DRAM-base cache design which uses cheap and slow DRAM to make cost effective

cached data storage. To speed up data access, we use small but fast SRAM buffer associated with

each bank of DRAM arrays. Our simulation results show that most data accessed can be served by

32

the SRAM buffer rather than DRAM arrays. This cache subsystem provides the same interface to both

processor and next level of memory as any conventional cache. In addition, we employ DRAM-page

based data prefetching between this cache and main memory to further improve system performance

and prefetching schemes are simple and require very limited of extra storage thus they are easy to

be integrated into cache controller. Our simulation results indicate that this design is practical and

quite effective. As for the future work, in this project we only simulate an internal on-chip level

two DRAM-based cache. But this design can also be used in external cache systems. The current

benchmarks suits we use turn out that only 1M of cache is enough to provide high hit rate for most

benchmarks. Therefore, performance improvement brought by prefetching and even larger DRAM-

based cache such as 16M or 64M is not apparent using current benchmark suites. We expect to do

simulation on more benchmarks which requires bigger working set than current programs to have

more insight of this design. And, we also want to explore the opportunity of inner prefetching within

the cache, that is, prefetching cache page from DRAM arrays to SRAM buffers to reduce the misses

in SRAM. Obviously inner prefetching only makes sense when misses in SRAM is high but current

benchmarks we have do not show such characteristics in simulation.

33

References

[1] T. Alexander. A Distributed Predictive Cache for High Performance Computer Systems. PhD

thesis, Department of Computer Science, Duke University, 1995.

[2] T. Alexander and G. Kedem. Distributed predictive cache design for high performance memory

system. InProceedings of the Second International Symposium on High-Performance Computer

Architecture, pages 254–263, 1996.

[3] J. L. Baer and T. F. Chen. An effective on-chip pre-loading scheme to reduce data access penalty.

Supercomuting, 1991.

[4] Dave Bursky. Fast drams can be swapped for sram caches.Electronic Design, pages 55–56,60–

67, July 1993.

[5] Tien-Fu Chen. Data Prefetching for High Performance Processors. PhD thesis, Computer

Science and Engineering Department, University of Washington, 1993.

[6] Tien-Fu Chen. A performance study of software and hardware data prefetching schemes. In

Proceedings of the 21st International Symposium on Computer Architecture, pages 223–232,

1994.

[7] Pomerene et. al. Prefetching system for cache having a second directory for sequentially ac-

cessed blocks. US Patent 4,807,110 1989.

34

[8] Charles Hart. Cdram in a unified memory architecture. InProceedings of Spring CompCon,

pages 261–266, 1994.

[9] Y. Nakajima S. Utsugi M. Hamada I. Naritake, T. Sugibayashi. A 12ns 8mb dram secondary

cache for a 64b microprocessor.IEEE Journal of Solid-State Circuits, 35(8):1153–1157, 2000.

[10] Doug Joseph and Dirk Grunwald. Prefetching using markov predictors. InProceedings of the

24th International Symposium on Computer Architecture, pages 252–263, 1997.

[11] Norman P. Jouppi. Improving direct-mapped cache performance by the addition of a small fully-

associative cache and prefetch buffers. InProceedings of the 17th International Symposium on

Computer Architecture, pages 364–373, 1990.

[12] G. Kedem and Ram P. Koganti. Wcdram: A fully associative integrated cached-dram with wide

cache lines. InProceedings of the 11th Annual International Symposium on High Performance

Computing Systems, July 1997.

[13] Ram P. Koganti. Wcdram: A fully associative integrated cached-dram with wide cache lines.

Master’s thesis, Department of Computer Science, Duke University, 1997.

[14] Ray Ng. Fast computer memories.IEEE Spectrum, pages 36–39, October 1992.

[15] Alvin R. Lebeck Mark D. Hill Richard E. Kessler, Richard J. Jooss. Inexpensive implemen-

tations of set-associativity. InACM/IEEE International Symposium on Computer Architecture,

1989.

35

[16] Alan J. Smith. Cache memories.Computing Surveys, 14(3), September 1982.

[17] S. K. Reinhardt W.-F. Lin and D. Burger. Designing a modern memory hierarchy with hardware

prefetching.IEEE Transactions on Computers, 50(11):1202–1218, November 2001.

[18] W.A. Wong and J.-L. Baer. Dram caching. Technical report, 97-03-04, Department of Computer

Science and Engineering, University of Washington, 1997.

[19] Haifeng Yu and Gerson Kedem. Dram-page based prediction and prefetching. InProceedings

of International Conference on Computer Design, September 2000.

36

Date post:	08-Jul-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

CS-2002-05 Building DRAM-based High Performance ...€¦ · cache system in detail. We will...

Documents