+ All Categories
Home > Documents > Programmable DDRx Controllers

Programmable DDRx Controllers

Date post: 19-Dec-2016
Category:
Upload: engin
View: 212 times
Download: 2 times
Share this document with a friend
10
................................................................................................................................................................................................................... PROGRAMMABLE DDRX CONTROLLERS ................................................................................................................................................................................................................... MAKING MODERN MEMORY CONTROLLERS PROGRAMMABLE IMPROVES THEIR VERSATILITY AND EFFICIENCY.HOWEVER, THE STRINGENT LATENCY AND THROUGHPUT REQUIREMENTS OF MODERN DDRX (DOUBLE DATA RATE MEMORY INTERFACE TECHNOLOGY) DEVICES HAVE RENDERED SUCH PROGRAMMABILITY LARGELY IMPRACTICAL, CONFINING DDRX CONTROLLERS TO FIXED-FUNCTION HARDWARE.PARDIS IS THE FIRST PROGRAMMABLE MEMORY CONTROLLER THAT CAN MEET THESE CHALLENGES AND THUS SATISFY THE PERFORMANCE REQUIREMENTS OF A HIGH-SPEED DDRX INTERFACE. ......The off-chip memory sub- system is a significant performance, power, and quality-of-service (QoS) bottle- neck in modern computers, necessitating a high-performance memory controller that can overcome DRAM (dynamic random- access memory) timing and resource con- straints by orchestrating data movement between the processor and main memory. Contemporary DDRx (double data rate memory interface technology) memory controllers implement sophisticated ad- dress mapping, command scheduling, power management, and refresh algorithms to maximize system throughput and mini- mize DRAM energy, while ensuring that system-level QoS targets and real-time deadlines are met. The conflicting require- ments imposed by this multiobjective optimi- zation, compounded by diversity in both workload and memory system characteristics, make high-performance memory controller design a significant challenge. A promising way of improving the versa- tility and efficiency of a memory controller is to make it programmable—a proven tech- nique that has seen wide use in other control tasks ranging from direct memory access (DMA) scheduling 1,2 to NAND flash and directory control. 3-9 In these and other ar- chitectural control problems, programm- ability allows the processor designers to customize the controller on the basis of sys- tem requirements and performance objec- tives, perform in-field firmware updates to the controller, and set up application- specific control policies. Unfortunately, the stringent latency and throughput requirements of modern DDRx devices have rendered such programmability largely impractical, confining DDRx controllers to fixed-function hardware. As a result, contemporary memory controllers are invar- iably confined to implementing DRAM control policies in hardwired, fixed-function hardware blocks. Pardis (programmable architecture for the DDRx interfacing standards) is the first pro- grammable memory controller that provides sufficiently high performance to make the firmware implementation of DDRx control policies practical. 10 Pardis divides the tasks associated with high-performance DRAM control among a request processor, a transac- tion processor, and dedicated command logic. The request and transaction processors Mahdi Nazm Bojnordi Engin Ipek University of Rochester ........................................................ 106 Published by the IEEE Computer Society 0272-1732/13/$31.00 c 2013 IEEE
Transcript
Page 1: Programmable DDRx Controllers

..........................................................................................................................................................................................................................

PROGRAMMABLE DDRX CONTROLLERS..........................................................................................................................................................................................................................

MAKING MODERN MEMORY CONTROLLERS PROGRAMMABLE IMPROVES THEIR

VERSATILITY AND EFFICIENCY. HOWEVER, THE STRINGENT LATENCY AND THROUGHPUT

REQUIREMENTS OF MODERN DDRX (DOUBLE DATA RATE MEMORY INTERFACE

TECHNOLOGY) DEVICES HAVE RENDERED SUCH PROGRAMMABILITY LARGELY IMPRACTICAL,

CONFINING DDRX CONTROLLERS TO FIXED-FUNCTION HARDWARE. PARDIS IS THE

FIRST PROGRAMMABLE MEMORY CONTROLLER THAT CAN MEET THESE CHALLENGES

AND THUS SATISFY THE PERFORMANCE REQUIREMENTS OF A HIGH-SPEED DDRX

INTERFACE.

......The off-chip memory sub-system is a significant performance,power, and quality-of-service (QoS) bottle-neck in modern computers, necessitating ahigh-performance memory controller thatcan overcome DRAM (dynamic random-access memory) timing and resource con-straints by orchestrating data movementbetween the processor and main memory.Contemporary DDRx (double data ratememory interface technology) memorycontrollers implement sophisticated ad-dress mapping, command scheduling,power management, and refresh algorithmsto maximize system throughput and mini-mize DRAM energy, while ensuring thatsystem-level QoS targets and real-timedeadlines are met. The conflicting require-ments imposed by this multiobjective optimi-zation, compounded by diversity in bothworkload and memory system characteristics,make high-performance memory controllerdesign a significant challenge.

A promising way of improving the versa-tility and efficiency of a memory controller isto make it programmable—a proven tech-nique that has seen wide use in other controltasks ranging from direct memory access

(DMA) scheduling1,2 to NAND flash anddirectory control.3-9 In these and other ar-chitectural control problems, programm-ability allows the processor designers tocustomize the controller on the basis of sys-tem requirements and performance objec-tives, perform in-field firmware updatesto the controller, and set up application-specific control policies. Unfortunately,the stringent latency and throughputrequirements of modern DDRx deviceshave rendered such programmability largelyimpractical, confining DDRx controllersto fixed-function hardware. As a result,contemporary memory controllers are invar-iably confined to implementing DRAMcontrol policies in hardwired, fixed-functionhardware blocks.

Pardis (programmable architecture for theDDRx interfacing standards) is the first pro-grammable memory controller that providessufficiently high performance to make thefirmware implementation of DDRx controlpolicies practical.10 Pardis divides the tasksassociated with high-performance DRAMcontrol among a request processor, a transac-tion processor, and dedicated commandlogic. The request and transaction processors

mmi2013030106.3d 15/5/013 16:20 Page 106

Mahdi Nazm Bojnordi

Engin Ipek

University of Rochester

..............................................................

106 Published by the IEEE Computer Society 0272-1732/13/$31.00 �c 2013 IEEE

Page 2: Programmable DDRx Controllers

each have a domain-specific instruction setarchitecture (ISA) for accelerating commonrequest and memory transaction process-ing tasks, respectively. Pardis enforces thecorrectness of the derived schedule in hard-ware through dedicated command logic,which inspects—and if necessary, stalls—each DDRx command to DRAM to ensurethat all DDRx timing constraints are met.This separation between performance opti-mization and correctness allows the firmwareto dedicate request and transaction processorresources exclusively to optimizing perfor-mance and QoS, without expending limitedcompute cycles on verifying the derivedschedule’s correctness.

Organization of DRAM systemsModern DRAM systems are organized

into a hierarchy of channels, ranks, banks,rows, and columns to exploit locality andrequest-level parallelism. Contemporaryhigh-performance microprocessors com-monly integrate two to four independentmemory controllers, each with a dedicatedDDRx channel. Each channel consists ofmultiple ranks that can be accessed in paral-lel, and each rank comprises multiple banksorganized as rows by columns, sharing com-mon data and address buses. A set of timingconstraints dictate the minimum delaybetween each pair of commands issuedto the memory system; maintaining highthroughput and low latency necessitates asophisticated memory controller that cancorrectly schedule requests around thesetiming constraints.

A typical DDRx memory controllerreceives a request stream consisting ofreads and writes from the cache subsystem,and generates a corresponding DRAM com-mand stream. Every read or write requestrequires accessing multiple columns of arow within the DRAM system. A rowmust be loaded into a row buffer by an acti-vate command prior to a column access.Consecutive accesses to the same row, calledrow hits, enjoy the lowest access latency;however, a row miss necessitates issuing aprecharge command to precharge the bit-lines within the memory array, and thenloading a new row to the row buffer usingan activate command.

Pardis overviewFigure 1 shows an example computer sys-

tem comprising a multicore processor withPardis, interfaced to off-chip DRAM over athird-generation double data rate (DDR3)memory channel. Pardis receives read andwrite requests from the last-level cache con-troller via a first-in, first-out (FIFO) queue,called the request queue, and generatesDDR3 commands to orchestrate data move-ment between the processor and main mem-ory using three tightly coupled processingelements.

Request processorThe request processor dequeues the next

request from the head of the request queue,generates a set of DRAM coordinates—channel, rank, bank, row, and columnIDs—for the requested address, and en-queues a new DDRx transaction withthe generated coordinates in a transactionqueue. Hence, the request processor repre-sents the first level of translation—fromrequests to memory transactions—in Pardis,and is primarily responsible for DRAM ad-dress mapping.

Transaction processorThe transaction processor tracks each

memory transaction’s resource needs andtiming constraints and uses this informa-tion to emit a sequence of DDRx com-mands that achieves performance, energy,

mmi2013030106.3d 15/5/013 16:20 Page 107

Processor

L2 cache

DRAM Commandlogic

Transactionprocessor

SRAM

Requestprocessor

SRAMRequestqueue

Transactionqueue

Commandqueue

Dataqueue

Pardis

DDR3interface

Figure 1. Example of Pardis in a computer system. Pardis receives read

and write requests and generates DDRx commands to help move data

between the processor and main memory. (DDR3: double data rate;

SRAM: static RAM.)

....................................................................

MAY/JUNE 2013 107

Page 3: Programmable DDRx Controllers

and QoS goals. Therefore, the transactionprocessor is primarily in charge of DRAMcommand scheduling and tasks such asDRAM refresh and power management.The end result of transaction processing isa sequence of commands that are enqueuedat a FIFO command queue.

Command logicThe command logic inspects the generated

command stream, checks—and if necessary,stalls—the command at the head of the com-mand queue to ensure all DDRx timing con-straints are met, and synchronizes the issue ofeach command with the DDRx clock. Thecommand logic is not programmable throughan ISA; nevertheless, it provides configurablecontrol registers specifying the value of eachDDRx timing constraint, thereby makingit possible to interface Pardis to differentDDRx systems. The command logic enforcesall timing constraints and guarantees the tim-ing correctness of the scheduled commandstream, making it possible to separate timingcorrectness from performance optimization.

Pardis architectureProgramming Pardis involves writing

code for the request and transaction process-ors and configuring the control registersspecifying DDRx timing constraints to the

command logic. Together, the request andtransaction processors provide the program-mer with seven data types, 44 instructions,and three instruction flags (see Figure 2).

Request processorThe request processor is a 16-bit reduced-

instruction-set computing (RISC) architec-ture with separate instruction and data mem-ories; it provides specialized data types,storage structures, and instructions for ad-dress manipulation. The request processor’sISA supports two data types—an unsignedinteger and a request. Programmer-visiblestorage structures within the request pro-cessor include the architectural registers, thedata memory, and the request queue. The re-quest processor supports fourteen 32-bitinstructions of four different types: arithme-tic logic unit (ALU), control flow, memoryaccess, and queue access. Queue accessinstructions provide a mechanism for dequeu-ing requests from the request queue andenqueuing transactions at the transactionqueue. After a request is dequeued from therequest queue, its fields are available for pro-cessing in the register file.

Transaction processorThe transaction processor implements a

16-bit RISC ISA with split instruction and

mmi2013030106.3d 15/5/013 16:20 Page 108

AddressMetadataMemory request

Metadata: Read/write, data/instruction access, load miss flag,

thread ID, prefetch flag, and application-defined flags.

ADD, SUB, SLL, SRL, AND, OR, XOR, NOT

JMP, BEQ, BNEQ, BTQE

LOAD, STORE

any instruction annotated with –R or –T

ALU

Control flow

Data memory

Queue access

Request processor instructions

ADD, SUB, MIN, MAX, SLL, SRL, AND, OR, XOR, NOT

JMP, JR, RETI, BLT, BLSG, BMSK, BEQ, BNEQ, BTQE,

BCQE

LOAD, STORE

LTQ, CTQ, UTQ, SRT, LCQ, ICQ,

any instruction annotated with –C

ALU

Control flow

Data memory

Queue access

MFSR, SICInterrupt

Transaction processor instructions

AddressKeyMemory transaction

Fixed Variable

Hardware managed Software managed

Fixed key: Read-only

flags, such as

read/write.

Software managed: Eight flags fully accessible by the programmer

to mark any transaction based on user-defined criteria; for example,

for marking all transactions with a ready activate command.

Hardware managed: Six flags maintained by hardware on a

cycle-by-cycle basis; for example, ready flags that indicate whether

the next command of a transaction can be issued in the next cycle

without violating any DRAM constraints.

Variable key: Writable

and status flags, such

as valid flag.

Memory command

Type: Valid, read, write, activate, precharge, power up, power

down, refresh, sleep.

AddressType

Figure 2. Data types and instructions supported by Pardis. The request and transaction processors provide seven data

types, 44 instructions, and three instruction flags. (ALU: arithmetic logic unit.)

....................................................................

108 IEEE MICRO

...............................................................................................................................................................................................

TOP PICKS

Page 4: Programmable DDRx Controllers

data memories; due to the computational in-tensity of the tasks it supports (for example,command scheduling), the transaction pro-cessor’s ISA is more powerful than that ofthe request processor. The transaction pro-cessor defines two new data types, called atransaction and a command. A transactionuses two key fields—fixed and variablekeys—for performing associative lookupson the outstanding transactions in the trans-action queue. For example, it is possible tosearch the fixed-key fields of all outstandingtransactions to identify those transactionsthat occurred due to cache-missing loads.The fixed key is written by the request pro-cessor, and is read-only and searchable withinthe transaction processor. The variable keyreflects the state of a transaction based ontiming constraints, resource availability, andthe state of the DRAM system. The variablekey makes it possible, for example, to searchfor all transactions whose next command is aprecharge to a specific bank.

The transaction processor provides 30instructions comprising ALU, control flow,memory access, interrupt processing, andqueue access operations. It provides 64 pro-grammable counters for capturing processorand queue states (for example, the numberof commands issued to the commandqueue). Each counter counts up and firesan interrupt when it reaches a preprog-rammed threshold. The programmer canuse the transaction processor to search for agiven transaction by matching against fixedand variable keys among all valid transactionsin the transaction queue; in the case of mul-tiple matches, the transaction processor givespriority to the oldest matching transaction. Asearch operation requires two register oper-ands specifying the fixed and variable keys.After a search, the transaction processor typ-ically either

� loads a matching transaction into thearchitectural registers,

� updates a transaction in the queue withthe contents of architectural registers, or

� counts the number of matches for apair of fixed and variable keys.

Eventually, Pardis creates a DDRx com-mand sequence for each transaction in the

transaction processor and enqueues them inthe command queue. The transaction pro-cessor allows the programmer to issue alegal command to the command queueusing a dedicated instruction or an instruc-tion flag. In addition to precharge, activate,read, and write commands, the firmwarecan also issue predefined control commandsto control the command queue. For exam-ple, it can use a sleep command to throttlethe DRAM system for active power manage-ment. Other DRAM maintenance com-mands allow changing DRAM power statesand issuing a refresh to DRAM. By relyingon dedicated command logic to stall eachcommand until it is free of all timing con-straints, Pardis lets the programmer writefirmware code for the DDRx DRAM systemwithout expending limited compute cycleson ensuring that all timing constraintsare met.

ImplementationThis article builds upon our ISCA 2012

paper and examines a scalar, pipelinedimplementation of Pardis as depicted inFigure 3.10 The proposed implementationfollows a six-step procedure for processingan incoming DRAM request, ultimately gen-erating the corresponding DRAM commandstream. First, Pardis assigns a unique requestID (URID) to a new DRAM request beforeit is enqueued at the FIFO request queue (1);the URID accompanies the request through-out the pipeline, and is used to associate therequest with commands and DRAM datablocks. After a request is processed and itsDRAM coordinates are assigned, a newtransaction for the request is enqueued atthe transaction queue (2). At the time thetransaction is enqueued, the fixed keyof the transaction is initialized to the requesttype, while the variable key is initializedbased on the current state of the DRAM sub-system. A queued transaction is prioritizedbased on fixed and variable keys (3), afterwhich the processor issues the next commandof the transaction to the commandqueue (4). The command logic processescommands that are available in the com-mand queue in FIFO order (5). A DRAMcommand is dequeued when it is ready toappear on the DDRx command bus (6),

mmi2013030106.3d 15/5/013 16:20 Page 109

....................................................................

MAY/JUNE 2013 109

Page 5: Programmable DDRx Controllers

and is issued to the DRAM subsystem at thenext rising edge of the DRAM clock.

Request processorThe request processor implements a

five-stage pipeline with a read interface tothe request queue and a write interfaceto the transaction queue. In the firststage, the processor fetches an instructionfrom the instruction memory. The requestprocessor predicts that all branches aretaken, so when it mispredicts a branch, itnullifies the wrong-path instruction. In thesecond stage, the processor decodes thefetched instruction to extract control sig-nals, reads operands from the register file,and dequeues the next request from the re-quest queue if the instruction is annotatedwith a request flag (R-flag). If a request

must be dequeued but the request queue isempty, the request processor stalls the de-code and fetch stages until a new requestarrives at the request queue. (Instructionsin later pipeline stages continue uninter-rupted.) Request registers (R1 through R4)can be written only from the requestqueue side (on a dequeue), and are read-only to the request processor. In the thirdstage, a simple 16-bit ALU executes thedesired ALU operation or computes the ef-fective address if the instruction is a load ora store. Loads and stores access the datamemory in the fourth stage. In the finalstage, the result of every instruction is writ-ten back to the register file, and if the trans-action flag (T-flag) of the instruction isset, a new transaction is enqueued at thetransaction queue.

mmi2013030106.3d 15/5/013 16:20 Page 110

DDRx bus

Register file

Instructionmemory

ALU

Data memory

Request processor

Fromprocessor

Requestqueue

IF

ID

EX

MEM

WB

1

2

GP reg file

Instructionmemory

Data memory

Transaction processor

IF

ID

EX

MEM

WB

SP regs

Brn pred

3

4

ALU

Transactionqueue

Command

queue

Command logic

5

6

State counters

Timing table

Figure 3. Example of the proposed Pardis implementation. This implementation follows a

six-step procedure for processing an incoming DRAM request, ultimately generating the

corresponding DRAM command stream. (IF: instruction fetch; ID: instruction decode;

EX: instruction execute; MEM: memory access; WB: writeback; DDRx bus: double data

rate bus; Brn pred: branch prediction; SP regs: special-purpose registers; GP reg file:

general-purpose register file; ALU: arithmetic logic unit.)

....................................................................

110 IEEE MICRO

...............................................................................................................................................................................................

TOP PICKS

Page 6: Programmable DDRx Controllers

Transaction processorThe transaction processor is a 16-bit, five-

stage, pipelined processor. In the first stage,the processor fetches the next instructionfrom a 64-Kbyte instruction memory. Theimplementation divides branch and jumpinstructions into two categories: fast andslow. Fast branches include jump and branchon queue status instructions such as ‘‘branchif the transaction queue is empty’’ (BTQE)and ‘‘branch if the command queue isempty’’ (BCQE), for which the next instruc-tion can be determined in the fetch stage;as such, these branches are not predictedand incur no performance losses due tobranch mispredictions. Slow branches de-pend on register contents and are predictedby an 8-Kbyte-entry g-share branch predic-tor. Critical branches in the transaction pro-cessor are usually coded using the fast branchinstructions (for example, infinite schedulingloops, or queue state checking).

In the second stage, the processor decodesthe instruction, reads general- and special-purpose registers, and sets special-purpose in-terrupt registers. Special-purpose registers areimplemented using a 64-entry array of pro-grammable counters. The proposed im-plementation of Pardis uses 32 of theseprogrammable counters (S0 through S31)for timer interrupts, and the remaining32 programmable counters (S32 throughS63) for collecting statistics to aid in decisionmaking.

After decode, in the third stage, a 16-bitALU performs arithmetic and logic opera-tions; the transaction queue is accessed inparallel. Command queue and data memoryaccesses occur in the fourth stage, and theprocessor writes the result of the instructionback to the register file in the fifth stage.

Command logicThe command logic implementation uses

masking and timing tables initialized at boottime based on DDRx parameters, plus adedicated down counter for each DRAMtiming constraint imposed by the DDRxstandard. During each DRAM cycle, thecommand logic inspects the command atthe head of the command queue, andretrieves a bit mask from the masking tableto mask out timing constraints that are

irrelevant to the command under consider-ation, such as column address latency(tCAS) in the case of a precharge. Theremaining unmasked timers are used to gen-erate a ready signal indicating whether thecommand is ready to be issued to theDRAM subsystem at the next rising edgeof the DRAM clock.

Evaluation highlightsWe evaluate the performance potential of

Pardis by comparing fixed-function hard-ware and Pardis-based firmware implementa-tions of the first-come, first served (FCFS),11

first-ready, first-come, first-served (FR-FCFS),11 parallelism-aware batch scheduler(Par-BS),12 and thread cluster memoryscheduling (TCMS)13 algorithms. We alsoimplement in firmware a recent DRAMpower management algorithm proposedby Hur and Lin14 and compare both theperformance and the energy of this imple-mentation to the fixed-function hardwareimplementation of the same algorithm. Weevaluate DRAM refresh management onPardis by comparing the fixed-function hard-ware implementation of the Elastic Refreshtechnique to its firmware implementation.15

Finally, we evaluate the performance poten-tial of application-specific optimizationsenabled by Pardis by implementing customaddress-mapping mechanisms. We evaluateDRAM energy and system performanceby simulating 13 memory-intensive parallelapplications, running on a heavily modifiedversion of the SuperEscalar (SESC) simula-tor.16 We measure the area, frequency, andpower dissipation of Pardis by implementingthe proposed system in Verilog HDL andsynthesizing the proposed hardware.

Area, power, and delay:Where are the bottlenecks?

Figure 4 shows synthesis results on thearea, power, and delay contributions of dif-ferent hardware components. At 22 nm, afully synthesizable implementation of Pardisoperates at over 2 GHz, occupies 1.8 mm2

of die area, and dissipates 152 mW of peakpower; higher frequencies, lower power dissi-pation, or a smaller-area footprint can beattained through custom—rather than fullysynthesized—circuit design. Most of the

mmi2013030106.3d 15/5/013 16:20 Page 111

....................................................................

MAY/JUNE 2013 111

Page 7: Programmable DDRx Controllers

area is occupied by the request and transac-tion processors because of four 64-Kbyteinstruction and data static RAM (SRAM)arrays; however, the transaction queue—which implements associative lookups usingcontent-addressable memory (CAM)—is amajor power-hungry component (it uses29 percent of the total power). Other major

consumers of peak power are the transactionprocessor (29 percent) and the request pro-cessor (28 percent).

Scheduling policiesFigure 5 compares Pardis-based firmware

implementations of FCFS,11 FR-FCFS,11

Par-BS,12 and TCMS13 scheduling algorithms

mmi2013030106.3d 15/5/013 16:20 Page 112

Requestprocessor

Transactionprocessor

410Command

logic

500Critical path delay (picoseconds)

Peak power(152 mW)

Area(1.8 mm2)

100%

80%

60%

40%

20%

0%

Area and power breakdown

420

430

440

450

460

470

480

490

Request processor Data queue Transaction queue Transaction processor

Command logic Request queue Command queue

Figure 4. Delay, area, and peak-power characteristics of the synthesized Pardis implementation.

Pardis operates at over 2 GHz, occupies 1.8 mm2 of die area, and dissipates 152 mW of peak

power. The black section at the bottom of the area column (about 2 percent) represents

the data queue, transaction queue, command logic, request queue, and command queue

combined.

0.0

1.2

Sp

eed

up

over

hard

wired

imp

lem

enta

tion

0.2

0.4

0.6

0.8

1.0

Art GmeanWordSwimStringScalParCRadixOceanMG

Benchmarks

LinearHistogramFFTEquakeCG

FCFS FR-FCFS PARBS TCMS

Figure 5. Performance of Pardis-based and hardwired implementations for the first-come, first-served (FCFS), first-ready,

first-come, first-served (FR-FCFS), parallelism-aware batch scheduler (Par-BS), and thread cluster memory scheduling (TCMS)

algorithms. Pardis-based implementations achieve performance within 8 percent of a hardwired memory controller.

....................................................................

112 IEEE MICRO

...............................................................................................................................................................................................

TOP PICKS

Page 8: Programmable DDRx Controllers

to their fixed-function hardware implementa-tions. Pardis achieves virtually the same per-formance as fixed-function hardware onFCFS and FR-FCFS schedulers across allapplications. For some benchmarks (for exam-ple, Art and Ocean with FR-FCFS), the Pardisversion of a scheduling algorithm outperformsthe fixed-function hardware implementationof the same algorithm by a small margin.This improvement is an artifact of the higherlatency incurred in making decisions whenusing Pardis, which generally results in greaterqueue occupancies. As a result of having morerequests to choose from, the scheduling algo-rithm can exploit bank parallelism and rowbuffer locality more effectively under the Par-dis implementation. However, for Par-BS andTCMS—two compute-intensive schedulingalgorithms—Pardis suffers from higher pro-cessing latency, thus hurting performance byeight percent and five percent, respectively.

Address mappingTo evaluate the performance of different

DRAM address-mapping techniques onPardis, we mapped the permutation-basedinterleaving technique17 onto Pardis andcompared it to its fixed-function hardwareimplementation (Figure 6a). The averageperformance of the two implementationsdiffered by less than 1 percent.

Power managementDRAM power management with Pardis

was evaluated by implementing Hur andLin’s queue-aware power management tech-nique14 in firmware and comparing theresults to a fixed-function hardware imple-mentation (see Figure 6c for energy andFigure 6d for performance); in both cases,the underlying command scheduling algo-rithm is FR-FCFS. The hardwired imple-mentation reduces average DRAM energyby 32 percent over conventional FR-FCFSat a cost of four percent lower performance.The firmware implementation of queue-aware power management with Pardis showssimilar results: 29 percent DRAM energy sav-ings at a cost of five percent performance loss.

RefreshTo evaluate DRAM refresh management

on Pardis, we considered a conventional

on-demand DDR3 refresh method18 asthe baseline to which we compared fixed-function hardware and Pardis-based firmwareimplementations of the recently proposedElastic Refresh algorithm15 (Figure 6b). ThePardis-based refresh mechanism takes advan-tage of interrupt programming to managethe state of the ranks and to issue refresh com-mands at the right time. The results indicatethat the average performance of firmware-based Elastic Refresh is within one percentof fixed-function hardware.

M emory system bandwidth and powerare two extremely important pro-

blems that have a significant impact onoverall system performance and energyefficiency. As a result, researchers havedesigned numerous memory controller op-timizations to improve system performance,aiming at different performance objectivesand system requirements.12,14,15 These pro-posals mainly focus on optimizing existingcontrol functions or adding new capabilities

mmi2013030106.3d 15/5/013 16:20 Page 113

0.00

0.20

0.40

0.60

0.80

1.00

0.00

0.20

0.40

0.60

0.80

1.00

0.00

0.20

0.40

0.60

0.80

1.00

(a) (b)

ASIC Pardis0.00

0.20

0.40

0.60

0.80

1.00

ASICSp

eed

up

over

hard

wired

FR

-FC

FS

+ c

onventional

pag

e inte

rleavin

g

Pardis

FR-FCFS + permutationbased

ASIC ASIC PardisPardis

Sp

eed

up

over

hard

wired

FR

-FC

FS

+ b

asic

refr

esh

FR-FCFS + Hur & Lin

Sp

eed

up

over

hard

wired

FR

-FC

FS

DR

AM

energ

ynorm

aliz

ed

to h

ard

wired

FR

-FC

FS

(c) (d)

FR-FCFS + elasticrefresh

FR-FCFS + Hur & Lin

Figure 6. Comparison of Pardis-based and hardwired implementations:

performance of address-mapping schemes (a), performance of refresh

management (b), DRAM energy consumption under Hur and Lin’s power

management algorithm14 (c), and performance under Hur and Lin’s

power management algorithm14 (d).

....................................................................

MAY/JUNE 2013 113

Page 9: Programmable DDRx Controllers

to memory controllers—address mapping,command scheduling, QoS maintenance,DRAM refresh management, and memorypower optimization are some examples.Such optimizations must satisfy differentuser objectives and system requirements;this complicates memory controller design.Not only is it challenging to satisfy multipleconflicting performance requirements in anexisting hardwired memory controller, butit’s impossible to incorporate application-specific optimizations into control policiesimplemented in fixed-function hardware. Aprogrammable platform for memory con-troller design would therefore be a signifi-cant improvement over today’s rigid andrelatively inefficient systems.

In addition to its potential impact onexisting memory systems, programmabilityalso holds the promise of solving some ofthe key problems in next-generation memoryinterfaces. One effective solution to the off-chip memory bandwidth problem is to em-ploy high-speed communication links be-tween processor cores and a highly bankedmemory subsystem. This approach hasrecently been employed in Micron’s hybridmemory cube (HMC)19 to achieve signifi-cant improvements in memory bandwidthand power efficiency. The HMC, however,implements the memory controller on theDRAM package; as a result, the processor de-signer loses the ability to dictate exactly howthe memory controller operates. A program-mable memory controller would give thatcontrol back to the processor designers byletting them develop firmware for memorycontrol functions.

Designing a programmable memory con-troller is a significant challenge. Comparedto a hardwired memory controller, aprogrammable controller could result inslower request processing, thereby decreasingthroughput and efficiency. In addition, as aresult of instruction processing overheads,the control firmware may add extra latencyto every memory access. Moreover, the con-troller’s power dissipation and area may be-come serious limiting factors. Hence, it iscritical to strike a careful balance betweenthe controller’s versatility and complexity.

Pardis is the first programmable DRAMcontroller to address these challenges. Unlike

prior work on intelligent memory controllers(for instance, Impulse20) that allow configu-rable access to memory blocks via physicaladdress remapping, Pardis provides pro-grammability and configurability down tointernal DRAM resources. To achieve ahigh degree of versatility with acceptablecomplexity, Pardis introduces a judicious di-vision of labor between specialized hardwareand firmware: request and transaction pro-cessing in firmware, and configurable timingvalidation in hardware. This task separationallows request and transaction processingresources to be dedicated exclusively to deriv-ing the best schedule, without the burden ofany extra cycles to verify the derived sched-ule’s timing.

Pardis enables novel capabilities at themain memory controller. As opposed to ahardwired memory controller, a programma-ble controller allows application-specific con-trol policies to manage the underlying mainmemory resources more efficiently; providesthe required infrastructure for applying in-field updates, as well as patches to fix bugsand revise the control firmware; adds theability to context-switch among differentcommand schedulers in a multiprogrammedsetting; and returns DRAM control to pro-cessor designers in next-generation HMCsystems. M I CR O

AcknowledgmentsThis work was supported in part by NSF

grant CCF-1217418.

....................................................................References

1. J. Martin et al., ‘‘A Microprogrammable

Memory Controller for High-Performance

Dataflow Applications,’’ Proc. 35th European

Solid-State Circuits Conf. (ESSCIRC 09),

IEEE, 2009, pp. 348-351.

2. G. Kornaros et al., ‘‘A Fully Programmable

Memory Management System Optimizing

Queue Handling at Multi Gigabit Rates,’’

Proc. 40th Design Automation Conf.

(DAC 03), ACM, 2003, pp. 54-59.

3. Micron Technology, ‘‘TN-29-01: Increasing

NAND Flash Performance,’’ 2006; www.

micron.com/~/media/Documents/Products/

Technical%20Note/NAND%20Flash/

tn2901.pdf.

mmi2013030106.3d 15/5/013 16:20 Page 114

....................................................................

114 IEEE MICRO

...............................................................................................................................................................................................

TOP PICKS

Page 10: Programmable DDRx Controllers

4. J. Kuskin et al., ‘‘The Stanford FLASH Multi-

processor,’’ Proc. 21st Int’l Symp. Com-

puter Architecture (ISCA 94), IEEE CS,

1994, pp. 302-313.

5. S.K. Reinhardt, J.R. Larus, and D.A. Wood,

‘‘Tempest and Typhoon: User-level Shared

Memory,’’ Proc. 21st Int’l Symp. Computer

Architecture (ISCA 94), IEEE CS, 1994,

pp. 325-336.

6. J. Carter et al., ‘‘Impulse: Building a Smarter

Memory Controller,’’ Proc. 15th Int’l Symp.

High-Performance Computer Architecture

(HPCA 99), IEEE CS, 1999, pp. 70-79.

7. M. Browne et al., ‘‘Design Verification of

the S3.mp Cache Coherent Shared-Memory

System,’’ IEEE Trans. Computers, Jan.

1998, pp. 135-140.

8. A. Agarwal et al., ‘‘The MIT Alewife Machine:

Architecture and Performance,’’ Proc. 22nd

Ann. Int’l Symp. Computer Architecture

(ISCA 95), ACM, 1995, pp. 2-13.

9. A. Firoozshahian et al., ‘‘A Memory Sys-

tem Design Framework: Creating Smart

Memories,’’ Proc. 36th Int’l Symp. Com-

puter Architecture (ISCA 09), ACM, 2009,

pp. 406-417.

10. M.N. Bojnordi and E. Ipek, ‘‘PARDIS: A Pro-

grammable Memory Controller for the DDRx

Interfacing Standards,’’ Proc. 39th Int’l

Symp. Computer Architecture (ISCA 12),

IEEE, 2012, pp. 13-24.

11. S. Rixner et al., ‘‘Memory Access Schedul-

ing,’’ Proc. 27th Int’l Symp. Computer

Architecture (ISCA 00), IEEE, 2000,

pp. 128-138.

12. O. Mutlu and T. Moscibroda, ‘‘Parallelism-

Aware Batch Scheduling: Enhancing Both

Performance and Fairness of Shared

DRAM Systems,’’ Proc. 35th Int’l Symp.

Computer Architecture (ISCA 08), IEEE,

2008, pp. 32-41.

13. Y. Kim et al., ‘‘Thread Cluster Memory

Scheduling: Exploiting Differences in Mem-

ory Access Behavior,’’ Proc. 43rd Ann.

IEEE/ACM Int’l Symp. Microarchitecture,

IEEE, 2010, pp. 65-76.

14. I. Hur and C. Lin, ‘‘A Comprehensive

Approach to DRAM Power Management,’’

Proc. Int’l Symp. High Performance Com-

puter Architecture (HPCA 08), IEEE CS,

2008, pp. 305-316.

15. J. Stuecheli et al., ‘‘Elastic Refresh: Tech-

niques to Mitigate Refresh Penalties in

High Density Memory,’’ Proc. 43rd Ann.

IEEE/ACM Int’l Symp. Microarchitecture,

IEEE, 2010, pp. 375-384.

16. J. Renau et al., ‘‘SESC Simulator,’’ Jan.

2005; http://sesc.sourceforge.net.

17. Z. Zhang, Z. Zhu, and X. Zhang, ‘‘A

Permutation-Based Page Interleaving

Scheme to Reduce Row-Buffer Conflicts

and Exploit Data Locality,’’ Proc. 33rd

Ann. IEEE/ACM Int’l Symp. Microarchitec-

ture, 2000, ACM, pp. 32-41.

18. JEDEC, DDR3 SDRAM Specification, 2010.

19. J. Jeddeloh and B. Keeth, ‘‘Hybrid Memory

Cube New DRAM Architecture Increases

Density and Performance,’’ Proc. IEEE

Symp. VLSI Technology, IEEE, 2012,

pp. 87-88.

20. J. Carter et al., ‘‘Impulse: Building a Smarter

Memory Controller,’’ Proc. 5th Int’l Symp.

High Performance Computer Architecture

(HPCA 99), IEEE CS, 1999, pp. 70-79.

Mahdi Nazm Bojnordi is a doctoralcandidate in electrical and computer engi-neering at the University of Rochester. Hisresearch focuses on designing main memorycontrollers capable of performing application-specific performance and energy-efficiencyoptimizations. Nazm Bojnordi has an MSin electrical and computer engineering fromthe University of Tehran.

Engin Ipek is an assistant professor in theDepartments of Computer Science andElectrical and Computer Engineering atthe University of Rochester. His researchfocuses on computer architecture, withan emphasis on multicore architectures,hardware-software interaction, and high-performance memory systems. Ipek has aPhD in electrical and computer engineer-ing from Cornell University.

Direct questions and comments about thisarticle to Mahdi Nazm Bojnordi, CSB Room401, University of Rochester, Rochester, NY14627; [email protected].

mmi2013030106.3d 15/5/013 16:20 Page 115

....................................................................

MAY/JUNE 2013 115


Recommended