+ All Categories
Home > Documents > 05071192

05071192

Date post: 07-Oct-2014
Category:
Upload: karthik-reddy
View: 9 times
Download: 1 times
Share this document with a friend
9
356 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 3, MARCH 2010 Reducing SRAM Power Using Fine-Grained Wordline Pulsewidth Control Mohamed H. Abu-Rahma, Member, IEEE, Mohab Anis, Member, IEEE, and Sei Seung Yoon Abstract—Embedded SRAM dominates modern SoCs, and there is a strong demand for SRAM with lower power consumption while achieving high performance and high density. However, the large increase of process variations in advanced CMOS technologies is considered one of the biggest challenges for SRAM designers. In the presence of large process variations, SRAMs are expected to consume larger power to ensure correct read operations and meet yield targets. In this paper, we propose a new architecture that significantly reduces the array switching power for SRAM. The proposed architecture combines built-in self-test and digitally con- trolled delay elements to reduce the wordline pulsewidth for mem- ories while ensuring correct read operations, hence reducing the switching power. Monte Carlo simulations using a 1-Mb SRAM macro in an industrial 45-nm technology are used to verify the power saving for the proposed architecture. For a 48-Mb memory density, a 27% reduction in array switching power can be achieved for a read access yield target of 95%. In addition, the proposed system can provide larger power saving as process variations in- crease, which makes it an attractive solution for 45-nm-and-below technologies. Index Terms—Built-in self test (BIST), low power, random vari- ations, SRAM, statistical, statistical yield estimation. I. INTRODUCTION W ITH TECHNOLOGY scaling, the requirements of higher density and lower power embedded SRAMs are increasing exponentially. It is expected that more than 90% of the die area in future systems-on-chip (SoCs) will be occupied by SRAM [1]. This is driven by the high demand for low-power mobile systems, which integrate a wide range of functionality such as digital cameras, 3-D graphics, MP3 players, and other applications. In the mean time, random variations are increasing significantly with technology scaling. Random dopant fluctuation (RDF) is the dominant source of random variation in the bit cell’s transistors. The variations in due to RDF are inversely proportional to the square root of device area [2]. Therefore, SRAM bit cells experience the largest random variations on a chip, as bit-cell transistors are typically the smallest devices for the given design rules [3]–[6]. Embedded SRAMs usually dominate the SoC silicon area, and their power consumption (both dynamic and static) is a considerable portion of the total power consumption of a SoC. Manuscript received May 27, 2008; revised November 19, 2008. First pub- lished June 10, 2009; current version published February 24, 2010. M. H. Abu-Rahma and S. S. Yoon are with Qualcomm Incorporated, San Diego, CA 92121 USA (e-mail: [email protected]). M. Anis is with the Department of Electrical and Computer Engineering, Uni- versity of Waterloo, Waterloo, ON N2L 3G1, Canada (e-mail: [email protected] terloo.ca). Digital Object Identifier 10.1109/TVLSI.2009.2012511 Fig. 1. Memory read power and bitline differential versus for a 512-kb memory in 65-nm technology. Moreover, the SRAM yield can dominate the overall chip yield. Hence, statistical design margining techniques are used to guarantee high memory yield. However, to achieve high yield, memory power consumption (and speed) is negatively impacted. The stringent requirements of high yield and low power consumption require combining circuit and architectural techniques to reduce SRAM power consumption. SRAM array switching power consumption is considered one of the largest components of power in high-density memories [7], [8]. This is mainly because of the large memory arrays and the requirements for high area efficiency which force SRAM de- signers to use the maximum numbers of rows and columns en- abled by the technology. Fig. 1 shows the dynamic power con- sumption for read operations versus wordline (WL) pulsewidth 1063-8210/$26.00 © 2009 IEEE
Transcript
Page 1: 05071192

356 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 3, MARCH 2010

Reducing SRAM Power Using Fine-GrainedWordline Pulsewidth Control

Mohamed H. Abu-Rahma, Member, IEEE, Mohab Anis, Member, IEEE, and Sei Seung Yoon

Abstract—Embedded SRAM dominates modern SoCs, and thereis a strong demand for SRAM with lower power consumption whileachieving high performance and high density. However, the largeincrease of process variations in advanced CMOS technologies isconsidered one of the biggest challenges for SRAM designers. Inthe presence of large process variations, SRAMs are expected toconsume larger power to ensure correct read operations and meetyield targets. In this paper, we propose a new architecture thatsignificantly reduces the array switching power for SRAM. Theproposed architecture combines built-in self-test and digitally con-trolled delay elements to reduce the wordline pulsewidth for mem-ories while ensuring correct read operations, hence reducing theswitching power. Monte Carlo simulations using a 1-Mb SRAMmacro in an industrial 45-nm technology are used to verify thepower saving for the proposed architecture. For a 48-Mb memorydensity, a 27% reduction in array switching power can be achievedfor a read access yield target of 95%. In addition, the proposedsystem can provide larger power saving as process variations in-crease, which makes it an attractive solution for 45-nm-and-belowtechnologies.

Index Terms—Built-in self test (BIST), low power, random vari-ations, SRAM, statistical, statistical yield estimation.

I. INTRODUCTION

W ITH TECHNOLOGY scaling, the requirements ofhigher density and lower power embedded SRAMs

are increasing exponentially. It is expected that more than90% of the die area in future systems-on-chip (SoCs) will beoccupied by SRAM [1]. This is driven by the high demandfor low-power mobile systems, which integrate a wide rangeof functionality such as digital cameras, 3-D graphics, MP3players, and other applications. In the mean time, randomvariations are increasing significantly with technology scaling.Random dopant fluctuation (RDF) is the dominant source ofrandom variation in the bit cell’s transistors. The variations in

due to RDF are inversely proportional to the square rootof device area [2]. Therefore, SRAM bit cells experience thelargest random variations on a chip, as bit-cell transistors aretypically the smallest devices for the given design rules [3]–[6].

Embedded SRAMs usually dominate the SoC silicon area,and their power consumption (both dynamic and static) is aconsiderable portion of the total power consumption of a SoC.

Manuscript received May 27, 2008; revised November 19, 2008. First pub-lished June 10, 2009; current version published February 24, 2010.

M. H. Abu-Rahma and S. S. Yoon are with Qualcomm Incorporated, SanDiego, CA 92121 USA (e-mail: [email protected]).

M. Anis is with the Department of Electrical and Computer Engineering, Uni-versity of Waterloo, Waterloo, ON N2L 3G1, Canada (e-mail: [email protected]).

Digital Object Identifier 10.1109/TVLSI.2009.2012511

Fig. 1. Memory read power and bitline differential versus � for a 512-kbmemory in 65-nm technology.

Moreover, the SRAM yield can dominate the overall chipyield. Hence, statistical design margining techniques are usedto guarantee high memory yield. However, to achieve highyield, memory power consumption (and speed) is negativelyimpacted. The stringent requirements of high yield and lowpower consumption require combining circuit and architecturaltechniques to reduce SRAM power consumption.

SRAM array switching power consumption is considered oneof the largest components of power in high-density memories[7], [8]. This is mainly because of the large memory arrays andthe requirements for high area efficiency which force SRAM de-signers to use the maximum numbers of rows and columns en-abled by the technology. Fig. 1 shows the dynamic power con-sumption for read operations versus wordline (WL) pulsewidth

1063-8210/$26.00 © 2009 IEEE

Page 2: 05071192

ABU-RAHMA et al.: REDUCING SRAM POWER USING FINE-GRAINED WORDLINE PULSEWIDTH CONTROL 357

for a 512-kb memory macro in an industrial 65-nm tech-nology. Power consumption results are extrapolated toto estimate the component of switching power due to the pe-ripheral circuits. For normal operating conditions, array powerconsumption is more than 60% of read power. Therefore, it isimportant to reduce the array switching due to its strong impacton the memory’s total power as well as the SoC’s power.

Several circuit techniques have been proposed to reduce theSRAM array switching power consumption by reducing the WLpulsewidth. One of the most common techniques to controlis using a bit-cell replica path, which reduces the bitline dif-ferential, hence lowering power consumption [4], [6], [9]–[11].Replica-path (e.g., self-timed) techniques provide a simple ap-proach of process tracking for global variations (interdie or sys-tematic within die) as well as environmental variations (voltageand temperature). However, these circuit techniques are not ef-ficient when memory bit cells experience large random varia-tion, since circuit techniques cannot adapt to random variations.Therefore, their effectiveness decreases with process scaling,and larger design margins are used which increases power con-sumption due to larger . To reduce the loss due to exces-sive margining, circuits and architectures must be designed to-gether to reduce power and manage variability. Higher levels ofdesign abstraction can have better variation-tolerance capabil-ities because the impact of random variation can be measuredat that level. Therefore, combining architecture techniques withcircuit-level designs can reduce the pessimism in using worstcase approaches and can help adapt the circuit to random varia-tions, which can reduce power consumption [3], [12], [13].

In this paper, we propose a new architecture that reducesSRAM switching power consumption by using fine-grained WLpulsewidth control. The proposed solution combines a memorybuilt-in self-test (BIST) with additional logic to reduce theswitching power consumption for the memory. The proposedarchitecture helps recover the switching power consumed dueto the worst case assumption used in SRAM design to achievea high yield. The proposed architecture utilizes infrastructurewhich is already available in SoC with very low area overhead.

The rest of this paper is organized as follows. In Section II,we derive statistical models for memory read access yield andread power consumption, which show the tradeoff betweenthese metrics. In Section III, we describe the proposed systemand its operation. In Section IV, we present the statistical sim-ulation flow and the power savings using the proposed systemwhen applied for memories in an industrial 45-nm technology.In addition, we discuss some design considerations related tothe proposed system. In Section V, we present our conclusions.

II. YIELD AND POWER TRADEOFF

Due to the random variation in SRAM bit cells, there is atight coupling between memory yield and power consumption.To achieve high yield, read access failures should be minimized.Read access yield1 is defined as the probability of a correct readoperation. In a read operation, the selected WL is activated for aperiod of time to allow the bitlines to discharge. The WL activa-tion time ( ) is a critical parameter for memory design since

1In this paper, we use the term yield to refer to read access yield.

Fig. 2. Typical SRAM architecture.

it affects the memory speed (access time) as well as memorypower. To reduce read access failures, the WL pulse shouldbe large enough to guarantee adequate bitline differential, whichcan be sensed correctly using the sense amplifier (SA).

The total power consumption for a memory in a read or writecycle can be expressed as

(1)

where is the total leakage power from the array and theperipheral circuitry, and and are the switchingpowers from the array and the peripheral circuitry, respectively(as shown in Fig. 2).

In a read access, the array switching power can be calculatedas

(2)

where and are the number of bitlines and WLs in amemory bank, respectively. is the bitline capacitance perbit cell, is the bitline differential in read access (used tosense the bit cell’s stored value), is the supply voltage, and

is the operating frequency.can be calculated as

for

for(3)

where is the bit-cell read current. To a first order, canbe approximated by assuming linear dependence on , for therange of where , as shown in Fig. 1.

Therefore, from (2) and (3), the array switching power can becomputed as

for

for(4)

Page 3: 05071192

358 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 3, MARCH 2010

From (4), it is clear that is directly proportional to(when ), which is confirmed by the readpower results shown in Fig. 1.

A correct read operation requires to be large enough toguarantee a correct sensing using the SA. Hence, a largeimplies having a sufficiently large that enables weak bit cells(with low ) to be correctly sensed for a given yield require-ment. Increasing increases the read access yield; however, atthe same time, increasing increases power consumption, asshown in (4). Moreover, has a direct impact on a memory’saccess time [4], [14]. Therefore, is usually set to ensure acorrect read operation for a given read access yield requirementand memory density.

The bit-cell read current is strongly affected by the randomvariations in the bit-cell access device (pass gate) as well

as the pull-down device. Due to these variations, has beenshown to follow a Normal distribution with amean of and a standard deviation of [4]–[6], [14], [15].Moreover, recent measurement results for a 1-Mb memory haveconfirmed that indeed follows a Normal distribution up tofor the mean value [16].

It is noteworthy to mention that, with technology/supplyscaling and with the increase of variation, the device op-eration region may change from strong to moderate inversion.In the extreme case of the device working in the subthresholdregion, distribution will be lognormal due to the exponentialdependence on [17].

Random within-die (WID) variations also have strong impacton the SA offset [4], [14], [18]–[20], which cause SAs to showinput offset voltages that affect the accuracy of the read opera-tion. In addition, systematic variations due to asymmetric layoutcan increase the SA input offset; that is why highly symmetriclayouts are typically used for SA [4], [14]. Moreover, due tothe small differential signal developed on the SA inputs, an ag-gressor located near the SA may couple a large noise at the SAinput which can affect the accuracy of the read operation. Nev-ertheless, by using layout noise shielding techniques and highlysymmetric SA layout styles, the impact of this component canbe minimized.

Typically, the SA input offset due to random variations canbe modeled using a Normal distribution [19]. However, due tothe complexity of deriving a closed-form yield relation that ac-counts for both and the SA offset statistically, we treat theSA input offset as a worst case approach instead of statistical aswas used in [4], [5], and [15].

To guarantee a correct read operation and by using a statisticaltreatment for and a worst case approach for the SA offset, thefollowing condition should be satisfied [4], [5], [15]:

(5)

where is the minimum required bitline differentialvoltage, which is a function of the SA input offset and itsimmunity against coupling noise (typically ).

is the mean bit-cell read current, and is the relativevariation in . is the required design coverage, which is

Fig. 3. � PDF showing the 3, 4, and 5�� points which correspond to differentmemory yield targets (assumed ���� � ���).

related to the target yield and the memory density [4], [15], andcan be computed as

(6)

where is the inverse standard Normal cumulative distri-bution function, is the memory read access yield target,and are the total number of bit cells in the memory. Forexample, for a 1-Mb memory, if the target read access yield is95%, then the required design coverage is . There-fore, to achieve the same yield for a large memory density,should be increased. From (5), this means that a larger isrequired.

It is important to note that the relation between andis nonlinear as shown in (3) and (5). In fact, assuming thatis a Normal distribution, the probability density function

(PDF) of can be calculated using one-to-one mappingfrom (5) as follows [21] (details of the derivation are providedin Appendix A):

for (7)

where is the PDF for and is the PDF for , whichis a Normal distribution.

Figs. 3 and 4 show the distributions of both bit cell and. Note that the PDF is not symmetric, but instead, it

is skewed to larger values. Moreover, the 3, 4, and 5values are shown for and the corresponding values forthese ’s. It is clear that is very sensitive to variations.For , increases four times compared with its nominalvalue (calculated using ).

Because of the skewed distribution, large values ofare required to ensure an acceptable read access yield. More-over, to achieve the same yield as the memory size increases, ahigher coverage is required as shown in (6), which signif-icantly increases (due to the nonlinear relation betweenand ). Therefore, due to statistical variations in the bit cell,

Page 4: 05071192

ABU-RAHMA et al.: REDUCING SRAM POWER USING FINE-GRAINED WORDLINE PULSEWIDTH CONTROL 359

Fig. 4. � PDF. Furthermore, shown are the points corresponding to 3, 4, and5 �� . � PDF is skewed toward a higher � (assumed ���� � ���).

should be pessimistically large to achieve the required yieldtarget. This will negatively impact the dynamic power for mem-ories, which do not have weak bit cells (or have a small spreadof around its mean value). Therefore, it is desirable to havea postsilicon approach that can recover the pessimism in deter-mining .

III. FINE-GRAINED WL PULSEWIDTH CONTROL

As discussed in the previous section, the read switchingpower increases significantly due to process variations, becauseat the design time, we do not have information about whichmemories will have weak bit cells with low ’s. Therefore, aworst case approach is applied to determine for a givenmemory density and yield requirement, as shown in (5).

Memory BIST is nowadays an integral part of a SoC [1],[22]. Typically, a memory BIST engine is used to generate pat-terns to test a memory and detect memory failures. Based onthe memory failure information, the unit under test is either dis-carded or repaired using memory redundancy. To reduce SoCrepair cost, memory repair using built-in self-repair (BISR) isalso used to enable soft repair instead of laser repair (hard repair)and without using automatic test equipment (ATE) [1], [22].

A postsilicon approach is required to recover the loss inswitching power due to the worst case design practices used inSRAM design. Fig. 5 shows a conceptual view of the proposedarchitecture. The system is composed of three main components:BIST, “WL delay,” and “pulsewidth control” block. The memoryBIST is used to test the memory functionality using differenttesting patterns. Each memory instance contains a WL delayblock, which is a digitally programmable delay element that con-trols the WL pulsewidth . This adjustment for is achievedby adding the programmable delay element in the disable path ofthe WL. The delay element is controlled using a digital code pro-vided by the pulsewidth control logic [23], [24]. The pulsewidthcontrol logic increases or decreases the digital code based onthe BIST result. Therefore, the proposed system creates a closedfeedback loop between the BIST, the WL pulsewidth control,and the memory internal timing, which can be used to reducethe power consumption, as explained in the following.

Fig. 5. Proposed architecture: Fine-grained WL pulsewidth control.

Next, we present more details on the three main componentsof the proposed system: BIST, WL delay, and pulsewidth controlblocks.

A. SRAM BIST

SRAMs are prone to different types of failure (catastrophicand parametric). To address yield problems in SRAMs, largearrays are provided with redundant rows and/or columns,which can be used replace defected bit cells, hence repairingthe memory. To enable the repair, the memory should betested first to locate the defected bit cells. However, sinceembedded memories lack direct access to chip input/outputsignals, embedded memory testing becomes complicated andtime consuming [25].

Memory BIST is nowadays an integral part of SoCs [1], [22].A memory BIST engine is used to generate patterns to test amemory and detect memory failures. Typical data patterns in-clude checkerboard, solid, row stripe, column stripe, double rowstripe, and double column stripe. Different test algorithms areused to cover different types of faults; however, one of the mostwidely used algorithms is the MARCH test [22]. After testingthe memory, the unit under test is either discarded or repairedusing memory redundancy (depending on failure information).BIST not only reduces testing time (and cost) but also allowstesting the memory under actual clock speed (at-speed testing).Moreover, BIST can be used to enable BISR, where the BISRlogic analyzes the failing addresses from BIST and generatesa failure bitmap for memory. The BIST then uses the failurebitmap to replace failing bit cells using the redundant rows andcolumns. BISR is also used to enable soft repair instead of laserrepair (hard repair) and without using ATE which further re-duces testing cost [1], [22]. Memory self-test can be performedevery time the chip is reset (in power-up test mode).

B. WL Programmable Delay Elements

WL and SA timing are of utmost importance for a correctread operation. The timer block shown in Fig. 2 is respon-sible of generating these critical internal timing signals. ForSRAM postsilicon debugging purposes and yield learning,programmable delay elements are used to control internaltiming for WL and SA [23], [24], [26], [27]. For example, thesedelay elements are used to characterize the margin in bitlinedifferential voltage. Moreover, they are used in relaxing address

Page 5: 05071192

360 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 3, MARCH 2010

Fig. 6. One type of programmable delay elements [23].

setup time requirements by delaying the clock edge whichstarts an access [23]. Fig. 6 shows one type of programmabledelay elements [23], [26].

C. Pulsewidth Control Logic

Pulsewidth logic is the control logic for the proposed archi-tecture. It can be as simple as a digital counter, which variesthe digital code for the delay element depending on the outputof the BIST. It can also be implemented in software. For ex-ample, a programmable processor can be used to control boththe BIST and the programmable delay elements. Fortunately,modern SoCs include programmable processors that can be usedat start-up time to test and verify the operation of other modules[25], [28].

D. System Operation

Fig. 7 shows the operation of the proposed system: Initially,the pulsewidth control logic provides the initial code for thememory. This initial digital code will correspond to the requiredworst case for a given yield requirement, which is de-termined in the design time [using (5)]. The BIST tests thememory instance using the initial code. If the memory fails, itmay require repairing or it may be discarded, which is a typicalBIST testing sequence. However, if the memory passes the BISTtesting using the initial code (or after repair), the BIST signalsthe pulsewidth control logic to reduce the digital code, hencereducing using the programmable delay element. Using thenew digital code for , the BIST tests the memory once again,and this process of BIST and reduction is repeated until thememory fails in the read operation. The last passing code is thenstored on the built-in registers inside the memory instances. Ifthe lowest code is reached without the memory failing, the codeis also stored, and the operation is terminated. The aforemen-tioned steps are repeated for all the memories in the chip. Hence,the proposed architecture reduces for all memories in whichthe bit cells have sufficient to ensure a correct read oper-ation. Therefore, by reducing , the read power of the mem-ories can be reduced. This operation can be part of the systemtesting or power-up, where the final codes for each memory canbe stored on built-in registers or burned on fuses.

IV. RESULTS AND DISCUSSION

To test the power savings using the proposed architecture,Monte Carlo simulations are used to capture the impact of de-vice variation on bit cell and the corresponding . Simu-lations are performed using a 1-Mb macro from an industrial

Fig. 7. Flowchart for the operation of the proposed fine-grained WL pulsewidthcontrol system.

45-nm technology. The 1-Mb macro uses replica path to reducepower consumption and improve process tracking [4], [9]–[11].Hardware-correlated bit-cell statistical models2 are used to com-pute and which are used in the simulation flow shownin the following. Postlayout switching power simulations wereused to measure the power versus dependence, as shown inFig. 1.

In SoC design, typically, a high-density macro (on the orderof 512 kb or 1 Mb) is used as a building block for larger sizememories. Therefore, in our simulations, we assume that theminimum memory macro size is 1 Mb, and multiple instancesof that macro are used to realize larger memories.

To estimate power saving using the proposed system, a MonteCarlo simulation flow has been developed as follows.

1) For every memory instance in the chip, generate sam-ples of Normal distribution with a mean of and astandard deviation of to represent the read current vari-ation in the macro.

2) Find the minimum current in each memory instance andcompute the from the information of the weakestbit cell which has the lowest ((3)). Therefore,represents the minimum WL pulsewidth that guarantees

2The statistical models are provided by the foundry, and the statistical data areextracted from measuring a large sample of bit cells as shown in [16]. RandomWID variations are reflected into the Spice model by varying � , � , �, andother parameters of the bit-cell devices according to the measured statistics.

Page 6: 05071192

ABU-RAHMA et al.: REDUCING SRAM POWER USING FINE-GRAINED WORDLINE PULSEWIDTH CONTROL 361

Fig. 8. Power reduction using the proposed architecture versus chip-levelmemory density. Different values of yield target are shown.

a correct read operation for that instance. This value ofshould be automatically determined by the pro-

posed system, since the WL control block shown in Fig. 5will reduce until that memory instance fails a readoperation.

3) Using (4), calculate the power consumption of thatmemory instance.

4) Repeat all the aforementioned steps for all memory in-stances on that chip and estimate the total read power formemories on that chip.3

5) Repeat all the aforementioned steps for a large number ofchips and get the average power.

6) From the chip-level yield requirement and the totalmemory density, calculate the design coverage using (6),and find the worst case . This will be the valueof , which would have been used for all memory in-stances if the proposed architecture was not enabled. For

, the corresponding power consumption can becalculated using (4).

7) From the last two steps, calculate the power reductionusing the proposed architecture.

Fig. 8 shows the power reduction achieved using the pro-posed architecture for different memory densities and yield tar-gets. Note that these yield values represent the intrinsic yieldbefore applying any repair or correction [i.e., redundancy orerror-correction code (ECC)]. As the memory density increases,the power saving increases, which shows the effectiveness of theproposed system particularly if a high yield target is required (asin high-volume low-cost products). Array switching power con-sumption can be reduced between 15% and 35% for a 48-Mbmemory density depending on the yield target. Fig. 9 showsthe achievable power saving versus memory yield target. Fora 1-Mb memory density, the array switching power savings canbe as high as 15% for a yield of 99%. From Figs. 8 and 9, itis clear that the proposed architecture reduces array switchingpower significantly. As the SRAM content increases in a SoC,

3Here, we assume that all memories are accessed simultaneously. Hence, weadd the individual read power of each memory. For our analysis, this is a fairassumption since we do not make any assumptions related to how the systemaccesses these memories. Nevertheless, the same simulation flow can be used ifthe switching activity for each memory is known.

Fig. 9. Power reduction using the proposed architecture versus yield target fortwo cases of chip-level density: (a) 1 and (b) 48 Mb.

Fig. 10. Power reduction using the proposed architecture versus chip-levelmemory density for different values of � variation for a yield target of 95%.

it is expected that a higher power saving can be achieved usingthe proposed architecture.

SRAM bit cells show strong sensitivity to device variations,which increase significantly with scaling. To investigate the im-pact of variations on the power, we increase variations from10% to 12.5% [29]. Fig. 10 shows power reduction for these twocases of variation. It is obvious that significant power savingscan be achieved. For the 48-Mb case, power saving increases bymore than two times, from 25% to 55% (even though varia-tions only increased from 10% to 12.5%). This shows the effec-tiveness of the proposed architecture, which makes it attractivefor power reduction in the presence of large process variationswhich are expected to worsen with technology scaling.

For the previous simulation results, we assumed that a 1-Mbmemory is the minimum memory instance size that can havea specific . However, in memory design, multiple subarrays(i.e., banks) are used to implement a single memory instance,as shown in Fig. 5. Typically, a timer circuit is used per sub-array; hence, the fine-grained concept can be further applied tothe subarray level. This requires adding the digitally controlleddelay element per subarray and has the capability of storing thedigital code per subarray. Nevertheless, the area overhead canstill be very small since the additional area is amortized over

Page 7: 05071192

362 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 3, MARCH 2010

Fig. 11. Power reduction using the proposed architecture versus chip-levelmemory density for different values of the minimum-controlled memoryinstance (or subbanks) for a yield target of 95%.

the large size of the memory macro size. To evaluate the bene-fits of control at the subarray level, we assume that we have16 subarrays in the 1-Mb macro. Each subarray is composed of256 bitlines by 256 rows. Therefore, using the proposed archi-tecture, can be adjusted for a 64-kb block. Fig. 11 showsthe achieved power saving for the 64-kb block size as well asthe 1-Mb full macro. By adjusting at the subarray level,power saving increases from 24% to 42% for the same 48-Mbchip-level memory density (1.75 times improvement). Further-more, for the 1-Mb chip-level memory density, power savingincreases from 7.5% to 20% (2.67 times improvement). Thisshows the importance of reducing the size of the memory block,which can be individually controlled using , as this will sig-nificantly reduce the array switching power consumption.

The proposed architecture reduces power consumption by re-ducing the WL pulsewidth for memories which have enoughmargin for a read operation. It is important to investigate how re-ducing will affect the system timing for both setup and holdtime requirements. Reducing also reduces the access timeof the memory , which implies that the logic delay throughthe memory is now faster. Therefore, the setup time requirementwill be automatically met. Reducing , however, may causehold violations. This situation can be easily resolved in designtime by using the lowest expected for hold time verification(i.e., using the minimum that the programmable delay ele-ment provides).

In the aforementioned analysis, the discrete quantization effecton power savings is not considered. In reality, since we are usinga digitally controlled delay element, can only take discretevalues. However, this may not have a significant impact on theshown results, since small area and low power delay elements can

cover a large range of delays with fine control [23], [24]. In addi-tion, by using the Monte Carlo simulation flow described previ-ously, we can determine the range of with the highest proba-bility of occurring and modify the delay elements in design timeto have enough steps in that region. It is important to note thatdelay elements do not add extra area as they are typically usedin memories for debugging purposes [23], [24], [27].

While the proposed system accounts for process variationswhich are static in nature, it cannot adapt to environmentalvariations such as voltage or temperature variation [13] due totheir dynamic nature. This is because will be fixed afterrunning the pulsewidth control system in the start-up. Hence,environmental variations may cause the memory to fail if theyreduce the sensing margin. This problem can be addressed indesign time, by ensuring that the minimum step size for delaycontrol provides sufficient margin for voltage and temperaturevariations. Moreover, if self-timed memories are used, thereplica path will provide efficient tracking for environmentalvariations [4], [9]–[11]. In addition, in product testing, lowvoltage screening can be used in power-up to set , whichguarantees a sufficient margin for environmental variations,since the product will operate at a supply voltage typicallylarger than the low-voltage test condition.

It is important to note that, in addition to environmental varia-tions, transistor aging mechanisms such as hot carrier injection,time-dependent dielectric breakdown, and negative bias temper-ature instability cause device characteristics to shift with time,which increase gate delay [30], [31]. To prevent the circuit fromfailing due to device aging, additional guard banding (in delayor in the minimum supply voltage) is required, which may lowerthe expected power savings of the proposed architecture. For anaccurate estimate of power savings including device aging, re-liability models for these effects need to be included in the pro-posed simulation flow.

In this paper, we presented the analysis and results only forread power reduction. Nevertheless, the proposed system canbe extended to reduce the switching power in write operations.In a write operation, one side of the bitline is pulled down tozero, while the other side is held at using write drivers.This operation is enabled for columns which are selected forwrite (using column multiplexer). However, half-selected bitcells (bit cells on the same WL which are not accessed for writeoperation) still experience a dummy read operation, and the bit-line discharges using for a period determined by . Arrayswitching power for write operation can therefore becalculated as shown at the bottom of the page, where is theinput/output data width for the memory.

The term including in (8) refers to the power con-sumption in the selected bitlines for write operation, which arepulled down to zero using write drivers (not the bit-cell readcurrent , therefore, there is no dependence on ). The term

for

for(8)

Page 8: 05071192

ABU-RAHMA et al.: REDUCING SRAM POWER USING FINE-GRAINED WORDLINE PULSEWIDTH CONTROL 363

including is the power consumption in the half-se-lected bitlines in write operation, which is dependent on .Array switching power due to half-selected bit cells can con-tribute significantly to the total write power particularly in high-density memories with small IO width and large muxing option(i.e., ). Therefore, the proposed architecture can beused to reduce in write operations which reduces the powerconsumption of half-selected bit cells. However, as is re-duced, the bit cell becomes more susceptible to write failures.A write failure happens when the cell fails to write the desiredvalue during write operation [5], [15]. Therefore, the proposedsystem can reduce until a write failure occurs. To evaluatethe power savings in this case, a statistical simulation flow thatincludes models for write failure is required.

To implement both read and write power reduction, additionallogic is required to control for read and write separately, asshown recently in [27]. Nevertheless, the area penalty can benegligible due to the small size of digital delay elements. More-over, as mentioned earlier, these delay elements are typicallyused in memories for debugging purposes [23], [24], [27].

In addition to reducing power, it is noteworthy to mention thatreducing (for read or write operations) has another benefit,which is the reduction of read-disturb failures [32]–[35]. Readdisturb happens when the bit cell flips its stored data when it isaccessed (or half-selected). As becomes smaller, the bit cellhas shorter time to flip, which reduces the read-disturb-failureprobability [32]–[35]. While, in this work, we focused on readaccess yield only, it is interesting to evaluate the reduction inread disturb using the proposed system, which will be a point offuture research.

V. CONCLUSION

Array switching power is one of the largest components ofpower consumption in high-density SRAM. Moreover, withthe large increase in process variation, SRAMs are expected toconsume even larger array switching power to ensure correctread operations and meet yield targets. In this paper, we proposea new architecture that significantly reduces the array switchingpower. The proposed architecture combines BIST and digi-tally controlled delay elements to reduce the WL pulsewidthfor memories while ensuring a correct read operation, hencereducing the switching power. Combining both architectureand circuit techniques enables the system to detect weak bitcells using the BIST and adjust accordingly. Therefore, theproposed architecture recovers the power consumption sinceit tests each memory individually and adjusts to ensure acorrect read operation. A new simulation flow was developedto evaluate the power savings for the proposed architecture.Monte Carlo simulations using a 1-Mb SRAM macro from anindustrial 45-nm technology was used to examine the power re-duction achieved by the system. The proposed architecture canreduce the array switching power significantly and show largepower saving particularly as the chip-level memory density in-creases. For a 48-Mb memory density, a 27% reduction in arrayswitching power can be achieved for a read access yield targetof 95%. In addition, the proposed system can provide largerpower saving as process variations increase, which makes it avery attractive solution for 45-nm-and-below technologies.

APPENDIX

EXPRESSION FOR PDF ( )

In this section, we derive the PDF for as shown in (7).Given a function , where the PDF of is known, the

PDF of can be expressed as [21]

(9)

where and are the PDFs of and , respectively.is the derivative of , and are the real roots

of the equation .Therefore, to determine the PDF for given the PDF of ,

we begin by the relationship between and

(10)

and the derivative w.r.t.

(11)

Equation (10) has a single solution with. Using (9), (10), and (11)

(12)

(13)

Therefore

for (14)

where is the PDF for , which is a Normal distribution.

REFERENCES

[1] Y. Zorian and S. Shoukourian, “Embedded-memory test and repair:Infrastructure IP for SoC yield,” IEEE Des. Test Comput., vol. 20, no.3, pp. 58–66, May/Jun. 2003.

[2] M. Pelgrom, H. Tuinhout, and M. Vertregt, “Transistor matching inanalog CMOS applications,” in IEDM Tech. Dig., 1998, pp. 915–918.

[3] T.-C. Chen, “Where is CMOS going: Trendy hype versus real tech-nology,” in Proc. ISSCC, 2006, pp. 22–28.

[4] R. Heald and P. Wang, “Variability in sub-100 nm SRAM designs,” inProc. Int. Conf. Comput. Aided Des., 2004, pp. 347–352.

[5] K. Agarwal and S. Nassif, “Statistical analysis of SRAM cell stability,”in Proc. 43rd DAC, 2006, pp. 57–62.

[6] R. Venkatraman, R. Castagnetti, and S. Ramesh, “The statistics of de-vice variations and its impact on SRAM bitcell performance, leakageand stability,” in Proc. ISQED, 2006, pp. 190–195.

[7] M. Q. Do, M. Drazdziulis, P. Larsson-Edefors, and L. Bengtsson,“Parameterizable architecture-level SRAM power model using cir-cuit-simulation backend for leakage calibration,” in Proc. ISQED,2006, pp. 557–563.

[8] A. Macii, L. Benini, and M. Poncino, Memory Design Techniques forLow Energy Embedded Systems. Norwell, MA: Kluwer, 2002.

[9] B. Amrutur and M. Horowitz, “A replica technique for wordline andsense control in low-power SRAM’s,” IEEE J. Solid-State Circuits, vol.33, no. 8, pp. 1208–1219, Aug. 1998.

[10] K. Osada, J.-U. Shin, M. Khan, Y.-D. Liou, K. Wang, K. Shoji, K.Kuroda, S. Ikeda, and K. Ishibashi, “Universal-Vdd 0.65–2.0 V 32kB cache using voltage-adapted timing-generation scheme and a litho-graphical-symmetric cell,” in Proc. ISSCC, 2001, pp. 168–169.

Page 9: 05071192

364 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 18, NO. 3, MARCH 2010

[11] M. Yamaoka, N. Maeda, Y. Shinozaki, Y. Shimazaki, K. Nii, S.Shimada, K. Yanagisawa, and T. Kawahara, “Low-power embeddedSRAM modules with expanded margins for writing,” in Proc. ISSCC,2005, vol. 1, pp. 480–611.

[12] D. Marculescu and E. Talpes, “Variability and energy awareness: Amicroarchitecture-level perspective,” in Proc. 42nd DAC, 2005, pp.11–16.

[13] S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi, and V.De, “Parameter variations and impact on circuits and microarchitec-ture,” in Proc. 40th DAC, 2003, pp. 338–342.

[14] M. H. Abu-Rahma, K. Chowdhury, J. Wang, Z. Chen, S. S. Yoon, andM. Anis, “A methodology for statistical estimation of read access yieldin SRAMs,” in Proc. 45th DAC, 2008, pp. 205–210.

[15] S. Mukhopadhyay, H. Mahmoodi, and K. Roy, “Statistical design andoptimization of SRAM cell for yield enhancement,” in Proc. Int. Conf.Comput. Aided Des., 2004, pp. 10–13.

[16] T. Fischer, C. Otte, D. Schmitt-Landsiedel, E. Amirante, A. Olbrich,P. Huber, M. Ostermayr, T. Nirschl, and J. Einfeld, “A 1 Mbit SRAMtest structure to analyze local mismatch beyond 5 sigma variation,” inProc. IEEE ICMTS, 2007, pp. 63–66.

[17] R. Rao, A. Srivastava, D. Blaauw, and D. Sylvester, “Statisticalanalysis of subthreshold leakage current for VLSI circuits,” IEEETrans. Very Large Scale Integr. (VLSI) Syst., vol. 12, no. 2, pp.131–139, Feb. 2004.

[18] P. Kinget, “Device mismatch and tradeoffs in the design of analog cir-cuits,” IEEE J. Solid-State Circuits, vol. 40, no. 6, pp. 1212–1224, Jun.2005.

[19] B. Wicht, T. Nirschl, and D. Schmitt-Landsiedel, “Yield and speed op-timization of a latch-type voltage sense amplifier,” IEEE J. Solid-StateCircuits, vol. 39, no. 7, pp. 1148–1158, Jul. 2004.

[20] S. Mukhopadhyay, K. Kim, K. Jenkins, C.-T. Chuang, and K. Roy,“Statistical characterization and on-chip measurement methods forlocal random variability of a process using sense-amplifier-based teststructure,” in Proc. ISSCC, Feb. 2007, pp. 400–611.

[21] A. Papoulis, Probability, Random Variables, and Stochastic Processes,3rd ed. New York: McGraw-Hill, 1991.

[22] R. Rajsuman, “Design and test of large embedded memories: Anoverview,” IEEE Des. Test Comput., vol. 18, no. 3, pp. 16–27, May2001.

[23] W. Kever, S. Ziai, M. Hill, D. Weiss, and B. Stackhouse, “A 200 MHzRISC microprocessor with 128 kB on-chip caches,” in Proc. ISSCC,Feb. 1997, pp. 410–411.

[24] Y. H. Chan, T. J. Charest, J. R. Rawlins, A. D. Tuminaro, J. K. Wadhwa,and O. M. Wagner, “Programmable sense amplifier timing generator,”U.S. Patent 6 958 943, Oct. 25, 2005.

[25] A. Chandrakasan, W. J. Bowhill, and F. Fox, Design of High-Perfor-mance Microprocessor Circuits. New York: Wiley-IEEE Press, 2000.

[26] M. Min, P. Maurine, M. Bastian, and M. Robert, “A novel dummybitline driver for read margin improvement in an eSRAM,” in Proc.IEEE Int. Symp. DELTA, Jan. 2008, pp. 107–110.

[27] R. Joshi, R. Houle, D. Rodko, P. Patel, W. Huott, R. Franch, Y. Chan,D. Plass, S. Wilson, S. Wu, and R. Kanj, “A high performance 2.4Mb L1 and L2 cache compatible 45 nm SRAM with yield improve-ment capabilities,” in Proc. IEEE Symp. VLSI Circuits, Jun. 2008, pp.208–209.

[28] J. M. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Cir-cuits, 2nd ed. Englewood Cliffs, NJ: Prentice-Hall, 2002.

[29] H. Yamauchi, “Embedded SRAM circuit design technologies for a45 nm and beyond,” in Proc. 7th Int. Conf. ASIC, Oct. 2007, pp.1028–1033.

[30] K. Kang, S. P. Park, K. Roy, and M. A. Alam, “Estimation ofstatistical variation in temporal NBTI degradation and its impact onlifetime circuit performance,” in Proc. IEEE/ACM ICCAD, 2007, pp.730–734.

[31] X. Li, J. Qin, B. Huang, X. Zhang, and J. Bernstein, “SRAM circuit-failure modeling and reliability simulation with SPICE,” IEEE Trans.Device Mater. Rel., vol. 6, no. 2, pp. 235–246, Jun. 2006.

[32] M. Khellah, Y. Ye, N. Kim, D. Somasekhar, G. Pandya, A. Farhang,K. Zhang, C. Webb, and V. De, “Wordline and bitline pulsing schemesfor improving SRAM cell stability in low Vcc 65 nm CMOS designs,”in Proc. IEEE Symp. VLSI Circuits, 2006, pp. 9–10.

[33] S. Ikeda, Y. Yoshida, K. Ishibashi, and Y. Mitsui, “Failure analysis of 6T SRAM on low-voltage and high-frequency operation,” IEEE Trans.Electron Devices, vol. 50, no. 5, pp. 1270–1276, May 2003.

[34] J. B. Khare, A. B. Shah, A. Raman, and G. Rayas, “Embedded memoryfield returns—Trials and tribulations,” in Proc. IEEE ITC, Oct. 2006,pp. 1–6.

[35] M. Yamaoka, K. Osada, and T. Kawahara, “A cell-activation-time con-trolled SRAM for low-voltage operation in DVFS SoCs using dynamicstability analysis,” in Proc. 34th ESSCIRC, Sep. 2008, pp. 286–289.

Mohamed H. Abu-Rahma (S’01–M’09) receivedthe B.Sc. degree (with honors) in electronics andcommunication engineering from Ain Shams Uni-versity, Cairo, Egypt, in 2001, the M.Sc. degree inelectronics and communication engineering fromCairo University, Cairo, Egypt, in 2004, and the Ph.D.degree in electrical engineering from the Universityof Waterloo, Waterloo, ON, Canada, in 2009.

From 2001 to 2004, he was with Mentor Graphics,Egypt, where he worked on MOSFET compactmodel development and extraction. He has been

with Qualcomm Incorporated, San Diego, CA, since 2005, where he has beenengaged in the research and development of low-power embedded SRAMand CMOS circuits. He is the inventor/coinventor of six pending patents. Hisresearch interests include low-power digital circuits, variation-tolerant memorydesign, and statistical design methodologies.

Mohab Anis (S’98–M’03) received the B.Sc. degree(with honors) in electronics and communication engi-neering from Cairo University, Cairo, Egypt, in 1997and the M.A.Sc. and Ph.D. degrees in electrical en-gineering from the University of Waterloo, Waterloo,ON, Canada, in 1999 and 2003, respectively.

He is currently an Assistant Professor and theCodirector of the VLSI Research Group, Departmentof Electrical and Computer Engineering, Universityof Waterloo. He is the Cofounder of Spry DesignAutomation. He has authored/coauthored over 60

papers in international journals and conferences and is the author of thebook Multi-Threshold CMOS Digital Circuits—Managing Leakage Power(Kluwer, 2003). He is an Associate Editor of the Journal of Circuits, Systemsand Computers, the American Scientific Publishers Journal of Low PowerElectronics, and the Journal of VLSI Design. His research interests includeintegrated circuit design and design automation for VLSI systems in the deepsubmicrometer regime.

Dr. Anis is an Associate Editor of the IEEE TRANSACTIONS ON CIRCUITS AND

SYSTEMS–II. He is also a member of the program committee for several IEEEconferences. He was the recipient of the 2004 Douglas R. Colton Medal forResearch Excellence in recognition of his excellence in research leading to newunderstanding and novel developments in microsystems in Canada and won the2002 International Low-Power Design Contest.

Sei Seung Yoon received the B.S.E.E. degree fromYonsei University, Korea, in 1988.

He joined the Qualcomm Memory Design Team,San Diego, CA, in 2004. Prior to joining Qualcomm,he had more than 16 years experience in Micron tech-nology, T-RAM, and Samsung as a Memory CircuitDesigner. He is currently working on low power andhigh performance embedded SRAM design. He holdsmore than 20 U.S. patents.


Recommended