Mitigating Bitline Crosstalk Noise in DRAM...

Mitigating Bitline Crosstalk Noise in DRAM MemoriesSeyed Mohammad Seyedzadeh, Donald Kline Jr, Alex K. Jones, Rami Melhem

University of [email protected],{dek61,akjones}@pitt.edu,[email protected]

ABSTRACTDRAM cells in deeply scaled CMOS confront significant challengesto ensure reliable operation. Parasitic capacitances induced by certainbit storage patterns, or bad patterns, create coupling noise that cancause crosstalk-induced faults when the coupling exceeds tolerablemargins. These margins decrease and their variabilities increase withscaling, leading to weak cells that are highly susceptible to this formof crosstalk.

This paper explores coding techniques to address row-basedcrosstalk. First, n-to-m bit encoding is explored to remove bad bitpatterns from code words. Second, a Periodic Flip Encoding (PFE)technique is proposed to flip specific bits in a repeated pattern withdifferent offsets and produce multiple code word candidates. PFEencoding can be used in a fault-oblivious or fault-aware fashion.Fault-oblivious PFE mitigates faults when the location of weak cellsis unknown by minimizing the number of bad patterns in the en-coded data. Fault-aware PFE avoids faults when the location of theweak cells is known by selecting the code word in which the centerof any bad pattern does not coincide with a weak cell. Fault-awareand fault-oblivious PFE provide two fault tolerance solutions witha trade-off between reliability improvement and performance andpower overheads.

Experimental evaluation demonstrates that PFE outperforms n-to-m bit encoding as well as other leading approaches, includingerror correction pointers (ECP) and error correction codes (ECC).For example, when 0.01% of the cells are weak, fault-aware PFEachieves an Uncorrectable Bit Error Rate (UBER) smaller than3× 10−12 compared to 1.4× 10−6 for ECP and 6.8× 10−6 for ECC-1.When a relatively high 1% of the cells are weak, fault-aware PFEimproves the UBER more than two orders of magnitude comparedto ECC-1 and one order of magnitude compared to ECP. This isaccomplished in both cases with a low performance overhead ofbetween 1-2%, depending on the hardware implementation.

1 INTRODUCTIONDynamic random access memory (DRAM) has been the primarybuilding block of main memory systems in recent decades. Advance-ments in integrated circuit process technology has resulted in shrink-ing circuit and interconnect feature size allowing increases in storagedensity. Unfortunately, this scaling has negative impacts on mem-ory cell function, which can jeopardize their reliability [19, 21]. Forexample, smaller DRAM cells hold a decreasing amount of charge re-sulting in a lower noise margin [8, 28, 32, 54], which causes them tomore easily lose their data. Additionally, due to the increasing prox-imity of cells, electromagnetic coupling [9, 15, 16, 22, 24, 27, 28, 36]between cells also increases. This increased coupling causes cellsto interact in undesirable ways. Moreover, the parasitic capacitanceamong bitlines due to higher process variation results in some weak

cells that are more vulnerable to undesired coupling with its neigh-boring cells. Therefore, it becomes increasingly common for aDRAM cell to become disturbed beyond its coupling noise mar-gin, resulting in a disturbance fault from this crosstalk.

To mitigate crosstalk-induced faults, DRAM manufacturers havetypically employed solutions for improving inter-cell isolation throughcircuit-level techniques [3, 35, 59] and screening for disturbanceerrors during post-production testing [2, 40, 45, 53]. Although theseapproaches reduce the coupling noise, they have several disadvan-tages. First, they increase the complexity of the layout as well asthe likelihood of creating short circuit between non-adjacent bitlines.Second, they can decrease storage density due to more complexlayouts. Third, some solutions for addressing one form of crosstalkactually exacerbate another source of crosstalk. Therefore, someDRAM manufacturers have opted to abandon circuit techniques infavor of the open bitline structure at the expense of higher crosstalknoise, particularly between bitlines [28].

A number of studies [16, 54, 58] have concentrated on the com-plication of the coupling process in light of the data stored in DRAMcells referred to as neighborhood-pattern faults. Neighborhood-pattern faults occur when certain data patterns (i.e., bad patterns)induce faults from coupling. The specific bad patterns depend onthe memory design and layout. Solutions to problems resulting fromdata pattern dependencies have been proposed for other memorytechnologies based on the minimization of the destructive data pat-terns in data blocks. An example is an encoding that maps n-bit datawords to m-bit code words, resulting in a reduction in destructivepatterns obtained at the cost of a space overhead. A 3-bit to 4-bitinsulation mapping was proposed to remove any two consecutivezeros to mitigate write disturbance errors in phase change memorywhile incurring a 33.3% space overhead [13].

In this paper, we study the applicability of an n-bit to m-bit in-sulation encoding for avoiding specific bit patterns (bad patterns)vulnerable to crosstalk in DRAM cells, while minimizing the encod-ing overhead. In particular, a Four-bit to Five-bit Encoding (FFE) isderived that partitions the original data into 4-bit groups (i.e., words)and then maps these 4-bit data words into 5-bit code words. It is alsoproven by contradiction that this incurs the minimum space overheadamong insulation coding-based techniques. FFE is fault-oblivious asit minimizes the number of bad patterns that can lead to faults ratherthan the number of actual faults.

In this paper, a low overhead technique, called Periodic FlipEncoding (PFE), is also proposed to avoid bad patterns that lead tocrosstalk. This technique reduces bitline crosstalk while not harmingthe protection against wordline crosstalk. PFE first partitions thedata into groups and then flips the same bit position of each group.For groups of three bits, this approach provides four different codewords for a block, and only two auxiliary bits are needed to specifythe code word used. In the absence of information about weak cells,PFE is fault-oblivious and selects a code word that decreases the

number of bad patterns. However, when the locations of the weakcells are known, PFE may also be employed in a fault-aware fashionby selecting a code word that may have a bad pattern, as long as thecenter of the bad pattern does not overlap with a weak cell.

This paper makes the following contributions:

• To our knowledge, this is the first work that uses coding-based techniques to mitigate bitline crosstalk in DRAM cellsby reducing the number of bit patterns vulnerable to crosstalk.

• It explores the applicability of n-bit to m-bit coding to crosstalkfault mitigation.

• It introduces fault-oblivious and fault-aware periodic flipencodings that increase the encoding space in return forcrosstalk fault mitigation.

• It provides a characterization of the fault mitigation of theproposed techniques and conducts an extensive study to illus-trate the tradeoffs between reliability, cost, performance, andpower.

2 BACKGROUND AND RELATED WORKDRAM is built from a two-dimensional array of cells. It consistsof memory cells at the intersections of bitline pairs and wordlines.The memory cell is composed of a transistor and a capacitor inwhich the data is stored. Depending on whether its capacitor is fullycharged or fully discharged, a cell is either in the charged state orin the discharged state, respectively. Those two states are utilized torepresent a ‘0’ or ‘1’ data value. Each bitline pair is connected toa sense amplifier (SA) that consists of a latch with some auxiliarytransistors to control its operation. The read operation of a memorycell using the bitline pair consists of the three stages: recharging,turning on the wordline, and turning on the SA. The voltages ofthe bitlines connected to the same SA depend on the values ofthe cells. Therefore, the stored voltage levels determine how muchinfluence the coupling capacitance between the bitlines will have onthe final values [22, 41]. Since each bitline is coupled to the adjacentbitline in the neighboring bitline pair, the probability that crosstalknoise causes a false output from the read operation increases. Theprobability of a false output further depends on the stored datapatterns due to the various coupling capacitances.

2.1 Origins of CrosstalkCrosstalk occurs due to parallel bitlines on the memory chip surfacewhich are particularly prone to relatively large capacitive couplingnoise from adjacent bitlines. At smaller feature sizes, the integrationdensity of memory devices increases and the cell voltage does notscale down proportionally to the feature size. Consequently, weakcell signals are not sensed reliably, aggravating problems (faults)associated with bitline coupling noise.

In DRAM, the coupling capacitances among bitlines, amongwordlines, and between wordlines and bitlines are the root causesof crosstalk [41]. Coupling capacitances between wordlines (i.e.,inter-row crosstalk) can cause data retention problems. When cellsfrom neighboring wordlines are accessed in DRAM, a value storedin a given cell can be changed. Due to capacitive coupling betweenmemory cells on adjacent wordlines, voltage levels used while ac-cessing data on one wordline can affect data quality on non-accessedneighboring wordlines. This is the root of the crosstalk problem

highlighted as “row hammering” [18, 47] and is sometimes referredto as “read disturbance crosstalk.” Lowering the effective resistanceof wordlines is one solution to efficiently decrease this coupling’svoltage magnitude [5]. Crosstalk between bitlines and wordlinessimilarly affects data retention, but to a smaller extent.

Coupling between bitlines (i.e., intra-row crosstalk) is also asignificant form of crosstalk [3, 4, 16, 24, 28, 38, 43]. This can bemanifested in deeply scaled DRAM as the value of a weak cell (dueto process variation) is deflected due to an inadvertent capacitivecoupling with its adjacent bitlines [57, 58]. Bitline crosstalk occursdue to two fundamental reasons: First, the long parallel bitlinesare often switched simultaneously during read and write operations.Second, they require particularly sensitive voltage-sensing circuits.

Recently, wordline crosstalk has been the focus of attention inmodern memories due to the row hammering security exploit. Whilebitline crosstalk has long been known to be a problem [38] it has,until recently, been successfully mitigated using the circuit tech-nique of bitline twisting [4]. Unfortunately, bitline twisting is knownto exacerbate wordline crosstalk, which is a much more difficultproblem to solve as it is harder to detect [10]. Moreover, the trendtoward open bitline approaches [28] reintroduces bitline crosstalkas a significant concern. Thus, an architectural solution to mitigatebitline crosstalk can also benefit wordline crosstalk mitigation. Inthe next section we focus on the analysis of data pattern dependencybetween bitlines.

2.2 Analysis of Data Pattern DependencyThe faults resulting from crosstalk noise depend on the DRAM ar-chitecture (e.g., open or folded bitline architecture), bitline twisting,DRAM cell characteristics (true or anti cell), coupling during pre-charge sensing or post-sensing, etc [3, 4, 16, 28]. The worst casepatterns for different array structures are quite different. For exam-ple, patterns that alternate between ‘0’ and ‘1’ in the open bitlinestructure increase crosstalk noise more than the all zeros and all onespatterns, whereas the worst case patterns in the folded bitline arrayare all zeros and all ones [22, 26, 27, 36, 38, 58]. The pre-sensecoupling noise is generated after the wordline is activated and cellsare accessed but before the sense amplifier is activated. The noise ona given floating bitline results from coupling between two bitlines,adjacent to the victim bitline [3, 26]. The severity of coupling noisedepends on the background data stored in the accessed cells. Theworst-case background here is when both neighboring cells containthe same data (either both ‘0’ or ‘1’). In contrast, post-sense cou-pling noise is generated after the sense amplifier is activated andrequires a circuit solution to decrease the time difference betweensense amplifier activation time and the time result is determined. Inour paper, we propose a solution for mitigating bad patterns in boththe folded and open bitline architectures.

2.3 Related WorkIn this section we discuss relevant related work in the area of mitiga-tion of crosstalk in DRAM and advances in error correction schemesthat can be applied in DRAM systems.

2.3.1 Crosstalk in DRAM. As described in Section 2.1, DRAMcrosstalk can be categorized into two main groups, wordline crosstalkand bitline crosstalk. For the first group, there has been a relatively

2

significant body of work proposed for determining wordlines thatare more likely to suffer from read disturbance, where reads to alocalized region of the memory can cause inter-row crosstalk. Specif-ically, Kim et al. analyzed the crosstalk disturbance probability inDRAM cells by considering the adversarial access patterns that openand close rows just frequently enough to create bit changes withina refresh interval. They proposed a probabilistic adjacent row acti-vation method to refresh the impacted neighboring rows through aweighted randomization method (i.e., flipping a biased coin) at thememory controller [21]. Another approach was to leverage a counter-based row activation method that counts the number of accesses perrow and refresh rows that reach a predefined threshold [18]. Edwardet al. presented a similar refreshing mechanism for neighboring cellsthat detects “hot rows” [51]. Based on determining which rows weresubjected to potential crosstalk, the system can refresh the neighbor-ing data while keeping the rest of the block intact. These solutionsavoid the need for creating increasingly short refresh intervals toeliminate read disturbance errors, which have negative performanceand energy-efficiency impacts [1].

Regarding the second group, as mentioned in Section 2.1, bitlinetwisting [4] is a circuit technique that mitigates bitline crosstalk by“twisting” complementary bitline pairs such that the bitline and itscomplement are equally exposed to the adjacent bitline and com-plement to cancel out coupling. Unfortunately, bitline twisting inDRAM can exacerbate the likelihood of wordline read disturbancecrosstalk [10]. Moreover, bitline crosstalk has been shown to bepresent and significant in modern memory products.

DRAM vendors internally scramble the layout of bits [16, 17, 28]so that adjacent bit in an addressed word do not necessarily map toadjacent cells in the physical DRAM arrays. An efficient system-level technique, PARBOR, was proposed to use errors due to bitlinecrosstalk to determine the locations of physically neighboring cellsin DRAM [16]. Using PARBOR, it is possible to determine thelogical to physical bit mapping (i.e., unscramble the bit locations)and enumerate the patterns most likely to incur data dependentfailures (bad patterns).

To our knowledge, this paper is the first work to minimize bitlinecrosstalk through coding techniques. These techniques attempt tomitigate this crosstalk by preventing the middle bit of a bad patternfrom being stored in one of the weak cells present in the memorydue to scaling and process variation.

2.3.2 Error Correction. Amongst the existing error correctioncoding (ECC) schemes, the Hamming coding scheme is the mostpopular and is currently adopted to correct single transient errors inDRAM memories. Server-grade systems apply traditional ECC tocorrect read disturbance errors. However, the high overhead of multi-bit ECC correction is only justified when the likelihood of multi-errors in the system is high. Error Correction Pointers (ECP) hasrecently been proposed [44] as a more efficient method compared toHamming codes from the standpoint of both hardware overhead andfailure recovery [46]. ECP is designed to tolerate a large number of“stuck at” faults in phase change memory by recording the positionof failed bits in the line and storing their correct value. For example,a 64-byte line requires a 9-bit pointer plus one replacement bitresulting in a total of 10 bits for each ECP entry.

Our goal in this work is to present a low overhead encoding thatmitigates bitline crosstalk noise in DRAM memory. Our techniquesapply to either the folded bitline or open bitline architectures where‘000’ and ‘111’ or ‘010’ and ‘101’ induce the worst-case crosstalk(i.e., bad patterns), respectively [22, 26, 27, 36, 38, 58]. To illustrateour techniques, we focus on addressing crosstalk in the folded bitlinearchitecture, where ‘000’ and ‘111’ are bad patterns, and showin Section 5 how to generalize the techniques to the open bitlinearchitecture. We start, in the next section by exploring n-bit to m-bitencoding that removes bad patterns from every group of n bits andstudy the effectiveness of such encoding in removing bad patternsfrom larger data blocks. The scheme is an extension of 2-to-3 bitand 3-to-4 bit encodings that were previously applied to avoid writedisturbances [13] and minimize write latency [25, 31, 48–50] inphase change memory, respectively.

3 N-BIT TO M-BIT ENCODINGIn this section, we develop an n-bit to (n + 1)-bit encoding thateliminates 3-bit bad patterns such that n is maximized to reducethe code overhead. We prove that the maximum value for n is 4 byshowing that, for n > 4, fewer than 2n of the 2n+1 (n + 1)-bit wordsare free of bad patterns.

A binary sequence b = b1b2...bm is said not to include any badpattern if bi−1bibi+1 , 000 or bi−1bibi+1 , 111 for all 2 ≤ i ≤ m−1.The goal of an (n,m) encoding for DRAM cells is to focus oneliminating bad patterns from n-bit groups where n and m denotethe original data word size and the code word size, respectively.

To map n-bit data words to m-bit code words that do not havethree consecutive zeros or ones, the space of 2m elements mustinclude 2n elements which do not contain three consecutive zeros orones. The number of strings without three consecutive ones is givenby the following well known integer Fibonacci sequence [39]:

f3(m) = f3(m − 1) + f3(m − 2) + f3(m − 3) (1)

for m > 3, with f3(1) = 2, f3(2) = 4 and f3(3) = 7. Note that thezero and one bits in binary strings are symmetric, so the number ofbinary strings which do not contain three ones is the same as thenumber of binary strings that do not contain three zeros. To derive aformula for all possiblem-bit strings in the space of 2m m-bit stringsthat do not have any three consecutive zeros or ones, we followEq. 2 to compute the number ofm-bit strings with at least one threeconsecutive zeros or ones as follows:

am = 2am−1 + 2m−3 − am−3

a0 = a1 = a2 = 0; a3 = 2(2)

where am denotes the number ofm-bit strings that include at leastone sequence of three consecutive zeros or ones. Parameters a0, a1,a2 and a3 are initial values of the recursive equation. Note that a3 is2 form = 3 because there are two 3-bit strings which contain ‘000’and ‘111’. Accordingly, the number ofm-bit strings not containing‘000’ and ‘111’ can be expressed as bm = 2m − am .

Table 1 shows the binary string size,m; the number of strings, am ,containing at least one substring of 000’s or 111’s; and the numberof binary strings not containing substrings of 000’s and 111’s. Ac-cording to Table 1, there are 16 elements in the space of 2m=5 thatdo not have any ‘000’ or ‘111’, so we can use a one to one mappingfor encoding 4-bit strings by 5-bit strings that removes all possible

3

Table 1: The number of m-bit binary strings: containing anyspecific 3-bit substring (fm ); containing substrings 000’s and111’s (am ); not containing substrings 000’s and 111’s (bm ).

m 1 2 3 4 5 6 7 8 9 10fm 0 0 7 13 24 44 81 149 274 504am 0 0 2 6 16 38 86 188 402 846bm 0 0 6 10 16 26 42 68 110 178

Table 2: FFE mapping to eliminate ‘000’ and ‘111’.Dataword 0000 0001 0010 0011 0100 0101 0110 0111Codeword 10101 00101 00110 01001 11011 01011 01100 00100Dataword 1000 1001 1010 1011 1100 1101 1110 1111Codeword 01101 10010 10011 10100 10110 11001 11010 01010

000’s or 111’s, and incurs 25% overhead. Note that for anym > 5,bm < 2m−1, which means that we cannot find 2m−1 m-bit wordsthat do not contain ‘000’ or ‘111’, and hence no (m − 1)-bit to m-bitencoding can remove all occurrences of ‘000’ and ‘111’.

Table 2 presents code words of 4-bit to 5-bit encoding (4,5) forDRAM cells such that none of the 16 5-bit code words contains‘000’ or ‘111’. Since the most frequent patterns existing in realbenchmarks are ‘0000’ or ‘1111,’ we selected code words ‘10101’and ‘01010’ to represent these values, respectively. This ensures thatno bad patterns are created through runs of 0’s or runs of 1’s. Inother words, the encodings of ‘00...00’ and ‘11...11’ are guaranteednot to include ‘000’ and ‘111’.

Consider Figure 1, which depicts a 12-bit data word which in-cludes two ‘000’ bad patterns (Red) and one ‘111’ bad pattern(Orange). The 4-to-5 bit encoding (FFE) utilizes the encoding inTable 2 to map each 4-bit data word to a 5-bit corresponding codeword. While there are two ‘000’ substrings and one ‘111’ substringin the data word 0x638, FFE removes all bad patterns by the encod-ing and produces only one ‘000’ substring due to concatenating thefirst and second code words. This demonstrates that although eachcode word eliminates bad patterns, bad patterns can still occur fromadjacent encoded words when a code word that ends with ‘00’ (or‘11’) is followed by a code word that starts with ‘0’ (or ‘1’). Thesame occurs when a code word that ends with ‘0’ (or ‘1’) is followedby one that starts with ‘00’ (or ‘11’).

Assuming random data, we can estimate the expected number ofbad patterns in an M-bit encoded string resulting from concatenatingM/5 5-bit code words (c1, c2, ..., cM/5). To do this, we use Table 2to count the number, E00, of 5-bit code words that end with ‘00’.We also count the number, E10, of code words that end with ‘10’(i.e., end with one zero but not two zeroes). According to the table,E00 = 3 and E10 = 5. The number of code words starting with ‘01’and ‘00’ can also be counted, resulting in S01 = 5 and S00 = 3.Note that the total number of code words that end with ‘10’ and‘00’ is the same as the number of code words that end with ‘01’and ‘11’. While appending code words ending with ‘10’ or ‘00’ tocode words starting with ‘00’ or ‘01’ generates E10 × S00 + E00 × S01000’s, appending code words ending with ‘00’ and starting with ‘00’produces 2 × E00 × S00 000’s. When two 5-bit code words such asc1 and c2 are randomly selected to be concatenated, the expectedNumber of Bad Patterns (NBP), i.e. ‘000’ and ‘111,’ in the resulting10-bis string can be calculated as:

NBP10F F E = 2 × (2 × S00 × E00 + E10 × S00 + E00 × S01

b2m=5

) (3)

0 1 1 0 0 0 1 1 1 0 0 0

0 1 1 0 0 0 1 0 0 1 0 1 1 0 1

Figure 1: Encoding 0x638 string using FFE.

where the multiplicative factor of 2 reflects the symmetry with re-spect to 000 and 111. When M/5 5-bit code words, c1...cM/5 areconcatenated, the expected number of bad patterns increases by(M/5)-1 fold and can be expressed as:

NBPMFFE =M − 5

5× NBP10

F F E (4)

To compare with the expected number of bad patterns in a randomunencoded word, we note that, given a memory block of size N , theexpected number of existing bad patterns can be expressed as:

NBPNRND =

∑i≤N−3i=0

(N−2)!(N−i−3)!i !

2N−1(5)

where N ≥ 3. Figure 2 shows the average number of bad patternsafter concatenating 5-bit substrings versus the number of bad patternsexisting in N -bit original data blocks. When the block size grows,FFE reduces the number of bad patterns on average about 63% forblocks of random data.

Note that if only one of the two patterns must be avoided, (e.g.,‘111’ not ‘000’), there are 9-bit strings that do not contain 000’s andhence we can develop an eight-bit to nine-bit encoding that reducesthe encoding overhead from 25% to 12.5% in comparison to FFEtechnique.

4 PERIODIC FLIPPING ENCODING (PFE)The FFE encoding scheme presented in the previous section incursa large overhead. In this section, we present a simple and effectivesolution to reduce the number of ‘000’ and ‘111’ sequences in datato be written, which is based on flipping ‘0’ and ‘1’ of sub blocksinterleaved throughout the data. Specifically, to write an N -bit dataword,W = x0, ...,xN−1, we propose a low overhead PFE techniquethat reduces the occurrence of bad patterns by partitioningW into3-bit groups. We either keep the original data unchanged or flipa specific bit (1st bit, 2nd bit, or 3rd bit) in all groups to obtainthree possible code words, W 1, W 2, and W 3. The code word thatminimizes the number of bad patterns is selected. To decode thegenerated code word, the encoder should supplement the data wordwith two auxiliary bits to record which of the four code wordswere used and enable the decoder to retrieve the original data wordthrough simple logic operations.

0.375 1.125 2.625 5.62511.625 23.625

47.625

1.5 3.5 7.515.5

31.5

63.5

127.5

020406080

100120140

8 16 32 64 128 256 512BlockSize

AverageNumberofBadPaAernsFFERandom

Figure 2: The average number of bad patterns.

4

We use the idea of PFE to tolerate crosstalk errors via two differentmodes: fault-oblivious PFE and fault-aware PFE. In fault-obliviousPFE, all four code word candidates are generated and the codeword containing the minimum number of bad patterns is written inmemory without knowing the location of weak cells. In fault awarePFE, a generated off-line map of weak cells maintains informationabout the location of weak cells and is used to minimize or avoid theoverlap of bad patterns and weak cells. We describe the two modesof PFE in the following sections.

4.1 Fault Oblivious PFE (PFEFO)Figure 3 depicts fault-oblivious PFE using the same example stringused in Figure 1. Figure 3(a) shows the original data word appendedby two auxiliary bits set to ‘00’ to indicate that no bits are flipped inthe encoded data word. In addition to the bad patterns in the originaldata word, the last bit of the data word and the two auxiliary bitsare concatenated, and thus may introduce a new bad pattern. Forexample, concatenating ‘000’ and ‘00’ in Figure 3(a) increases thenumber of ‘000’ sequences from two to four. Note that if k consecu-tive zero (one) bits are concatenated with ‘0’ (‘1’), the number of‘000’ (‘111’) sequences is k+1. Figure 3(b) shows the code wordwhen flipping the first bit of each group. While flipping the first bitof each group breaks up the first ‘000’ sequence, it introduces other‘111’ sequences that are equally problematic. Figure 3(c) flips thesecond bit of each group and removes all bad patterns from the dataword. Finally, Figure 3(d) toggles the third bit of each group whichresults in four bad patterns. Hence, PFE examines four possiblecode words and selects the one with the minimum number of badpatterns to be written into memory. Note that this type of PFE doesnot require any information about the location of weak cells.

4.2 Fault-Aware PFE (PFEFA)In contrast to fault-oblivious PFE where the encoded candidate withthe minimum number of bad patterns is chosen, fault-aware PFEutilizes weak cell information to select the code word that minimizes(or eliminates) the occurrence of crosstalk errors. As will be shownin Section 4.2.2, PFEFA has the additional advantage of being ableto guarantee the mitigation of at least three weak cells in a block.When writing an N -bit data word,W = x0, ...,xN−1 onto N DRAMcells c0, ..., cN−1, PFEFA uses information about the weakness ofcells to select the code word that avoids the overlap of the center ofa 3-bit bad pattern with a weak cell.

A map of weak cells can be discovered as part of memory re-gression tests during the memory testing phase. For example, thistesting can write bad patterns (all zeros, all ones, alternating 01’s,etc.) many times to map out the locations of weak cells or by using

(a)

(b)

(c)

(d)

0 1 1 0 0 0 1 1 1 0 0 0 0 0

1 1 1 1 0 0 0 1 1 1 0 0 0 1

0 0 1 0 1 0 1 0 1 0 1 0 1 0

0 1 0 0 0 1 1 1 0 0 0 1 1 1

Figure 3: EncodingW = 0x638 using PFE.

well-known testing algorithms [16, 29, 55]. The weak cell informa-tion can be stored in resident memory and cached in the on-chipcache on-demand, as was proposed in [37]. Alternatively, the weakcell information can be stored in a ROM that can be accessed by thehardware.

4.2.1 Memory Controller Implementation. Figure 4 showsthe implementation flow of fault-aware PFE in a memory controller.The memory controller first computes the address of the fault mapentry and then checks ① whether the entry exists in the last levelcache or not. If not, a request is issued to the main memory ② inorder to bring and store the fault map entry of the correspondingaddress in the last level cache [37]. The location of weak cell(s)is temporarily recorded in a weak cell Map (WCM). The 512-bitoriginal data and WCM ③ are then divided into 32-bit blocks andsent to 16 encoding modules that leverage the weak cell informationto map each 32-bit block to 34-bit code words. The encoding moduleutilizes a priority encoder ④ to select the code word which avoidsthe overlap of bad patterns with the location of weak cells. The 16encodings are done in parallel and the final 16 generated code wordsare concatenated and sent ⑤, along with the memory address, to bewritten in memory ②.

The PFEFO encoder implementation is similar to the PFEFA en-coder implementation except that it does not include the fault mapgenerator. Furthermore, it requires 5-bit counters in the encodingmodules to count the number of bad patterns in each of the fourgenerated code words to select the code word with the minimumnumber of bad patterns. For a 512-bit cache line, PFEFO encoderrequires 64 5-bit counters that produce an extra cost area overheadin comparison to PFEFA encoder.

Note that an additional step may be required to unscramble thelogical bit locations in order to determine the logical locations ofphysically neighboring bits, which are subjected to the highest levelsof crosstalk. This information, which is known to vendors, can alsobe discovered [37]. We discuss the encoder and decoder implementa-tion for different techniques for fault-oblivious and fault-aware PFEin detail in Section 6.

4.2.2 Tolerance Capability. Given the four code word can-didates from PFE, in PFEFA there are some guarantees that can bemade about the protection capability. In what follows, we will firstprove that PFEFA can tolerate at least three of the weak cells storingthe data bits, then we will present an extension which will enablePFE to tolerate at least three weak cells in either the data bits or theauxiliary bits.

Theorem 1. If an N -bit data word, W = x0, ...,xN−1, is to bewritten to N cells, c0, ..., cN−1, then using PFEFA encoding willtolerate at least three weak cells assuming that the auxiliary bits usedin the encoding are stored in a reliable memory.

Proof: Given any sequence of three consecutive bits p = “xi−1 xixi+1”, applying the three PFE transformations will change thissequence to p1 = “x ′i−1 xi xi+1”, p2 = “xi−1 x ′i xi+1” or p3 =“xi−1 xi x ′i+1”, where x ′ is the complement of x . Table 3 indicatesthat for any sequence, p, at most one of p, p1, p2 and p3 will beequal to ‘000’ or ‘111’. Hence, if ci is a weak cell, then three of thefour sequences p, p1, p2 or p3 (and thus three of the four PFE codewords) will tolerate the weakness of this cell. Consequently, if three

5

Modified Memory Controller

Last Level C

ache

Address

512-bit Data / WCM

CW

Encoder Module_0Encoder Module_1

Encoder Module_15

Encoder Module_14

1

3

5

Original M

emory

Controller

4:1 Mux

2:4E

ncoder

Encoder_0

Encoder_1

Encoder_2

Encoder_3

CW0

CW1

CW2

CW3

WCM

CW

_ ij

4

Encoder Module_ i

Main Memory

2

Figure 4: Memory controller implementation of fault aware PFE.

cells ci , c j and ck are weak, 0 < i, j, k < N − 1, then at least oneof the four PFE code words will tolerate the weakness of the threecells. ■

If two auxiliary bits xN and xN+1 are used to indicate which ofW ,W1,W2, orW3 is used in the encoding and these bits are writtento the two cells cN , cN+1 following the data cells, then it is onlypossible to prove that PFE can tolerate the weakness of at leasttwo of the cells c0, ..., cN+1. This is because the two auxiliary bitscan generate a bad pattern in more than one of the four possibleencodings. However, we will prove in the next theorem that by usingthree auxiliary bits, xN , xN+1 and xN+2 to record which of the codewords,W ,W1,W2, orW3 is used, and storing these bits in the threecells cN , cN+1 and cN+2, it is possible to tolerate the weakness ofat least three of the N + 3 cells storing the data and the auxiliarybits. Specifically, we define PFE+ as a PFE scheme which uses threeauxiliary bits such that ‘000’ indicates the original data word, ‘100’indicates the encoding in which bit xN−3 is flipped (call itW1), ‘010’indicates the encoding in which bit xN−2 is flipped (call itW2) and‘001’ indicates the encoding in which bit xN−1 is flipped (call itW3).With this scheme, we prove the following:

Theorem 2. If an N -bit data word, W = x0, ..., xN−1, is tobe written to N cells, c0, ..., cN−1, then using fault aware PFE+encoding with the three auxiliary bits xN , xN+1 and xN+2 writteninto cells cN , cN+1 and cN+2, will tolerate at least three weak cellsci , c j and ck , 0 < i, j,k < N + 2.

Proof: It was shown in the proof of Theorem 1 that if ci , 0 < i < N−1, is a weak cell, then three of the four code words,W , W1, W2 andW3 will tolerate the weakness of this cell. Here, we will prove thatthe same applies to any weak cell, ci , N − 1 ≤ i < N + 2. For this,we observe that the five bits xN−2 ... xN+2 will have one of fourpossible forms: “xN−2 xN−1 000” (ifW is used), “xN−2 xN−1 100”(ifW1 is used), “x ′N−2 xN−1 010” (ifW2 is used), “xN−2 x

′N−1 001”

(ifW3 is used). We consider the following three cases:1. Cell cN+1 is weak: the three bits xN , xN+1, xN+2 can produce

a bad pattern (000) only if W is used in the encoding. Hence, thethree other encodings can tolerate a weak cN+1.

Table 3: PFE transformations of 3-bit sequences.p 000 001 010 011 100 101 110 111p1 100 101 110 111 000 001 010 011p2 010 011 000 001 110 111 100 101p3 001 000 011 010 101 100 111 110

2. Cell cN is weak: the 3-bit sequence xN−1, xN , xN+1 can equal“xN−1 00”, “xN−1 10”, “xN−1 01” or “x ′N−1 00” ifW ,W1,W2, orW3are used for the encoding, respectively. Hence, “xN−1 xN xN+1”cannot be equal to 111 and can only be equal to 000 either ifW isused (when xN−1 = 0) or ifW2 is used (when xN−1 = 1). In otherwords, for any value of xN−1, three encodings can tolerate a weakcN .

3. Cell cN−1 is weak: the 3-bits sequence xN−2,xN−1,xN can equal: “xN−2 xN−1 0," “xN−2 xN−1 1," “x ′N−2xN−1 0," or “xN−2 x ′N−1 0” if W , W1, W2, or W3 are used for theencoding, respectively. It is straightforward to check that for anyspecific values of xN−2 xN−1, only one of the encodingsW , W1, W2,orW3 will produce 000 or 111. Hence, the three other encodings cantolerate a weak cN−1.

Consequently, if three cells ci , c j and ck are weak, 0 < i, j,k <N + 2, then at least one of the four encodings will tolerate theweakness of the three cells. ■

Clearly PFE relies on the specific information about the nature ofthe faults to largely outperform more general error correcting codes.Specifically, at least 2loдN and 3loдN auxiliary bits have to be usedwhen ECC-2 and ECC-3 are used to tolerate the errors resultingfrom two or three weak cells, respectively. In contrast, only 2 or 3auxiliary bits (independent of N ) are needed to tolerate two or threefaults using PFE or PFE+, respectively. Moreover, ECC-k cannottolerate more than k faults while PFE and PFE+ can tolerate morethan two or three faults, respectively, with some probability. Thisis because when q instances of the same 3-bit bad pattern overlapq weak cells, the same code word in PFE/PFE+ will tolerate all qweak cells. Note, however, that PFE/PFE+ are designed to tolerateerrors induced by crosstalk faults, while ECC can tolerate errors dueto any type of faults, including transient faults. To overcome thislimitation, it is possible to combine PFE+ with ECC-k to tolerate ktransient and three crosstalk faults. This will be more efficient thanusing ECC-(k + 3) to tolerate the same faults. Finally, PFE/PFE+encoding and decoding are much simpler than ECC-2/ECC-3 andleads to a much simpler implementation as described next.

5 GENERALIZATION TO OTHER 3-BIT BADPATTERNS

As explained in Section 3, FFE eliminates specific 3-bit bad patternsfrom 4-bit binary strings. Similar to removing bad patterns ‘000’ and‘111’ in the folded bitline structure, FFE can eliminate the 3-bit badpatterns ‘010’ and ‘101’ from 4-bit binary strings in the open bitlinestructure. In fact, Table 1 represents the number of m-bit binarystrings containing any 3-bit substring α0α1α2 and its complementα0α1α2. Specifically, Table 4 shows FFE mapping from 4-bit stringsto 5-bit strings which removes ‘010’ and ‘101’.

Regarding PFE, the proof of Theorem 1 uses the fact that eachcolumn of Table 3 contains at most one of the patterns ‘000’ or ‘111’.We observe from Table 3 that this holds true for any 3-bit substringα0α1α2 (and its complement), which confirms the correctness of

6

Table 4: FFE mapping to eliminate ‘101’ and ‘010’.Dataword 0000 0001 0010 0011 0100 0101 0110 0111Codeword 00000 00011 00110 00111 01100 01110 01111 10000Dataword 1000 1001 1010 1011 1100 1101 1110 1111Codeword 10001 10011 11000 11001 11100 11110 00001 11111

Theorem 1 for any substring and its complement including ‘010’ and‘101’.

It is possible to show that Theorem 2 also works for any 3-bit sub-string α0α1α2. We demonstrate this for ‘010’&‘101’ by consideringthe following cases, corresponding to the cases enumerated in theproof of Theorem 2 for ‘000’&‘111’:

(1) Cell cN+1 is weak: the sequence xN ,xN+1,xN+2 produces abad pattern ‘010’ only ifW2 is used in the encoding.

(2) Cell cN is weak: the 3-bit sequence xN−1,xN ,xN+1 equals‘xN−100,’ ‘xN−110,’ ‘xN−101’ or ‘x ′N−100’ whenW ,W1,W2,orW3 are used. The sequence xN−1,xN ,xN+1 can be equalto ‘010’ (‘101’) as long asW1(W2) is used and xN−1 is equalto 0(1).

(3) Cell cN−1 is weak: the 3-bit sequence xN−2,xN−1,xN equals‘xN−2xN−10’, ‘xN−2xN−11,’ ‘x ′N−2xN−10’ or ‘xN−2x

′N−10’

whenW ,W1,W2, orW3 are used. For any specific values ofxN−2xN−1, only one of encodings produces a bad pattern of‘101’ or ‘010.’ Thus, three other encodings tolerate a weaknessin cN−1.

6 HARDWARE IMPLEMENTATIONIn this section, we evaluate the latency, power, area, and number ofcells of FFE and PFE encoders and decoders. Verilog implementa-tions were synthesized using Synopsys Design Compiler targeting a45nm FreePDK standard cell library [52]. The PFEFO and PFEFAimplementations assume 32-bit block granularity requiring 16 en-coding modules to simultaneously encode a 512-bit cache line witha space overhead of 6.25%. The FFE implementation utilizes 128LUTs to store the code words shown in Table 2 to encode 4-bitsubstrings (25% space overhead) and decode 5-bit substrings inparallel.

Table 5 shows the synthesis results for three different encodersand decoders in two different scenarios that optimize latency andpower consumption, respectively. When optimizing latency, the ta-ble shows that the latency of the FFE encoder is about 83% and58% of the PFEFO and PFEFA latencies, respectively. Conversely,since the decoders of PFEFO, PFEFA take advantage of simple XORoperations, their latency is 39% that of the FFE decoder. In all cases,the latency of encoding and decoding is ≤ 1.65ns, which is signifi-cantly lower than the latency of commodity DRAM (approximately50ns [14, 23]). Hence, the PFEFO encoder requires only a few clockcycles to encode a data block.

We implemented the PFEFO encoder using four 5-bit counters tocount in parallel the number of bad patterns in each of four 32-bitpotential code words. As a result, the design requires a larger areain comparison to the PFEFA encoder which only checks the overlapof bad patterns to weak cell locations. The maximum occupiedarea by the PFEFO encoder is 0.084mm2 which amounts to a 0.23%area overhead compared a 2Gb DRAM chip with a 36.5mm2 diearea [12, 42]. For decoding, both PFEFO and PFEFA have a 73%smaller area overhead compared to FFE. Note that the PFEFO and

PFEFA decoders are identical and simply XOR the code word withthe corresponding bit flip vector.

With respect to power consumption, we find that although the FFEencoder consumes 68% and 34% less power compared to PFEFO andPFEFA, respectively, the FFE decoder uses 65% more power com-pared to the PFE decoders. However, the largest encoder power con-sumption is 20mW which is less than 9% of the power consumptionof DRAM at the same technology node [12, 33, 56]. Furthermore,the energy per access (power times latency) of the en/decoding mod-ules is orders of magnitude smaller than the actual DRAM energyper access (pJ as opposed to tens of nJ).

In contrast to the designs optimized for latency, the designs opti-mized for power decrease the power consumption of FFE, PFEFO,and PFEFA in the encoding process by about 19%, 50%, and 24%, respectively, while not increasing latency dramatically. We notethat the implementations reported in Table 5 for PFEFO and PFEFAgenerate the four potential code words in parallel. Alternative im-plementations could generate the code words serially to reduce thearea by a factor of almost four at the expense of increasing the worstcase latency by a factor of almost four. We note, however, that whenthe number of weak cells is low, PFEFA usually tries only one codeword and very rarely tries more than two code words, which meansthat the average latency would not increase much beyond the parallelimplementation.

7 EVALUATIONTo evaluate the effectiveness of both the fault-oblivious and fault-aware PFE approaches1, we conducted experiments to comparethe Uncorrectable Bit Error Rates (UBER2)[34] and performanceoverheads of PFE, FFE, ECP and ECC. Our experiments consideriso-storage overhead comparisons based on the overhead for FFEand PFE and consider the cases of a moderate weak cell incidencerate of 0.01% and a high cell incidence rate of 1%. Note that the highweak cell rate of 1% may result when the refresh rate is relaxed tosave energy, especially when technology is scaled [20, 40]. Finallyin addition to the weak cell rates, we include a sensitivity study ofPFE for different block sizes as we vary the weak cell map. First webegin with a description of our experimental setup.

7.1 Experimental MethodologyTo evaluate the fault tolerance capability of ECC, ECP, FFE, andPFE, we developed a PIN-based simulator [30] to model the cache hi-erarchy in order to determine the accesses to main memory. We alsoused the gem5 full system simulator [7] to evaluate the performanceoverheads of the PFE fault aware scheme. The system parameterswere designed to be similar in both simulation environments and alisting of the relevant parameters for gem5 are shown in Table 6. ThePIN simulator uses a similar L1/L2 cache configuration. To evaluatethe fault-rate, the PIN simulator evaluates main memory writes byencoding the data and recording a fault if the center bit of a badpattern in the encoded block is aligned with a weak cell. To modelweak cells of the memory, maps of weak cells were created using

1We use PFE rather than PFE+ because it achieves almost the same fault tolerance usingfewer auxiliary bits.2the percentage of bits that have errors relative to the total number of bits that have beenread

7

Table 5: The overhead of different schemes with latency optimization and power optimization.

Scheme512-bit block size

Latency Optimization Power OptimizationLatency (ns) Power (mW) Area (µm2) # Cells Latency (ns) Power (mW) Area (µm2) # Cells

Enc

oder FFE 0.28 6.55 19402.74 6592 0.36 5.33 18141.26 5888

PFEFA 0.66 9.90 32348.38 10443 0.84 7.50 30223.86 9820PFEFO 1.65 20.46 84268.44 26830 2.02 8.25 76139.23 24640

Dec

oder FFE 0.28 8.10 18152.05 7341 0.38 6.82 20784.36 6656

PFEFA 0.17 2.57 4835.67 1376 0.40 1.40 2905.91 656PFEFO 0.17 2.57 4835.67 1376 0.40 1.40 2905.91 656

Bayesian distributions to mimic the impact of process variation andinclude spatial correlation of faults [2, 60]. We followed the modeldescribed in [60] to generate maps of weak cells.

Errors can be mitigated by (1) reducing the probability that badpatterns coincide with weak cells by reducing the number of badpatterns, as in FFE and PFEFO, (2) avoiding bad patterns that coin-cide with weak cells, as in PFEFA or (3) correcting faults, as in ECC.ECP can be used to protect against potential faults by pointing toweak cells and providing reliable storage for their content (will becalled ECPFO). Alternatively, ECP can be used to correct faults bypointing to and storing the values of weak cells that overlap with thecenter of any bad pattern (will be called ECPFA). We consider bothapproaches.

In our experimental tests, we measure UBER when applying FFE,PFE, ECP, or ECC. The amount of storage overhead for all of theseschemes is summarized in Table 7 where k, in ECC-k and ECP-k,refers to the maximum number of errors that can be corrected by thescheme. We consider 512-bit cache lines that are divided into n-bitprotected blocks. For FFE, n is always 4 and for PFE we use n=32.We use ECC-132 and ECC-1128 to denote single error correctingECC with n = 32 and n = 128, respectively. Finally, we use n =512 for ECP because it results in 10-bit pointers, compared to 6-bitsfor 32-bit blocks. This makes the overhead of ECP-3 comparable tothe overheads of PFE and ECC-1128, and the overhead of ECP-12comparable to the overhead of FFE.

We model our fault-aware implementations as described in Sec-tion 4.2.1, where the memory controller must obtain the fault mapentry. If the fault map entry is not cached, this can increase trafficon the memory bus, impacting performance. To illustrate this per-formance overhead, we compare a conservative system without afault map cache (MemCTRL) to an ideal system (MemDIMM). Theconservative system requires access to the memory bus for each en-coding operation while the ideal system knows the fault map and canencode the data on the memory DIMMs directly, without incurringadditional memory traffic due to querying the fault map.

We performed our evaluations for workloads from the PAR-SEC [6] and selected SPEC CPU2006 [11] benchmarks3 for dif-ferent maps of weak cells including the moderate and high weak

3The benchmarks selected were those we were able to run successfully within gem5.All SPEC CPU2006 benchmarks were tested using the PIN tool for fault tolerance andare consistent with the results presented here.

Table 6: Simulator ParametersCPU 4-core, 8-issue width per core, out of orderL1 Cache 16K private Inst. & Data, 8-way set-assoc.L2 Cache 1MB shared 16-way set-assoc.Cache Block 512-bitsWrite Buffer 64-entries

cell rates of 0.01% and 1%, respectively. In this context, the weakcell rate is defined as the fraction of weak cells relative to the totalnumber of DRAM cells.

7.2 Fault-Oblivious EffectivenessKnowledge of both the location of weak cells and bad patternsis necessary for a fault-aware scheme in the context of intra-rowcrosstalk. FFE and PFEFO attempt to minimize the number of badpatterns without regard to the location of weak cells. Thus, it selectsthe encoding with the fewest bad patterns and upon a tie selects acandidate arbitrarily.

FFE requires a 25% encoding overhead (see Table 7). For aniso-storage-overhead comparison, we combined PFE with ECC-132 (PFEFO+ECC-132), which requires the same overhead of eightbits (2+6) required for FFE. We also compare with the ECPFO-12approach with similar overhead (24%). Figures 5(a) and 5(b) showthis comparison for weak cell incidence rates of 0.01% and 1%,respectively. For the 0.01% rate of weak cells, the UBER of FFEis 1.2 × 10−5. However, for the same overhead, ECPFO-12 does amuch better job achieving an UBER of approximately 2.3 × 10−6

while PFEFO with ECC-132 is even more effective with an UBER ofapproximately 2.5×10−7. FFE demonstrates its scalablility to highernumbers of weak cells reaching an UBER of approximately 6× 10−3

with a weak cell incidence rate of 1%. ECPFO-12 degrades UBER tomatch FFE at the UBER of 8.3× 10−3. In contrast, PFEFO+ECC-132still provides an UBER of 5 × 10−5.

Although the proposed techniques decrease the susceptibility tocrosstalk through avoiding bad patterns, they cannot completelyguarantee tolerance to crosstalk faults. Additionally, bad patternsmay be formed when consecutive blocks are concatenated or whenthe number of weak cells is larger than what can be tolerated. Asingle parity bit can be added to provide a capability to detect anuncorrectable fault similar to the addition of a parity bit in ECC-k todetect a (k + 1)th fault. However, the greatest value of using PFE incombination with an ECC-k is to increase the effectiveness of ECCwith a lower overhead than moving to ECC-(k + β) for β > 1.

7.3 Fault-Aware EffectivenessIn this section, PFEFA is compared to ECC-1128 and ECPFA-3. Eachscheme has a storage overhead of approximately 6% additional bits(see Table 7). ECP can be fault-aware in a similar fashion to PFEFAby similarly retrieving the fault locations from the on-chip cache andusing the pointers only to store the corrected values for locations atwhich a bad patterns and weak cells intersect.

Figures 6(a) and 6(b) show the comparison of PFEFA with ECC-1128 and ECPFA-3 for a weak cell incidence rate of 0.01% and 1%,respectively. For the 0.01% weak cell incidence rate, ECC-1 achieves

8

Table 7: Bit overheads for fault tolerance schemes.ECC-k FFE PFE ECP-k

Overhead per n-bit block k ( ⌈loд(n)⌉ + 1) ⌈ n4 ⌉ 2 k ( ⌈loд(n)⌉ + 1) + 1ECC-132 ECC-232 ECC-1128 FFE PFE ECP-3 ECP-12

Block size 32 128 4 32 512Overhead bits per block 6 11 8 1 2 31 121Overhead % 18.75% 34.37% 6.25% 25% 6.25% 6.05% 23.63%

1.E-09

1.E-08

1.E-07

1.E-06

1.E-05

1.E-04

blacks

choles

bodytrack

ferret

fluidan

imate

freqm

ine

raytrace

swapCo

ns vipsx26

4

cannea

lded

up

stream

cluster

parsec

mean

bzip2gob

mk

hmme

r

libquan

tummcfsjeng

cactus

ADM

calculix

GemsFDT

D lbmmil

cnam

d

specm

eanme

an

UBE

R

(Low

erisBeU

er)

FFE ECP-12 PFE+ECC-1FO FO 32

(a) 0.01% incidence of weak cells

1.E-06

1.E-05

1.E-04

1.E-03

1.E-02

blacks

choles

bodytrack

ferret

fluidan

imate

freqm

ine

raytrace

swapBo

ns vipsx26

4

cannea

lded

up

stream

cluster

parsec

mean

bzip2gob

mk

hmme

r

libquan

tummcfsjeng

cactus

ADM

calculix

GemsFDT

D lbmmil

cnam

d

specm

eanme

an

UBE

R(Low

erisBeS

er)

FFE ECP-12 PFE+ECC-1FO FO 32

(b) 1% incidence of weak cells

Figure 5: Comparison of “moderate-overhead” (∼25%) fault-oblivious approaches of FFE with ECPFO-12 and PFEFO+ECC-132.

an UBER of 6.8 × 10−6 while ECPFA achieves an UBER of 1.4 ×10−6. In contrast, PFE at 0.01% error produced no errors during ourexperiments. From the number of writes and repeated experiments,this guarantees that in the worst case it has an UBER of at most3 × 10−12. This actually exceeds the capability of PFEFO+ECC-132with a 75% less storage overhead. For the 1% weak cell incidencerate, even with fault-awareness, ECP is only able to correct one-thirdof the faults dropping below ECC-1’s roughly 50% correction rateat the UBER of 2.5 × 10−3. In contrast, PFE still reaches an UBERof 1.3 × 10−5 and corrects 99.7% of the faults.

7.4 Impact on PerformanceTo enable a fault-aware implementation of PFE and ECP requiresadditional memory accesses to query the weak cell locations ifthey miss in the cache, which is an overhead. This is described inSection 4.2.1. The performance overhead only impacts writing, whileread operations proceed normally with a small increase in delaydue to decoding (see Table 5). The encoding overheads for fault-oblivious schemes of 2ns or less (see Table 5) combined with the

fact that this encoding could occur in the write-buffer, which maskswrite latency, resulted in a negligible impact on performance. Thuswe only report the performance impact of the fault-aware schemes,which also include this encoding delay in addition to the access tothe weak cell map. These performance overheads are reported asinstructions per cycle (IPC) in Figure 7 and compared against theperformance of a fault-oblivious scheme that does not query thefault map (Baseline). The write-buffer seems to still mask much ofthe additional latency from the fault-aware schemes. In general, acontroller level (MemCTRL) implementation does have a slightlyhigher performance degradation than a ideal one implemented at thememory chip level (MemDIMM). However, in most benchmarks, thedegradation from either scheme is not dramatic. Only in a handfulof applications, vips, gobmk, sjeng, gemsfdtd, and milc, do we seea noticeable reduction of IPC for MemCTRL. Overall, the fault-aware MemCTRL and MemDIMM implementations see a 2.3% and1.1% degradation, respectively, in IPC. However, the improvementin fault-tolerance with a fault-aware scheme makes it a good choicegiven a minimal performance overhead.

9

1.E-121.E-111.E-101.E-091.E-081.E-071.E-061.E-051.E-04

blacks

choles

bodytrack

ferret

fluidan

imate

freqm

ine

raytrace

swapDo

ns vipsx26

4

cannea

lded

up

stream

cluster

parsec

mean

bzip2gob

mk

hmme

r

libquan

tummcfsjeng

cactus

ADM

calculix

GemsFDT

D lbmmil

cnam

d

specm

eanme

an

UBE

R

(Low

erisBeU

er)

ECC-1 ECP-3 PFEFA FA128

(a) 0.01% incidence of weak cells

1.E-07

1.E-06

1.E-05

1.E-04

1.E-03

1.E-02

blacks

choles

bodytrack

ferret

fluidan

imate

freqm

ine

raytrace

swapCo

ns vipsx26

4

cannea

lded

up

stream

cluster

parsec

mean

bzip2

gobmk

hmme

r

libquan

tummcfsjeng

cactus

ADM

calculix

GemsFDT

D lbmmil

cnam

d

specm

eanme

an

UBE

R(Low

erisBeT

er)

ECC-1 ECP-3 PFEFA FA128

(b) 1% incidence of weak cells

Figure 6: Comparison of PFE to other “low-overhead” (6.25%) fault-aware approaches.

00.51

1.52

2.53

blackscholes

bodytrack

ferret

fluidanim

ate

freqmine

raytrace

swap>ons

vips

x264

canneal

dedup

streamcluster

parsecm

ean

bzip2

gobmk

hmmer

libquantum

mcf

sjeng

cactusADM

calculix

GemsFDTD

lbm

milc

namd

specmean

mean

IPC

Baseline MemCTRL MemDIMM

Figure 7: Performance impact of run-time determination of weak cells at the memory controller (MemCtrl) or within the memoryDIMM (MemDIMM) compared to a fault-oblivious baseline that does not query the weak cell map.

7.5 Comparison of different fault mitigationschemes

As the number of weak cells increases in newly fabricated memoriesas a result of increasingly smaller technology nodes, fault mitigationstrategies will have to scale with the increased propensity for faults.In Figure 8, we compare fault-oblivious and fault-aware versionsof ECC-1 and 2, FFE, PFE, ECP-3 and 12, and PFE+ECC-1 forweak cell rates ranging from 0.01% to 1%. The figure shows thatPFE approaches the desired system UBER (10−14[21]) at a weakcell rate of 0.01% while continuing to outperform other baselinecorrection schemes at higher weak cell rates (e.g., 1%). Hence, PFEshows propensity for being effective at moderate weak cell rates

(e.g., 0.01%) while being extremely valuable for high weak cell rates(e.g., 1%).

For a 0.01% weak cell rate, ECC-1128 only achieves an UBERof 6.8 × 10−6 while ECC-232 achieves an UBER of 1.3 × 10−6.Moreover, the UBER drops dramatically as the number of errorsincreases, to 2.5 × 10−3 and 1.3 × 10−3, respectively at a 1% weakcell incidence rate. For FFE, the UBER changes linearly with respectto the incidence rate. In other words, the percent of faults correctedby FFE remains invariant as the weak cell rate increases. As a result,while FFE has a worse UBER at lower error rates than traditionalcorrection schemes such as ECC, it becomes the more effectivecorrection scheme at high error rates.

10

1.E-091.E-081.E-071.E-061.E-051.E-041.E-031.E-02

UBE

R(Low

erisBe:

er) 0.01% 0.10% 1.00%

<1.0E-11

Lessthan310-12

Figure 8: UBER for different fault mitigation schemes as weakcell incidence rate varies from 0.01% to 1%.

1.E-08

1.E-06

1.E-04

0.01%(FO) 1%(FO) 0.01%(FA) 1%(FA)

UBE

R(Low

erisBe:

er)

8 16 32 64 128 512

Lessthan310-12

1.E-11

Figure 9: PFE for different block sizes, n.

ECPFO-3 and ECPFA-3 achieve relatively good UBERs of 2.3 ×10−6 and 1.4 × 10−6, respectively, at a 0.01% weak cell incidencerate. However, like ECC, ECP’s fault tolerance drops off sharply asweak cell incidence rate increases. With a larger overhead, ECPFO-12 and ECPFA-12 are very effective for low weak cell incidencerates. With this higher overhead, ECP has a similar overhead to FFEand is slightly less effective at lower weak cell incidence rates (e.g.,0.01%) for similar reasons.

While for a 0.01% weak cell incidence rate, PFEFO achieves anUBER close to that of ECPFA-12, the effectiveness of PFEFO in-creases at higher weak cell incidence rates. Moreover, the resultsshow that adding ECC-1 to PFEFO improves the UBER by an orderof magnitude versus PFEFO for different ranges of weak cell inci-dence rates. PFEFA reaches an UBER that is less than 3×10−12 for a0.01% weak cell incidence rate and is about five orders of magnitudemore effective than PFEFO+ECC-1. When adding ECC-1 on top ofPFEFA, the PFE scheme reaches the best fault mitigation efficiency(at least one order of magnitude) for 1% weak cell incidence rateagainst all fault mitigation schemes shown in the figure. PFE clearlyprovides the best fundamental protection against intra-row cross talkand when coupled with ECC-1 can achieve reasonable error rateseven with extremely large numbers of weak cells.

7.6 Sensitivity to block sizePFE is also effective for a variety of block sizes. Figure 9 showsthe impact of varying the block size for PFE. For the fault-obliviouscase, as the size of the block covered by PFE increases, the ability ofbad pattern reduction decreases linearly. However, even for a 512-bitblock with 1% weak cells, PFE reaches an UBER of 4.8 × 10−4.

For the fault-aware approach, the UBER of PFE is at most 3 ×10−12 for low weak cell rates and 512-bit blocks. For a 1% weakcell rate, a 32-bit block size results in an UBER of 1.3× 10−5, but asthe block size increases, the effectiveness drops, achieving only anUBER of 8.2 × 10−5 for 512-bit blocks.

8 CONCLUSIONIn this paper, we discuss how the presence of specific patterns ofstored data exacerbates the likelihood of crosstalk occurrence in

DRAM cells and triggers crosstalk faults. Reducing the number ofbad patterns can decrease the occurrence of crosstalk incurred byprocess variation. We studied two orthogonal crosstalk mitigatingtechniques for DRAM cells. Experimental results conducted on PAR-SEC and SPEC benchmarks showed that the effects of crosstalk canbe destructive, especially as the percentage of weak cells increases.The results, however, showed that the proposed PFE scheme is effec-tive at avoiding the occurrence of crosstalk faults and if combinedwith single error correcting ECC may eliminate crosstalk faults whenfewer than 0.01% of the cells are weak. Furthermore, combiningPFE with ECC provides tolerance to other types of faults, such astransient faults, that are traditionally tolerated by ECC.

9 ACKNOWLEDGMENTThis work is supported by NSF grant CCF-1064976, NSF GRFPGrant No. 1247842 and an SGMI grant from Samsung electronics.

REFERENCES[1] 2012. JEDEC. Standard No. 79-3F. DDR3 SDRAM Specification.[2] Zaid Al-Ars. 2005. DRAM fault analysis and test generation. TU Delft, Delft

University of Technology.[3] Zaid Al-Ars, Said Hamdioui, and Ad J van de Goor. 2004. Effects of bit line

coupling on the faulty behavior of DRAMs. In Symp. on VLSI Test. IEEE, 117–122.

[4] Zaid Al-Ars, Martin Herzog, Ivo Schanstra, and Ad J van de Goor. 2004. Influenceof bit line twisting on the faulty behavior of DRAMs. In Memory Technology,Design and Testing, 2004. Records of the 2004 International Workshop on. IEEE,32–37.

[5] R Anglada and A Rubio. 1988. An approach to crosstalk effect analysis andavoidance techniques in digital CMOS VLSI circuits. Journal of Electronics 65, 1(1988), 9–17.

[6] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. 2008. ThePARSEC benchmark suite: Characterization and architectural implications. InProceedings of the 17th international conference on Parallel architectures andcompilation techniques. ACM, 72–81.

[7] Nathan Binkert and et al. 2011. The Gem5 Simulator. in SIGARCH Comput.Archit. News 39, 2 (Aug. 2011), 1–7.

[8] SY Cha. 2011. DRAM and future commodity memories. in VLSI TechnologyShort Course (2011).

[9] Kevin K Chang and et al. 2016. Understanding Latency Variation in ModernDRAM Chips: Experimental Characterization, Analysis, and Optimization. InSIGMETRICS. ACM, 323–336.

[10] Robert Feurle, Sabine Mandel, Dominique Savignac, and Helmut Schneider. 2001.Semiconductor memory configuration with a bit-line twist. (Oct. 30 2001). USPatent 6,310,399.

[11] John L. Henning. 2006. SPEC CPU2006 Benchmark Descriptions. SIGARCHComput. Archit. News 34, 4 (Sept. 2006).

[12] Wei Huang, Karthick Rajamani, Mircea R Stan, and Kevin Skadron. 2011. Scalingwith design constraints: Predicting the future of big chips. IEEE Micro 31, 4(2011), 16–29.

[13] Lei Jiang, Youtao Zhang, and Jun Yang. 2014. Mitigating write disturbance insuper-dense phase change memories. In Dependable Systems and Networks (DSN),2014 44th Annual IEEE/IFIP International Conference on. IEEE, 216–227.

[14] Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, RachataAusavarungnirun, Mahmut T Kandemir, Gabriel H Loh, Onur Mutlu, and Chita RDas. 2014. Managing GPU concurrency in heterogeneous architectures. In Mi-croarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposiumon. IEEE, 114–126.

[15] Samira Khan, Donghyuk Lee, Yoongu Kim, Alaa R Alameldeen, Chris Wilkerson,and Onur Mutlu. 2014. The efficacy of error mitigation techniques for DRAMretention failures: a comparative experimental study. In SIGMETRICS, Vol. 42.ACM, 519–532.

[16] Samira Khan, Donghyuk Lee, and Onur Mutlu. 2016. PARBOR: An EfficientSystem-Level Technique to Detect Data-Dependent Failures in DRAM. In Depend-able Systems and Networks (DSN), 2016 46th Annual IEEE/IFIP InternationalConference on. IEEE, 239–250.

[17] Samira Khan, Chris Wilkerson, Donghyuk Lee, Alaa R Alameldeen, and OnurMutlu. 2016. A case for memory content-based detection and mitigation ofdata-dependent failures in dram. IEEE Computer Architecture Letters (2016).

[18] Dae-Hyun Kim, Prashant J Nair, and Moinuddin K Qureshi. 2015. Architecturalsupport for mitigating row hammering in DRAM memories. IEEE Computer

11

Architecture Letters 14, 1 (2015), 9–12.[19] Kinam Kim. 2005. Technology for sub-50nm DRAM and NAND flash man-

ufacturing. In Electron Devices Meeting, 2005. IEDM Technical Digest. IEEEInternational. IEEE, 323–326.

[20] Kinam Kim and Jooyoung Lee. 2009. A new investigation of data retention time intruly nanoscaled DRAMs. IEEE Electron Device Letters 30, 8 (2009), 846–848.

[21] Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji Hye Lee, Donghyuk Lee,Chris Wilkerson, Konrad Lai, and Onur Mutlu. Flipping bits in memory withoutaccessing them: An experimental study of DRAM disturbance errors. In SIGARCHComput. Archit. News, Vol. 42. IEEE Press, 361–372.

[22] Yasuhiro Konishi and et al. 1989. Analysis of coupling noise between adjacent bitlines in megabit DRAMs. Journal of Solid-State Circuits 24, 1 (1989), 35–42.

[23] Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian,and Onur Mutlu. 2013. Tiered-latency DRAM: A low latency and low cost DRAMarchitecture. In High Performance Computer Architecture (HPCA2013), 2013IEEE 19th International Symposium on. IEEE, 615–626.

[24] Myoung Jin Lee and Kun Woo Park. 2010. A mechanism for dependence ofrefresh time on data pattern in DRAM. Journal of Electron Device Letters 31, 2(2010), 168–170.

[25] Jiayin Li and Kartik Mohanram. 2014. Write-once-memory-code phase changememory. In Design, Automation and Test in Europe Conference and Exhibition(DATE), 2014. IEEE, 1–6.

[26] Yan Li, Helmut Schneider, Florian Schnabel, Roland Thewes, and Doris Schmitt-Landsiedel. 2011. DRAM yield analysis and optimization by a statistical designapproach. Transactions on Circuits and Systems 58, 12 (2011), 2906–2918.

[27] Kyu-Nam Lim, Woong-Ju Jang, Hyung-Sik Won, Kang-Yeol Lee, HyungsooKim, Dong-Whee Kim, Mi-Hyun Cho, Seung-Lo Kim, Jong-Ho Kang, Keun-WooPark, et al. 2012. A 1.2 V 23nm 6F 2 4Gb DDR3 SDRAM with local-bitlinesense amplifier, hybrid LIO sense amplifier and dummy-less array architecture. InSolid-State Circuits Conference Digest of Technical Papers (ISSCC), 2012 IEEEInternational. IEEE, 42–44.

[28] Jamie Liu, Ben Jaiyen, Yoongu Kim, Chris Wilkerson, and Onur Mutlu. 2013.An experimental study of data retention behavior in modern DRAM devices:Implications for retention time profiling mechanisms. In SIGARCH Comput. Archit.News, Vol. 41. ACM, 60–71.

[29] Jamie Liu, Ben Jaiyen, Richard Veras, and Onur Mutlu. 2012. RAIDR: Retention-aware intelligent DRAM refresh. In SIGARCH Comput. Archit. News, Vol. 40.IEEE Computer Society, 1–12.

[30] Chi-Keung Luk and et al. Pin: Building Customized Program Analysis Tools withDynamic Instrumentation. In PLDI.

[31] Rakan Maddah, Seyed Mohammad Seyedzadeh, and Rami Melhem. 2015. CAFO:Cost aware flip optimization for asymmetric memories. In High PerformanceComputer Architecture (HPCA), 2015 IEEE 21st International Symposium on.IEEE, 320–330.

[32] Jack A Mandelman, Robert H Dennard, Gary B Bronner, John K DeBrosse, RamaDivakaruni, Yujun Li, and Carl J Radens. 2002. Challenges and future directionsfor the scaling of dynamic random-access memory (DRAM). IBM Journal ofResearch and Development 46, 2.3 (2002), 187–212.

[33] Micron. 2010. DDR3 SDRAM System-Power Calculator. (2010).[34] Neal Mielke, Todd Marquart, Ning Wu, Jeff Kessenich, Hanmant Belgal, Eric

Schares, Falgun Trivedi, Evan Goodness, and Leland R Nevill. 2008. Bit errorrate in NAND flash memories. In Symp. Reliability Physics. IEEE, 9–19.

[35] Dong Min, Dong Seo, Jehwan You, Soo Cho, Daeje Chin, and YE Park. 1990.Wordline coupling noise reduction techniques for scaled DRAMs. In Symposiumon VLSI Circuits. 81–82.

[36] Yongsam Moon, Yong-Ho Cho, Hyun-Bae Lee, Byung-Hoon Jeong, Seok-HunHyun, Byung-Chul Kim, In-Chul Jeong, Seong-Young Seo, Jun-Ho Shin, Seok-Woo Choi, et al. 2009. 1.2 V 1.6 Gb/s 56nm 6F 2 4Gb DDR3 SDRAM with hybrid-I/O sense amplifier and segmented sub-array architecture. In Solid-State CircuitsConference-Digest of Technical Papers, 2009. ISSCC 2009. IEEE International.IEEE, 128–129.

[37] Prashant J Nair, Dae-Hyun Kim, and Moinuddin K Qureshi. 2013. ArchShield:Architectural framework for assisting DRAM scaling by tolerating high error rates.In ACM SIGARCH Computer Architecture News, Vol. 41. ACM, 72–83.

[38] YOSHINOBU Nakagome, M Aoki, S Ikenaga, M Horiguchi, S Kimura, YKawamoto, and K Itoh. 1988. The impact of data-line interference noise onDRAM scaling. Journal of Solid-State Circuits 23, 5 (1988), 1120–1127.

[39] MA Nyblom. Enumerating binary strings without r-runs of ones. In Int. Math.Forum, Vol. 7.

[40] Moinuddin K Qureshi, Dae-Hyun Kim, Samira Khan, Prashant J Nair, and OnurMutlu. 2015. AVATAR: A variable-retention-time (VRT) aware refresh forDRAM systems. In Dependable Systems and Networks (DSN), 2015 45th An-nual IEEE/IFIP International Conference on. IEEE, 427–437.

[41] Jan M Rabaey, Anantha P Chandrakasan, and Borivoje Nikolic. 2002. Digitalintegrated circuits. Vol. 2. Prentice hall Englewood Cliffs.

[42] Rambus. 2010. DRAM Power Model. (2010).[43] Michael Redeker, Bruce F Cockburn, and Duncan G Elliott. 2002. An investi-

gation into crosstalk noise in DRAM structures. In Memory Technology, Design

and Testing, 2002.(MTDT 2002). Proceedings of the 2002 IEEE InternationalWorkshop on. IEEE, 123–129.

[44] Stuart Schechter, Gabriel H Loh, Karin Straus, and Doug Burger. Use ECP, notECC, for hard failures in resistive memories. In SIGARCH Comput. Archit. News.

[45] T Schloesser, F Jakubowski, J v Kluge, A Graham, S Slesazeck, M Popp, P Baars,K Muemmler, P Moll, K Wilson, et al. 2008. 6F 2 buried wordline DRAM cellfor 40nm and beyond. In Electron Devices Meeting, 2008. IEDM 2008. IEEEInternational. IEEE, 1–4.

[46] Nak Hee Seong, Dong Hyuk Woo, Vijayalakshmi Srinivasan, Jude A Rivers, andHsien-Hsin S Lee. 2010. SAFER: Stuck-at-fault error recovery for memories. InProceedings of the 2010 43rd Annual IEEE/ACM International Symposium onMicroarchitecture. IEEE Computer Society, 115–124.

[47] Seyed Mohammad Seyedzadeh, Alex K Jones, and Rami Melhem. 2017. Counter-based tree structure for row hammering mitigation in DRAM. IEEE ComputerArchitecture Letters 16, 1 (2017), 18–21.

[48] Seyed Mohammad Seyedzadeh, Rakan Maddah, Alex Jones, and Rami Mel-hem. 2015. Pres: Pseudo-random encoding scheme to increase the bit flip re-duction in the memory. In Design Automation Conference (DAC), 2015 52ndACM/EDAC/IEEE. IEEE, 1–6.

[49] Seyed Mohammad Seyedzadeh, Rakan Maddah, Alex Jones, and Rami Melhem.2016. Leveraging ECC to Mitigate Read Disturbance, False Reads and WriteFaults in STT-RAM. In Dependable Systems and Networks (DSN), 2016 46thAnnual IEEE/IFIP International Conference on. IEEE, 215–226.

[50] Seyed Mohammad Seyedzadeh, Rakan Maddah, Donald Kline, Alex K Jones, andRami Melhem. 2016. Improving bit flip reduction for biased and random data.IEEE Trans. Comput. 65, 11 (2016), 3345–3356.

[51] D.E. TUERS, Y. Ataklti, and A. Manohar. 2015. Detection of read disturbances onnon-volatile memories through counting of read accesses within divisions of thememory. (Sept. 24 2015). http://www.google.com/patents/WO2015142513A1?cl=en WO Patent App. PCT/US2015/018,527.

[52] North Carolina State Univ. FreePDK45. http://www.eda.ncsu.edu/wiki/. ().[53] Ad J van de Goor and J De Neef. 1999. Industrial evaluation of DRAM tests. In

Proceedings of the conference on Design, automation and test in Europe. ACM,123.

[54] Ad J van de Goor and Ivo Schanstra. 2002. Address and data scrambling: Causesand impact on memory tests. In DELTA. 128–136.

[55] Ravi K Venkatesan, Stephen Herr, and Eric Rotenberg. 2006. Retention-awareplacement in DRAM (RAPID): Software methods for quasi-non-volatile DRAM.In High-Performance Computer Architecture, 2006. The Twelfth InternationalSymposium on. IEEE, 155–165.

[56] Thomas Vogelsang. 2010. Understanding the energy consumption of dynamicrandom access memories. In Proceedings of the 2010 43rd Annual IEEE/ACMInternational Symposium on Microarchitecture. IEEE Computer Society, 363–374.

[57] Zemo Yang and Samiha Mourad. 2000. Crosstalk in Deep Submicron DRAMs..In mtdt. 125–130.

[58] Zemo Yang and Samiha Mourad. 2006. Crosstalk induced fault analysis and testin DRAMs. Journal of Electronic Testing 22, 2 (2006), 173–187.

[59] Tsutomu Yoshihara, Hideto Hidaka, Yoshio Matsuda, and Kazuyasu Fujishima.1988. A twisted bit line technique for multi-Mb DRAMs. In 1988 IEEE Interna-tional Solid-State Circuits Conference, 1988 ISSCC. Digest of Technical Papers.238–239.

[60] Tao Yuan, Saleem Z Ramadan, and Suk Joo Bae. 2011. Yield prediction forintegrated circuits manufacturing through hierarchical Bayesian modeling ofspatial defects. Transactions on Reliability 60, 4 (2011), 729–741.

12

http://www.google.com/patents/WO2015142513A1?cl=en

http://www.google.com/patents/WO2015142513A1?cl=en

http://www.eda.ncsu.edu/wiki/

Date post:	03-Mar-2021
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Mitigating Bitline Crosstalk Noise in DRAM...

Documents