+ All Categories
Home > Documents > IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI ...ieeeprojectsmadurai.com/2015-16 IEEE...

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI ...ieeeprojectsmadurai.com/2015-16 IEEE...

Date post: 10-Jul-2018
Category:
Upload: truonganh
View: 238 times
Download: 0 times
Share this document with a friend
10
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 23, NO. 7, JULY 2015 1235 An MPCN-Based BCH Codec Architecture With Arbitrary Error Correcting Capability Chi-Heng Yang, Yi-Min Lin, Hsie-Chia Chang, and Chen-Yi Lee, Member, IEEE Abstract— This paper presents an area-efficient architecture of arbitrary error correction Bose–Chaudhuri–Hocquenghem codec for NAND flash memory. By factorizing the generator polynomial into several minimal polynomials and utilizing linear feedback shift registers based on minimal polynomials, our reconfigurable design cannot only support multiple error correcting capabilities at a few extra cost, but also merge the encoder and syn- drome calculator for efficiently reducing hardware complexity. After being implemented in CMOS 65-nm technology, the test chip supporting t = 1–24 bits can achieve 1.33-Gb/s measured throughput with 73k gate-count while another design supporting t = 60–84 bits can provide 1.60-Gb/s synthesized throughput with 168.6k gate-count. Index Terms— Bose–Chaudhuri–Hocquenghem (BCH) codes, encoder, error correcting codes (ECC), NAND flash, syndrome. I. I NTRODUCTION T HE NAND flash memory is currently one of the most popular memory systems. Due to its advantages of nonvolatility, shock tolerance, high speed, high density, and low cost, it is widely used in fields such as flash drives, embedded multimedia controllers, solid-state drives, and big- data server. To fulfil the needs for higher storage capacity and lower storage cost, the feature size of NAND flash devices is contin- uously scaling down, and the techniques of multiple bits per cell or triple bits per cell are widely adopted [1]–[4]. However, coming with all these remarkable breakthroughs, two most dominative errors in NAND flash devices: 1) the data retention errors (e retention ); and 2) the program disturbance errors (e disturb ) are also increased. e retention occurs when the charge stored in floating gate losses slowly through leakage over time, and e disturb occurs when the cell changes while the cell of the same wordline is being programmed. Moreover, because of the physical structure of flash devices, the con- secutive electron tunneling resulted from program/erase (P/E) operation is harmful to the tunnel oxide of flash devices. As the number of P/E cycles increases, the possibilities of e retention and e disturb are enhanced accordingly, making the raw bit error rate (RBER) become larger and degrading the reliability [5], [6]. Manuscript received April 24, 2013; revised February 17, 2014; accepted June 23, 2014. Date of publication August 12, 2014; date of current version June 23, 2015. This work was supported in part by the National Science Council of Taiwan, and in part by the Ministry of Economic Affairs, Taiwan, under Grants NSC-101-2628-E-009-013-MY3 and 100-EC-17-A-01-S1-124. The authors are with the Department of Electronics Engineering, Institute of Electronics, National Chiao Tung University, Hsinchu 30010, Taiwan (e-mail: [email protected]; [email protected]; [email protected]; [email protected]). Digital Object Identifier 10.1109/TVLSI.2014.2338309 To ensure the system reliability and improve the endurance of NAND flash devices, Bose–Chaudhuri–Hocquenghem (BCH) code [7], [8] is one of the most widely used error correcting codes (ECCs) because of its low-complexity feature and random error immunity [9]–[13]. Since the RBER of flash device is degraded during its life- time and distribution of RBER of flash chip is not consistent among all chips, the reliability requirement of ECC must be targeted on the worst case RBER to ensure that the system is reliable enough. However, in the early stage of flash device’s lifetime or flash chip with lower RBER distribution, the use of a strong single-mode ECC (designed for the worst-case RBER) is a waste of decoding latency/spare space/power consumption. In order to meet the varying performance requirements, it is desirable for BCH codec supporting multiple error correcting capabilities, denoted as t . However, since the generator poly- nomials are different for each t ’s, the hardware cost of encoder is unaffordable if we directly put the logic of each t ’s together without further simplifications. Nevertheless, because the generator polynomial is defined as the product of several minimal polynomials, and the set of required minimal polynomials of smaller t is involved in the set of greater t , these minimal polynomials of the greatest t can be shared with the one of other smaller t ’s. Furthermore, by factorizing the generator polynomial into a set of minimal polynomials, the proposed architecture can support arbitrary error correcting capability within the predefined range using the partial set of these minimal polynomials, and the hardware complexity is obviously lower than the conventional linear feedback shift register (LFSR) approach and prior art [14]. Furthermore, to meet the high-throughput demand of NAND flash applications, minimal polynomial combination networks (MPCNs) [15] are applied for exploiting higher par- allelism. Suppose the minimal polynomial of α i over GF(2 m ) be M i (x ) = x m + M i,m1 x m1 + M i,m2 x m2 +···+ M i,2 x 2 + M i,1 x + 1, where M i, j , 1 j m 1 are binary coefficients of M i (x ), the block diagram of MPCN i can be expressed as Fig. 1. Moreover, the hardware resources of encoder can be eas- ily shared with syndrome calculator for further complexity reduction. For Reed–Solomon (RS) codes, an architecture combining the encoder and syndrome calculator is proposed in [16]. It presents a method by factorizing the generator polynomial of RS code into the product of (x α i ). However, if the same method is applied to BCH code, the generator polynomial, which consists of minimal polynomials M i (x ) with binary coefficients, will be factorized into the product of (x α i ), and the encoding process will be transformed from 1063-8210 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
Transcript
Page 1: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI ...ieeeprojectsmadurai.com/2015-16 IEEE PAPERS/vlsi/mpcn.pdf · IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS,

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 23, NO. 7, JULY 2015 1235

An MPCN-Based BCH Codec Architecture WithArbitrary Error Correcting CapabilityChi-Heng Yang, Yi-Min Lin, Hsie-Chia Chang, and Chen-Yi Lee, Member, IEEE

Abstract— This paper presents an area-efficient architecture ofarbitrary error correction Bose–Chaudhuri–Hocquenghem codecfor NAND flash memory. By factorizing the generator polynomialinto several minimal polynomials and utilizing linear feedbackshift registers based on minimal polynomials, our reconfigurabledesign cannot only support multiple error correcting capabilitiesat a few extra cost, but also merge the encoder and syn-drome calculator for efficiently reducing hardware complexity.After being implemented in CMOS 65-nm technology, the testchip supporting t = 1–24 bits can achieve 1.33-Gb/s measuredthroughput with 73k gate-count while another design supportingt = 60–84 bits can provide 1.60-Gb/s synthesized throughputwith 168.6k gate-count.

Index Terms— Bose–Chaudhuri–Hocquenghem (BCH) codes,encoder, error correcting codes (ECC), NAND flash, syndrome.

I. INTRODUCTION

THE NAND flash memory is currently one of the mostpopular memory systems. Due to its advantages of

nonvolatility, shock tolerance, high speed, high density, andlow cost, it is widely used in fields such as flash drives,embedded multimedia controllers, solid-state drives, and big-data server.

To fulfil the needs for higher storage capacity and lowerstorage cost, the feature size of NAND flash devices is contin-uously scaling down, and the techniques of multiple bits percell or triple bits per cell are widely adopted [1]–[4].

However, coming with all these remarkable breakthroughs,two most dominative errors in NAND flash devices: 1) the dataretention errors (eretention); and 2) the program disturbanceerrors (edisturb) are also increased. eretention occurs when thecharge stored in floating gate losses slowly through leakageover time, and edisturb occurs when the cell changes while thecell of the same wordline is being programmed. Moreover,because of the physical structure of flash devices, the con-secutive electron tunneling resulted from program/erase (P/E)operation is harmful to the tunnel oxide of flash devices.As the number of P/E cycles increases, the possibilities oferetention and edisturb are enhanced accordingly, making theraw bit error rate (RBER) become larger and degrading thereliability [5], [6].

Manuscript received April 24, 2013; revised February 17, 2014; acceptedJune 23, 2014. Date of publication August 12, 2014; date of current versionJune 23, 2015. This work was supported in part by the National ScienceCouncil of Taiwan, and in part by the Ministry of Economic Affairs, Taiwan,under Grants NSC-101-2628-E-009-013-MY3 and 100-EC-17-A-01-S1-124.

The authors are with the Department of Electronics Engineering, Institute ofElectronics, National Chiao Tung University, Hsinchu 30010, Taiwan (e-mail:[email protected]; [email protected]; [email protected];[email protected]).

Digital Object Identifier 10.1109/TVLSI.2014.2338309

To ensure the system reliability and improve the enduranceof NAND flash devices, Bose–Chaudhuri–Hocquenghem(BCH) code [7], [8] is one of the most widely used errorcorrecting codes (ECCs) because of its low-complexity featureand random error immunity [9]–[13].

Since the RBER of flash device is degraded during its life-time and distribution of RBER of flash chip is not consistentamong all chips, the reliability requirement of ECC must betargeted on the worst case RBER to ensure that the system isreliable enough. However, in the early stage of flash device’slifetime or flash chip with lower RBER distribution, the use ofa strong single-mode ECC (designed for the worst-case RBER)is a waste of decoding latency/spare space/power consumption.

In order to meet the varying performance requirements, it isdesirable for BCH codec supporting multiple error correctingcapabilities, denoted as t . However, since the generator poly-nomials are different for each t’s, the hardware cost of encoderis unaffordable if we directly put the logic of each t’s togetherwithout further simplifications.

Nevertheless, because the generator polynomial is definedas the product of several minimal polynomials, and the set ofrequired minimal polynomials of smaller t is involved in theset of greater t , these minimal polynomials of the greatest tcan be shared with the one of other smaller t’s. Furthermore,by factorizing the generator polynomial into a set of minimalpolynomials, the proposed architecture can support arbitraryerror correcting capability within the predefined range usingthe partial set of these minimal polynomials, and the hardwarecomplexity is obviously lower than the conventional linearfeedback shift register (LFSR) approach and prior art [14].

Furthermore, to meet the high-throughput demand ofNAND flash applications, minimal polynomial combinationnetworks (MPCNs) [15] are applied for exploiting higher par-allelism. Suppose the minimal polynomial of αi over GF(2m)be Mi (x) = xm +Mi,m−1 xm−1+Mi,m−2 xm−2+· · ·+Mi,2 x2+Mi,1x + 1, where Mi, j , 1 ≤ j ≤ m − 1 are binary coefficientsof Mi (x), the block diagram of MPCNi can be expressed asFig. 1.

Moreover, the hardware resources of encoder can be eas-ily shared with syndrome calculator for further complexityreduction. For Reed–Solomon (RS) codes, an architecturecombining the encoder and syndrome calculator is proposedin [16]. It presents a method by factorizing the generatorpolynomial of RS code into the product of (x −αi ). However,if the same method is applied to BCH code, the generatorpolynomial, which consists of minimal polynomials Mi (x)with binary coefficients, will be factorized into the product of(x − αi ), and the encoding process will be transformed from

1063-8210 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI ...ieeeprojectsmadurai.com/2015-16 IEEE PAPERS/vlsi/mpcn.pdf · IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS,

1236 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 23, NO. 7, JULY 2015

Fig. 1. Block diagram of MPCNi .

binary operation into more complicated finite field operation.Therefore, it is not efficient for BCH codes.

For BCH codes, a combined encoder and syndrome calcu-lator (SC) is proposed in [9]. By preprocessing the receivedcodeword with encoder, the evaluation range of SC is reducedfrom (n − 1) degree to (n − k − 1) degree. Nevertheless, thepreprocessing of received codeword increases the total latencyof SC, and the hardware resources cannot be shared with thismethod. In contrast, because the syndrome coefficients Si canbe calculated with minimal polynomials [17], and the proposedencoder architecture is based on minimal polynomials as well,these two major components of BCH codec can be combinedtogether efficiently.

In this paper, our test chip provides a three-stage pipelinedparallel-8 BCH codec with t = 1–24 bits to demonstratethe proposed architectures. To meet the current reliabilityrequirement of NAND flash memory, we also present anenhanced t = 60–84 bits design with similar architectures.

This paper is organized as follows. Section II illustratesthe background of this paper. Section III describes how theproposed multimode encoder architecture is developed. Thecombined encoder and SC architecture and analysis betweenprior arts are presented in Section IV. Section V demonstratesthe other architectures of proposed parallel-8 BCH codecs,involving the key equation solver (KES) and Chien searchlogic (CSL). Section VI lists measurement results and com-parison. Section VII gives a conclusion of this paper.

II. BACKGROUND

A. Construction of BCH Codes

For a finite field GF(2m), it involves 2m elements andcan be viewed as a m-dimensional vector space over thesubfield GF(2). Suppose β �= 0 be an element in GF(2m).Since β2m−1 = 1, β is said to be a root of x2m−1 + 1.However, β can also be a root of another polynomial overGF(2) with degree less than 2m . The minimal polynomialof β is defined as the monic polynomial with the smallestdegree among all polynomials over GF(2) having β as aroot. In a monic polynomial, the coefficient of the highestorder is always 1. If there were two minimal polynomials ofβ: φ(x) and φ′(x), the difference φ(x) − φ′(x) will havea smaller degree and will still have β as a root. Sincethe above result contradicts its basic definition, the minimalpolynomial is unique. Besides, a minimal polynomial mustbe irreducible over GF(2); otherwise, it will not be the least

degree polynomial having β as a root. The roots of minimalpolynomial φ(x) are conjugates of each other. Generally,assume e is the smallest integer such that (αi )2e = αi , thenthe minimal polynomial of αi ∈ GF(2m) is defined as

Mi (x) =e−1∏

j=0

(x + (αi )2 j). (1)

From (1), the minimal polynomial of αi is the same as theone of (αi )2 j

Mi (x) = Mi·2 j (x) (2)

and (αi )2 j, 1 ≤ j ≤ e are the conjugates of αi .

An (n, k; t) BCH code over GF(2m) with k-bit message andn-bit codeword can correct at most t errors [18]. The codewordc(x) of BCH code is an (n − 1)-degree polynomial c(x) =cn−1xn−1 + · · · + c1x + c0, ci ∈ GF(2), and the informationpolynomial u(x) = uk−1xk−1 + · · · + u1x + u0, ui ∈ GF(2).The relationship between c(x) and u(x) is defined as

c(x) = u(x)xn−k + p(x) (3)

where the parity polynomial p(x) = u(x)xn−k mod gBCH(x).Note that the generator polynomial is defined as

gBCH(x) = LCM{M1(x), M2(x), . . . , M2t (x)} (4)

= LCM{M1(x), M3(x), . . . , M2t−1(x)} (5)

= M1(x) × M3(x) × · · · × M2t−1(x) (6)

= gn−k xn−k + · · · + g2x2 + g1x + g0 (7)

where each coefficient g j for 0 ≤ j ≤ n − k is a binary valueand Mi (x) is the minimal polynomial of αi with coefficientsover GF(2). Equation (4) can be reduced to (5) due to therelationship that (αi )2 j

is a conjugate of αi in (2). Supposethat for any {i, j |1 ≤ i, j ≤ 2t − 1}, the minimal polynomialMi (x) �= M j (x), then (5) can be deducted as (6).

B. SC Using Minimal Polynomials

Since any polynomial over GF(2m) can be expressed interms of the minimal polynomial of αi , Mi (x), the receivedcodeword r(x) can be rewritten as

r(x) = Mi (x) × Qi (x) + bi (x) (8)

where bi(x) is the remainder polynomial resulting from divid-ing r(x) by Mi (x), and Qi (x) is the quotient polynomial.Because the degree of bi (x) is always less than deg(Mi (x)),which is generally equals to m, bi (x) can be defined asbi (x) = bi,m−1xm−1+bi,m−2xm−2+· · ·+bi,1x +bi,0. Becausethe syndrome Si = r(αi ), (8) becomes

Si = r(x)|x=αi

= Mi (αi ) × Qi (α

i ) + bi (αi )

= bi (αi ). (9)

Therefore, the SC can be implemented using the minimalpolynomials [17].

The following is an example in BCH (15, 7; 2) code overGF(24). Since the even syndromes S2 and S4 can be calculated

Page 3: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI ...ieeeprojectsmadurai.com/2015-16 IEEE PAPERS/vlsi/mpcn.pdf · IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS,

YANG et al.: MPCN-BASED BCH CODEC ARCHITECTURE 1237

Fig. 2. SC for S1 and S3 in BCH (15, 7; 2) code.

Fig. 3. Block diagram of conventional LFSR-based BCH encoder.(a) Conventional LFSR-based BCH encoder. (b) Transformed LFSR-basedBCH encoder with system function 1/G(z)+ 1.

from the odd ones S1 and S3, here we only discuss thecalculation of odd syndromes. From (9), S1 and S3 becomes

S1 = b1(α)

= b1,3α3 + b1,2α

2 + b1,1α + b1,0 (10)

S3 = b3(α3)

= b3,3α9 + b3,2α

6 + b3,1α3 + b3,0

= b3,3(α3 + α) + b3,2(α

3 + α2) + b3,1α3 + b3,0

= (b3,3 + b3,2 + b3,1)α3 + b3,2α

2 + b3,3α + b3,0. (11)

As a result, the SC for S1 and S3 based on (10) and (11) canbe implemented as Fig. 2.

III. PROPOSED MULTIPLE t ENCODER ARCHITECTURE

The conventional BCH encoder is implemented by LFSRsbased on the generator polynomial, and the block diagram isshown in Fig. 3(a), in which the connections are determinedby the coefficients in (13).

From the viewpoint of discrete-time system [19], and thatgn−k is always 1, (13) can be acquired from (7)

G(z) = 1 + gn−k−1z−1 + · · ·+ g1z−(n−k−1) + g0z−(n−k) (12)

= M1(z) × M3(z) × · · · × M2t−1(z). (13)

If we regard the information u(x) and codeword c(x)as discrete-time signals u[n] and c[n], respectively, thez-transform of u[n] and c[n] are, respectively, U(z) and C(z),

Fig. 4. Block diagram of the proposed BCH encoder. (a) BCH encoder basedon factorized generator polynomial. (b) Proposed BCH encoder removingzero-delay loop.

which can be expressed as C(z) = C(z) · (G(z) + 1) + U(z).In addition, the system function in Fig. 3(a) is

H (z) = C(z)

U(z)= 1

G(z). (14)

To remove the upper switch, which blocks the feedbacksignal in the last (n − k) cycles during encoding, the blockdiagram in Fig. 3(a) can be transformed as Fig. 3(b), and thesystem function becomes 1/G(z) + 1.

Based on (13), the generator polynomial is factorized intoseveral minimal polynomials, implying that the encoder can beimplemented by several separated Mi (z) sub-LFSRs, whosestructures are determined by the binary coefficients Mi,k

of Mi (x). The encoder can be redrawn as Fig. 4(a).To remove the path that causing a zero-delay loop in

Fig. 4(a), it can be modified by adding input to each sub-LFSRs. Fig. 4(b) is the block diagram of the proposed encoderarchitecture.

Differ from the conventional approach, the data bits storedin registers at the (k +1)th cycle are not the parity bits, whichare continuously computed using the feedback signal in the last(n−k) cycles. Besides, each sub-LFSR in Fig. 4(b) requires atmost m XOR gates, and the XOR-chain below these sub-LFSRsinvolves (t − 1) XOR gates. As a result, the complexity of theproposed approach in terms of XOR-gate is m · t + (t − 1).

Assume that a multimode BCH encoder with error correct-ing capabilities t1 < t2 < · · · < tmax, the function Gt j (z) ofeach t j can be listed from (13)

Gt1(z) = M1(z)M3(z) . . . M2t1−1(z) (15)

Gt2(z) = M1(z)M3(z) . . . M2t2−1(z)

= Gt1(z)M2(t1+1)−1(z) . . . M2t2−1(z) (16)...

Gtmax(z) = Gt1(z) . . . M2tmax−1(z) (17)

= Gtmax−1 M2tmax−1(z). (18)

Since Gt j (z) = ∏l

M2l−1(z) for 1 ≤ l ≤ t j , it can be found

from (17) that Gt j (z) is a partial product of Gtmax(z), where

Page 4: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI ...ieeeprojectsmadurai.com/2015-16 IEEE PAPERS/vlsi/mpcn.pdf · IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS,

1238 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 23, NO. 7, JULY 2015

Fig. 5. BCH encoder in [14].

t j < tmax. As a result, the hardware resource of t j is involvedin that of tmax, leading to significant hardware reduction formultiple t design.

If there are repeated minimal polynomials such thatM2ti −1(x) = M2t j −1(x) for some ti �= t j , 1 ≤ ti , t j ≤ tmax,then either M2ti −1(z) or M2t j −1(z) is removed from Gtmax(z).For example, since α17 is a conjugate of α5 in GF(26), theminimal polynomial M5(x) = M17(x), and the function of a9-error-correctable BCH code in GF(26) is defined as

G9(z) = M1(z)M3(z) · · · M13(z)M15(z). (19)

Note that M17(z) is removed due to its repetition with M5(z).Since G9(z) can still be factorized into the product of minimalpolynomials, our proposed encoding architecture can still beapplied in such case.

A prior art architecture for multimode BCH encoder isproposed in [14], which supports multiple error correctingcapabilities using cascaded LFSRs based on the minimalpolynomials. As shown in Fig. 5, the function of each LFSRis similar to the conventional LFSR encoder in Fig. 3(a).For the first k cycles, the information is fed into the firstLFSR, and the quotient q1 is propagated to the next LFSR.For the last (n − k) cycles, the switch in each LFSR isopened and blocks the feedback signal such that the values(p1, p3, . . . , p2t−1) in the registers of each LFSR can be sentto the weighting and combining block. The parity p(x) is thencomputed according to

p(x) = [p2t−1 · M2(t−1)−1(x) + p2(t−1)−1] · M2(t−2)−1(x)

+p2(t−2)−1 · · · × M1(x) + p1. (20)

Due to the overhead of the weighting and combining block,the complexity in terms of XOR gates is m · t + (m +1)(t −1),which is obviously larger than the proposed approach.

Two examples are given in the following: 1) one witht = 1–24 and 2) the other with t = 60–84. In these sets oft’s, their minimal polynomial Mti (x) �= Mt j (x), {ti , t j |1 ≤ ti ,t j ≤ 2 × 84 − 1}. Both codes are over GF(214). Accordingto (18), the function of tmax = 24 in the first exampleis G24(z) = G23(z)M2×24−1(z). For the second example,the function of tmax = 84 can be derived from (17) asG84(z) = G60(z)M2×61−1(z) . . . M2×84−1(z), and G60(z) =M1(z)M3(z) . . . M2×60−1(z). The encoder block diagram ofthese two examples are depicted in Fig. 6(a) and (b).In the first k cycles, the message bits are fed into each minimalpolynomial sub-LFSR blocks; in the last n − k cycles, theencoder circuit will output the parity bits corresponding to theerror correcting capability signal t_select, which is determined

Fig. 6. Block diagrams of proposed BCH encoder. (a) BCH encoder witht = 1–24 over GF(214). (b) BCH Encoder with t = 60–84 over GF(214).

from the performance requirement, and the feedback signal isdecided by the multiplexer.

Conventionally, the implementation of multiple t encoderis carried out utilizing the LFSR architecture. Although theregisters for storing parity bits can be shared for hardwarereduction, the combinational logics for each t are distinctdue to the uniqueness of each generator polynomial. As therequired error correcting capability gets larger, the storagespace for parity bits and distinct combinational logics of eacht result in a huge overhead of hardware complexity.

Compared to the conventional BCH encoder supportingthe same amounts of error correcting capabilities, theencoder of the first example requires only the sub-LFSRs ofM−1

2×1−1(z), M−12×2−1(z), . . . , M−1

2×24−1(z), and the encoder ofthe second example involves only the LFSR of G−1

60 (z) andthe sub-LFSRs of M−1

2×61−1(z), M−12×62−1(z), . . . , M−1

2×84−1(z),while the conventional approach needs all 24 LFSRsG1(z), G2(z), . . . , G24(z) and all 25 LFSRs G60(z),G61(z), . . . , G84(z), respectively. Therefore, our approach

Page 5: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI ...ieeeprojectsmadurai.com/2015-16 IEEE PAPERS/vlsi/mpcn.pdf · IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS,

YANG et al.: MPCN-BASED BCH CODEC ARCHITECTURE 1239

Fig. 7. Hardware complexity of parallel-8 BCH encoder over G F(214).

Fig. 8. Block diagram of a BCH decoder.

efficiently reduces the hardware complexity for BCHencoder.

To fulfil the high-throughput demand of NAND flash mem-ory systems, the proposed architecture can also be adapted top-bit parallelism for enhancing the throughput. In Fig. 6(b),each minimal polynomial block can be separated into twoparts: 1) the sequential part, which involves the registers;and 2) the combinational part, which is exactly the MPCN.To support p-bit parallelism, the encoder has to process pinformation bits and output the same amount of codewordbits simultaneously. Since the computation of parity bits isdone by the MPCNs, the p-bit parallelism can be achieved byduplicating the MPCN for p times.

The comparison between our proposed encoder architec-ture and conventional LFSR architecture is shown in Fig. 7.In these examples, two different groups of error correctingcapabilities are provided: 1) one is t = 1–24 bits; and2) the other is t = 60–84 bits. Both our proposed approachesand the conventional LFSR approaches are designed with 8-bitparallelism for protecting 1024-bytes information. Accordingto the synthesis result in standard 65-nm CMOS process,the gate-count of the conventional LFSR approach fort = 1–24 bits is 21.2k, whereas the proposed approachrequires only 6.6k, which is 31.1% of the conventional one.Furthermore, the area efficiency of the proposed encoder archi-tecture is even remarkable in the case of t = 60–84 bits. Thegate-count of the conventional one requires 109.6k, whereasthe proposed one is only 16.9k, which is equivalent to an84.6% reduction compared with the conventional one.

IV. COMBINED MPCN-BASED ENCODER AND SC

As shown in Fig. 8, the BCH decoder involves three majorparts: 1) SC; 2) KES; and 3) CSL. In this section, themethod for combining the encoder architecture introduced inSection III and the SC will be presented.

To have a clear comparison with the proposed architecture,the approach of combining encoder and SC for RS code in [16]will be introduced in the beginning. Then, the method forBCH codes in [9] is also introduced and compared. In order

Fig. 9. Computation of syndrome Si .

to support the illustration, the characteristics of RS code willbe reviewed shortly.

RS code [20] is an important subclass of nonbinary BCHcodes. Due to the similarity between RS codes and BCHcodes, the algorithm and VLSI architecture for encoding anddecoding of these two codes are quite similar.

In an (n, k; t) RS code over F = GF(2m), a n-symbolcodeword involves k message symbols and r = n − k = 2tparity symbols, where each symbol is an element of GF(2m)and can be viewed as a m-tuple vector. The codeword andinformation of RS code can be, respectively, expressed aspolynomials c′(x) and u′(x) like BCH code, but note that thecoefficients c′

i and u′i are now over F. The relationship between

c′(x) and u′(x) is similar to (3): c′(x) = u′(x)xn−k + p′(x).The parity polynomial p′(x) is also defined as the remainder ofu′(x)xn−k divided by the generator polynomial gRS(x), whichis the product of 2t consecutive degree-1 polynomials

gRS(x) = (x + α)(x + α2) . . . (x + α2t )

= x2t + G2t−1x2t−1 + · · · + G1x + G0 (21)

where Gi ∈ F = GF(2m), 1 ≤ i ≤ 2t − 1. A prior artarchitecture combining the encoder and SC for RS code isproposed in [16]. The factor polynomial (x − αi ) in (21) canbe implemented by constant finite field multiplier (CFFM).In addition, the syndrome polynomial S(x) is defined asS(x) = S1+S2x +· · ·+S2t x2t−1. From the received codewordpolynomial r(x) = r0 + r1x + · · ·+ rn−1xn−1, the coefficientsSi are calculated as

Si = r(αi ) =n−1∑

j=0

r j (αi ) j , 1 ≤ i ≤ 2t (22)

= rn−1α(n−1)i + rn−2α

(n−2)i + · · · + r0

= (rn−1α

i + rn−2)α(n−2)i + · · · + r0

= ((rn−1α

i + rn−2)αi + rn−3

)α(n−3)i + · · · + r0

= ( · · · (rn−1αi + rn−2

)αi + · · · )αi + r0. (23)

Since the equation of Si can be reformulated as an iterativeform in (23), it can be implemented as Fig. 9, where β and γindicate the value stored in register and input value of eachiteration, respectively. Accordingly, Si can be iteratively cal-culated by CFFM, and the SC and encoder can be combinedtogether by sharing the resources of CFFMs.

However, this method is not efficient for BCH code. If thegenerator polynomial gBCH(x), consisting of binary-coefficientMi (x), is factorized into the product of (x −αi ) like [16], theencoding process will be transformed from binary operation

Page 6: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI ...ieeeprojectsmadurai.com/2015-16 IEEE PAPERS/vlsi/mpcn.pdf · IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS,

1240 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 23, NO. 7, JULY 2015

Fig. 10. Combined BCH encoder and SC in [9].

into more complicated finite field operation, resulting in theoverhead of hardware complexity.

On the other hand, a method for combining BCH encoderand its SC is proposed in [9]. It utilizes the property thatr(x) can be expressed in terms of the generator polynomialgBCH(x). By defining r(x) = q(x) + gBCH(x) · a(x), thesyndromes Si , 1 ≤ i ≤ 2t can be calculated by

Si = r(αi ) = q(αi ). (24)

In addition, by expressing the received polynomial asr(x) = r1(x)xn−k + r2(x) and by reencoding r1(x) with theoriginal LFSR encoder, the remainder polynomial d(x) can beacquired by

r1(x)xn−k = a(x)gBCH(x) + d(x). (25)

With (24) and (25), the syndromes can be calculated fromq(x) = d(x) + r2(x) by the following procedure: 1) reencodethe first k bits of r(x); 2) add the remaining n − k bits; and3) finally calculate q(αi ). The block diagram of this methodis shown in Fig. 10.

Although the calculation of syndromes Si can be easedfrom r(αi ) to q(αi ) with the help of encoder, leading to anintegrated SC and encoder, there are several drawbacks. First,since the use of encoder is to shorten the evaluation range[from deg(r(x)) = n − 1 to deg(q(x)) = n − k − 1], thehardware resources of encoder and SC cannot be shared withthis method. Second, the computing latency of syndromes withthis method is not reduced, but actually increased. The latencyof reencoding is �n/p� cycles, and the calculation of q(αi )requires �n − k/p� cycles. Because the calculation of q(αi )must be done after the reencoding, the latency of reencod-ing should also be taken into account for fair comparison.Therefore, the required latency of this method is actually�n/p� + �n − k/p� cycles, which is obviously larger than theone of ordinary SC, and the additional �n − k/p� cycles canbe a timing overhead for strong BCH codec.

In contrast, since both the architectures of proposed encoderand SC are based on the sub-LFSR of minimal polynomi-als, the hardware resources of these two components canbe efficiently shared. Moreover, the computation latency ofSC is the same as the ordinary one without any additionallatency. In considering the computation of syndromes Si ’s,since they can be transformed into bi (α

i ), to enhance theefficiency of computation, the original polynomial evaluation,such as the example shown in (10) and (11), is changed intoa multiplication between a m × m matrix and a m-tuple

Fig. 11. Combined BCH encoder and SC with t = 60–84 over GF(214).

vector

Si = bi (αi )

= bi,m−1(αi )m−1 + bi,m−2(α

i )m−2 + · · · + bi,0

=

⎡⎢⎢⎢⎢⎣

αi(m−1)m−1 α

i(m−2)m−1 · · · α0

m−1

αi(m−1)m−2 α

i(m−2)m−2 · · · α0

m−2...

.... . .

...

αi(m−1)0 α

i(m−2)0 · · · α0

0

⎤⎥⎥⎥⎥⎦

︸ ︷︷ ︸BTi

⎡⎢⎢⎢⎣

bi,m−1bi,m−2

...bi,0

⎤⎥⎥⎥⎦ (26)

where the matrix in (26) is called as i th basis trans-former (BTi ), which transforms the received bits from thebasis of bi (x) {αi(m−1), αi(m−2), . . . , αi , 1}, to the standardbasis of GF(2m) {αm−1, αm−2, . . . , α, 1}.

By transforming the bits of each sub-LFSR in Fig. 6(b), thepreviously introduced encoder can be turned into a SC [17].To be consistent, the example for the proposed combinedencoder and SC uses the same specifications as Fig. 6(b).The block diagram is shown in Fig. 11. The major differencebetween this hardware and encoder-only one is the additionof basis transformers BTi , 2 × 61 − 1 ≤ i ≤ 2 × 84 − 1.As an encoder, the computation is totally the same as Fig. 6(b);as a SC, the data bits of each sub-LFSR M2×t j −1(z)−1 areprocessed within its own block without passing through othersub-LFSR, and the data bits go through the basis transformersat the last clock cycle of computation.

To support parallel processing, the serial hardware in Fig. 11can be modified with the concept introduced in Section III.As shown in Fig. 12, at encoder mode, the hardware passesthe processed information bits through MPCNs vertically andpasses the feedback bit to adjacent MPCN loops horizontally.At SC mode, received bits are added into the MPCN loop in adifferent position and the hardware passes the processed bitsvertically without passing any cross-loop signal. After �n/p�clock cycles, the syndromes are available by performing basistransform to the processed bits of each loop.

Table I lists synthesis results of two encoder architecture(proposed MPCN-based encoder and LFSR encoder) andthree approaches of combined encoder and SC. For the firstapproach, since the original implementation in [9] is a (4148,4096; 4) BCH code, and its SC and CSL are merged forcomplexity reduction, it is reimplemented for fair comparison

Page 7: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI ...ieeeprojectsmadurai.com/2015-16 IEEE PAPERS/vlsi/mpcn.pdf · IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS,

YANG et al.: MPCN-BASED BCH CODEC ARCHITECTURE 1241

TABLE I

SYNTHESIS RESULTS FOR COMBINED BCH ENCODER AND SC

Fig. 12. Parallel-p combined BCH encoder and SC with t = 60–84 overGF(214).

with our pipelined designs. Without lack of consistency, thearchitecture of SC for the LFSR approach is the MPCN-basedSC, which is the same as the proposed approach. From theresults shown in Table I, the gate-count of proposed approachis 16.2k, whereas the LFSR approach is 36k for the case oft = 1 ∼ 24 bits; the gate-count of proposed approach is58.3 K while the LFSR approach is 154.9k for the case oft = 60–84 bits. For the LFSR approach, the additionaloverhead of supporting syndrome calculation from encoder

hardware is 14.8k for t = 1–24 bits and 45.3k fort = 60–84 bits. While in the proposed approach, the over-head is reduced to 9.6k for t = 1–24 bits and 41.4k fort = 60–84 bits.

V. PROPOSED PARALLEL-8 BCH CODEC ARCHITECTURE

This paper presents two designs: 1) one is the BCH codecchip with t = 1–24 bits and 1-kB information bits overGF(214); and 2) the other one is through synthesis withenhanced error correcting capabilities t = 60–84 bits to meetthe current reliability requirement of NAND flash memory.In these designs, the encoder and SC with 8-bit parallelismare efficiently combined utilizing the proposed MPCN-basedarchitecture in Sections III and IV.

The key equation solver in our t = 1–24 chip utilizesthe inversionless Berlekamp–Massey (iBM) algorithm withreversed error locator polynomial RELP σ̃ (x) [21] to elim-inate the hardware overhead of dummy location search inshorten BCH codes. To reduce the hardware complexity ofKES, which is the most computation-intensive part of BCHdecoder, the folded architecture is used such that the requiredFFMs are reduced from 3t + 1 to �t/3� + 1. Accordingly,it requires nine FFMs, and the required clock cycles for KESare increased from 2t to 18t , which are maximally 432 clockcycles.

However, for the proposed design with t = 60–84, since therequired number of clock cycles exceeds the maximal clockcycles of a single stage in the three-stage pipelined BCHdecoder, the original 2t-iteration folded architecture is notsuitable. Due to the property of finite field in characteristic 2that S2i = (Si )

2 and discrepancy � is always 0 in odditerations of BM algorithm, the number of iterations can bereduced from 2t to t [17]. Owing to the halved iteration-count,the required FFMs can be further reduced from �t/3� + 1 to�t/4�+1. As a result, it requires 22 FFMs, and the computationlatency is 12t clock cycles, which are maximally 1008 clockcycles. To eliminate the overhead of dummy location search,the coefficients of error locator polynomial are reciprocallyshifted before fed into CSL.

The CSL is used to iteratively find the error locations.Traditionally, the parallel-p CSL utilizes a chain of p CFFMsto find p locations simultaneously. It is called a direct-unfoldedarchitecture with the unfolded factor p [22]. Since the timing

Page 8: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI ...ieeeprojectsmadurai.com/2015-16 IEEE PAPERS/vlsi/mpcn.pdf · IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS,

1242 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 23, NO. 7, JULY 2015

complexity of CFFM over GF(2m) is usually proportional tom · tXOR, where tXOR is the delay of a XOR gate, the path delayof the CFFM chain involving p CFFM is p · m · tXOR. As theparallelism p gets larger, the lengthened CFFM chain wouldcause a large timing overhead to the hardware. Thus, similarto the architecture for encoder and syndrome generator, theproposed work utilizes the MPCNs to replace the CFFMs inCSL. Since the delay of each single MPCN is at most tXOR,the path delay is reduced to p · tXOR.

Although both the parallel architecture of MPCN-basedChien search and proposed combined encoder and syndromecalculator utilize the MPCNs as basic components, the wayof adapting MPCNs into the implementation are quite dif-ferent. The approach for encoder starts from factorizing thegenerator polynomial into product of minimal polynomials,whereas the approach for Chien search is developed fromexpressing the j th coefficient of error locator polynomialσ̃ in terms of the power of α j . Since α j is one of theroots of minimal polynomial M j (x), the logic of Chiensearch can be implemented with the sub-LFSRs of minimalpolynomials.

The key idea of CSL is that it exhaustively examineswhether σ̃ (αi ) = 0 for i = 0 ∼ n − 1

σ̃ (αi ) =t∑

j=0

σ̃ j (αi ) j =

t∑

j=1

σ̃ jαi j + σ̃0. (27)

Thus, from (27), the examination of σ̃ (αi ) = 0 involvesthe calculation of σ̃ jα

i j . To calculate σ̃ jαi j with minimal

polynomials, we can first define a (m − 1)-degree polynomialTj (x) = t j,m−1xm−1 + t j,m−2xm−2 + · · · + t j,1x + t j,0, thenthe relationship between σ̃ j and Tj (x) is defined as

σ̃ j = Tj (x)|x=α j

= t j,m−1αj (m−1) + · · · + t j,1α

j + t j,0α0

= σ̃ j,m−1αm−1 + · · · + σ̃ j,1α

1 + σ̃ j,0α0. (28)

From (28), {̃σ j,m−1, . . . , σ̃ j,1, σ̃ j,0} and {t j,m−1, . . . , t j,1, t j,0}can be viewed as coordinates with respect to the stan-dard basis of G F(2m) {αm−1, . . . , α1, α0} and the basisof {α j (m−1), . . . , α j , α0}, respectively. Thus, the relationshipbetween these two bases can be expressed as

⎡⎢⎢⎢⎣

σ̃ j,m−1σ̃ j,m−2

...σ̃ j,0

⎤⎥⎥⎥⎦ =

⎡⎢⎢⎢⎢⎣

αj (m−1)m−1 α

j (m−2)m−1 · · · α0

m−1

αj (m−1)m−2 α

j (m−2)m−2 · · · α0

m−2...

.... . .

...

αj (m−1)0 α

j (m−2)0 · · · α0

0

⎤⎥⎥⎥⎥⎦

︸ ︷︷ ︸BT j

⎡⎢⎢⎢⎣

t j,m−1t j,m−2

...t j,0

⎤⎥⎥⎥⎦.

(29)

Similar to the matrix operation in (26), after thematrix operation in (29), σ̃ j is expressed from the basis{α j (m−1), . . . , α j , α0} to the basis {αm−1, . . . , α1, α0}. Thus,this matrix is indeed the j th basis transformer (BT j ). As aresult, the coefficients of Tj (x) can be determined by theinverse matrix of BT j , which is called as the j th inverse basis

transformer (IBT j )

⎢⎢⎢⎣

t j,m−1t j,m−2

...t j,0

⎥⎥⎥⎦ =

⎢⎢⎢⎢⎣

αj (m−1)m−1 α

j (m−2)m−1 · · · α0

m−1

αj (m−1)m−2 α

j (m−2)m−2 · · · α0

m−2...

.... . .

...

αj (m−1)0 α

j (m−2)0 · · · α0

0

⎥⎥⎥⎥⎦

−1

︸ ︷︷ ︸IBT j

⎢⎢⎢⎣

σ̃ j,m−1σ̃ j,m−2

...σ̃ j,0

⎥⎥⎥⎦.

(30)

Based on the previous defined terms, let Pij = σ̃ j αi j , it can

be represented in terms of Tj (x) as

Pij = σ̃ j αi j

= Tj (x)|x=α j × (α j )i

= xi Tj (x)|x=α j

= M j (x) × W j (x) + D j (x)|x=α j

= M j (αj ) × W j (α

j ) + D j (αj ) (31)

where D j (x) is the remainder polynomial resulting fromdividing xi Tj (x) by M j (x). Since α j is a root of M j (x),Pij = σ̃ j α

i j is equal to D j (αj ), and (27) can be rewritten as

σ̃ (αi ) =t∑

j=1

σ̃ jαi j + σ̃0 =

t∑

j=1

Pij + σ̃0 =t∑

j=1

D j (αi j ) + σ̃0.

(32)

In (32), the Chien search can be further integrated bysumming up all the evaluation results of t BTs. Insteadof summation after the basis transformations, the integratedoperation can reduce unnecessary logic operation and lead tofewer hardware complexity. Thus, (32) can be reformulatedwith group basis transformer as

σ̃ (αi ) =t∑

j=1

m−1∑

k=0

d j,kαj k + σ̃0 =

mt∑

v=0

∀ j k=v

d j,kαv + σ̃0 (33)

where d j,k is the kth coefficient of D j (x). Due to the structuralsimilarity, the MPCN-based parallel Chien search is alsopotentially capable of sharing common resources with theencoder and syndrome generator introduced previously. TheMPCN-based parallel-8 Chien search architecture is shown inFig. 13.

VI. MEASUREMENT RESULTS AND COMPARISON

Concerning the limited silicon area, our test chip pro-vides a three-stage pipelined parallel-8 BCH codec witherror correcting capability t = 1–24 bits for protecting1024-bytes information (Design-I), as shown in Fig. 14, andwe also propose a synthesized design with similar archi-tecture but enhanced t = 60–84 bits (Design-II) to meetthe current reliability requirement of NAND flash memory.The combined MPCN-based encoder and SC is utilized forefficiently supporting multiple error correcting capabilities.The RELP-based iBM algorithm and the t-iteration iBMalgorithm with reciprocally shifting approach are, respectively,utilized in proposed Design I and II for eliminating the need

Page 9: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI ...ieeeprojectsmadurai.com/2015-16 IEEE PAPERS/vlsi/mpcn.pdf · IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS,

YANG et al.: MPCN-BASED BCH CODEC ARCHITECTURE 1243

Fig. 13. MPCN-based Chien search architecture with 8-bit parallelism.

Fig. 14. Microphoto of proposed BCH chip with t = 1–24.

TABLE II

IMPLEMENTATION RESULTS OF PROPOSED BCH CODECS

for dummy location search in shorten codes. The MPCN-based architecture is also applied in CSL for further reducingcomplexity.

According to measurement results in CMOS 65-nm technol-ogy, the test chip can achieve 1.33-Gb/s throughput with 73kgate-count, and at the cost of 168.6k gate-count, the Design-IIcan achieve 1.60-Gb/s throughput. Both designs meet the133 MB/s target for toggle-mode double data rate NAND 1standard [23].

Table II lists the implementation and comparison results.Compared with other designs, our designs both support morethan 24 modes with competitive hardware cost, whereasthe works in [10] and [12] only support 1 and 4 modes,respectively. Moreover, our design successfully integrates thehardware resources for encoder providing a low complexitybut much powerful codec contrasted with [12]. Comparedwith [13], which supports t = 32 (single mode), our designsupports t = 60–84 (25 modes) but the hardware complex-ity is only 1.53 times of [13]. As a result, the proposedMPCN-based architecture leads the optimization of area-efficient BCH codec with arbitrary error correcting capability.

VII. CONCLUSION

This paper introduces not only a multiple t BCH encoderarchitecture, but also a combined architecture for encoderand SC. Exploiting the property that the generator polynomialis the product of minimal polynomials, the encoder canefficiently support multiple error correcting capabilities byusing sub-LFSRs of minimal polynomial as basic components.Moreover, the proposed encoder can be merged with the SCby adding basis transformers. To meet the high-throughputdemand of NAND flash applications, parallel processing isachieved utilizing the MPCNs. There are two BCH codecdesigns presented in this paper. From the implementationresult in CMOS 65-nm technology, the test chip supportingt = 1–24 bits can achieve 1.33-Gb/s measured throughput with73k gate-count; the other design supporting t = 60–84 bits canprovide 1.60-Gb/s throughput with 168.6k gate-count. Bothdesigns meet the throughput requirement of industry standardwhile retaining area efficiency.

REFERENCES

[1] N. Shibata et al., “A 70 nm 16 Gb 16-level-cell NAND flash memory,”IEEE J. Solid-State Circuits, vol. 43, no. 4, pp. 929–937, Apr. 2008.

[2] H. Kim et al., “A 159 mm2 32 nm 32 Gb MLC NAND-flash memorywith 200 MB/s asynchronous DDR interface,” in IEEE Int. Solid-StateCircuits Conf. Dig. Tech. Papers (ISSCC), Feb. 2010, pp. 442–443.

[3] D. Lee et al., “A 64 Gb 533 Mb/s DDR interface MLC NAND flashin sub-20 nm technology,” in IEEE Int. Solid-State Circuits Conf. Dig.Tech. Papers (ISSCC), Feb. 2012, pp. 430–432.

[4] Y. Li et al., “128 Gb 3b/cell NAND flash memory in 19 nm technologywith 18 MB/s write rate and 400 Mb/s toggle mode,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers (ISSCC), Feb. 2012, pp. 436–437.

[5] Y. Cai, E. F. Haratsch, O. Mutlu, and K. Mai, “Error patterns in MLCNAND flash memory: Measurement, characterization, and analysis,”in Proc. Design, Autom. Test Eur. Conf. Exhibit. (DATE), Mar. 2012,pp. 521–526.

[6] S. Tanakamaru, C. Hung, and K. Takeuchi, “Highly reliable and lowpower SSD using asymmetric coding and stripe bitline-pattern elim-ination programming,” IEEE J. Solid-State Circuits, vol. 47, no. 1,pp. 85–96, Jan. 2012.

[7] R. C. Bose and D. K. Ray-Chaudhuri, “On a class of error correctingbinary group codes,” Inform. Control, vol. 3, no. 1, pp. 68–69, Mar. 1960.

[8] A. Hocquenghem, “Codes corecteurs d’erreurs,” Chiffres, vol. 2,pp. 147–159, Sep. 1959.

Page 10: IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI ...ieeeprojectsmadurai.com/2015-16 IEEE PAPERS/vlsi/mpcn.pdf · IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS,

1244 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 23, NO. 7, JULY 2015

[9] W. Liu, J. Rho, and W. Sung, “Low-power high-throughput BCH errorcorrection VLSI design for multi-level cell NAND flash memories,”in Proc. IEEE Workshop Signal Process. Syst. Design Implement.,Oct. 2006, pp. 303–308.

[10] T.-H. Chen, Y.-Y. Hsiao, Y.-T. Hsing, and C.-W. Wu, “An adaptive-rateerror correction scheme for NAND flash memory,” in Proc. 27th IEEEVLSI Test Symp. (VTS), May 2009, pp. 53–58.

[11] S. Li and T. Zhang, “Improving multi-level NAND flash memorystorage reliability using concatenated BCH-TCM coding,” IEEE Trans.Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 10, pp. 1412–1420,Oct. 2010.

[12] K. Lee, S. Lim, and J. Kim, “Low-cost, low-power and high-throughputBCH decoder for NAND flash memory,” in Proc. IEEE Int. Symp.Circuits Syst. (ISCAS), May 2012, pp. 413–415.

[13] Y. Lee, H. Yoo, I. Yoo, and I.-C. Park, “6.4 Gb/s multi-threadedBCH encoder and decoder for multi-channel SSD controllers,” in IEEEInt. Solid-State Circuits Conf. Dig. Tech. Papers (ISSCC), Feb. 2012,pp. 426–428.

[14] R. Cherukuri, “Agile encoder architectures for strength-adaptive longBCH codes,” in Proc. IEEE GLOBECOM Workshops, Dec. 2010,pp. 1900–1904.

[15] Y.-M. Lin, C.-H. Yang, C.-H. Hsu, H.-C. Chang, and C.-Y. Lee,“A MPCN-based parallel architecture in BCH decoders for NAND flashmemory devices,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 58,no. 10, pp. 682–686, Oct. 2011.

[16] G. Fettweis and M. Hassner, “A combined Reed-Solomon encoder andsyndrome generator with small hardware complexity,” in Proc. IEEE Int.Symp. Circuits Syst. (ISCAS), vol. 4. May 1992, pp. 1871–1874.

[17] S. Lin and D. J. Costello, Error Control Coding, 2nd ed.Englewood Cliffs, NJ, USA: Prentice-Hall, 1983.

[18] R. E. Blahut, Theory and Practice of Error Control Codes. Reading,MA, USA: Addison-Wesley, 1983.

[19] A. V. Oppenheim, R. W. Schafer, and J. R. Buck, Discrete-Time SignalProcessing, 2nd ed. Englewood Cliffs, NJ, USA: Prentice-Hall, 1999.

[20] I. S. Reed and G. Solomon, “Polynomial codes over certain finite fields,”J. Soc. Ind. Appl. Math., vol. 8, no. 2, pp. 300–304, Jun. 1960.

[21] Y.-M. Lin, J.-Y. Wu, C.-C. Lin, and H.-C. Chang, “A long block lengthBCH decoder for DVB-S2 application,” in Proc. 12th Int. Symp. Integr.Circuits (ISIC), Dec. 2009, pp. 171–174.

[22] H.-C. Chang, C.-C. Lin, and C.-Y. Lee, “A low-power Reed-Solomondecoder for STM-16 optical communications,” in Proc. IEEE Asia-Pacific Conf. ASIC, 2002, pp. 351–354.

[23] (2010). Toshiba Introduces Double Data Rate Toggle ModeNAND in MLC and SLC Configurations, Toshiba Corp., Tokyo,Japan. [Online]. Available: http://www.toshiba.com/taec/news/press_releases/2010/memy_10_599.jsp

Chi-Heng Yang received the B.S. degree in elec-tronics engineering from National Chiao Tung Uni-versity, Hsinchu, Taiwan, in 2008, where he iscurrently working toward the Ph.D. degree at theInstitute of Electronics.

His current research interests include coding the-ory, VLSI implementation, and architecture of errorcontrol codes.

Yi-Min Lin received the B.S. degree in electricalengineering from National Tsing Hua University,Hsinchu, Taiwan, in 2005 and the Ph.D. degree fromNational Chiao Tung University, Hsinchu, in 2011.

He was a Postdoctoral Scholar with the Depart-ment of Electrical Engineering, University of Cali-fornia at Los Angeles, Los Angeles, CA, USA, from2011 to 2012, where he was involved in the area ofcoding architectures for 60-GHz baseband. In 2012,he joined SK Hynix Memory Solutions, San Jose,CA, as a Systems Architect of Solid-State Drives.

His current research interests include coding theory, VLSI architectures,integrated circuit design for communication and storage systems, and signalprocessing.

Hsie-Chia Chang received the B.S., M.S., and Ph.D.degrees from the Department of Electronics Engi-neering, National Chiao Tung University, Hsinchu,Taiwan, in 1995, 1997, and 2002, respectively.

He was with OSP/DE1, MediaTek Corporation,Hsinchu, from 2002 to 2003, where he was involvedin the area of decoding architectures for combosingle chip. In 2003, he joined the faculty of theDepartment of Electronics Engineering, NationalChiao Tung University, where he has been a Profes-sor since 2010. He has recently committed himself

to designing high code-rate ECC schemes for flash memory and rate-lesscoding scheme for video streaming applications. His current research interestsinclude algorithms and VLSI architectures in signal processing, in particular,error control codes and cryptosystems.

Dr. Chang has served as the Deputy Director General with the ChipImplementation Center, Taiwan, since 2014. He has also served as anAssociate Editor of the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS

I: REGULAR PAPERS since 2012, and served as a Technique ProgramCommittee Member of the IEEE Asian Solid-State Circuits Conference from2011 to 2013. He was a recipient of the Outstanding Youth Electrical EngineerAward from the Chinese Institute of Electrical Engineering in 2010, and theOutstanding Youth Researcher Award from the Taiwan IC Design Societyin 2011.

Chen-Yi Lee (M’14) received the B.S. degree inelectrical engineering from National Chiao TungUniversity, Hsinchu, Taiwan, in 1982 and the M.S.and Ph.D. degrees in electrical engineering fromKatholieke University Leuven, Leuven, Belgium, in1986 and 1990, respectively.

He was with IMEC, Leuven, and the VLSI Sys-tems and Design Methodology Division, IMEC,from 1986 to 1990, where he was involved in archi-tecture synthesis for digital signal processing (DSP).In 1991, he joined the faculty of the Department

of Electronics Engineering, National Chiao Tung University, where he iscurrently a Professor. He is involved in various aspects of short-rangewireless communications, system-on-chip design technology, very low-powerdesigns, and multimedia signal processing. He has published more than 200journal/conference papers and holds more than 25 China/U.S. patents. Hiscurrent research interests include VLSI algorithms and architectures for high-throughput and energy-efficient DSP applications.

Dr. Lee served as the Director with the Chip Implementation Center,Hsinchu, from 2000 to 2003. He was the IEEE CAS Taipei Chapter Chairfrom 2000 to 2001, the SIP Task Leader of National System-on-Chip ResearchProgram from 2003 to 2005, and the Microelectronics Program Coordinatorof Engineering Division under the National Science Council of Taiwan from2003 to 2005.


Recommended