+ All Categories
Home > Documents > AnFPGAImplementationof (3 6)-RegularLow-Density Parity ... · An FPGA Implementation of...

AnFPGAImplementationof (3 6)-RegularLow-Density Parity ... · An FPGA Implementation of...

Date post: 14-Mar-2020
Category:
Upload: others
View: 12 times
Download: 0 times
Share this document with a friend
13
EURASIP Journal on Applied Signal Processing 2003:6, 530–542 c 2003 Hindawi Publishing Corporation An FPGA Implementation of (3, 6)-Regular Low-Density Parity-Check Code Decoder Tong Zhang Department of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute, Troy, NY 12180, USA Email: [email protected] Keshab K. Parhi Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455, USA Email: [email protected] Received 28 February 2002 and in revised form 6 December 2002 Because of their excellent error-correcting performance, low-density parity-check (LDPC) codes have recently attracted a lot of attention. In this paper, we are interested in the practical LDPC code decoder hardware implementations. The direct fully parallel decoder implementation usually incurs too high hardware complexity for many real applications, thus partly parallel decoder design approaches that can achieve appropriate trade-os between hardware complexity and decoding throughput are highly desirable. Applying a joint code and decoder design methodology, we develop a high-speed (3,k)-regular LDPC code partly parallel decoder architecture based on which we implement a 9216-bit, rate-1/ 2 (3, 6)-regular LDPC code decoder on Xilinx FPGA device. This partly parallel decoder supports a maximum symbol throughput of 54 Mbps and achieves BER 10 6 at 2dB over AWGN channel while performing maximum 18 decoding iterations. Keywords and phrases: low-density parity-check codes, error-correcting coding, decoder, FPGA. 1. INTRODUCTION In the past few years, the recently rediscovered low-density parity-check (LDPC) codes [1, 2, 3] have received a lot of at- tention and have been widely considered as next-generation error-correcting codes for telecommunication and magnetic storage. Defined as the null space of a very sparse M × N parity-check matrix H, an LDPC code is typically represented by a bipartite graph, usually called Tanner graph, in which one set of N variable nodes corresponds to the set of code- word, another set of M check nodes corresponds to the set of parity-check constraints and each edge corresponds to a nonzero entry in the parity-check matrix H. (A bipartite graph is one in which the nodes can be partitioned into two sets, X and Y , so that the only edges of the graph are be- tween the nodes in X and the nodes in Y .) An LDPC code is known as ( j, k)-regular LDPC code if each variable node has the degree of j and each check node has the degree of k, or in its parity-check matrix each column and each row have j and k nonzero entries, respectively. The code rate of a ( j, k)-regular LDPC code is 1 j/k provided that the parity- check matrix has full rank. The construction of LDPC codes is typically random. LDPC codes can be eectively decoded by the iterative belief-propagation (BP) algorithm [3] that, as illustrated in Figure 1, directly matches the Tanner graph: decoding messages are iteratively computed on each variable node and check node and exchanged through the edges be- tween the neighboring nodes. Recently, tremendous eorts have been devoted to ana- lyze and improve the LDPC codes error-correcting capabil- ity, see [4, 5, 6, 7, 8, 9, 10, 11] and so forth. Besides their powerful error-correcting capability, another important rea- son why LDPC codes attract so many attention is that the iterative BP decoding algorithm is inherently fully parallel, thus a great potential decoding speed can be expected. The high-speed decoder hardware implementation is ob- viously one of the most crucial issues determining the extent of LDPC applications in the real world. The most natural so- lution for the decoder architecture design is to directly in- stantiate the BP decoding algorithm to hardware: each vari- able node and check node are physically assigned their own processors and all the processors are connected through an interconnection network reflecting the Tanner graph con- nectivity. By completely exploiting the parallelism of the BP decoding algorithm, such fully parallel decoder can achieve very high decoding speed, for example, a 1024-bit, rate-1/ 2 LDPC code fully parallel decoder with the maximum symbol throughput of 1 Gbps has been physically implemented us- ing ASIC technology [12]. The main disadvantage of such
Transcript
Page 1: AnFPGAImplementationof (3 6)-RegularLow-Density Parity ... · An FPGA Implementation of (3,6)-Regular LDPC Code Decoder 531 Check nodes Variable nodes Check-to-variable message Variable-to-check

EURASIP Journal on Applied Signal Processing 2003:6, 530–542c© 2003 Hindawi Publishing Corporation

An FPGA Implementation of (3, 6)-Regular Low-DensityParity-Check Code Decoder

Tong ZhangDepartment of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute, Troy, NY 12180, USAEmail: [email protected]

Keshab K. ParhiDepartment of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455, USAEmail: [email protected]

Received 28 February 2002 and in revised form 6 December 2002

Because of their excellent error-correcting performance, low-density parity-check (LDPC) codes have recently attracted a lot ofattention. In this paper, we are interested in the practical LDPC code decoder hardware implementations. The direct fully paralleldecoder implementation usually incurs too high hardware complexity for many real applications, thus partly parallel decoderdesign approaches that can achieve appropriate trade-offs between hardware complexity and decoding throughput are highlydesirable. Applying a joint code and decoder designmethodology, we develop a high-speed (3, k)-regular LDPC code partly paralleldecoder architecture based on which we implement a 9216-bit, rate-1/2 (3, 6)-regular LDPC code decoder on Xilinx FPGA device.This partly parallel decoder supports a maximum symbol throughput of 54Mbps and achieves BER 10−6 at 2 dB over AWGNchannel while performing maximum 18 decoding iterations.

Keywords and phrases: low-density parity-check codes, error-correcting coding, decoder, FPGA.

1. INTRODUCTION

In the past few years, the recently rediscovered low-densityparity-check (LDPC) codes [1, 2, 3] have received a lot of at-tention and have been widely considered as next-generationerror-correcting codes for telecommunication and magneticstorage. Defined as the null space of a very sparse M × Nparity-checkmatrixH, an LDPC code is typically representedby a bipartite graph, usually called Tanner graph, in whichone set of N variable nodes corresponds to the set of code-word, another set of M check nodes corresponds to the setof parity-check constraints and each edge corresponds toa nonzero entry in the parity-check matrix H. (A bipartitegraph is one in which the nodes can be partitioned into twosets, X and Y , so that the only edges of the graph are be-tween the nodes in X and the nodes in Y .) An LDPC codeis known as ( j, k)-regular LDPC code if each variable nodehas the degree of j and each check node has the degree ofk, or in its parity-check matrix each column and each rowhave j and k nonzero entries, respectively. The code rate of a( j, k)-regular LDPC code is 1− j/k provided that the parity-check matrix has full rank. The construction of LDPC codesis typically random. LDPC codes can be effectively decodedby the iterative belief-propagation (BP) algorithm [3] that,as illustrated in Figure 1, directly matches the Tanner graph:

decoding messages are iteratively computed on each variablenode and check node and exchanged through the edges be-tween the neighboring nodes.

Recently, tremendous efforts have been devoted to ana-lyze and improve the LDPC codes error-correcting capabil-ity, see [4, 5, 6, 7, 8, 9, 10, 11] and so forth. Besides theirpowerful error-correcting capability, another important rea-son why LDPC codes attract so many attention is that theiterative BP decoding algorithm is inherently fully parallel,thus a great potential decoding speed can be expected.

The high-speed decoder hardware implementation is ob-viously one of the most crucial issues determining the extentof LDPC applications in the real world. The most natural so-lution for the decoder architecture design is to directly in-stantiate the BP decoding algorithm to hardware: each vari-able node and check node are physically assigned their ownprocessors and all the processors are connected through aninterconnection network reflecting the Tanner graph con-nectivity. By completely exploiting the parallelism of the BPdecoding algorithm, such fully parallel decoder can achievevery high decoding speed, for example, a 1024-bit, rate-1/2LDPC code fully parallel decoder with the maximum symbolthroughput of 1Gbps has been physically implemented us-ing ASIC technology [12]. The main disadvantage of such

Page 2: AnFPGAImplementationof (3 6)-RegularLow-Density Parity ... · An FPGA Implementation of (3,6)-Regular LDPC Code Decoder 531 Check nodes Variable nodes Check-to-variable message Variable-to-check

An FPGA Implementation of (3, 6)-Regular LDPC Code Decoder 531

Check nodes

Variable nodes

Check-to-variablemessage

Variable-to-checkmessage

Figure 1: Tanner graph representation of an LDPC code and the decoding messages flow.

fully parallel design is that with the increase of code length,typically the LDPC code length is very large (at least severalthousands), the incurred hardware complexity will becomemore and more prohibitive for many practical purposes,for example, for 1-K code length, the ASIC decoder imple-mentation [12] consumes 1.7M gates. Moreover, as pointedout in [12], the routing overhead for implementing the en-tire interconnection network will become quite formidabledue to the large code length and randomness of the Tan-ner graph. Thus high-speed partly parallel decoder de-sign approaches that achieve appropriate trade-offs betweenhardware complexity and decoding throughput are highlydesirable.

For any given LDPC code, due to the randomness of itsTanner graph, it is nearly impossible to directly develop ahigh-speed partly parallel decoder architecture. To circum-vent this difficulty, Boutillon et al. [13] proposed a decoder-first code design methodology: instead of trying to conceivethe high-speed partly parallel decoder for any given ran-dom LDPC code, use an available high-speed partly par-allel decoder to define a constrained random LDPC code.We may consider it as an application of the well-known“Think in the reverse direction” methodology. Inspired bythe decoder-first code design methodology, we proposeda joint code and decoder design methodology in [14] for(3, k)-regular LDPC code partly parallel decoder design. Byjointly conceiving the code construction and partly paral-lel decoder architecture design, we presented a (3, k)-regularLDPC code partly parallel decoder structure in [14], whichnot only defines very good (3, k)-regular LDPC codes butalso could potentially achieve high-speed partly paralleldecoding.

In this paper, applying the joint code and decoder designmethodology, we develop an elaborate (3, k)-regular LDPCcode high-speed partly parallel decoder architecture basedon which we implement a 9216-bit, rate-1/2 (3, 6)-regularLDPC code decoder using Xilinx Virtex FPGA (Field Pro-grammable Gate Array) device. In this work, we significantlymodify the original decoder structure [14] to improve the de-coding throughput and simplify the control logic design. Toachieve good error-correcting capability, the LDPC code de-coder architecture has to possess randomness to some extent,which makes the FPGA implementations more challengingsince FPGA has fixed and regular hardware resources. Wepropose a novel scheme to realize the random connectivityby concatenating two routing networks, where all the ran-dom hardwire routings are localized and the overall routing

complexity is significantly reduced. Exploiting the goodmin-imum distance property of LDPC codes, this decoder em-ploys parity check as the earlier decoding stopping criterionto achieve adaptive decoding for energy reduction. With themaximum 18 decoding iterations, this FPGA partly paralleldecoder supports a maximum of 54Mbps symbol through-put and achieves BER (bit error rate) 10−6 at 2 dB overAWGN channel.

This paper begins with a brief description of the LDPCcode decoding algorithm in Section 2. In Section 3, we brieflydescribe the joint code and decoder design methodology for(3, k)-regular LDPC code partly parallel decoder design. InSection 4, we present the detailed high-speed partly paralleldecoder architecture design. Finally, an FPGA implementa-tion of a (3, 6)-regular LDPC code partly parallel decoder isdiscussed in Section 5.

2. DECODING ALGORITHM

Since the direct implementation of BP algorithm will incurtoo high hardware complexity due to the large number ofmultiplications, we introduce some logarithmic quantitiesto convert these complicated multiplications into additions,which lead to the Log-BP algorithm [2, 15].

Before the description of Log-BP decoding algorithm,we introduce some definitions as follows. Let H denote theM × N sparse parity-check matrix of the LDPC code andHi, j denote the entry of H at the position (i, j). We de-fine the set of bits n that participate in parity-check m as�(m) = {n : Hm,n = 1}, and the set of parity-checks m inwhich bit n participates as �(n) = {m : Hm,n = 1}. We de-note the set �(m) with bit n excluded by �(m) \ n, and theset �(n) with parity-checkm excluded by �(n) \m.

Algorithm 1 (Iterative Log-BP Decoding Algorithm).

Input

The prior probabilities p0n = P(xn = 0) and p1n = P(xn = 1) =1− p0n, n = 1, . . . , N ;

Output

Hard decision x = {x1, . . . , xN};

Procedure(1) Initialization: For each n, compute the intrinsic (or

channel) message γn = log p0n/p1n and for each (m,n) ∈

Page 3: AnFPGAImplementationof (3 6)-RegularLow-Density Parity ... · An FPGA Implementation of (3,6)-Regular LDPC Code Decoder 531 Check nodes Variable nodes Check-to-variable message Variable-to-check

532 EURASIP Journal on Applied Signal Processing

High-speed partlyparallel decoder

Random input H3

Constrained randomparameters

Construction of

H =

[H1

H2

] Deterministicinput

H

(3, k)-regular LDPC codeensemble defined by

H =

[H

H3

] Selected code

Figure 2: Joint design flow diagram.

{(i, j) | Hi, j = 1}, compute

αm,n = sign(γn)log

(1 + e−|γn|

1− e−|γn|

), (1)

where

sign(γn) =

+1, γn ≥ 0,

−1, γn < 0.(2)

(2) Iterative decoding(i) Horizontal (or check node computation) step: for

each (m,n) ∈ {(i, j) | Hi, j = 1}, compute

βm,n = log

(1 + e−α

1− e−α

) ∏n′∈�(m)\n

sign(αm,n′

), (3)

where α =∑n′∈�(m)\n |αm,n′ |.(ii) Vertical (or variable node computation) step: for

each (m,n) ∈ {(i, j) | Hi, j = 1}, compute

αm,n = sign(γm,n

)log

(1 + e−|γm,n|

1− e−|γm,n|

), (4)

where γm,n = γn +∑

m′∈�(n)\m βm′ ,n. For eachn, update the pseudoposterior log-likelihood ratio(LLR) λn as

λn = γn +∑

m∈�(n)

βm,n. (5)

(iii) Decision step:(a) perform hard decision on {λ1, . . . , λN} to ob-

tain x = {x1, . . . , xN} such that xn = 0 ifλn > 0 and xn = 1 if λ ≤ 0;

(b) if H·x = 0, then algorithm terminates, else goto horizontal step until the preset maximumnumber of iterations have occurred.

We call αm,n and βm,n in the above algorithm extrinsicmessages, where αm,n is delivered from variable node to checknode and βm,n is delivered from check node to variable node.

Each decoding iteration can be performed in fully paral-lel fashion by physically mapping each check node to one in-dividual check node processing unit (CNU) and each variablenode to one individual variable node processing unit (VNU).

Moreover, by delivering the hard decision xi from each VNUto its neighboring CNUs, the parity-check H · x can be eas-ily performed by all the CNUs. Thanks to the good min-imum distance property of LDPC code, such adaptive de-coding scheme can effectively reduce the average energy con-sumption of the decoder without performance degradation.

In the partly parallel decoding, the operations of a cer-tain number of check nodes or variable nodes are time-multiplexed, or folded [16], to a single CNU or VNU. Foran LDPC code with M check nodes and N variable nodes, ifits partly parallel decoder containsMp CNUs and Np VNUs,we denote M/Mp as CNU folding factor and N/Np as VNUfolding factor.

3. JOINT CODE ANDDECODER DESIGN

In this section, we briefly describe the joint (3, k)-regularLDPC code and decoder design methodology [14]. It is wellknown that the BP (or Log-BP) decoding algorithm workswell if the underlying Tanner graph is 4-cycle free and doesnot contain too many short cycles. Thus the motivation ofthis joint design approach is to construct an LDPC code thatnot only fits to a high-speed partly parallel decoder but alsohas the average cycle length as large as possible in its 4-cycle-free Tanner graph. This joint design process is outlined as fol-lows and the corresponding schematic flow diagram is shownin Figure 2.

(1) Explicitly construct two matrices H1 and H2 in such away that H = [HT

1 ,HT2 ]

T defines a (2, k)-regular LDPCcode C2 whose Tanner graph has the girth1 of 12.

(2) Develop a partly parallel decoder that is configured bya set of constrained random parameters and definesa (3, k)-regular LDPC code ensemble, in which eachcode is a subcode ofC2 and has the parity-checkmatrixH = [HT ,HT

3 ]T .

(3) Select a good (3, k)-regular LDPC code from the codeensemble based on the criteria of large Tanner graphaverage cycle length and computer simulations. Typi-cally the parity-check matrix of the selected code hasonly few redundant checks, so we may assume that thecode rate is always 1− 3/k.

1Girth is the length of a shortest cycle in a graph.

Page 4: AnFPGAImplementationof (3 6)-RegularLow-Density Parity ... · An FPGA Implementation of (3,6)-Regular LDPC Code Decoder 531 Check nodes Variable nodes Check-to-variable message Variable-to-check

An FPGA Implementation of (3, 6)-Regular LDPC Code Decoder 533

H=

[H1

H2

]=

L

I1,1

I2,1. . .

Ik,1

I1,2

I2,2. . .

Ik,2

0 0 0

· · ·I1,k

I2,k. . .

Ik,k

00

0

P1,1 P2,1 · · · Pk,1

0

P1,2 P2,2 · · · Pk,20

0 . ..

P1,k P2,k · · · Pk,k

0

L · k

L · k

N = L · k2

Figure 3: Structure of H = [HT1 ,H

T2 ]T .

Construction of H = [HT1 ,H

T2 ]

T

The structure of H is shown in Figure 3, where both H1 andH2 are L · k by L · k2 submatrices. Each block matrix Ix,y inH1 is an L × L identity matrix and each block matrix Px,y

in H2 is obtained by a cyclic shift of an L × L identity ma-trix. Let T denote the right cyclic shift operator where Tu(Q)represents right cyclic shifting matrix Q by u columns, thenPx,y = Tu(I) where u = ((x − 1) · y)modL and I representsthe L × L identity matrix, for example, if L = 5, x = 3, andy = 4, we have u = (x − 1) · ymodL = 8mod5 = 3, then

P3,4 = T3(I) =

0 0 0 1 00 0 0 0 11 0 0 0 00 1 0 0 00 0 1 0 0

. (6)

Notice that in both H1 and H2, each row contains k 1’sand each column contains a single 1. Thus, the matrix H =[HT

1 ,HT2 ]

T defines a (2, k)-regular LDPC code C2 with L ·k2 variable nodes and 2L · k check nodes. Let G denote theTanner graph of C2, we have the following theorem regardingto the girth of G.

Theorem 1. If L cannot be factored as L = a · b, where a, b ∈{0, . . . , k− 1}, then the girth of G is 12 and there is at least one12-cycle passing each check node.

Partly parallel decoder

Based on the specific structure of H, a principal (3, k)-regularLDPC code partly parallel decoder structure was presented in[14]. This decoder is configured by a set of constrained ran-dom parameters and defines a (3, k)-regular LDPC code en-semble. Each code in this ensemble is essentially constructedby inserting extra L · k check nodes to the high-girth (2, k)-regular LDPC code C2 under the constraint specified by thedecoder. Therefore, it is reasonable to expect that the codesin this ensemble more likely do not contain too many shortcycles and we may easily select a good code from it. For realapplications, we can select a good code from this code ensem-ble as follows: first in the code ensemble, find several codes

with relatively high-average cycle lengths, then select the oneleading to the best result in the computer simulations.

The principal partly parallel decoder structure presentedin [14] has the following properties.

(i) It contains k2 memory banks, each one consists of sev-eral RAMs to store all the decoding messages associ-ated with L variable nodes.

(ii) Each memory bank associates with one address gener-ator that is configured by one element in a constrainedrandom integer set �.

(iii) It contains a configurable random-like one-dimen-sional shuffle network � with the routing complexityscaled by k2.

(iv) It contains k2 VNUs and k CNUs so that the VNU andCNU folding factors are L·k2/k2 = L and 3L·k/k = 3L,respectively.

(v) Each iteration completes in 3L clock cycles in whichonly CNUs work in the first 2L clock cycles and bothCNUs and VNUs work in the last L clock cycles.

Over all the possible � and �, this decoder defines a (3, k)-regular LDPC code ensemble in which each code has theparity-check matrix H = [HT ,HT

3 ]T , where the submatrix

H3 is jointly specified by � and S.

4. PARTLY PARALLEL DECODER ARCHITECTURE

In this paper, applying the joint code and decoder designmethodology, we develop a high-speed (3, k)-regular LDPCcode partly parallel decoder architecture based on which a9216-bit, rate-1/2 (3, 6)-regular LDPC code partly paralleldecoder has been implemented using Xilinx Virtex FPGAdevice. Compared with the structure presented in [14], thispartly parallel decoder architecture has the following distinctcharacteristics.

(i) It employs a novel concatenated configurable ran-dom two-dimensional shuffle network implementa-tion scheme to realize the random-like connectivitywith low routing overhead, which is especially desir-able for FPGA implementations.

(ii) To improve the decoding throughput, both the VNUfolding factor and CNU folding factor are L instead ofL and 3L in the structure presented in [14].

(iii) To simplify the control logic design and reduce thememory bandwidth requirement, this decoder com-pletes each decoding iteration in 2L clock cycles inwhich CNUs and VNUs work in the 1st and 2nd Lclock cycles, alternatively.

Following the joint designmethodology, we have that thisdecoder should define a (3, k)-regular LDPC code ensemblein which each code has L · k2 variable nodes and 3L · k checknodes and, as illustrated in Figure 4, the parity-check ma-trix of each code has the form H = [HT

1 ,HT2 ,H

T3 ]

T where H1

and H2 have the explicit structures as shown in Figure 3 andthe random-like H3 is specified by certain configuration pa-rameters of the decoder. To facilitate the description of the

Page 5: AnFPGAImplementationof (3 6)-RegularLow-Density Parity ... · An FPGA Implementation of (3,6)-Regular LDPC Code Decoder 531 Check nodes Variable nodes Check-to-variable message Variable-to-check

534 EURASIP Journal on Applied Signal Processing

H=

H1

H2

H3

=

Leftmost column

Lh(k,2)1

Rightmost column

h(k,2)L

I1,1. . .

Ik,1

I1,2. . .

Ik,2

· · ·I1,k

. . .

Ik,k

L · k

P1,1 · · · Pk,1P1,2 · · · Pk,2

. . .

P1,k · · · Pk,kL · k

L · k

H(k,2) L · k2

Figure 4: The parity-check matrix.

decoder architecture, we introduce some definitions as fol-lows: we denote the submatrix consisting of the L consecutivecolumns in H that go through the block matrix Ix,y as H(x,y)

in which, from left to right, each column is labeled as h(x,y)i

with i increasing from 1 to L, as shown in Figure 4. We label

the variable node corresponding to column h(x,y)i as v

(x,y)i and

the L variable nodes v(x,y)i for i = 1, . . . , L constitute a variable

node group VGx,y . Finally, we arrange the L · k check nodescorresponding to all the L·k rows of submatrixHi into checknode group CGi.

Figure 5 shows the principal structure of this partly par-allel decoder. It mainly contains k2 PE blocks PEx,y , for 1 ≤ xand y ≤ k, three bidirectional shuffle networks π1, π2, andπ3, and 3 · k CNUs. Each PEx,y contains one memory bankRAMsx,y that stores all the decoding messages, including theintrinsic and extrinsic messages and hard decisions, associ-ated with all the L variable nodes in the variable node groupVGx,y , and contains one VNU to perform the variable nodecomputations for these L variable nodes. Each bidirectionalshuffle network πi realizes the extrinsic message exchange be-tween all the L·k2 variable nodes and the L·k check nodes inCGi. The k CNUi, j , for j = 1, . . . , k, perform the check nodecomputations for all the L · k check nodes in CGi.

This decoder completes each decoding iteration in 2Lclock cycles, and during the first and second L clock cycles,it works in check node processing mode and variable nodeprocessing mode, respectively. In the check node processingmode, the decoder not only performs the computations ofall the check nodes but also completes the extrinsic messageexchange between neighboring nodes. In variable node pro-cessing mode, the decoder only performs the computationsof all the variable nodes.

The intrinsic and extrinsic messages are all quantized tofive bits and the iterative decoding datapaths of this partlyparallel decoder are illustrated in Figure 6, in which the dat-apaths in check node processing and variable node process-ing are represented by solid lines and dash dot lines, respec-tively. As shown in Figure 6, each PE block PEx,y contains

five RAM blocks: EXT RAM i for i = 1, 2, 3, INT RAM, andDEC RAM. Each EXT RAM i has L memory locations andthe location with the address d − 1 (1 ≤ d ≤ L) containsthe extrinsic messages exchanged between the variable nodev(x,y)d in VGx,y and its neighboring check node in CGi. TheINT RAM and DEC RAM store the intrinsic message andhard decision associated with node v

(x,y)d at the memory lo-

cation with the address d − 1 (1 ≤ d ≤ L). As we will seelater, such decoding messages storage strategy could greatlysimplify the control logic for generating the memory accessaddress.

For the purpose of simplicity, in Figure 6 we do not showthe datapath from INT RAM to EXT RAM i’s for extrinsicmessage initialization, which can be easily realized in L clockcycles before the decoder enters the iterative decoding pro-cess.

4.1. Check node processing

During the check node processing, the decoder performs thecomputations of all the check nodes and realizes the extrinsicmessage exchange between all the neighboring nodes. At thebeginning of check node processing, in each PEx,y the mem-ory location with address d − 1 in EXT RAM i contains 6-bit hybrid data that consists of 1-bit hard decision and 5-bitvariable-to-check extrinsic message associated with the vari-

able node v(x,y)d in VGx,y . In each clock cycle, this decoder

performs the read-shuffle-modify-unshuffle-write operationsto convert one variable-to-check extrinsic message in eachEXT RAM i to its check-to-variable counterpart. As illus-trated in Figure 6, we may outline the datapath loop in checknode processing as follows:

(1) read: one 6-bit hybrid data h(i)x,y is read from eachEXT RAM i in each PEx,y ;

(2) shuffle: each hybrid data h(i)x,y goes through the shufflenetwork πi and arrives at CNUi, j ;

(3) modify: each CNUi, j performs the parity check on the 6input hard decision bits and generates the 6 output 5-

bit check-to-variable extrinsic messages β(i)x,y based onthe 6 input 5-bit variable-to-check extrinsic messages;

(4) unshuffle: send each check-to-variable extrinsic mes-

sage β(i)x,y back to the PE block via the same path as itsvariable-to-check counterpart;

(5) write: write each β(i)x,y to the same memory location inEXT RAM i as its variable-to-check counterpart.

All the CNUs deliver the parity-check results to a centralcontrol block that will, at the end of check node processing,determine whether all the parity-check equations specifiedby the parity-check matrix have been satisfied, if yes, the de-coding for current code frame will terminate.

To achieve higher decoding throughput, we implementthe read-shuffle-modify-unshuffle-write loop operation byfive-stage pipelining as shown in Figure 7, where CNU isone-stage pipelined. To make this pipelining scheme feasi-ble, we realize each bidirectional I/O connection in the three

Page 6: AnFPGAImplementationof (3 6)-RegularLow-Density Parity ... · An FPGA Implementation of (3,6)-Regular LDPC Code Decoder 531 Check nodes Variable nodes Check-to-variable message Variable-to-check

An FPGA Implementation of (3, 6)-Regular LDPC Code Decoder 535

Active duringvariable node processing

PE1,1

VNU

RAMs1,1

PE2,1

VNU

RAMs2,1 · · ·

PEk,k

VNU

RAMsk,k

· · ·π1(regular & fixed)

· · ·π2 (regular & fixed)

· · ·π3

(random-like &configurable)

CNU1,1

· · ·CNU1,k CNU2,1

· · ·CNU2,k CNU3,1

· · ·CNU3,k

Active duringcheck node processing

Figure 5: The principal (3, k)-regular LDPC code partly parallel decoder structure.

· · ·

CNU1, j

6 bits

5 bits

π1 (regular & fixed) h(1)x,y

· · ·

CNU2, j

6 bits

5 bits

π2 (regular & fixed)

h(2)x,y

· · ·

CNU3, j

6 bits

5 bits

π3(random-like &configurable)

h(3)x,y

PEx,y

{h(i)x,y} 18 bits

{β(i)x,y}15 bits

EXT RAM 1

EXT RAM 2

EXT RAM 3

18 bits {h(i)x,y}

{β(i)x,y}15 bits

INT RAM

5 bits

VNU

1 bit

DEC RAM

Figure 6: Iterative decoding datapaths.

CNU

Read

6 bits

Shuffle

6 bitsCNU

(1st half)CNU

(2nd half)

5 bitsUnshuffle

5 bits

Write

Figure 7: Five-stage pipelining of the check node processing datapath.

shuffle networks by two distinct sets of wires with oppositedirections, which means that the hybrid data from PE blocksto CNUs and the check-to-variable extrinsic messages fromCNUs to PE blocks are carried on distinct sets of wires. Com-pared with sharing one set of wires in time-multiplexed fash-ion, this approach has higher wire routing overhead but ob-viates the logic gate overhead due to the realization of time-multiplex and, more importantly, make it feasible to directlypipeline the datapath loop for higher decoding throughput.

In this decoder, one address generator AG(i)x,y associates

with one EXT RAM i in each PEx,y . In the check node pro-cessing, AG(i)

x,y generates the address for reading hybrid dataand, due to the five-stage pipelining of datapath loop, the ad-dress for writing back the check-to-variable message is ob-

tained via delaying the read address by five clock cycles. Itis clear that the connectivity among all the variable nodesand check nodes, or the entire parity-check matrix, realizedby this decoder is jointly specified by all the address genera-tors and the three shuffle networks. Moreover, for i = 1, 2, 3,the connectivity among all the variable nodes and the checknodes in CGi is completely determined by AG(i)

x,y and πi. Fol-lowing the joint design methodology, we implement all theaddress generators and the three shuffle networks as follows.

4.1.1 Implementations ofAG(1)x,y and π1

The bidirectional shuffle network π1 and AG(1)x,y realize the

connectivity among all the variable nodes and all the checknodes in CG1 as specified by the fixed submatrix H1. Recall

Page 7: AnFPGAImplementationof (3 6)-RegularLow-Density Parity ... · An FPGA Implementation of (3,6)-Regular LDPC Code Decoder 531 Check nodes Variable nodes Check-to-variable message Variable-to-check

536 EURASIP Journal on Applied Signal Processing

π3

Input datafrom PE blocks

a1,1 · · · a1,k...

ak,1 · · · ak,k

. . ....

r = 0 · · ·L − 1

ROMR

s(r)1

a1,1 · · · a1,k

Ψ(r)1 (R1 or Id)

1 bitb1,1 b1,k· · ·

...

1 bit

s(r)k

ak,1 · · · ak,k

Ψ(r)k(Rk or Id)

bk,1 · · · bk,k

Stage I: intrarow shuffle

r = 0 · · ·L − 1

ROMC1 bit

s(c)1

b1,1

...

bk,1

Ψ(c)1(C

1or

Id)

c1,1

...

ck,1

1 bit

s(c)k

b1,1

...· · ·

bk,1

Ψ(c)k(C

kor

Id)

c1,1

...

ck,1

Output datato CNU3, j ’s

c1,1 · · · c1,k

.... . .

...

ck,1 · · · ck,k

Stage II: intracolumn shuffle

Figure 8: Forward path of π3.

that node v(x,y)d corresponds to the column h

(x,y)i as illustrated

in Figure 4 and the extrinsic messages associated with nodev(x,y)d are always stored at address d − 1. Exploiting the ex-plicit structure of H1, we easily obtain the implementationschemes for AG(1)

x,y and π1 as follows:

(i) each AG(1)x,y is realized as a �log2 L�-bit binary counter

that is cleared to zero at the beginning of check nodeprocessing;

(ii) the bidirectional shuffle network π1 connects the kPEx,y with the same x-index to the same CNU.

4.1.2 Implementations ofAG(2)x,y and π2

The bidirectional shuffle network π2 and AG(2)x,y realize the

connectivity among all the variable nodes and all the checknodes in CG2 as specified by the fixed matrix H2. Similarly,exploiting the extrinsic messages storage strategy and the ex-plicit structure ofH2, we implement AG(2)

x,y and π2 as follows:

(i) each AG(2)x,y is realized as a �log2 L�-bit binary counter

that only counts up to the value L − 1 and is loadedwith the value of ((x − 1) · y)modL at the beginningof check node processing;

(ii) the bidirectional shuffle network π2 connects the kPEx,y with the same y-index to the same CNU.

Notice that the counter load value for each AG(2)x,y directly

comes from the construction of each block matrix Px,y inH2

as described in Section 3.

4.1.3 Implementations ofAG(3)x,y and π3

The bidirectional shuffle network π3 and AG(3)x,y jointly de-

fine the connectivity among all the variable nodes and all thecheck nodes in CG3, which is represented byH3 as illustratedin Figure 4. In the above, we show that by exploiting the spe-cific structures ofH1 andH2 and the extrinsic messages stor-age strategy, we can directly obtain the implementations ofeach AG(i)

x,y and πi, for i = 1, 2. However, the implementa-

tions of AG(3)x,y and π3 are not easy because of the following

requirements onH3:

(1) the Tanner graph corresponding to the parity-checkmatrixH = [HT

1 ,HT2 ,H

T3 ]

T should be 4-cycle free;(2) to make H random to some extent, H3 should be

random-like.

As proposed in [14], to simplify the design process, weseparately conceive AG(3)

x,y and π3 in such a way that the im-

plementations of AG(3)x,y and π3 accomplish the above first and

second requirements, respectively.

Implementations ofAG(3)x,y

We implement each AG(3)x,y as a �log2 L�-bit binary counter

that counts up to the value L − 1 and is initialized with aconstant value tx,y at the beginning of check node process-ing. Each tx,y is selected in random under the following twoconstraints:

(1) given x, tx,y1 �= tx,y2 , for all y1, y2 ∈ {1, . . . , k};(2) given y, tx1 ,y − tx2,y �≡ ((x1 − x2) · y)modL, for all

x1, x2 ∈ {1, . . . , k}.It can be proved that the above two constraints on tx,y are

sufficient to make the entire parity-check matrix H alwayscorrespond to a 4-cycle free Tanner graph no matter how weimplement π3.

Implementation of π3Since each AG(3)

x,y is realized as a counter, the pattern of shuf-fle network π3 cannot be fixed, otherwise the shuffle patternof π3 will be regularly repeated in the H3, which means thatH3 will always contain very regular connectivity patterns nomatter how random-like the pattern of π3 itself is. Thus weshould make π3 configurable to some extent. In this paper,we propose the following concatenated configurable randomshuffle network implementation scheme for π3.

Figure 8 shows the forward path (from PEx,y to CNU3, j)of the bidirectional shuffle network π3. In each clock cycle, it

Page 8: AnFPGAImplementationof (3 6)-RegularLow-Density Parity ... · An FPGA Implementation of (3,6)-Regular LDPC Code Decoder 531 Check nodes Variable nodes Check-to-variable message Variable-to-check

An FPGA Implementation of (3, 6)-Regular LDPC Code Decoder 537

realizes the data shuffle from ax,y to cx,y by two concatenatedstages: intrarow shuffle and intracolumn shuffle. Firstly, theax,y data block, where each ax,y comes from PEx,y , passes anintrarow shuffle network array in which each shuffle network

Ψ(r)x shuffles the k input data ax,y to bx,y for 1 ≤ y ≤ k. Each

Ψ(r)x is configured by 1-bit control signal s(r)x leading to the

fixed random permutation Rx if s(r)x = 1, or to the identity

permutation (Id) otherwise. The reason why we use the Idpattern instead of another random shuffle pattern is to min-imize the routing overhead, and our simulations suggest thatthere is no gain on the error-correcting performance by usinganother random shuffle pattern instead of Id pattern. The k-bit configuration word s(r) changes every clock cycle and allthe L k-bit control words are stored in ROM R. Next, the bx,ydata block goes through an intracolumn shuffle network ar-

ray in which eachΨ(c)y shuffles the k bx,y to cx,y for 1 ≤ x ≤ k.

Similarly, each Ψ(c)y is configured by 1-bit control signal s(c)y

leading to the fixed random permutation Cy if s(c)y = 1, or to

Id otherwise. The k-bit configuration word s(c)y changes ev-ery clock cycle and all the L k-bit control words are storedin ROM C. As the output of forward path, the k cx,y with thesame x-index are delivered to the same CNU3, j . To realize thebidirectional shuffle, we only need to implement each config-

urable shuffle network Ψ(r)x and Ψ(c)

y as bidirectional so thatπ3 can unshuffle the k2 data backward from CNU3, j to PEx,yalong the same route as the forward path on distinct sets ofwires. Notice that, due to the pipelining on the datapath loop,the backward path control signals are obtained via delayingthe forward path control signals by three clock cycles.

To make the connectivity realized by π3 random-like andchange each clock cycle, we only need to randomly generate

the control words s(r)x and s(c)y for each clock cycle and thefixed shuffle patterns of each Rx and Cy . Since most modernFPGA devices have multiple metal layers, the implementa-tions of the two shuffle arrays can be overlapped from thebird’s-eye view. Therefore, the above concatenated imple-mentation scheme will confine all the routing wires to smallarea (in one row or one column), which will significantlyreduce the possibility of routing congestion and reduce therouting overhead.

4.2. Variable node processing

Compared with the above check node processing, the opera-tions performed in the variable node processing is quite sim-ple since the decoder only needs to carry out all the variablenode computations. Notice that at the beginning of variablenode processing, the three 5-bit check-to-variable extrinsic

messages associated with each variable node v(x,y)d are stored

at the address d − 1 of the three EXT RAM i in PEx,y . The

5-bit intrinsic message associated with variable node v(x,y)d is

also stored at the address d−1 of INT RAM in PEx,y . In eachclock cycle, this decoder performs the read-modify-write op-erations to convert the three check-to-variable extrinsic mes-sages associated with the same variable node to three hybriddata consisting of variable-to-check extrinsic messages and

VNU

Read

5 bitsVNU

(1st half)VNU

(2nd half)

6 bits

1 bit

Write

Figure 9: Three-stage pipelining of the variable node processingdatapath.

hard decisions. As shown in Figure 6, wemay outline the dat-apath loop in variable node processing as follows:

(1) read: in each PEx,y , three 5-bit check-to-variable ex-

trinsic messages β(i)x,y and one 5-bit intrinsic messagesγx,y associated with the same variable node are readfrom the three EXT RAM i and INT RAM at the sameaddress;

(2) modify: based on the input check-to-variable extrinsicmessages and intrinsic message, each VNU generatesthe 1-bit hard decision xx,y and three 6-bit hybrid data

h(i)x,y ;

(3) write: each h(i)x,y is written back to the same memorylocation as its check-to-variable counterpart and xx,yis written to DEC RAM.

The forward path from memory to VNU and backwardpath from VNU to memory are implemented by distinct setsof wires and the entire read-modify-write datapath loop ispipelined by three-stage pipelining as illustrated in Figure 9.

Since all the extrinsic and intrinsic messages associatedwith the same variable node are stored at the same addressin different RAM blocks, we can use only one binary counterto generate all the read address. Due to the pipelining of thedatapath, the write address is obtained via delaying the readaddress by three clock cycles.

4.3. CNU and VNU architectures

Each CNU carries out the operations of one check node,including the parity check and computation of check-to-variable extrinsic messages. Figure 10 shows the CNU archi-tecture for check node with the degree of 6. Each input x(i)

is a 6-bit hybrid data consisting of 1-bit hard decision and5-bit variable-to-check extrinsic message. The parity check isperformed by XORing all the six 1-bit hard decisions. Each5-bit variable-to-check extrinsic messages is represented bysign-magnitude format with a sign bit and four magnitudebits. The architecture for computing the check-to-variableextrinsic messages is directly obtained from (3). The func-tion f (x) = log((1 + e−|x|)/(1− e−|x|)) is realized by the LUT(lookup table) that is implemented as a combinational logicblock in FPGA. Each output 5-bit check-to-variable extrinsicmessage y(i) is also represented by sign-magnitude format.

Each VNU generates the hard decision and all thevariable-to-check extrinsic messages associated with onevariable node. Figure 11 shows the VNU architecture forvariable node with the degree of 3. With the input 5-bit in-trinsic message z and three 5-bit check-to-variable extrinsicmessages y(i) associated with the same variable node, VNU

Page 9: AnFPGAImplementationof (3 6)-RegularLow-Density Parity ... · An FPGA Implementation of (3,6)-Regular LDPC Code Decoder 531 Check nodes Variable nodes Check-to-variable message Variable-to-check

538 EURASIP Journal on Applied Signal Processing

X(1)

6

5 1

X(2)

6

5 1

X(3)

6

5 1

X(4)

6

5 1

X(5)

6

5 1

X(6)

6

5 1

4 1 4 1 4 1 4 1 4 1 4 1

6

LUT

6

LUT

6

LUT

6

LUT

6

LUT

6

LUT

4 1

5

y(1)

4 1

5

y(2)

4 1

5

y(3)

4 1

5

y(4)

4 1

5

y(5)

4 1

5

y(6)

Pipeline

1

Parity-check result

Figure 10: Architecture for CNU with k = 6.

Z5

PipelineS-to-T : Sign magnitude to two’s complement

T-to-S: Two’s complement to sign-magnitude

1Hard decision

y(1)5

S-to-T

y(2)5

S-to-T

y(3)5

S-to-T

T-to-S7 6

LUT4

1

6X(1)

T-to-S7 6

LUT4

1

6X(2)

T-to-S7 6

LUT4

1

6X(3)

Figure 11: Architecture for VNU with j = 3.

generates three 5-bit variable-to-check extrinsic messagesand 1-bit hard decision according to (4) and (5), respectively.To enable each CNU to receive the hard decisions to per-form parity check as described above, the hard decision iscombined with each 5-bit variable-to-check extrinsic mes-sage to form 6-bit hybrid data x(i) as shown in Figure 11.Since each input check-to-variable extrinsic message y(i) isrepresented by sign-magnitude format, we need to convertit to two’s complement format before performing the addi-tions. Before going through the LUT that realizes f (x) =log((1 + e−|x|)/(1− e−|x|)), each data is converted back to thesign-magnitude format.

4.4. Data Input/Output

This partly parallel decoder works simultaneously on threeconsecutive code frames in two-stage pipelining mode: whileone frame is being iteratively decoded, the next frame isloaded into the decoder, and the hard decisions of theprevious frame are read out from the decoder. Thus eachINT RAM contains two RAM blocks to store the intrinsicmessages of both current and next frames. Similarly, eachDEC RAM contains two RAM blocks to store the hard de-cisions of both current and previous frames.

The design scheme for intrinsic message input and harddecision output is heavily dependent on the floor planning of

Page 10: AnFPGAImplementationof (3 6)-RegularLow-Density Parity ... · An FPGA Implementation of (3,6)-Regular LDPC Code Decoder 531 Check nodes Variable nodes Check-to-variable message Variable-to-check

An FPGA Implementation of (3, 6)-Regular LDPC Code Decoder 539

Intrinsicdata

5

Loadaddress

�log2 L� + �log2 k2�

�log2 L��log2 k2�

Binary decoder

k2

PE blockselect

�log2 L�Readaddress

PE1,1

1

PE1,2

2

PE2,1

1

PE2,2

2

......

PEk,1

1

PEk,2

2

· · ·

· · ·

· · ·

· · ·

k − 1

PE1,k

k

k − 1

PE2,k

k

...

k − 1

PEk,k

k

k2 Decodingoutput

Figure 12: Data input/output structure.

the k2 PE blocks. To minimize the routing overhead, we de-velop a square-shaped floor planning for PE blocks as illus-trated in Figure 12 and the corresponding data input/outputscheme is described in the following.

(1) Intrinsic data input. The intrinsic messages of nextframe is loaded, 1 symbol per clock cycle. As shownin Figure 12, the memory location of each input in-trinsic data is determined by the input load ad-dress that has the width of (�log2 L� + �log2 k2�) bitsin which �log2 k2� bits specify which PE block (orwhich INT RAM) is being accessed and the other�log2 L� bits locate the memory location in the selectedINT RAM. As shown in Figure 12, the primary intrin-sic data and load address input directly connect to thek PE blocks PE1,y for 1 ≤ y ≤ k, and from each PEx,ythe intrinsic data and load address are delivered to theadjacent PE block PEx+1,y in pipelined fashion.

(2) Decoded data output. The decoded data (or hard deci-sions) of the previous frame is read out in pipelinedfashion. As shown in Figure 12, the primary �log2 L�-bit read address input directly connects to the k PEblocks PEx,1 for 1 ≤ x ≤ k, and from each PEx,y theread address are delivered to the adjacent block PEx,y+1in pipelined fashion. Based on its input read address,each PE block outputs 1-bit hard decision per clockcycle. Therefore, as illustrated in Figure 12, the widthof pipelined decoded data bus increases by 1 after go-ing through one PE block, and at the rightmost side,we obtain k k-bit decoded output that are combinedtogether as the k2-bit primary decoded data output.

5. FPGA IMPLEMENTATION

Applying the above decoder architecture, we implementeda (3, 6)-regular LDPC code partly parallel decoder for L =256 using Xilinx Virtex-E XCV2600E device with the pack-age FG1156. The corresponding LDPC code length is N =L · k2 = 256 · 62 = 9216 and code rate is 1/2. We obtainthe constrained random parameter set for implementing π3and each AG(3)

x,y as follows: first generate a large number ofparameter sets from which we find few sets leading to rela-tively high Tanner graph average cycle length, then we selectone set leading to the best performance based on computersimulations.

The target XCV2600E FPGA device contains 184 largeon-chip block RAMs, each one is a fully synchronous dual-port 4K-bit RAM. In this decoder implementation, we con-figure each dual-port 4K-bit RAM as two independentsingle-port 256×8-bit RAM blocks so that each EXT RAM ican be realized by one single-port 256 × 8-bit RAM block.Since each INT RAM contains two RAM blocks for storingthe intrinsic messages of both current and next code frames,we use two single-port 256 × 8-bit RAM blocks to imple-ment one INT RAM. Due to the relatively small memory sizerequirement, the DEC RAM is realized by distributed RAMthat provides shallow RAM structures implemented in CLBs.Since this decoder contains k2 = 36 PE blocks, each one in-corporates one INT RAM and three EXT RAM i’s, we to-tally utilize 180 single-port 256 × 8-bit RAM blocks (or 90dual-port 4K-bit RAM blocks). We manually configured theplacement of each PE block according to the floor-planningscheme as shown in Figure 12. Notice that such placement

Page 11: AnFPGAImplementationof (3 6)-RegularLow-Density Parity ... · An FPGA Implementation of (3,6)-Regular LDPC Code Decoder 531 Check nodes Variable nodes Check-to-variable message Variable-to-check

540 EURASIP Journal on Applied Signal Processing

Table 1: FPGA resources utilization statistics.

Resource Number Utilization rate Resource Number Utilization rate

Slices 11,792 46% Slices Registers 10,105 19%

4 input LUTs 15,933 31% Bonded IOBs 68 8%

Block RAMs 90 48% DLLs 1 12%

PE1,1

PE2,1

PE3,1

PE4,1

PE5,1

PE6,1

PE1,2

PE2,2

PE3,2

PE4,2

PE5,2

PE6,2

PE1,3

PE2,3

PE3,3

PE4,3

PE5,3

PE6,3

PE1,4

PE2,4

PE3,4

PE4,4

PE5,4

PE6,4

PE1,5

PE2,5

PE3,5

PE4,5

PE5,5

PE6,5

PE1,6

PE2,6

PE3,6

PE4,6

PE5,6

PE6,6

Figure 13: The placed and routed decoder implementation.

scheme exactly matches the structure of the configurableshuffle network π3 as described in Section 4.1.3, thus therouting overhead for implementing the π3 is also minimizedin this FPGA implementation.

From the architecture description in Section 4, we knowthat, during each clock cycle in the iterative decoding, thisdecoder need to perform both read and write operations oneach single-port RAM block EXT RAM i. Therefore, sup-pose the primary clock frequency is W , we must generatea 2 ×W clock signal as the RAM control signal to achieveread-and-write operation in one clock cycle. This 2 × Wclock signal is generated using the delay-locked loop (DLL)in XCV2600E.

To facilitate the entire implementation process, we exten-sively utilized the highly optimized Xilinx IP cores to instan-tiate many function blocks, that is, all the RAM blocks, allthe counters for generating addresses, and the ROMs used tostore the control signals for shuffle network π3. Moreover, allthe adders in CNUs and VNUs are implemented by ripple-carry adder that is exactly suitable for Xilinx FPGA imple-mentations thanks to the on-chip dedicated fast arithmeticcarry chain.

This decoder was described in the VHDL (hardware de-scription language) and SYNOPSYS FPGA Express was usedto synthesize the VHDL implementation. We used the Xil-inx Development System tool suite to place and route thesynthesized implementation for the target XCV2600E devicewith the speed option −7. Table 1 shows the hardware re-source utilization statistics. Notice that 74% of the total uti-lized slices, or 8691 slices, were used for implementing allthe CNUs and VNUs. Figure 13 shows the placed and routeddesign in which the placement of all the PE blocks are con-strained based on the on-chip RAM block locations.

Based on the results reported by the Xilinx static timing

analysis tool, the maximum decoder clock frequency can be56MHz. If this decoder performs s decoding iterations foreach code frame, the total clock cycle number for decodingone frame will be 2s · L + L, where the extra L clock cyclesis due to the initialization process, and the maximum sym-bol decoding throughput will be 56 · k2 · L/(2s · L + L) =56·36/(2s+1)Mbps. Here, we set s = 18 and obtain the max-imum symbol decoding throughput as 54Mbps. Figure 14shows the corresponding performance over AWGN channelwith s = 18, including the BER, FER (frame error rate), andthe average iteration numbers.

6. CONCLUSION

Due to the unique characteristics of LDPC codes, we be-lieve that jointly conceiving the code construction andpartly parallel decoder design should be a key for practi-cal high-speed LDPC coding system implementations. Inthis paper, applying a joint design methodology, we devel-oped a (3, k)-regular LDPC code high-speed partly paral-lel decoder architecture design and implemented a 9216-bit, rate-1/2 (3, 6)-regular LDPC code decoder on the Xil-inx XCV2600E FPGA device. The detailed decoder architec-ture and floor planning scheme have been presented and aconcatenated configurable random shuffle network imple-mentation is proposed to minimize the routing overheadfor the random-like shuffle network realization. With themaximum 18 decoding iterations, this decoder can achieveup to 54Mbps symbol decoding throughput and the BER10−6 at 2 dB over AWGN channel. Moreover, exploitingthe good minimum distance property of LDPC code, thisdecoder uses parity check after each iteration as earlierstopping criterion to effectively reduce the average energyconsumption.

Page 12: AnFPGAImplementationof (3 6)-RegularLow-Density Parity ... · An FPGA Implementation of (3,6)-Regular LDPC Code Decoder 531 Check nodes Variable nodes Check-to-variable message Variable-to-check

An FPGA Implementation of (3, 6)-Regular LDPC Code Decoder 541

100

10−1

10−2

10−3

10−4

10−5

10−6

BER/FER

1 1.5 2 2.5

Eb/N0(dB)

BER

FER

18

16

14

12

10

8

6

4

Average

numberof

iterations

1 1.5 2 2.5 3 3.5

Eb/N0(dB)

Figure 14: Simulation results on BER, FER and the average itera-tion numbers.

REFERENCES

[1] R. G. Gallager, “Low-density parity-check codes,” IRE Trans-actions on Information Theory, vol. IT-8, no. 1, pp. 21–28,1962.

[2] R. G. Gallager, Low-Density Parity-Check Codes, MIT Press,Cambridge, Mass, USA, 1963.

[3] D. J. C. MacKay, “Good error-correcting codes based on very

sparse matrices,” IEEE Transactions on Information Theory,vol. 45, no. 2, pp. 399–431, 1999.

[4] M. C. Davey and D. J. C. MacKay, “Low-density parity checkcodes over GF(q),” IEEE Communications Letters, vol. 2, no.6, pp. 165–167, 1998.

[5] M. Luby, M. Mitzenmacher, M. Shokrollahi, and D. Spiel-man, “Improved low-density parity-check codes using irregu-lar graphs and belief propagation,” in Proc. IEEE InternationalSymposium on Information Theory, p. 117, Cambridge, Mass,USA, August 1998.

[6] T. Richardson and R. Urbanke, “The capacity of low-densityparity-check codes under message-passing decoding,” IEEETransactions on Information Theory, vol. 47, no. 2, pp. 599–618, 2001.

[7] T. Richardson, M. Shokrollahi, and R. Urbanke, “Designof capacity-approaching irregular low-density parity-checkcodes,” IEEE Transactions on Information Theory, vol. 47, no.2, pp. 619–637, 2001.

[8] S.-Y. Chung, T. Richardson, and R. Urbanke, “Analysis ofsum-product decoding of low-density parity-check codes us-ing a Gaussian approximation,” IEEE Transactions on Infor-mation Theory, vol. 47, no. 2, pp. 657–670, 2001.

[9] M. Luby, M. Mitzenmacher, M. Shokrollahi, and D. A. Spiel-man, “Improved low-density parity-check codes using irreg-ular graphs,” IEEE Transactions on Information Theory, vol.47, no. 2, pp. 585–598, 2001.

[10] S.-Y. Chung, G. D. Forney, T. Richardson, and R. Urbanke,“On the design of low-density parity-check codes within0.0045 dB of the Shannon limit,” IEEE Communications Let-ters, vol. 5, no. 2, pp. 58–60, 2001.

[11] G. Miller and D. Burshtein, “Bounds on the maximum-likelihood decoding error probability of low-density parity-check codes,” IEEE Transactions on Information Theory, vol.47, no. 7, pp. 2696–2710, 2001.

[12] A. J. Blanksby and C. J. Howland, “A 690-mW 1-Gb/s 1024-b,rate-1/2 low-density parity-check code decoder,” IEEE Journalof Solid-State Circuits, vol. 37, no. 3, pp. 404–412, 2002.

[13] E. Boutillon, J. Castura, and F. R. Kschischang, “Decoder-first code design,” in Proc. 2nd International Symposium onTurbo Codes and Related Topics, pp. 459–462, Brest, France,September 2000.

[14] T. Zhang and K. K. Parhi, “VLSI implementation-oriented(3, k)-regular low-density parity-check codes,” in IEEE Work-shop on Signal Processing Systems (SiPS), pp. 25–36, Antwerp,Belgium, September 2001.

[15] M. Chiani, A. Conti, and A. Ventura, “Evaluation of low-density parity-check codes over block fading channels,” inProc. IEEE International Conference on Communications, pp.1183–1187, New Orleans, La, USA, June 2000.

[16] K. K. Parhi, VLSI Digital Signal Processing Systems: Design andImplementation, John Wiley & Sons, New York, USA, 1999.

Tong Zhang received his B.S. and M.S.degrees in electrical engineering from theXian Jiaotong University, Xian, China, in1995 and 1998, respectively. He received thePh.D. degree in electrical engineering fromthe University of Minnesota in 2002. Cur-rently, he is an Assistant Professor in Elec-trical, Computer, and Systems EngineeringDepartment at Rensselaer Polytechnic Insti-tute. His current research interests includedesign of VLSI architectures and circuits for digital signal pro-cessing and communication systems, with the emphasis on error-correcting coding and multimedia processing.

Page 13: AnFPGAImplementationof (3 6)-RegularLow-Density Parity ... · An FPGA Implementation of (3,6)-Regular LDPC Code Decoder 531 Check nodes Variable nodes Check-to-variable message Variable-to-check

542 EURASIP Journal on Applied Signal Processing

Keshab K. Parhi is a Distinguished McK-night University Professor in the Depart-ment of Electrical and Computer Engineer-ing at the University of Minnesota, Min-neapolis. He was a Visiting Professor atDelft University and Lund University, aVisiting Researcher at NEC Corporation,Japan, (as a National Science FoundationJapan Fellow), and a Technical Director DSPSystems at Broadcom Corp. Dr. Parhi’s re-search interests have spanned the areas of VLSI architectures fordigital signal and image processing, adaptive digital filters andequalizers, error control coders, cryptography architectures, high-level architecture transformations and synthesis, low-power digitalsystems, and computer arithmetic. He has published over 350 pa-pers in these areas, authored the widely used textbook VLSI DigitalSignal Processing Systems (Wiley, 1999) and coedited the referencebookDigital Signal Processing for Multimedia Digital Signal Process-ing Systems (Wiley, 1999). He has received numerous best paperawards including the most recent 2001 IEEE WRG Baker Prize Pa-per Award. He is a Fellow of IEEE, and the recipient of a GoldenJubilee medal from the IEEE Circuits and Systems Society in 1999.He is the recipient of the 2003 IEEE Kiyo Tomiyasu Technical FieldAward.


Recommended