A Massively Parallel Implementation of QC-LDPC Decoder...

A Massively Parallel Implementation of QC-LDPCDecoder on GPU

Guohui Wang, Michael Wu, Yang Sun, and Joseph R. CavallaroDepartment of Electrical and Computer Engineering, Rice University, Houston, Texas 77005

{wgh, mwb2, ysun, cavallaro}@rice.edu

Abstract—The graphics processor unit (GPU) is able to providea low-cost and flexible software-based multi-core architecture forhigh performance computing. However, it is still very challengingto efficiently map the real-world applications to GPU and fullyutilize the computational power of GPU.

As a case study, we present a GPU-based implementation ofa real-world digital signal processing (DSP) application: low-density parity-check (LDPC) decoder. The paper shows theefforts we made to map the algorithm onto the massively parallelarchitecture of GPU and fully utilize GPU’s computationalresources to significantly boost the performance. Moreover,several algorithmic optimizations and efficient data structureshave been proposed to reduce the memory access latency andthe memory bandwidth requirement. Experimental results showthat the proposed GPU-based LDPC decoding accelerator cantake advantage of the multi-core computational power providedby GPU and achieve high throughput up to 100.3Mbps.

Keywords-GPU, parallel computing, CUDA, LDPC decoder,accelerator

I. INTRODUCTION

A graphics processing unit (GPU) provides a parallel ar-chitecture which combines raw computation power with pro-grammability. GPU provides extremely high computationalthroughput by employing many cores working on a largeset of data in parallel. In the field of wireless communi-cation, although power and strict latency requirements ofreal communication systems continue to be the main chal-lenges for a practical real-time GPU-based platform, GPU-based accelerators remain attractive due to their flexibility andscalability, especially in the realm of simulation accelerationand software-defined radio (SDR) test-beds. Recently, GPU-based implementations of several key components of com-munication systems have been studied. For instance, a softinformation multiple-input multiple-output (MIMO) detectoris implemented on GPU and achieves very high throughput [1].In [2], a parallel turbo decoding accelerator implemented onGPU is studied for wireless channels.

Low-density parity-check (LDPC) decoder [3] is anotherkey communication component and the GPU implementationsof the LDPC decoder have drawn much attention recently,due to its high computational complexity and inherently mas-sively parallel nature. LDPC codes are a class of powerfulerror correcting codes that can achieve near-capacity errorcorrecting performance. This class of codes are widely used inmany wireless standards such as WiMax (IEEE 802.16e), WiFi(IEEE 802.11n) and high speed magnetic storage devices.

The flexibility and scalability make GPU a good simula-tion platform to study the characteristics of different LDPCcodes or to develop new LDPC codes. However, it is stillvery challenging to efficiently map the LDPC algorithm tothe GPU’s massively parallel architecture and achieve veryhigh performance. Recently, parallel implementations of highthroughput LDPC decoders are studied in [4]. In [5], theresearchers optimize the memory access and develop paralleldecoding software for cyclic and quasi-cyclic LDPC (QC-LDPC) codes. However, there are still great potentials toachieve higher performance by developing better algorithmmapping according to GPU’s architecture.

This work focuses on the techniques to fully utilize GPU’scomputational resources to implement a computation-intensivedigital signal processing (DSP) algorithm. As a case study,a highly-optimized and massively parallel LDPC decoderimplementation on GPU is presented. Several efforts have beenmade from the aspects of algorithmic optimizations, memoryaccess optimizations and efficient data structures, which enableour implementation to achieve much higher performance thanthe prior work.

This paper is organized as follows. In Section II, theCUDA platform for GPU computing is introduced. Section IIIgives an overview of the LDPC decoding algorithm. Differentaspects of the GPU implementation of the LDPC decoder arediscussed in Section IV. Section V focuses on memory accessoptimization techniques. Section VI provides the experimen-tal results for performance and throughput of the proposedimplementation. Finally, Section VII concludes this paper.

II. CUDA ARCHITECTURE

Computer Unified Device Architecture (CUDA) [6] iswidely used to program massively parallel computing appli-cations. The NVIDIA Fermi GPU architecture consists ofmultiple stream multiprocessors (SM). Each SM consists of32 pipelined cores and two instruction dispatch units. Duringexecution, each dispatch unit can issue a 32 wide singleinstruction multiple data (SIMD) instruction which is executedon a group of 16 cores. Although CUDA provides the possibil-ity to unleash GPU’s computational power, several restrictionsprevent programmers from achieving peak performance. Theprogrammer should pay attention to the following aspects toachieve near-peak performance.

In the CUDA model, a frequently executed task can bemapped into a kernel and executed in parallel on many threads

in the GPU. CUDA divides threads within a thread blockinto blocks of 32 threads, which are executed as a groupusing the same common instruction (a warp instruction). Asinstructions are issued in-order, a stall can occur when aninstruction fetches data from device memory or when thereis data dependency. Stalls can be minimized by coalescingmemory access, using fast on-die memory resources andthrough hardware support for fast thread switching.

A CUDA device has a large amount (>1GB) of off-chipdevice memory with high latency (400∼800 cycles) [6]. Toimprove efficiency, device memory accesses can be coalesced.For CUDA devices, if all threads access 1, 2, 4, 8, or 16 bytewords and threads within half of a warp access sequentialwords in device memory, the memory requests of half ofa warp can be coalesced into a single memory request.Coalescing memory access improves performance by reducingthe number of device memory requests.

Fast on-chip resources such as registers, shared memory,and constant memory can reduce memory access time. Accessto shared memory usually takes one load or store operation.However, random accesses to shared memory with bank con-flicts should be avoided since they will be serialized and causeperformance degradation. On the other hand, memory accessto constant memory is cached although constant memoryresides in device memory. Therefore, it takes one cached readif all threads access the same location. However, other constantmemory access patterns will be serialized and need devicememory access, therefore, should be avoided.

Furthermore, to hide the processor stalls and achieve peakperformance, one should map sufficient concurrent threadsonto a SM. However, both shared memory and registersare shared among concurrent threads on a SM. The limitedamounts of shared memory and registers require the program-mers to effectively partition the workloads such that at leasta sufficient number of blocks can be mapped onto the sameprocessor to hide stalls.

III. INTRODUCTION TO LDPC DECODING ALGORITHM

A. LDPC Codes

Low-density parity-check (LDPC) codes are a class ofwidely used error correcting codes. The binary LDPC codescan be defined by the equation H · xT = 0, in which x is acodeword and H is an M×N sparse parity check matrix. Thenumber of 1’s in a row of H is called row weight, denotedby ωr. If ωr is the same for all rows, the code is regular;otherwise, the code is irregular. Regular codes are easier toimplement but have poorer error correcting performance thanirregular codes.

Quasi-Cyclic LDPC (QC-LDPC) codes [7] are a specialclass of LDPC codes with a structured H matrix, which canbe generated by the expansion of a Z ×Z base matrix. As anexample, Fig. 1 shows the parity check matrix for the (1944,972) 802.11n LDPC code with sub-matrix size Z = 81. Inthis matrix representation, each square box with a label Ixrepresents an 81 × 81 circularly right-shifted identity matrixwith a shifted value of x, and each empty box represents

Multi-Layer Parallel Decoding Algorithm and VLSIArchitecture for Quasi-Cyclic LDPC Codes

Yang Sun, Guohui Wang, and Joseph R. CavallaroDepartment of Electrical and Computer Engineering, Rice University, Houston, TX 77005

Email: {ysun, gw2, cavallar}@rice.edu

Abstract—In this paper, we propose a multi-layer parallel decodingalgorithm and VLSI architecture for decoding of structured quasi-cyclic low-density parity-check (QC-LDPC) codes. The layered decodingalgorithm is known to be very memory-efficient and it can achieve afaster convergence speed than the standard two-phase flooding decodingalgorithm. In the conventional layered decoding algorithm, the block-rows of the parity check matrix are processed sequentially, or layerafter layer. The maximum number of rows that can be simultaneouslyprocessed by the conventional layered decoder is limited to the sub-matrix size. To remove this limitation and support layer-level parallelism,we extend the conventional layered decoding algorithm and architectureto enable simultaneously processing of multiple (K) layers of a paritycheck matrix, which will lead to a K-fold throughput increase. As a casestudy, we have designed a double-layer parallel LDPC decoder for theIEEE 802.11n standard. The decoder was synthesized for a TSMC 45-nmCMOS technology. With a synthesis area of 0.81 mm2 and a maximumclock frequency of 815 MHz, the decoder achieves a maximum throughputof 3.0 Gbps at 15 iterations.

I. INTRODUCTION

Quasi-Cyclic LDPC (QC-LDPC) codes [1] have been widely usedin many practical systems, such as IEEE 802.11n WLAN and IEEE802.16e WiMAX, due to their efficient hardware implementationand good error performance. The parity check matrix of theseQC-LDPC codes can be partitioned into block-rows or layers forefficient implementation by using the so-called layered decodingalgorithm [2], [3]. The layered decoding algorithm has a much higherconvergence speed (up to two times faster), and requires less memorycompared to the standard two-phase flooding decoding algorithm. In[4], a fully parallel 1-Gbps LDPC decoder was implemented basedon the two-phase flooding algorithm. But this decoder only supportsone code structure and it consumes enormous hardware resources.As a tradeoff between complexity and performance, partial-parallellayered decoder architectures are often used to decode the QC-LDPCcodes [5], [6], [7], [8], [9].

The conventional layered decoder architecture [2], [3] is initiallydeveloped to process the parity check matrix layer by layer, whereeach layer corresponds to a block-row of the parity check matrix.Since the column-weight of each layer is typically 1 in manyapplications, such as IEEE 802.11n and IEEE 802.16e, this greatlysimplifies the decoder design. To further improve the throughput, thetwo consecutive layers’ data processing can be partially overlappedthrough a pipelined schedule [7], [5], where the data conflicts betweentwo layers can be resolved by stalling the pipeline. The maximum rowparallelism for the conventional layered algorithm is equal to the sub-matrix size Z, i.e. we can employ Z parallel check node processorsto process Z rows in parallel. With this amount of parallelism, theconventional layered decoder can typically offer 100-1000 Mbpsthroughput [5], [6], [7], [8].

To go beyond 1-Gbps throughput, the layered architecture needsto be extended to provide higher parallelism. One natural extensionof the conventional layered architecture is to design a multi-layerparallel architecture where multiple (K) layers of a parity check

matrix are processed in parallel. Now the maximum row parallelismis increased to KZ, i.e. we can employ KZ check node processors toprocess KZ rows in parallel. It should be noted that the multi-layerparallel decoding algorithm would still require less memory than thetwo-phase flooding algorithm because there is still no need to storethe variable node messages in the multi-layered algorithm.

In this paper, we propose a multi-layer parallel decoding algorithmand VLSI architecture for high throughput LDPC decoding. The dataconflicts between layers are resolved by modifying the LLR updaterules. As a case study, we describe a double-layer parallel decoderarchitecture for IEEE 802.11n LDPC codes.

II. MULTI-LAYER PARALLEL DECODING ALGORITHM

A. Introduction of The QC-LDPC Codes

Generally, a binary LDPC code is a linear block code specified by asparse binary parity check matrix: H·xT = 0, where x is a codewordand H can be viewed as a bipartite graph where each column androw in H represent a variable node and check node, respectively. QC-LDPC codes are a very important class of LDPC codes. The paritycheck matrix for a QC-LDPC code can be represented as an array ofsquare sub-matrices, where each sub-matrix is either a Z × Z zeromatrix or a Z × Z circulant matrix. As an example, Fig. 1 showsthe parity check matrix for the block length 1944 bits, code rate 1/2,sub-matrix size Z = 81, IEEE 802.11n LDPC code. In this matrixrepresentation, each square box with a label Ix represents an 81×81cyclicly-shifted identity matrix with a shifted value of x, and eachempty box represents an 81 × 81 zero matrix.

I57 I50 I11 I50 I79 I1 I0

I3 I28 I0 I55 I7 I0 I0

I30

I62 I53

I24

I53

I37 I56

I35I3

I14 I0 I0

I0 I0

I40

I0

I66

I8

I20

I42

I22 I28

I50 I8

I0 I0

I0 I0

I69

I65

I79 I79

I38 I57

I56 I52

I72 I27

I0 I0 I0

I0 I0

I64

I45 I70

I14

I0

I52 I30

I77 I9

I32 I0 I0

I0 I0

I2

I24

I56

I61

I57 I35

I60 I27 I51

I12

I16 I1

I0 I0

I0

Fig. 1. Parity check matrix for block length 1944 bits, code rate 1/2, sub-matrix size Z = 81, IEEE 802.11n LDPC code.

B. Review of The Layered Decoding Algorithm

A brief review of the conventional layered decoding algorithm[2] is provided to facilitate the discussion. We define the followingnotation. The a posteriori probability (APP) log-likelihood ratio(LLR) of each bit n is defined as:

Ln = logPr(n = 0)

Pr(n = 1), (1)

Fig. 1. Parity check matrix H for a block length 1944bits, code rate 1/2, IEEE802.11n (1944, 972) LDPC code. H consists of Msub ×Nsub sub-matrices(Msub = 12, Nsub = 24 in this example).

an 81 × 81 zero matrix. For a QC-LDPC code, H could beviewed as an array of Msub × Nsub sub-matrices. Msub andNsub are the number of sub-matrices in a column and a row,respectively. A row of Nsub sub-matrices is also known as alayer.

Because of its well-organized structure, QC-LDPC is veryefficient for hardware implementation. In addition, irregularcodes have better error correcting performance than regu-lar codes. Therefore, irregular QC-LDPC codes have beenadopted by many communication systems such as 802.11nWiFi and 802.16e WiMAX. In this paper, we take the irregularQC-LDPC codes from the 802.11n standard as a case study.However, the proposed accelerator can be used for other cyclicor quasi-cyclic LDPC codes.

B. Sum-product Algorithm for LDPC Decoder

The sum-product algorithm (SPA) is based on the iterativemessage passing among check-nodes (CNs) and variable-nodes (VNs) and provides a powerful method to decode LDPCcodes [3]. The SPA is usually performed in the log-domain.The SPA has a computational complexity of O(N3), in whichN is normally very large which results in very intensivecomputations.

Let cn denote the n-th bit of a codeword, and let xn denotethe n-th bit of a decoded codeword. The a posteriori probabil-ity (APP) log-likelihood ratio (LLR) is a soft information of cnand can be defined as Ln = log((Pr(cn = 0)/Pr(cn = 1)).1) Initialization:Ln is initialized to be the input channel LLR. The VN-to-

CN (VTC) message Qmn and the CN-to-VN (CTV) messageRmn are initialized to 0.2) Begin the iterative Decoding process:

For each VN n, calculate Qmn by

Qmn = Ln +∑

m′∈{Mn\m}

Rm′n, (1)

where Mn \m denotes the set of all the CNs connected withVN n except CN m. Then, for each CN m, compute the newCTV message R′mn and ∆mn by

R′mn = Qmn1�Qmn2

� · · · �Qmnk, (2)

∆mn = R′mn −Rmn, (3)

where n1, n2, · · · , nk ∈ {Nm \ n} and Nm \ n denotes theset of all the CNs connected with VN n except CN m. R′mn

and ∆mn are saved for the APP update and the next iteration.The � operation is defined as below:

x� y = sign(x)sign(y) min(|x|, |y|) + S(x, y), (4)

S(x, y) = log(1 + e−|x+y|)− log(1 + e−|x−y|). (5)

3) Update the APP values and make hard decisions

L′n = Ln +∑m

∆mn. (6)

The decoder makes a hard decision to get the decoded bitxn by quantizing the APP value L′n into 1 and 0, that is, ifL′n<0 then xn = 1, otherwise xn = 0. The decoding processterminates when the codeword x satisfies H · xT = 0, or thepre-set maximum number of iterations is reached. Otherwise,go back to step 2 and start a new iteration of decoding.

C. Scaled Min-Sum Algorithm

The min-sum algorithm (MSA) reduces the decoding com-plexity of the SPA with minor performance loss [8][9]. TheMSA approximates log-SPA by removing the correction itemS(x) from (4). The Rmn calculation in the scaled MSA canbe expressed as below:

R′mn = α ·∏

n′∈{Nm\n}

sign(Qmn′) · minn′∈{Nm\n}

| Qmn′ |, (7)

where α is the scaling factor to compensate for the perfor-mance loss in the min-sum algorithm (α = 0.75 is used inthis paper) [9].

IV. MAPPING LDPC DECODING ALGORITHM ONTO GPUIn this work, we map both the MSA and the log-SPA

onto GPU architecture. In this section, we present modifieddecoding algorithms to reduce the decoding latency and thebandwidth requirement of device memory. The implementa-tions of the LDPC decoding kernels are also described.

A. Loosely Coupled LDPC Decoding Algorithm

It can be noted that the log-SPA or min-sum algorithmcould be simplified by reviewing the equations from (1) to(6). Instead of saving all the Q values and R values in thedevice memory, only R values are stored [10]. To computethe new CTV message R′, we first recover the Q value basedon R and APP value from the previous calculation. Based onthis idea, Equation (1) can be changed to:

Qmn = Ln −Rmn. (8)

Due to the limited size of the on-chip shared memory, Rmn

values can only fit in the device memory. The loosely coupledLDPC decoding algorithm significantly reduces device mem-ory bandwidth by reducing the amount of device memory used.For example, with an M ×N parity-check matrix H with therow weight of ωr (the number of 1’s in a row of H), we save upto Mωr memory slots. Also, the number of memory accessesto the device memory is reduced by at least 2Mωr since eachQmn is read and written to at least once per iteration.

Host (CPU)1. Initialization

2. Transfer data from

Host to Device

Device (GPU) ...CUDA Kernel 1Horizontal Processing

Synchronization

CUDA Kernel 21. Update APP values

2. Make hard decisions


Device to Host

2. Finish decoding

Host (CPU)

Device (GPU) ...

Host (CPU)

Serial code

Serial code

Serial code

Iter

ati

ve

Dec

od

ing

Host

(CPU)1. Initialization


Host to Device

Device

(GPU)...CUDA Kernel 1

Horizontal Processing

CUDA Kernel 21. Update APP values

2. Make hard decisions


Device to Host

2. Finish decoding

Device

(GPU)...

Host

(CPU)

Serial code

Serial code

Iter

ati

ve

Dec

od

ing

Host



Host to Device

Device



CUDA Kernel 2

1. Update APP values

2. Hard decisions


Device to Host

2. Finish decoding

Device

(GPU)...

Host

(CPU)

Serial code

Serial code

Iter

ati

ve

Dec

od

ing

Host



Host to Device

Device



CUDA Kernel 2

1. Update APP values

2. Hard decisions


Device to Host

2. Finish decoding

Device

(GPU)...

Host

(CPU)

Serial code

Serial code

Iter

ati

ve

Dec

od

ing

Fig. 2. Code structure of the GPU implementation of LDPC decoder byusing two CUDA kernels.

B. Two-Pass Recursion CTV Message Update

Based on (2), we need to traverse one row of the H matrixmultiple times to update all the CTV messages Rmn. Eachtraversal pass requires (ωr − 2) compare-select operationsto find the minimum value from (ωr − 1) data. Moreover,(ωr − 1) XOR operations are also needed to compute thesign of Rmn. In addition, the traversal introduces some branchinstructions which cause an unbalanced workload and reducethe throughput.

The traversal process could be optimized by using a forwardand backward two-pass recursion scheme [11]. In the forwardrecursion, some intermediate values are calculated and stored.During the backward recursion, the CTV message for eachnode is computed based on the intermediate values from theforward recursion. By roughly calculating, the number ofoperations can be reduced from Mωr(ωr−2) to M(3ωr−2) inone iteration of LDPC decoding. For the 802.11n (1944, 972)code, this two-pass recursion scheme saves us about 50% ofthe operations. This significantly reduces the decoding latency.The number of memory accesses and the number of requiredregisters are also greatly reduced.

C. Implementation of the LDPC Decoder Kernel on GPU

According to Equations (2), (3), (6) and (8), the decodingprocess can be split into two stages: the horizontal processingstage and the APP update stage. So we can create onecomputational kernel for each stage. The relationship betweenthe host (CPU) code and the device (GPU) kernel is shown inFig 2.

1) CUDA Kernel 1: Horizontal Processing: During thehorizontal processing stage, since all the CTV messages arecalculated independently, we could use many parallel threadsto process these CTV messages. For an M × N H matrix,M threads are spawned, and each thread processes a row.Since all non-zero entries in a sub-matrix of H have the sameshift value (one square box in Fig 1), threads processing thesame layer have almost exactly the same operations whencalculating the CTV messages. In addition, there is no datadependency among different layers. Therefore, each layer isprocessed by a thread block. Msub thread blocks are used and

Layer 1 of Codeword NCW

. . .Layer 1 of Codeword 1

Macro Codeword 1Z threads

Z threads

.

..Thread block 1(ZxNCW threads)




Z threads

.

..




Z threads

.

..

.

..

Codeword NCW

. . .Codeword 1

Macro CodewordZ threads

Z threads

.

..A thread block

(ZxNCW threads)

Codeword NCW

. . .Codeword 1


Z threads

.

..A thread block

(ZxNCW threads)

Codeword NCW

. . .Codeword 1


Z threads

.

..A thread block

(ZxNCW threads)

.

..

Thread block 2(ZxNCW threads)


Macro Codeword 1



Z threads

Z threads

.

..Thread block (1, 1)

(ZxNCW threads)



Z threads

Z threads

.

..

.

..

Thread block (1, 12)(ZxNCW threads)

Macro-Codeword 1

...

Multi-Codeword Decoding Scheme Total number of thread block: NMCW x 12Total number of threads: NMCW x 12 x Z x NCW



Z threads

Z threads

.

..Thread block 1(ZxNCW threads)



Z threads

Z threads

.

..

.

..


Macro-Codeword NMCW



Z threads

Z threads

.

..Thread block (NMCW, 1)

(ZxNCW threads)



Z threads

Z threads

.

..

.

..

Thread block (NMCW, 12)(ZxNCW threads)

Macro-Codeword NMCW

Fig. 3. Multi-codeword parallel decoding algorithm. The 802.11n (1944, 972) code is assumed. The H matrix has 12 layers. Z = 81. NCW represents thenumber of codewords in one Macro-codeword (MCW). NMCW represents the number of MCWs.

each consists of Z threads. The CUDA kernel 1 is described inAlgorithm 1. Taking the 802.11n (1944, 972) LDPC code asan example, 12 thread blocks are generated, and each contains81 threads, so there are a total of 972 threads used to calculatethe CTV messages.

Algorithm 1 CUDA Kernel 1: Horizontal processing1: iLayer = blockIdx.x; /*the index of a layer in H*/2: iSubRow = threadIdx.x; /*the row index in the layer*/3: Calculate new CTV message using two-pass recursion;4: Write CTV messages (Rmn, ∆mn) back into device

memory;

2) CUDA Kernel 2: APP value update: During the APPupdate stage, there are N APP values to be updated. Similarly,the APP value update is independent among different variablenodes. So, Nsub thread blocks are used to update the APPvalues, with Z threads in each thread block. In the APP updatestage, there are 1944 threads which are grouped into 24 threadblocks working concurrently for the 802.11n (1944, 972)LDPC code. The algorithm of CUDA kernel 2 is describedin Algorithm 2.

Algorithm 2 CUDA Kernel 2: Update APP value1: iBlkCol = blockIdx.x; /*the column index of a sub-

matrix in H*/2: iSubCol = threadIdx.x; /*the column index within the

sub-matrix*/3: Calculate the device memory address of APP value;4: Read the old APP value;5: for all All the sub-matrices in iBlkCol column of H do6: Read the corresponding ∆mn value;7: Update the APP value using (6);8: end for9: Write the updated APP value into the device memory;

10: Make a hard decision for the current bit.

D. Multi-codeword Parallel Decoding

Since the number of threads and thread blocks are limitedby the dimensions of the H matrix, it is hard to keep all thecores fully occupied by decoding a single codeword. Multi-codeword decoding is needed to further increase the paral-lelism of the workload. A two-level multi-codeword schemeis designed. NCW codewords are first packed into one macro-codeword (MCW). Each MCW is decoded by a thread blockand NMCW MCWs are decoded by a group of thread blocks.The multi-codeword parallel decoding algorithm is describedin Fig. 3.

Since multiple codewords in one MCW are decoded by thethreads within the same thread block, all the threads followthe same execution path during the decoding process. So theworkload is well balanced for the codewords in one MCWwhich is helpful to increase the throughput. Another benefit isthat the latency of read-after-write dependencies and memorybank conflicts can be completely hidden by a sufficient numberof active threads. To implement this multi-codeword paralleldecoding scheme, NMCW ×NCW codewords are written intothe device memory in a specified order before kernel launch.

E. Implementation of Early Termination Scheme

The early termination (ET) algorithm is used to avoidunnecessary computations when the decoder already convergesto the correct codeword. For software implementation and sim-ulation, ET algorithm can significantly boost the throughput,especially for the high SNR (signal-to-noise ratio) scenario.

The parity check equations H · xT = 0 can be used toverify the correctness of the decoded codeword. Since theparity check equations are inherently parallel, a new CUDAkernel with many threads is launched to perform the ET check.We create M threads, and each thread calculates one paritycheck equation independently. Since the decoded codewordx, compact H matrix and parity check results are used byall the threads, on-chip shared memory is used to enable thehigh speed memory access. After the concurrent threads finishcomputing the parity check equations, we reuse these threadsto perform a reduction operation on all the parity check results

to generate the final ET check result, which indicates thecorrectness of the concurrent codeword.

This parallel ET algorithm is straightforward for a singlecodeword. However, for multi-codeword parallel decoding, theET can help increase the throughput only if all the codewordsmeet the ET condition, which has a very low chance tohappen. To overcome this problem, we propose a tag-basedET algorithm for multi-codeword parallel decoding. All theNMCW × NCW codewords are checked by multiple threadblocks. We assign one tag per codeword and mark the tagonce the corresponding parity check equation is satisfied.Afterwards, the decoding process of this particular codewordis stopped in the following iterations. Once the tags for allthe codewords are marked, the iterative decoding process isterminated. The experimental results in Section.VI-B show thatthe proposed parallel ET algorithm significantly increases thethroughput.

V. OPTIMIZING MEMORY ACCESS ON GPUIn this section, several memory access optimization tech-

niques to further increase the throughput are explained.

A. Memory Optimization for H MatrixThe constant memory on the GPU device could be utilized

to optimize the throughput. As mentioned in Section II,reading from the constant memory is as fast as reading froma register as long as all the threads within a half-warp readthe same address. Since all the Z threads in one thread blockaccess the same entry of the H matrix simultaneously, we canstore the H matrix in the constant memory and take advantageof the broadcasting mode of the constant memory. Simulationshows that constant memory increases the throughput by about8%.

The quasi-cyclic characteristic of the QC-LDPC code allowsus to efficiently store the sparse H matrix. Two compactrepresentations of H (Hkernel1 and Hkernel2) are designed.Here, we regard the cyclic H matrix in Fig. 1 as a 12 × 24matrix H̄ whose entry is the shift value for each sub-matrixof H. After horizontally compressing H̄ to the left, we getHkernel1. Similarly, vertically compressing H̄ to the top givesus Hkernel2. The structure of Hkernel1 and Hkernel2 are shownin Fig 4.

There are two advantages of the proposed compact repre-sentation of H. First, since the compressed format reduces thememory usage, the time used to read the H matrix from devicememory (with long latency) is reduced. Second, the branchinstructions cause throughput degradation. The compressedmatrix shows the position of all the non-empty entries in H.Therefore, during the two-pass recursion there is no need tocheck whether one entry of H is empty which avoids the use ofbranch instructions. Taking the 802.11n (1944, 972) H matrixas an example, 40% of memory access and branch instructionsare reduced by using the compressed Hkernel1 and Hkernel2.

B. Coalescing Device Memory AccessSince the device memory access suffers several hundred

cycles latency to complete, optimizing the device memory

structure h_element

{

byte x;

byte y;

byte shift_value;

byte valid;

};

I57 I3

I30

I62

I40

I0

I69

I65

I64

I28

I24

I53 I53

I20 I66

I8

I79 I79

I38

I14

I45 I70 I0

I50

I37

I57

I52

I55 I7

I56 I14

I3 I35

I22 I28

I42 I50

I56 I52

I72

I30

I77 I9

I79 I1 I0

I27

I0 I0

I0

I8

I0

I32

I0

I0 I0

I0 I0

I0 I0

I0 I0

I0 I0

I0 I0

I0

I2

I24

I56 I57 I35

I61 I60 I27 I51

I12

I16 I1

I0

I0 I0

I0

H_kernel1

12x8 matrix

H_kernel2

11x24 matrix

Horizontal compress

Ver

tica

l co

mp

ress

H_kernel2

matrix

struct h_element

{ byte x;

byte y;

byte shift_value;

byte valid; };

H_kernel1

matrix

Horizontal compression

Ver

tica

l co

mp

ress

ion

Fig. 4. The compact representation for H matrix. The H matrix is the same asin Fig. 1. After the horizontal compression and vertical compression, we getHkernel1 and Hkernel2, respectively. Each entry of the compressed H matrixis a customized structure shown in this figure. Each entry of the compressedH matrix contains 4 8-bit data indicating the row and column index of theelement in the original H matrix, the shift value and a valid flag which showswhether the current entry is empty or not. Therefore, each entry could beexpressed by a 32-bit data value which occupies one memory slot.

access time can help to increase the throughput. In CUDAkernel 1, Rmn and ∆mn values are read from the devicememory during the two-pass recursion. Once the computationsare done, Rmn and ∆mn values are written back to the devicememory. Since there is only one Rmn value and one ∆mn

value per row in each sub-matrix of H, the compressed formatcan be used to store Rmn and ∆mn. Two M × ωr matricesare used to store Rmn and ∆mn. In total, memory saving forRmn and ∆mn is more than halved.

More importantly, GPU supports very efficient coalescedaccess if all threads in a warp access the memory locationswhich have contiguous addresses. By writing the compressedRmn and ∆mn matrices column-wise into the device memory,all memory operations for Rmn and ∆mn are coalesced. Sim-ulation shows that 20% throughput improvement is achievedby coalescing device memory access for Rmn and ∆mn.

VI. EXPERIMENTAL RESULTS

The experimental setup to evaluate the performance of theproposed architecture on the GPU consists of an NVIDIAGTX470 GPU with 448 stream processors, running at1.215GHz and with 1280MB of GDDR5 device memory.

The implementation goals are flexibility and scalability.The key parameters such as the length of codeword andthe code rate can be easily reconfigured. In addition, thisLDPC decoder supports both the log-SPA algorithm and themin-sum algorithm. The implementation of log-SPA employsGPU’s intrinsic functions logf() and expf(), which aredirectly mapped onto GPU hardware and are very fast [6]. Fur-thermore, this LDPC decoding architecture supports differentcyclic or quasi-cyclic LDPC codes, including the standard-compliant 802.11n WiFi and 802.16e WiMAX LDPC codes.

A. Decoder Error Probability Performance

Fig. 5 shows the block error rate (BLER) performancefor the IEEE 802.11n (1944, 972) LDPC code versus thesignal-to-noise ratio (SNR) which is represented by EbN0.Binaly phase shift keying (BPSK) modulation and additivewhite Gaussian noise (AWGN) channel model are employed.

1 1.5 2 2.5 3 3.510−6

10−5

10−4

10−3

10−2

10−1

100

EbN0 (dB)

Blo

ck E

rror

Rat

e

Scaled min−sumLog−SPA

Fig. 5. The BLER performance for 802.11n (1944, 972) LDPC code for 15iterations versus signal strength (represented by EbN0). The scaled min-sumalgorithm and the log-SPA algorithms are compared.

In this simulation, the maximum number of iterations is setto a typically used value of 15. The simulation result curveis the same as the CPU version, since both the algorithmand the data precision (single-precision floating-point) are thesame. The simulation results show that the log-SPA algorithmoutperforms the min-sum algorithm by about 0.4dB. As willbe shown in the next subsection, the throughput for the log-SPA is comparable to the min-sum algorithm on GPU. Thismeans that the GPU implementation can take advantage of theperformance gain of the log-SPA while still achieving goodthroughput.

B. Throughput Results

Assume the length of the codeword is Nbits, the totalnumber of codewords is Ncodeword, the simulation numberis NSim, running time is Ttotal, which can be expressed asTtotal = THost to Device + Titerations + TDevice to Host. Thethroughput can be calculated by: Throughput = Nbits ×NSim ×Ncodeword/Ttotal.

TABLE ILDPC DECODING THROUGHPUT ON GPU WITHOUT EARLY

TERMINATION.

CodeNiter

CPU (Mbps) GPU (Mbps)Type log-SPA min-sum log-SPA min-sum

5 0.065 0.166 74.85 74.65802.11n 10 0.039 1.117 39.98 39.82

(1944, 972) 15 0.028 0.092 27.25 27.1820 0.021 0.077 20.72 20.63

5 0.071 0.169 95.8 96.12WiMAX 10 0.042 0.118 52.15 52.31

(2304, 1152) 15 0.029 0.096 35.84 35.9820 0.023 0.079 27.31 27.43

Table I shows the throughput for CPU and GPU implemen-tations for both 802.11n code and WiMAX code with differentnumber of iterations (Niter). The CPU implementation isa highly optimized single-core version running on an IntelCore i5 CPU running at 3.2GHz. The results show thatGPU implementation outperforms the CPU version with morethan 300x speed-up in most cases. Although using multi-core

programming language can increase the performance of theCPU implementation, the speed-up we can gain is still limitedby the computational resources in the CPU.

Moreover, we can also notice that, for the CPU implemen-tation, the throughput for the log-SPA is much slower thanthe min-sum algorithm because the log-SPA contains moreinstructions and more complicated log() and exp() functions.However, for the GPU implementation, the throughput for thelog-SPA algorithm is comparable to the min-sum algorithm.The reason is that GPU implementation employs very efficientintrinsic functions logf() and expf(). And the bottleneckfor GPU implementation is in the long device memory accesslatency, therefore, the run time for the extra instructions in thelog-SPA is hidden behind the memory access latency.

Furthermore, the results also show that the decoder forWiMAX code has higher throughput compared to the 802.11ncode because of two reasons. First, the block size of theWiMAX code is longer, while the row weights (ωr) forthese two codes are similar, which means the computationalworkload is comparable. Therefore, the WiMAX code withlonger codeword tends to have higher throughput according tothe throughput equation. Second, for longer codeword, thereare more arithmetic instructions per memory access which canhide the memory access latency and data transfer overhead.

0

20

40

60

80

100

120

0

10

20

30

40

50

60

1.5 2 2.5 3 3.5 4 4.5 5Th

rou

ghp

ut

(Mb

ps)

Ave

rage

# o

f it

era

tio

ns

EbN0 (dB)

Aver # of iterations

Throughput (Mbps)

(a) Log-SPA algorithm

0

20

40

60

80

100

120

0

10

20

30

40

50

60

1.5 2 2.5 3 3.5 4 4.5 5

Thro

ugh

pu

t (M

bp

s)

Ave

rage

# o

f it

era

tio

ns

EbN0 (dB)

Aver # of iterations

Throughput (Mbps)

(b) Scaled min-sum algorithm

Fig. 6. Simulation results for early termination. Average numbers of iterations(dashed line) and throughput (solid line) of the 802.11n (1944, 972) LDPCdecoder on GPU. The maximum number of iterations is set to 50.

The parallel early termination scheme is implemented asdescribed in Section IV-E. Fig. 6 shows the throughput resultsand the average number of iterations when the early termina-tion scheme is used. As the SNR (represented by EbN0) in-creases, the LDPC decoder converges faster so that the averagenumber of iterations decreases and the decoding throughputincreases. The simulation results show that the parallel earlytermination scheme significantly speeds up the simulation forthe high SNR case. For low SNR, the ET version may beslower than the non-ET version due to overhead of the ETcheck kernel. Therefore, an adaptive scheme can be used tospeed up the simulation from the low SNR to high SNR – theET kernel only launches when the simulation SNR is higherthan a specific threshold.

C. Comparison with Prior Work

Table II compares our work with related prior work. Thereare two main differences among these implementations. First,block sizes of LDPC codes are different. LDPC codes with

TABLE IILDPC DECODING THROUGHPUT COMPARISON WITH OTHER WORK. (ET MEANS EARLY TERMINATION IS USED.)

Work GPU Code Code Type # iteration Throughput

Wang, 2008 [12] 8800GT (2048, 1024) regular code ET 2.95∼8.0 Kbps (ωr = 3 ∼ 9)

Chang, 2010 [13] Tesla C1060 (4000, 2000) regular code 10 2.34 Mbps

Falcao, 2011 [4]8800 GTX (1024, 512)

regular code10 10.0 Mbps

8800 GTX (4896, 2448) 10 17.9 Mbps

Ji, 2010 [5] GTX 285 (1944, 972) 802.11n code, irregular ET 0.75 Mbps

This work GTX 470(1944, 972) 802.11n code, irregular

10 39.98 Mbps50 ∼ 4.3 (ET) 22.5∼100.3 Mbps (EbN0=1.5∼5dB)

(2304, 1152) WiMAX code, irregular 10 52.15 Mbps

longer codeword have higher throughput compared to shortercodes due to the high arithmetic operations to memory ac-cess ratio. Second, most of the previous implementations useregular LDPC codes which have more balanced workloadthan the irregular LDPC codes. It is difficult to use massivethreads to fully occupy the computation resources of the GPUwhen decoding the irregular LDPC codes. When processingan irregular LDPC code, imbalanced workloads cause thethreads on GPU to complete the computations at differenttimes. Therefore, runtime is bounded by the threads with themost amount of work.

Table II shows that although the irregular codes we usedare theoretically harder to get higher throughput than theones in the prior work, our decoder still outperforms theprior work with significant improvement, especially when theparallel ET scheme is used. Our work is directly comparableto [5] since they also implemented a decoder for 802.11n(1944, 972) QC-LDPC code. Although the GPU used in thiswork has approximately twice the amount of computationresource as in [5], our decoder achieves more than 50 timesthroughput compared to their work. This huge improvementcan be attributed to our highly optimized algorithm mappings,efficient data structures and the memory access optimizations.

D. The Occupancy Ratio and Instruction Throughput Ratio

According to the profiling results of NVIDIA Visual Profilertools, the occupancy of Kernel 1 is 0.833 and that of Kernel 2is 1 (the occupancy is measured by the ratio of active warp).The occupancy ratio of our work is high which implies thatour kernels have a sufficient number of threads to hide stalls.This is confirmed by looking at the instruction throughputratio (ITR), that is the ratio of achieved instruction rate topeak single issue instruction ratio. The ITR is around 0.8 forthis implementation. This shows that our algorithm mappingand workload partitioning are efficient and we are close toachieving the peak GPU throughput.

VII. CONCLUSION

This paper presents the techniques and design methodologyto fully utilize GPU’s computational resources to acceleratea computation-intensive DSP algorithm. As a case study, amassively parallel implementation of LDPC decoder on GPUis presented. The challenges to map the algorithms to thecomputational resources in GPU are described. To achieve

high decoding throughput, several techniques including thealgorithmic optimizations, efficient data structures and mem-ory access optimizations are employed. We take the LDPCdecoder for the IEEE 802.11n WiFi LDPC code and 802.16eWiMAX LDPC code as examples to demonstrate the perfor-mance of our GPU-based implementation. The proposed GPUimplementation of LDPC decoder shows great flexibility andreconfigurability. The simulation results exhibit that our LDPCdecoder can achieve high throughput around up to 100.3Mbps.

REFERENCES

[1] M. Wu, Y. Sun, S. Gupta, and J. Cavallaro, “Implementation of ahigh throughput soft MIMO detector on GPU,” Journal of SignalProcessing Systems, pp. 1–14, 2010, 10.1007/s11265-010-0523-4.[Online]. Available: http://dx.doi.org/10.1007/s11265-010-0523-4

[2] M. Wu, Y. Sun, and J. Cavallaro, “Implementation of a 3GPP LTEturbo decoder accelerator on GPU,” in 2010 IEEE Workshop on SignalProcessing Systems (SIPS), 2010, pp. 192 –197.

[3] R. Gallager, “Low-density parity-check codes,” IRE Transactions onInformation Theory, vol. 8, no. 1, pp. 21 –28, 1962.

[4] G. Falcao, L. Sousa, and V. Silva, “Massively LDPC decoding onmulticore architectures,” IEEE Transactions on Parallel and DistributedSystems, vol. 22, no. 2, pp. 309 –322, 2011.

[5] H. Ji, J. Cho, and W. Sung, “Memory access optimized implementationof cyclic and quasi-cyclic LDPC codes on a GPGPU,” Journal ofSignal Processing Systems, pp. 1–11, 2010, 10.1007/s11265-010-0547-9. [Online]. Available: http://dx.doi.org/10.1007/s11265-010-0547-9

[6] NVIDIA CUDA C programming guide v3.2. [Online]. Available:http://developer.nvidia.com/object/cuda 3 2 downloads.html

[7] R. Tanner, D. Sridhara, A. Sridharan, T. Fuja, and J. Costello, D.J.,“Ldpc block and convolutional codes based on circulant matrices,” IEEETransactions on Information Theory, vol. 50, no. 12, pp. 2966 – 2984,2004.

[8] M. Fossorier, M. Mihaljevic, and H. Imai, “Reduced complexity iterativedecoding of low-density parity check codes based on belief propagation,”IEEE Transactions on Communications, vol. 47, no. 5, pp. 673 –680,May 1999.

[9] J. Chen, A. Dholakia, E. Eleftheriou, M. Fossorier, and X.-Y. Hu,“Reduced-complexity decoding of LDPC codes,” IEEE Transactions onCommunications, vol. 53, no. 8, pp. 1288 – 1299, 2005.

[10] S.-H. Kang and I.-C. Park, “Loosely coupled memory-based decodingarchitecture for low density parity check codes,” in Proceedings of theIEEE 2005 Custom Integrated Circuits Conference, 2005, pp. 703 – 706.

[11] X.-Y. Hu, E. Eleftheriou, D.-M. Arnold, and A. Dholakia, “Efficientimplementations of the sum-product algorithm for decoding LDPCcodes,” in IEEE Global Telecommunications Conference, 2001, pp. 1036–1036E.

[12] S. Wang, S. Cheng, and Q. Wu, “A parallel decoding algorithm of LDPCcodes using CUDA,” in 2008 42nd Asilomar Conference on Signals,Systems and Computers, 2008, pp. 171 –175.

[13] Y.-L. Chang, C.-C. Chang, M.-Y. Huang, and B. Huang, “High-throughput GPU-based LDPC decoding,” vol. 7810, no. 1. SPIE,2010, p. 781008. [Online]. Available: http://link.aip.org/link/?PSI/7810/781008/1

Date post:	10-Nov-2020
Category:	Documents
Upload:	others
View:	6 times
Download:	0 times

A Massively Parallel Implementation of QC-LDPC Decoder...

Documents