Hardware Designs for Function Evaluation and LDPC Codingr95152/paper/LDPC... · ation library. The...

Imperial College

London

Hardware Designs for

Function Evaluation and LDPC Coding

A thesis submitted in partial satisfaction

of the requirements for the degree

Doctor of Philosophy in Computing

by

Dong-U Lee

October 2004

c© Copyright by

Dong-U Lee

October 2004

To my parents for their love and support,

and my country Korea. . .

2

Acknowledgments

I thank my supervisor Prof. Wayne Luk for his advice and direction on both

academic and non-academic issues. I would also like to thank Prof. John D. Vil-

lasenor from UCLA, Prof. Philip H.W. Leong from the Chinese University of

Hong Kong, Prof. Peter Y.K. Cheung from the Department of EEE and Dr. Os-

kar Mencer from the Department of Computing for their help on my research

topics.

Many thanks to my colleagues Altaf Abdul Gaffar, Andreas Fidjeland, An-

thony Ng, Arran Derbyshire, Danny Lee, David Pearce, David Thomas, Henry

Styles, Jose Gabriel de Fiqueiredo Coutinho, Jun Jiang, Ray Cheung, Shay Ping

Seng, Sherif Yusuf, Tero Rissa and Tim Todman from Imperial College, Chris

Jones, Connie Wang, David Choi, Esteban Valles and Mike Smith from UCLA,

and Dr. Guanglie Zhang from the Chinese University of Hong Kong for their

assistance. I am especially thankful to Altaf Abdul Gaffar and Ray Cheung who

helped me with numerous Linux programming tasks, and Tim Todman who proof

read this thesis.

The financial support of Celoxica Limited, Xilinx Inc., the U.K. Engineering

and Physical Sciences Research Council PhD Studentship from the Department of

Computing, Imperial College, and the U.S. Office of Naval Research is gratefully

acknowledged.

3

Abstract of the Thesis

Hardware based implementations are desirable, since they can be several or-

ders of magnitudes faster than software based methods. Reconfigurable devices

such as Field-Programmable Gate Arrays (FPGAs) are ideal candidates for this

purpose, because of their speed and flexibility. Three main achievements are

presented in this thesis: function evaluation, Gaussian noise generation, and

Low-Density Parity-Check (LDPC) code encoding. First, our function evalu-

ation research covers both elementary functions and compound functions. For

elementary functions, we automate function evaluation unit design covering table

look-up, table-with-polynomial and polynomial-only methods. We also illustrate

a framework for adaptive range reduction based on a parametric function evalu-

ation library. The proposed approach is evaluated by exploring various effects of

several arithmetic functions on throughput, latency and area for FPGA designs.

For compound functions which are often non-linear, we present an evaluation

method based on piecewise polynomial approximation with a novel hierarchical

segmentation scheme, which involves uniform segments and segments with size

varying by powers of two. Second, our research on Gaussian noise generation re-

sults in two hardware architectures, some of which can be used for Monte Carlo

simulations such as evaluating the performance of LDPC codes. The first design is

based on the Box-Muller method and the central limit theorem, while the second

design is based on the Wallace method. The quality of the noise produced by the

two noise generators are characterized with various statistical tests. We also ex-

amine how design parameters affect the noise quality with the Wallace method.

Third, our research on LDPC encoding describes a flexible hardware encoder

for regular and irregular LDPC codes. Our architecture, based on an encoding

method proposed by Richardson and Urbanke, has linear encoding complexity.

4

Table of Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.1 Objectives and Contributions . . . . . . . . . . . . . . . . . . . . 5

1.2 Computer Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.3 Error Correcting Coding and LDPC Codes . . . . . . . . . . . . . 9

1.4 Overview of our Approach . . . . . . . . . . . . . . . . . . . . . . 11

1.4.1 Function Evaluation . . . . . . . . . . . . . . . . . . . . . 11

1.4.2 Gaussian noise generation . . . . . . . . . . . . . . . . . . 13

1.4.3 LDPC Encoding . . . . . . . . . . . . . . . . . . . . . . . 18

2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.2 Design Tools . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3 Function Evaluation Methods . . . . . . . . . . . . . . . . . . . . 24

2.3.1 CORDIC . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.3.2 Digit-recurrence and On-line Algorithms . . . . . . . . . . 26

2.3.3 Bipartite and Multipartite Methods . . . . . . . . . . . . . 27

2.3.4 Polynomial Approximation . . . . . . . . . . . . . . . . . . 28

2.3.5 Polynomial Approximation with Non-uniform Segmentation 30

2.3.6 Rational Approximation . . . . . . . . . . . . . . . . . . . 31

5

2.4 Issues on Function Evaluation . . . . . . . . . . . . . . . . . . . . 31

2.4.1 Evaluation of Elementary and Compound Functions . . . . 32

2.4.2 Approximation Method Selection . . . . . . . . . . . . . . 32

2.4.3 Range Reduction . . . . . . . . . . . . . . . . . . . . . . . 33

2.4.4 Types of Errors . . . . . . . . . . . . . . . . . . . . . . . . 35

2.5 Gaussian Noise Generation . . . . . . . . . . . . . . . . . . . . . . 36

2.6 LDPC Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

2.6.1 Basics of LDPC Codes . . . . . . . . . . . . . . . . . . . . 38

2.6.2 LDPC Encoding . . . . . . . . . . . . . . . . . . . . . . . 42

2.6.3 RU LDPC Encoding Method . . . . . . . . . . . . . . . . 43

2.6.4 Hardware Aspects of LDPC codes . . . . . . . . . . . . . . 49

2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3 Automating Optimized Table-with-Polynomial

Function Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3 Algorithmic Design Space Exploration with MATLAB . . . . . . . 54

3.4 Hardware Design Space Exploration with ASC . . . . . . . . . . . 57

3.5 Verification with ASC . . . . . . . . . . . . . . . . . . . . . . . . 59

3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4 Adaptive Range Reduction

6

for Function Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4.3 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.3.1 Design Overview . . . . . . . . . . . . . . . . . . . . . . . 74

4.3.2 Degrees of Freedom . . . . . . . . . . . . . . . . . . . . . . 80

4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

4.4.1 Algorithmic Design Space Exploration . . . . . . . . . . . 83

4.4.2 ASC Code Generation and Optimizations . . . . . . . . . 87

4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5 The Hierarchical Segmentation Method

for Function Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.3 Optimum Placement of Segments . . . . . . . . . . . . . . . . . . 104

5.4 The Hierarchical Segmentation Method . . . . . . . . . . . . . . . 113

5.5 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

5.6 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

5.7 The Effects of Polynomial Degrees . . . . . . . . . . . . . . . . . . 127

5.8 Evaluation and Results . . . . . . . . . . . . . . . . . . . . . . . . 133

5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

7

6 Gaussian Noise Generator

using the Box-Muller Method . . . . . . . . . . . . . . . . . . . . . . 144

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

6.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

6.4 Function Evaluation for Non-uniform Segmentation . . . . . . . . 152

6.5 Function Evaluation for Noise Generator . . . . . . . . . . . . . . 156

6.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162


6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

7 Gaussian Noise Generator

using the Wallace Method . . . . . . . . . . . . . . . . . . . . . . . . . 175

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

7.2 The Wallace Method . . . . . . . . . . . . . . . . . . . . . . . . . 176

7.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

7.3.1 The First Stage . . . . . . . . . . . . . . . . . . . . . . . . 181

7.3.2 The Second Stage . . . . . . . . . . . . . . . . . . . . . . . 182

7.3.3 The Third Stage . . . . . . . . . . . . . . . . . . . . . . . 182

7.3.4 The Fourth Stage . . . . . . . . . . . . . . . . . . . . . . . 185

7.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186


7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

8

8 Design Parameter Optimization

for the Wallace Method . . . . . . . . . . . . . . . . . . . . . . . . . . 204

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204

8.2 Overview of the Wallace Method . . . . . . . . . . . . . . . . . . 205

8.3 Measuring the Wallace Correlations . . . . . . . . . . . . . . . . . 208

8.4 Reducing the Wallace Correlations . . . . . . . . . . . . . . . . . 211

8.5 Performance Comparisons . . . . . . . . . . . . . . . . . . . . . . 214

8.6 Hardware Design with Optimized Parameters . . . . . . . . . . . 220

8.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225

9 Flexible Hardware Encoder for LDPC Codes . . . . . . . . . . . 226

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226

9.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

9.3 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231

9.4 Encoder Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 235

9.5 Components for the Encoder . . . . . . . . . . . . . . . . . . . . . 239

9.5.1 Vector Addition . . . . . . . . . . . . . . . . . . . . . . . . 239

9.5.2 Matrix-Vector Multiplication . . . . . . . . . . . . . . . . . 239

9.5.3 Forward-Substitution . . . . . . . . . . . . . . . . . . . . . 241

9.6 Implementation and Results . . . . . . . . . . . . . . . . . . . . . 242

9.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253

10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

10.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257

9

10.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261

10.2.1 Function Evaluation . . . . . . . . . . . . . . . . . . . . . 261

10.2.2 Gaussian Noise Generation . . . . . . . . . . . . . . . . . . 263

10.2.3 LDPC Coding . . . . . . . . . . . . . . . . . . . . . . . . . 264

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

10

List of Figures

1.1 Relations of the chapters in this thesis. . . . . . . . . . . . . . . . 7

1.2 Design flow for evaluating elementary functions. . . . . . . . . . . 13

1.3 Design flow for evaluating non-linear functions using the hierarchi-

cal segmentation method. . . . . . . . . . . . . . . . . . . . . . . 14

1.4 The BenONE board from Nallatech used to run our LDPC simu-

lation experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1.5 Our LDPC hardware simulation framework. . . . . . . . . . . . . 17

1.6 LDPC encoding framework. . . . . . . . . . . . . . . . . . . . . . 18

2.1 Simplified view of a Xilinx logic cell. A single slice contains 2.25

logic cells. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.2 Architecture of a typical FPGA. . . . . . . . . . . . . . . . . . . . 22

2.3 Certain approximation methods are better than others for a given

metric at different precisions. . . . . . . . . . . . . . . . . . . . . 33

2.4 Area comparison in terms of configurable logic blocks for different

methods with varying data widths [122]. . . . . . . . . . . . . . . 34

2.5 Comparison of (3,6)-regular LDPC code, Turbo code and opti-

mized irregular LDPC code [151]. . . . . . . . . . . . . . . . . . . 39

2.6 LDPC communication system model. . . . . . . . . . . . . . . . . 40

2.7 A bipartite graph of a (3,6)-regular LDPC code of length ten and

rate 1/2. There are ten variable nodes and five check nodes. For

each check node Ci the sum (over GF(2)) of all adjacent variable

node is equal to zero. . . . . . . . . . . . . . . . . . . . . . . . . . 41

11

2.8 An equivalent parity-check matrix in lower triangular form. . . . . 43

2.9 The parity-check matrix in approximate lower triangular form . . 44

3.1 Block diagram of methodology for automation. . . . . . . . . . . . 55

3.2 Principles behind automatic design optimization with ASC. . . . 56

3.3 Accuracy graph: maximum error versus bitwidth for sin(x) with

the three methods. . . . . . . . . . . . . . . . . . . . . . . . . . . 58

3.4 Area versus bitwidth for sin(x) with TABLE+POLY. OPT indi-

cates for what metric the design is optimized for. Lower part:

LUTs for logic; small top part: LUTs for routing. . . . . . . . . . 62

3.5 Latency versus bitwidth for sin(x) with TABLE+POLY. Shows

the impact of latency optimization. . . . . . . . . . . . . . . . . . 62

3.6 Throughput versus bitwidth for sin(x) with TABLE+POLY. Shows

the impact of throughput optimization. . . . . . . . . . . . . . . . 63

3.7 Latency versus area for 12-bit approximations to sin(x). The

Pareto-optimal points [124] in the latency-area space are shown. 63

3.8 Latency versus throughput for 12-bit approximations to sin(x).

The Pareto-optimal points in the latency-throughput space are

shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.9 Area versus throughput for 12-bit approximations to sin(x). The

Pareto-optimal points in the throughput-area space are shown. . 64

3.10 Area versus bitwidth for the three functions with TABLE+POLY.

Lower part: LUTs for logic; small top part: LUTs for routing. . . 67

3.11 Latency versus bitwidth for the three functions with TABLE+POLY. 67

12

3.12 Throughput versus bitwidth for the three functions with TABLE+POLY.

Throughput is similar across functions, as expected. . . . . . . . . 68

3.13 Area versus bitwidth for sin(x) with the three methods. Note that

the TABLE method gets too large already for 14 bits. . . . . . . . 68

3.14 Latency versus bitwidth for sin(x) with the three methods. . . . 69

3.15 Throughput versus bitwidth for sin(x) with the three methods. . . 69

4.1 Design flow: MATLAB generates all the ASC code for the library.

The user simply indexes into the library to obtain the specific

function approximation unit. . . . . . . . . . . . . . . . . . . . . . 73

4.2 Description of range reduction, evaluation method and range re-

construction for the three functions sin(x), log(x) and√

x. . . . . 75

4.3 Circuit for evaluating sin(x). . . . . . . . . . . . . . . . . . . . . . 76

4.4 Circuit for evaluating log(x). . . . . . . . . . . . . . . . . . . . . . 77

4.5 Circuit for evaluating√

x. . . . . . . . . . . . . . . . . . . . . . . 78

4.6 Plot of the three functions over the range reduced intervals. . . . 79

4.7 Segmentation for evaluating log(y) with eight uniform segments.

The leftmost three bits of the inputs are used as the segment index. 82

4.8 Architecture of table-with-polynomial unit for degree d polynomi-

als. Horner’s rule is used to evaluate the polynomials. . . . . . . . 83

4.9 ASC code for evaluating sin(x) for range 8 bits and precision 8 bits

with tp2. This code is automatically generated from our MATLAB

tool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

4.10 Area matrix which tells us for each input range/precision combi-

nation which design to use for minimum area. . . . . . . . . . . . 91

13

4.11 Latency matrix which tells us for each input range/precision com-

bination which design to use for minimum latency. . . . . . . . . . 91

4.12 Area cost of range reduction (upper part) for sin(x) implemented

using po with the designs optimized for area. . . . . . . . . . . . . 92

4.13 Area cost of range reduction (upper part) for sin(x) implemented

using tp3 with the designs optimized for area. . . . . . . . . . . . 92

4.14 Area cost of range reduction (upper part) for log(x) implemented

using po with the designs optimized for area. . . . . . . . . . . . . 93

4.15 Area cost of range reduction (upper part) for log(x) implemented

using tp3 with the designs optimized for area. . . . . . . . . . . . 93

4.16 Area for sin(x) with precision of eight bits for different methods

with (WRR, solid line) and without (WOR, dashed line) range

reduction, with the designs optimized for area. . . . . . . . . . . . 94

4.17 Latency for sin(x) with precision of eight bits for different methods


reduction, with the designs optimized for latency. . . . . . . . . . 94

4.18 Area for log(x) with precision of eight bits for different methods


reduction, with the designs optimized for area. . . . . . . . . . . . 95

4.19 Latency for sin(x) with precision of eight bits for different methods


reduction, with the designs optimized for latency. . . . . . . . . . 95

4.20 Area versus precision for sin(x) using tp3 for different ranges and

optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

14

4.21 Latency versus precision for sin(x) using tp3 for different ranges

and optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

4.22 Area versus range for all three functions using different methods

with the precision fixed at eight bits optimized for area. . . . . . . 97

4.23 Latency versus range for all three functions using different methods

with the precision fixed at eight bits optimized for latency. . . . . 97

4.24 Area versus range for all three functions using po for different

precisions optimized for area. . . . . . . . . . . . . . . . . . . . . 98

4.25 Latency versus range for all three functions using po for different

precisions optimized for latency. . . . . . . . . . . . . . . . . . . . 98

4.26 Area versus range for all three functions using po for different

precisions optimized for area. . . . . . . . . . . . . . . . . . . . . 99

4.27 Latency versus range for all three functions using po for different

precisions optimized for latency. . . . . . . . . . . . . . . . . . . . 99

5.1 MATLAB code for finding the optimum boundaries. . . . . . . . . 109

5.2 Optimum locations of the segments for the four functions in Sec-

tion 5.1 for 16-bit operands and second order approximation. . . . 110

5.3 Numbers of optimum segments for first order approximations to

the functions for various operand bitwidths. . . . . . . . . . . . . 111

5.4 Numbers of optimum segments for second order approximations to

the functions for various operand bitwidths. . . . . . . . . . . . . 111

5.5 Ratio of the number of optimum segments required for first and

second order approximations to the functions. . . . . . . . . . . . 112

15

5.6 Circuit to calculate the P2S address for a given input δi, where

δi = av−1av−2..a0. The adder counts the number of ones in the

output of the two prefix circuits. . . . . . . . . . . . . . . . . . . . 115

5.7 Main MATLAB code for finding the hierarchical boundaries and

their polynomial coefficients. . . . . . . . . . . . . . . . . . . . . . 119

5.8 Variation of total number of segments against v0 for a 16-bit second

order approximation to f3. . . . . . . . . . . . . . . . . . . . . . . 120

5.9 The segmented functions generated by HFS for 16-bit second order

approximations. f1, f2, f3 and f4 employ P2S(US), P2SL(US),

US(US) and US(US) respectively. The black and grey vertical lines

are the boundaries for the outer and inner segments respectively. . 121

5.10 Design flow of our approach. . . . . . . . . . . . . . . . . . . . . . 123

5.11 HSM function evaluator architecture for λ = 2 and degree d ap-

proximations. Note that ‘:’ is a concatenation operator. . . . . . . 130

5.12 Variations of the table sizes to the four functions with varying

polynomial degrees and operand bitwidths. . . . . . . . . . . . . . 131

5.13 Variations of the HSM/Optimum segment ratio with polynomial

degrees and operand bitwidths. . . . . . . . . . . . . . . . . . . . 132

5.14 Xilinx System Generator design template used for first order US(US).135

5.15 Xilinx System Generator design template used for second order

P2SL(US). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

5.16 Error in ulp for 16-bit second order approximation to f3. . . . . . 137

6.1 Gaussian noise generator architecture. The black boxes are buffers. 150

16

6.2 The f function. The asterisks indicate the boundaries of the linear

approximations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

6.3 Circuit to calculate the segment address for a given input x. The

adder counts the number of ones in the output of the two prefix

circuits. Note that the least-significant bit xo is not required. . . . 155

6.4 Function evaluator architecture based on non-unform segmentation.157

6.5 Variation of function approximation error with number of bits for

the gradient of the f function. . . . . . . . . . . . . . . . . . . . . 158

6.6 The g functions. Only the thick line is approximated; see Figure

4. The most significant 2 bits of u2 are used to choose which of

the four regions to use; the remaining bits select a location within

Region 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

6.7 Approximation for g1 over [0, 1/4). The asterisks indicate the seg-

ment boundaries of the linear approximations. . . . . . . . . . . . 160

6.8 Approximation error to f . The worst case and average errors are

0.031 and 0.000048 respectively. . . . . . . . . . . . . . . . . . . . 161

6.9 Approximation error to g1. The worst case and average errors are

0.00079 and 0.0000012 respectively. . . . . . . . . . . . . . . . . . 162

6.10 PDF of the generated noise with 17 approximations for f and 6

for g for a population of four million. The p-values of the χ2 and

A-D tests are 0.00002 and 0.0084 respectively. . . . . . . . . . . . 169

6.11 PDF of the generated noise with 59 approximations for f and 21

for g for a population of four million. The p-values of the χ2 and

A-D tests are 0.0012 and 0.3487 respectively. . . . . . . . . . . . . 169

17

6.12 PDF of the generated noise with 59 approximations for f and

21 for g with two accumulated samples for a population of four

million. The p-values of the χ2 and A-D tests are 0.3842 and

0.9058 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 170

6.13 Scatter plot of two successive accumulative noise samples for a

population of 10000. No obvious correlations can be seen. . . . . . 170

6.14 Variation of output rate against the number of noise generator

instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

7.1 Overview of the Wallace method. . . . . . . . . . . . . . . . . . . 177

7.2 Overview of our Gaussian noise generator architecture based on the

Wallace method. The triangle in Stage 4 is a constant coefficient

multiplier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

7.3 The transformation circuit of Stage 3. The square boxes are reg-

isters. The select signals for the multiplexors and the clock enable

signals for the registers are omitted for simplicity. . . . . . . . . . 183

7.4 Detailed timing diagram of the transformation circuit and the

dual-port “Pool RAM”. A z indicates the address of the data z

and WE is the write enable signal of the “Pool RAM”. . . . . . . 184

7.5 Wallace architecture Stage 1 in Xilinx System Generator. The 30

LFSRs generate uniform random bits for Stage 2. . . . . . . . . . 188

7.6 Wallace architecture Stage 2 in Xilinx System Generator. Pseudo

random addresses for p, q, r, s are generated. . . . . . . . . . . . . 189

7.7 Wallace architecture Stage 3 and Stage 4 in Xilinx System Gener-

ator. Orthogonal transformation is performed and sum of squares

corrected. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

18

7.8 Our Wallace design placed on a Xilinx Virtex-II XC2V4000-6 FPGA.192

7.9 Our Wallace design routed on a Xilinx Virtex-II XC2V4000-6 FPGA.192

7.10 Scatter plot of two successive noise samples for a population of

10000. No obvious correlations can be seen. . . . . . . . . . . . . 195

7.11 PDF of the generated noise from our design for a population of

one million. The p-values of the χ2 and A-D tests are 0.9994 and

0.2332 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 196

7.12 PDF of the generated noise from our design for a population of

four million. The p-values of the χ2 and A-D tests are 0.7303 and

0.8763 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 197

7.13 PDF of the generated noise from the Xilinx block for a population

of one million. The p-values of the χ2 and A-D tests are 0.0000

and 0.0002 respectively. . . . . . . . . . . . . . . . . . . . . . . . . 198

7.14 Variation of the χ2 test p-value with sample size for the Xilinx

block, 12-bit, 16-bit, 20-bit and 24-bit Wallace implementation. . 200

7.15 Variation of output rate against the number of noise generator

instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

8.1 Pseudo code of the Wallace method. . . . . . . . . . . . . . . . . 207

8.2 Four million samples of blocks immediately following the block

containing a 5σ output, evaluated with the χ2 test with 200 bins

over [−7, 7] for FastNorm2. The χ2199 contributions of each of the

bins are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209

19

8.3 The χ2199 values of blocks relative to a block containing a realization

with absolute value of 5σ or higher. Four million samples are

compiled for each block. The dotted horizontal line indicates the

0.05 confidence level. . . . . . . . . . . . . . . . . . . . . . . . . . 210

8.4 Impact of various design choices on the χ2199 value. Four million

samples are compiled from the block immediately after each block

containing an absolute value of 5σ or higher for each data point.

The dotted horizontal line indicates the 0.05 confidence level. . . . 222

8.5 Speed comparisons at various K at N = 4096 and R = 1. Lower

part: arithmetic operations. Upper part: table accesses. . . . . . . 223

8.6 Speed comparisons for different parameter choices. The solid,

dashed and dotted lines are for R = 1, R = 2 and R = 3 re-

spectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223

8.7 Execution times for different pool sizes at R = 1 and K = 16. The

solid and dotted lines are for the Athlon XP and the Pentium 4

processors respectively. . . . . . . . . . . . . . . . . . . . . . . . . 224

8.8 Level 2 cache miss rates on the SimpleScalar x86 simulator for

different pool sizes at R = 1, K = 16 and various level 2 cache

sizes. Level 1 cache is fixed at 16KB and 65536 noise samples are

generated for each data point. . . . . . . . . . . . . . . . . . . . . 224

9.1 The parity-check matrix H in ALT form. A, B, C, and E are

sparse matrices, D is a dense matrix, and T is a sparse lower

triangular matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . 228

9.2 LDPC encoding framework. . . . . . . . . . . . . . . . . . . . . . 229

20

9.3 An equivalent parity-check matrix in lower triangular form. Note

that n = block length and m = block length× (1− code rate). . . 230

9.4 Different starting columns for H and HT . . . . . . . . . . . . . . . 235

9.5 Overview of our hardware encoder architecture. Double buffering

is used between the stages for concurrent execution. Grey and

white box indicate RAMs and operations respectively. . . . . . . . 236

9.6 Circuit for vector addition (VA). . . . . . . . . . . . . . . . . . . . 239

9.7 Circuit for matrix-vector multiplication (MVM). . . . . . . . . . . 241

9.8 Circuit for forward-substitution (FS). . . . . . . . . . . . . . . . . 243

9.9 Scatter plot of a preprocessed irregular 500 × 1000 H matrix in

ALT form with a gap of two. Ones appear as dots. . . . . . . . . 245

9.10 The four stage LDPC encoder architecture in Xilinx System Gener-

ator. Each stage contains multiple subsystems performing MVM,

FS, VA or CWG. . . . . . . . . . . . . . . . . . . . . . . . . . . . 246

9.11 LDPC encoder architecture Stage 2 and stage controller in Xilinx

System Generator. . . . . . . . . . . . . . . . . . . . . . . . . . . 247

9.12 The matrix-vector multiplication (MVM) circuit in Xilinx System

Generator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248

9.13 The forward-substitution (FS) circuit in Xilinx System Generator. 249

9.14 Variation of throughput with the number of encoder instances. . . 255

21

List of Tables

2.1 Maximum absolute and average errors for various fist order poly-

nomial approximations to ex over [−1, 1]. . . . . . . . . . . . . . . 29

2.2 Efficient computation of pT1 = −φ−1(−ET−1A + C)sT . . . . . . . 46

2.3 Efficient computation of pT2 = −T−1(AsT + BpT

1 ). . . . . . . . . . 47

2.4 Summary of the RU encoding procedure. . . . . . . . . . . . . . . 48

3.1 Various place and route results of 12-bit approximations to sin(x).

The logic minimized LUT implementation of the tables minimizes

latency and area, while keeping comparable throughput to the

other methods, e.g. block RAM (BRAM) based implementation. . 59

5.1 The ranges for P2S addresses for Λ1 = P2S, n = 8, v0 = 5 and

v1 = 3. The five P2S address bits δ0 are highlighted in bold. . . . 114

5.2 Number of segments for second order approximations to the four

functions. Results for uniform, HSM and optimum are shown. . . 122

5.3 Comparison of direct look-up, SBTM, STAM and HSM for 16 and

24-bit approximations to f2. The subscript for HSM denotes the

polynomial degree, and the subscript for STAM denotes the num-

ber of multipartite tables used. Note that SBTM is equivalent to

STAM2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

5.4 Hardware synthesis results on a Xilinx Virtex-II XC2V4000-6 FPGA

for 16 and 24-bit, first and second order approximations to f2 and

f3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

22

5.5 Widths of the data paths, number of segments, table size and

percentage of exactly rounded results for 16 and 24-bit second

order approximations to f2 and f3. . . . . . . . . . . . . . . . . . 141

5.6 Performance comparison: computation of f2 and f3 functions. The

Athlon and the Pentium 4 PCs are equipped with 512MB and 1GB

DDR-SDRAMs respectively. . . . . . . . . . . . . . . . . . . . . . 142

6.1 Comparing two segmentation methods. Second column shows the

comparison of the number of segments for non-uniform and uni-

form segmentation. Third column shows the number of bits used

for the coefficients to approximate f and g1. . . . . . . . . . . . . 163

6.2 Performance comparison: time for producing one billion Gaussian

noise samples. All PCs are equipped with 1GB DDR-SDRAM. . . 171

7.1 Resource utilization for the four stages of the noise generator on a

Xilinx Virtex-II XC2V4000-6 FPGA. . . . . . . . . . . . . . . . . 191

7.2 Hardware implementation results of the noise generator using dif-

ferent types of FPGA resources on a Xilinx Virtex-II XC2V4000-6

FPGA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193

7.3 Comparisons of different hardware Gaussian noise generators im-

plemented on Xilinx Virtex-II XC2V4000-6 FPGAs. All designs

generate a noise sample every clock. . . . . . . . . . . . . . . . . . 199

7.4 Hardware implementation results on a Xilinx Virtex-II XC2V4000-

6 FPGA for for different numbers of noise generator instances.

The device has 23040 slices, 120 block RAMs and 120 embedded

multipliers in total. . . . . . . . . . . . . . . . . . . . . . . . . . . 201

23

7.5 Performance comparison: time for producing one billion Gaussian

noise samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202

8.1 Number of arithmetic operations per transform/sample for the

transformation at various sizes of K. . . . . . . . . . . . . . . . . 214

8.2 Specifications of the AMD Athlon XP and Intel Pentium 4 plat-

forms used in our experiments. . . . . . . . . . . . . . . . . . . . . 216

8.3 Details of the AMD Athlon XP and Intel Pentium 4 data caches. 217

8.4 Execution time in nanoseconds for the AMD Athlon XP and Intel

Pentium 4 platforms at N = 4096. . . . . . . . . . . . . . . . . . . 218

8.5 Performance comparison of different software Gaussian random

number generators. The Wallace implementations use N = 4096,

R = 1 and K = 16. . . . . . . . . . . . . . . . . . . . . . . . . . . 220

9.1 Computation of pT1 = −F−1(−ET−1A+C)sT . Note that T−1[AsT ] =

yT ⇒ TyT = [AsT ]. . . . . . . . . . . . . . . . . . . . . . . . . . 232

9.2 Computation of pT2 = −T−1(AsT + BpT

1 ). . . . . . . . . . . . . . . 232

9.3 Matrix X stored in memory. The location of the edges of each row

and an extra bit indicating the end of a row are stored. . . . . . . 240

9.4 Preprocessing times and gaps for H matrices with rate 1/2 for var-

ious block lengths performed on a Pentium 4 2.4GHz PC equipped

with 512MB DDR-SDRAM. . . . . . . . . . . . . . . . . . . . . . 244

9.5 Dimensions and number of edges for the matrices A, B, T , C, F

and E generated from a 1000× 2000 irregular H matrix. . . . . . 250


for rate 1/2 for various block length. . . . . . . . . . . . . . . . . 252

24


for block length of 2000 bits for various rates. . . . . . . . . . . . 253


for block length of 2000 bits and rate 1/2 for different numbers of

encoder instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . 254

9.9 Performance comparison of block length of 2000 bits and rate 1/2

encoders: time for producing 410 million codeword bits. . . . . . . 256

25

Abbreviations

A-D Anderson-Darling

ALT Approximate Lower Triangular

ASC A Stream Compiler

ASIC Application-Specific Integrated Circuit

AWGN Additive White Gaussian Noise

BER Bit Error Rate

CDF Cumulative Distribution Function

CORDIC COordinate Rotations DIgital Computer

CPC Cycles Per Codeword

CPS Codewords Per Second

CWG CodeWord Generation

DDR Double Data Rate

DSP Digital Signal Processor

ECC Error Correcting Coding

FPGA Field-Programmable Gate Array

FS Forward-Substitution

GF Galois Field

HFS Hierarchical Function Segmenter

HSM Hierarchical Segmentation Method

K-S Kolmogorov-Smirnov

LDGM Low-Density Generator-Matrix

LDPC Low-Density Parity-Check

LFSR Linear Feedback Shift Register

LNS Logarithmic Number Systems

LRU Least Recently Used

LUT Look-Up Table

Mbps Mega bits per second

MVM Matrix-Vector Multiplication

P2S Powers of 2 Segments

PDF Probability Distribution Function

po polynomial only

RAM Random Access Memory

ROM Read Only Memory

RU Richardson and Urbanke

S1 Stage 1

SBTM Symmetric Bipartite Table Method

SNR Signal to Noise Ratio

STAM Symmetric Table Addition Method

tp2 table-with-polynomial of degree 2

ulp unit in the last place

US Uniform Segments

VA Vector Addition

VHDL Very high speed integrated circuits Hardware Description Language

WOR WithOut Range reduction

WRR With Range Reduction

1

Publications

Journal Papers

D. Lee, A. Abdul Gaffar, O. Mencer and W. Luk, “Automating optimized hard-

ware function evaluation”, submitted to IEEE Transactions on Computers, 2004.

P.H.W. Leong, G. Zhang, D. Lee, W. Luk and J.D. Villasenor, “A comment on

the implementation of the Ziggurat method”, submitted to Journal of Statistical

Software, 2004.

D. Lee, W. Luk, J.D. Villasenor and P.H.W. Leong, “Design parameter optimiza-

tion for the Wallace Gaussian random number generator”, submitted to ACM

Transactions on Modeling and Computer Simulation, 2004.

D. Lee, W. Luk, J.D. Villasenor, G. Zhang and P.H.W. Leong, “A hardware

Gaussian noise generator using the Wallace method”, submitted to IEEE Trans-

actions on VLSI, 2004.

G. Zhang, P.H.W. Leong, C.H. Ho, K.H. Tsoi, R.C.C. Cheung, D. Lee and

W. Luk, “Monte Carlo simulation using FPGAs”, submitted to IEEE Trans-

actions on VLSI, 2004.

D. Lee, W. Luk, J.D. Villasenor and P.Y.K. Cheung, “The hierarchical segmen-

tation method for function evaluation”, submitted to IEEE Transactions on Cir-

cuits and Systems I, 2004.

D. Lee, W. Luk, J.D. Villasenor and P.Y.K. Cheung, “A hardware Gaussian noise

generator for hardware-based simulations”, IEEE Transactions on Computers,

2

volume 53, number 12, pages 1523-1534, 2004.

Book Chapter

D. Lee, W. Luk, J.D. Villasenor and P.Y.K. Cheung, “The effects of polynomial

degrees on the hierarchical segmentation method”, Chapter in New Algorithms,

Architectures, and Applications for Reconfigurable Computing, W. Rosenstiel and

P. Lysaght (Eds.), Kluwer Academic Publishers, 2004.

Conference Papers

D. Lee, A. Abdul Gaffar, O. Mencer and W. Luk, “MiniBit: Bit-width opti-

mization via affine arithmetic”, submitted to ACM/IEEE Design Automation

Conference, 2005.

D. Lee, A. Abdul Gaffar, O. Mencer and W. Luk, “Adaptive range reduction for

hardware function evaluation”, In Proceedings of IEEE International Conference

on Field-Programmable Technology (FPT), pages 169-176, Brisbane, Australia,

Dec 2004.

D. Lee, “Gaussian noise generation for Monte Carlo simulations in hardware”, In

Proceedings of The Korean Scientists and Engineers Association in the UK 30th

Anniversary Conference, pages 182-185, London, UK, Sep 2004.

D. Lee, O. Mencer, D.J. Pearce and W. Luk, “Automating optimized table-

with-polynomial function evaluation for FPGAs”, In Proceedings of International

Conference on Field Programmable Logic and its Applications (FPL), pages 364-

3

373, LNCS 3203, Springer-Verlag, Antwerp, Belgium, Aug 2004.

D. Lee, W. Luk, C. Wang, C. Jones, M. Smith and J.D. Villasenor, “A flexible

hardware encoder for low-density parity-check codes”, In Proceedings of IEEE

Symposium on Field-Programmable Custom Computing Machines (FCCM), pages

101-111, Napa Valley, USA, Apr 2004.

D. Lee, W. Luk, J.D. Villasenor and P.Y.K. Cheung, “Hierarchical segmentation

schemes for function evaluation”, In Proceedings of IEEE International Confer-

ence on Field-Programmable Technology (FPT), pages 92-99, Tokyo, Japan, Dec

2003.

D. Lee, W. Luk, J.D. Villasenor and P.Y.K. Cheung, “Hardware function eval-

uation using non-linear segments”, In Proceedings of International Conference

on Field Programmable Logic and its Applications (FPL), pages 796-807, LNCS

2778, Springer-Verlag, Lisbon, Portugal, Sep 2003.

D. Lee, W. Luk, J.D. Villasenor and P.Y.K. Cheung, “A hardware Gaussian noise

generator for channel code evaluation”, In Proceedings of IEEE Symposium on

Field-Programmable Custom Computing Machines (FCCM), pages 69-78, Napa

Valley, USA, Apr 2003.

D. Lee, T.K. Lee, W. Luk and P.Y.K. Cheung, “Incremental programming for re-

configurable engines”, In Proceedings of IEEE International Conference on Field-

Programmable Technology (FPT), pages 411-415, Shatin, Hong Kong, Dec 2002.

4

CHAPTER 1

Introduction

1.1 Objectives and Contributions

The objective of this thesis is to explore hardware designs for function evaluation,

Gaussian noise generation and Low-Density Parity-Check (LDPC) code encoding.

Our main contributions are:

• Methodology for the automation of function evaluation unit design, cov-

ering table look-up, table-with-polynomial and polynomial-only methods

(Chapter 3).

• Framework for adaptive range reduction based on a parametric function

evaluation library, and on function approximation by polynomials and ta-

bles and pre-computing all possible input an output ranges (Chapter 4).

• Efficient hierarchical segmentation method based on piecewise polynomial

approximations suitable for non-linear compound functions, which involves

uniform segments and segments with size varying by powers of two (Chap-

ter 5).

• Hardware Gaussian noise generator based on the Box-Muller method and

the central limit theorem capable of producing 133 million samples per sec-

ond with 10% resource usage on a Xilinx XC2V4000-6 FPGA (Chapter 6).

• Hardware Gaussian noise generator based on the Wallace method capable

5

of producing 155 million samples per second with 3% resource usage on a

Xilinx XC2V4000-6 FPGA (Chapter 7).

• Design parameter optimization for software implementations of the Wallace

method to reduce correlations and execution time (Chapter 8).

• Linear complexity hardware encoder for regular and irregular LDPC codes

with an efficient architecture for storing and performing computation on

sparse matrices (Chapter 9).

The most exciting contribution of this thesis is perhaps the hierarchical seg-

mentation method presented in Chapter 5. It is a systematic method for pro-

ducing fast and efficient hardware function evaluators for both compound and

elementary functions using piecewise polynomial approximations with a novel

hierarchical segmentation scheme. This method is particulary useful for approx-

imating non-linear functions or curves, using significantly less memory than the

traditional uniform segmentation approach. Depending on the function and pre-

cision, the memory requirements can be reduced to several orders of magnitudes.

We believe that there are numerous applications out there that can benefit from

our approach including data compression, function evaluation, non-linear filter-

ing, pattern recognition and picture processing.

Although the designs in this thesis target FPGA technology, we believe that

our methods are generic enough to be applied across different implementation

technologies such as ASICs. FPGAs are simply used as a platform to demonstrate

that our ideas can be efficiently mapped into hardware.

Figure 1.1 illustrates how the various chapters in this thesis are related to

each other. The chapters on function evaluation are 3, 4 and 5. The chapters on

LDPC coding are 6, 7, 8 and 9. Within the LDPC coding framework, Chapters

6, 7 and 8 are on Gaussian noise generation, which is needed for exploring LDPC

6

C h a p t e r 3 A u t o m a t i n g

F u n c t i o n E v a l u a t i o n

L D P C C o d i n g

G a u s s i a n N o i s e G e n e r a t i o n

C h a p t e r 6 B o x - M u l l e r

M e t h o d

C h a p t e r 7 W a l l a c e M e t h o d

C h a p t e r 9 L D P C

E n c o d i n g

F u n c t i o n E v a l u a t i o n

C h a p t e r 4 R a n g e R e d u c t i o n

C h a p t e r 5 H i e r a r c h i c a l

S e g m e n t a t i o n

C h a p t e r 8 W a l l a c e

O p t i m i z a t i o n

C h a p t e r 1 I n t r o d u c t i o n

C h a p t e r 2 B a c k g r o u n d C h a p t e r 1 0

C o n c l u s t i o n s

Figure 1.1: Relations of the chapters in this thesis.

code behavior in hardware. The Box-Muller method in Chapter 6 requires the

evaluation of functions and uses a variant of the hierarchical segmentation method

presented in Chapter 5.

The rest of this chapter provides historical information and an overview of

the material in Chapters 3 ∼ 8. Chapter 2 covers background material and

previous work. Chapter 3 describes a methodology for the automation of el-

ementary function evaluation unit design. Chapter 4 presents a framework for

adaptive range reduction based on a parametric elementary function evaluation li-

brary. Chapter 5 presents an efficient hierarchical segmentation method suitable

for non-linear compound functions. Chapter 6 describes a hardware Gaussian

noise generator based on the Box-Muller method and the central limit theorem.

Chapter 7 presents a hardware Gaussian noise generator based on the Wallace

method. Chapter 8 analyzes correlations that can occur in the Wallace method,

and examines parameters to reduce correlations and execution time for software

implementations. Chapter 9 describes an efficient hardware encoder with linear

encoding complexity for both regular and irregular LDPC codes, and Chapter 10

7

offers conclusions and future work.

1.2 Computer Arithmetic

Arithmetic has played important roles in human civilization, especially in the

areas of science, engineering and technology. Machine arithmetic can be traced

back as early as 500 BC in the form of abacus used in China. Many numer-

ically intensive applications, such as signal processing, require rapid execution

of arithmetic operations. The evaluation of functions is often the performance

bottleneck of many compute-bound applications. Examples of these functions in-

clude elementary functions such as log(x) and√

x, and compound functions such

as√− log(x) and x log(x). Computing these functions quickly and accurately

is a major goal in computer arithmetic. For instance, over 60% of the total run

time is devoted to function evaluation operations in a simulation of a jet engine

reported by O’Grady and Wang [133].

Recent studies have shown that the increasing importance of these math-

ematical functions in a wide variety of applications. The applications where

these functions have increasingly more important are computer 3D graphics, an-

imation, scientific computing, artificial neural networks, digital signal processing

and multimedia applications. Software implementations are often too slow for

numerically intensive or real-time applications. The increasing speed and perfor-

mance constraints of such applications have led to the development of new ded-

icated hardware for the computation of these operations, providing high-speed

solutions implemented in coprocessors, graphic cards, Digital Signal Processors

(DSPs), Application-Specific Integrated Circuits (ASICs), Field-Programmable

Gate Arrays (FPGAs) [122] and numerical processors in general.

8

1.3 Error Correcting Coding and LDPC Codes

Error correcting coding (ECC) is a critical part of modern communications sys-

tems, where it is used to detect and correct errors introduced during a transmis-

sion over a channel [11], [126]. It relies on transmitting the data in an encoded

form, such that the redundancy introduced by the coding allows a decoding de-

vice at the receiver to detect and correct errors. In this way, no request for

retransmission is required, unlike systems which only detect errors (usually by

means of a checksum transmitted with the data). In many applications, a sub-

stantial portion of the baseband signal processing is dedicated to ECC. The wide

range of ECC applications [30] include space and satellite communications, data

transmission, data storage and mobile communications.

NASA’s space missions including Galileo, Odyssey, Rovers and Voyager would

not have been possible without the use of ECC [71]. Odyssey, NASA’s Mars

spacecraft currently boasts the highest data transmission rate at 128,000 bits per

second via a radio link. However, for future space missions NASA are planning to

use optical communications via laser beams [60]. The new laser will beam back

between one million and 30 million bits per second, depending on the distance

between Mars and Earth [119]. Projects like this provide great challenges to

implement high-speed and low-power ECC systems with good error correcting

performance in deep space.

In 1948, Claude Shannon founded the field of study “Information Theory”

which is the basis of modern ECC with his discovery of the noisy channel cod-

ing theorem [164]. The theoretical contribution of Shannon’s work was a useful

definition of “information” and several “channel coding theorems” which gave ex-

plicit upper bounds, called the channel capacity, on the rate at which information

could be transmitted reliably on a given communication channel. In the context

9

of our work, the result of primary interest is the “noisy channel coding theorem

for continuous channels with average power limitations”. This theorem states

that the capacity C (which is now known as the Shannon limit) of a bandlimited

additive white Gaussian noise (AWGN) channel with bandwidth W , a channel

model that approximately represents many practical digital communication and

storage systems, is given by

C = W log2(1 + Es/N0) bits per second (bps) (1.1)

where Es is the average signal energy in each signaling interval of duration

T = 1/W , and N0/2 is the two-sided noise power spectral density. Perfect

Nyquist signalling is assumed. The proof of this theorem demonstrates that for

any transmission rate R less than or equal to the channel capacity C, there exists

a coding scheme that achieves an arbitrarily small probability of error; conversely,

if R is greater than C, no coding scheme can achieve reliable performance. Since

this theorem was published, an entire field of study has grown out of attempts

to design coding schemes that approach the Shannon limit of various channels.

In the past few years, LDPC codes have received much attention because

of their excellent performance, and have been widely considered as the most

promising candidate ECC scheme for many applications in telecommunications

and storage devices [132], [8]. LDPC codes were first proposed by Gallager in

1962 [48], [49]. He defined an (n, dv, dc) LDPC code as a code of block length

n in which each column of the parity-check matrix contains dv ones and each

row contains dc ones. Due to the regular structure (uniform column and row

weight) of Gallager’s codes, they are now called regular LDPC codes. Gallager

provided simulation results for codes with block lengths of the order of hundreds

of bits. The results indicated that LDPC codes have very good potential for error

correction. However, the high storage and computation requirements interrupted

10

the research on LDPC codes. After the discovery of Turbo codes by Berrou et

al. in 1993 [7], MacKay [110] re-established the interest in LDPC codes during

the mid to late 1990s.

1.4 Overview of our Approach

1.4.1 Function Evaluation

The evaluation of elementary functions is at the core of many compute-intensive

applications [133] which perform well on reconfigurable platforms. Yet, in or-

der to implement function evaluation efficiently, the FPGA programmer has to

choose between many function evaluation methods such as table look-up, polyno-

mial approximation, or table look-up combined with polynomial approximation.

We present a methodology and a partially automated implementation to select

the best function evaluation hardware for a given function, accuracy require-

ment, technology mapping and optimization metrics, such as area, throughput

or latency. The automation of function evaluation unit design is combined with

ASC [123], A Stream Compiler, for FPGAs. On the algorithmic side, we use

MATLAB to design approximation algorithms with polynomial coefficients and

minimize bitwidths. On the hardware implementation side, ASC provides par-

tially automated design space exploration. We illustrate our approach for sin(x),

log(1 + x) and 2x, which are commonly used in a variety of applications. We

provide a selection of graphs that characterize the design space with various di-

mensions, including accuracy, precision and function evaluation method. We also

demonstrate design space exploration by implementing more than 400 distinct

designs.

The evaluation of a function f(x) typically consists of range reduction which

11

transforms the input into a small interval, and the actual function evaluation

on the small interval. We investigate optimization of range reduction given the

range and precision of x and f(x). For every function evaluation there exists

a convenient interval such as [0, π/2) for sin(x). An example of the adaptive

range reduction method, which we propose in our work, introduces another larger

interval for which it makes sense to skip range reduction. The decision depends

on the function being evaluated, precision, and optimization metrics such as area,

latency and throughput. In addition, the input and output range has an impact

on the choice of function evaluation method such as polynomial, table based, or

combinations of the two. We explore this vast design space of adaptive range

reduction for fixed-point sin(x), log(x) and√

x accurate to one unit in the last

place (ulp) using MATLAB and ASC. These tools enable us to study over 1000

designs resulting in over 40 million Xilinx equivalent circuit gates, in a few hours’

time. The final objective is to progress towards a fully automated library that

provides optimal function evaluation hardware units given input and output range

and precision. Our design flow for evaluating elementary functions is illustrated

in Figure 1.2.

Compound functions often have non-linear properties, hence sophisticated

approximation techniques are needed. We present a method for evaluating such

functions based on piecewise polynomial approximation with a novel hierarchical

segmentation scheme. The use of hierarchal schemes of uniform segments and

segments with size varying by powers of two enables us to approximate non-

linear regions of a function particularly well. This partitioning is automated:

efficient look-up tables and their coefficients are generated for a given function,

input range, degree of the polynomials, desired accuracy and finite precision

constraints. Parameterized reference design templates are provided for various

predefined hierarchical schemes. We describe an algorithm to find the optimum

12

f u n c t i o n f ( x ) i n p u t f o r m a t m e t h o d

A p p r o x i m a t e f ( x ) ( M A T L A B )

H a r d w a r e C o m p i l e r ( A S C )

F P G A i m p l e m e n t a t i o n s

L i b r a r y G e n e r a t o r ( P e r l S c r i p t )

A S C c o d e F u n c t i o n

E v a l u a t i o n L i b r a r y

( A S C L i b )

U s e r

L i b r a r y C o n s t r u c t i o n L i b r a r y U s a g e

Figure 1.2: Design flow for evaluating elementary functions.

number of segments and the placement of their boundaries, which is used to an-

alyze the properties of a function and to benchmark our hierarchical approach.

Our method is illustrated using four non-linear compound and elementary func-

tions:√− log(x), x log(x), a high order rational function and cos(πx/2). We

present results for various operand sizes between 8 and 24 bits for first and sec-

ond order polynomial approximations. For 24-bit data, our method requires a

look-up table of size 12 times smaller than the symmetric table addition method.

Our framework for the hierarchical segmentation method is shown in Figure 1.3.

1.4.2 Gaussian noise generation

Evaluations of LDPC codes are based on computer simulations which can be time

consuming, particularly when the behavior at low bit error rates (BERs) in the

error floor region is being studied [57]. Tremendous efforts have been devoted

13

Hierarchical Function

Segmenter

Data File

Synthesis

Place and Route Report

Hardware

User Input

Design Generator

Reference Design Library

Figure 1.3: Design flow for evaluating non-linear functions using the hierarchical

segmentation method.

to analyze and improve their error-correcting performance, but little considera-

tion has been given to the practical LDPC codec hardware implementations. If

the binary Hamming distance [148] between all combinations of codewords (the

distance spectrum) is known, then analytic techniques for describing the perfor-

mance of the codes in the presence of noise is available. However, in the case of

capacity achieving random linear codes (such as LDPC codes), the problem of

finding the distance spectrum of the code is intractable and researchers resort to

the use of Monte Carlo simulation in order to characterize various code construc-

tions in terms BER versus signal to noise ratio (SNR). At very low SNRs, errors

occur often and a sufficient statistic can be gathered readily within a PC. However

at higher SNRs where errors occur rarely, the situation is different. Thorough

characterization of a code in this region may require simulation of 1010−1012 code

symbols, and computer based simulations provide inadequate means of finding

statistically sufficient set of error events, which can take several weeks.

Hardware based simulation offers the potential of speeding up code evaluation

14

by several orders of magnitude [99]. Such simulation framework consists of three

main blocks: encoder, noise channel and decoder, where the noise channel is

generally modeled by Gaussian noise. Our LDPC code simulations are run on

a reconfigurable engine, which consists of a PC and a reconfigurable hardware

platform [85]. The reconfigurable hardware platform we use is a Xilinx Virtex-II

FPGA prototyping board from Nallatech [131] shown in Figure 1.4. It consists

of two Xilinx Virtex-II XC2V4000-6 FPGAs and 4MB of SRAM. The board can

be connected to a PC via the PCI bus or USB. The grey wires are connected to a

logic analyzer for debugging purposes. A block diagram of our LDPC simulation

framework is provided in Figure 1.5. The LDPC encoder follows an algorithm

suggested in [152]. Our noise generator block improves the overall value of the

system as a Monte Carlo simulator, since noise quality at high SNRs (tails of the

Gaussian distribution) is essential. Since the LDPC decoding process is iterative

and the number of required iterations is non-deterministic, a flow control buffer

is used to greatly increase the throughput of the overall system.

We present two methods for generating Gaussian noise. The first is based on

the Box-Muller method [13] and the central limit theorem [78], which involve the

computation of two functions:√− ln(x) and cos(2πx). The accuracy and speed

in computing these functions are essential for generating high-quality Gaussian

noise samples rapidly. The use of non-uniform segments enables us to approxi-

mate non-linear regions of a function particularly well. The appropriate segment

address for a given function can be rapidly calculated in run time by a simple

combinatorial circuit. Scaling factors are used to deal with large polynomial coef-

ficients and to trade precision for range. Our function evaluator is based on first

order polynomials, and is suitable for applications requiring high performance

with small area, at the expense of accuracy. We exploit the central limit theo-

rem to overcome quantization and approximation errors. An implementation at

15

Figure 1.4: The BenONE board from Nallatech used to run our LDPC simulation

experiments.

133MHz on a Xilinx Virtex-II XC2V4000-6 FPGA takes up 10% of the device

and produces 133 million samples per second, which is seven times faster than a

2.6GHz Pentium 4 PC.

The second method is based on the Wallace method [180]. Wallace proposed

a fast algorithm for generating normally distributed pseudo-random numbers

which generates the target distributions directly using their maximal-entropy

properties. This algorithm is particularly suitable for high throughput hardware

implementation since no transcendental functions such as√

x, log(x) or sin(x)

are required. The Wallace method takes a pool of normally distributed random

numbers from the normal distribution. Through transformation steps, a new

pool of normal distributed random numbers are generated. An implementation

running at 155MHz on a Xilinx Virtex-II XC2V4000-6 FPGA takes up 3% of the

device and produces 155 million samples per second.

16

LDPC Decoder

Record Errors

Code Definition LDPC Encoder

Gaussian Noise Generator

SNR

Flow Control Buffer

Code Definition

Data Source

Compare

Figure 1.5: Our LDPC hardware simulation framework.

The outputs of the two noise generators accurately model a true Gaussian

PDF even at very high σ values (tails of the Gaussian distribution). Their

properties are explored using: (a) several different statistical tests, including

the chi-square test and the Anderson-Darling test [32], and (b) an application for

decoding of LDPC codes. Although the Wallace design has smaller area and is

faster than the Box-Muller design, it has slight correlations between successive

transformations, which may be undesirable for certain types of simulations. We

examine design parameter optimizations to reduce such correlations.

17

H M a t r i x

P r e p r o c e s s o r ( S W )

A L T H M a t r i x

E n c o d e r ( H W ) M e s s a g e B l o c k s C o d e w o r d s

Figure 1.6: LDPC encoding framework.

1.4.3 LDPC Encoding

We describe a flexible hardware encoder for regular and irregular Low-Density

Parity-Check (LDPC) codes. Although LDPC codes achieve better performance

and lower decoding complexity than Turbo codes, a major drawback is their

apparently high encoding complexity: whereas Turbo codes can be encoded in

linear time, a straightforward implementation for a LDPC code has complexity

quadratic in the block length due to dense matrix-vector multiplication. Using

an efficient encoding method proposed by Richardson and Urbanke [152], we

present a hardware LDPC encoder with linear encoding complexity. The encoder

is flexible, supporting arbitrary H matrices, rates and block lengths. We develop

a software preprocessor to bring the parity-check matrix H into a approximate

lower triangular form. A hardware architecture with an efficient memory organi-

zation for storing and performing computations on sparse matrices is proposed.

An implementation for a rate 1/2 irregular length 2000 bits LDPC code encoder

on a Xilinx Virtex-II XC2V4000-6 FPGA takes up 4% of the device. It runs at

143MHz and has a throughput of 45 million codeword bits per second (or 22 mil-

18

lion information bits per second) with a latency of 0.18ms. An implementation

of 16 instances of the encoder on the same device at 82MHz is capable of 410 mil-

lion codeword bits per second, 80 times faster than an Intel Pentium 4 2.4GHz

PC. The design flow of our LDPC encoder is illustrated in Figure 1.6. This

block is placed in front of the noise generator in our LDPC simulation framework

(Figure 1.5).

19

CHAPTER 2

Background

2.1 Introduction

The purpose of this chapter is to present the background material and related

work of this thesis. Section 2.2 introduces the basics of FPGAs and the design

tools used in this thesis. Section 2.3 introduces six of the most popular methods

for approximating functions and the existing work. Section 2.4 discusses various

issues such are range reduction related to function evaluation. Section 2.5 presents

different ways of generating Gaussian noise and explores the existing work in this

area. Finally, Section 2.6 introduces the basics of LDPC codes, LDPC encoding,

describes Richardson and Urbanke’s (RU) method for efficiently encoding LDPC

codes and looks at previous work on hardware related issues on LDPC codes.

2.2 FPGAs

2.2.1 Introduction

Field-Programmable Gate Arrays (FPGAs) have long been used for glue logic

and prototyping. More recently, they are being used for many real-life appli-

cations including communications [93], encryption [173], video image process-

ing [168], [175], medical imaging [72], network security [96] and numerical com-

20

4 - i n p u t L U T

m u x

f l i p - f l o p

a

b

c

d

e

c l o c k

c l o c k e n a b l e

s e t / r e s e t

y

q

Figure 2.1: Simplified view of a Xilinx logic cell. A single slice contains 2.25 logic

cells.

putations [104].

FPGAs can potentially approach the execution speed of application specific

hardware with the rapid programming time of microprocessors. In recent years,

the size of FPGAs has followed Moore’s law: the number of logic gate doubles

every 18 months. FPGAs can exploit improvements following Moore’s law better

than microprocessors because of their simpler and more regular structure.

The fundamental building block of Xilinx FPGAs is the logic cell [118]. A logic

cell comprises a 4-input look-up table (which can also act as a 16× 1 RAM or a

16-bit shift register), a multiplexer and a register. A simplified view of a logic cell

is depicted in Figure 2.1. Two logic cells are paired together in an element called

a slice. A slice contains additional resources such as multiplexers and carry logic

to increase the efficiency of the architecture. These extra resources are equivalent

to having more logic cells, and therefore a slice is counted as being equivalent of

2.25 logic cells. Recent-generation reconfigurable hardware has a large amount

of slices. For instance, the Xilinx Virtex-II XC2V4000-6 has 23040 slices.

The architecture of a typical FPGA is illustrated in Figure 2.2. In general,

21

Figure 2.2: Architecture of a typical FPGA.

an FPGA will have an array of configurable logic blocks (which contain two

or four slices depending on the FPGA family), programmable wires, and pro-

grammable switches to realize any function out of the logic blocks and implement

any interconnection topology. Programming is done using of the many popular

technologies such as SRAM cells, Antifuses, EPROM transistors and EEPROM

transistors. In addition to logic blocks, state-of-the-art FPGAs such as the Xilinx

Virtex-II or Virtex-4 devices contain embedded hardware elements for memory,

multiplication, multiply-and-add and even a number of hard microprocessor cores

(such as the IBM PowerPC) [189].

The long IC fabrication time is completely eliminated for these devices and

design realization times are only a few hours. The idea of user-programmability is

very exciting, most ASIC vendors now prefer FPGAs for low cost prototyping for

fine tuning of designs before fabrication. Also, from a marketing point of view, the

FPGA technology allows quick product announcements, which is commercially

attractive. The two major FPGA vendors are Altera and Xilinx. A good review

on configurable computing and FPGAs is given in [28].

22

2.2.2 Design Tools

The following three FPGA design tools are used for the implementations pre-

sented in this thesis:

• ASC [123], A Stream Compiler for FPGAs, adopts C++ custom types and

operators to provide a programming interface in the algorithmic level, the

architectural, the arithmetic level and the gate level. As a unique feature,

all levels of abstraction are accessible from C++. This enables the user

to program on the desired level for each part of the application. Semi-

automated design space exploration further increases design productivity,

while supporting optimization at all available levels of abstraction. Object-

oriented design enables efficient code-reuse; ASC includes an integrated

arithmetic unit generation library, PAM-Blox II [121], which in turn builds

upon the PamDC [137] gate library. The elementary function evaluation

units in Chapters 3 and 4 are implemented with this tool.

• Handel-C [21] is based on ANSI-C with extensions to support flexible

width variables, signals, parallel blocks, bit-manipulation operations and

channel communication. A distinctive feature is that timing of the com-

piled circuit is fixed at one cycle per C assignment. This makes it easy for

programmers to know in which cycle a statement will be executed at the

expense of reducing the scope for optimization. It gives application devel-

opers the ability to schedule hardware resources manually, and Handel-C

tools generate the resulting designs automatically. The ideas of Handel-C

are based on work by Page and Luk in compiling Occam into FPGAs [134].

The Gaussian noise generator using the Box-Muller method in Chapter 6

is implemented with this tool.

23

• Xilinx System Generator [188] is a plug-in to the MATLAB Simulink

software [117] and provides bit-accurate model of FPGA circuits. It au-

tomatically generates a synthesizable VHDL or Verilog code including a

testbench. Other unique capabilities include MATLAB m-code compila-

tion, fast system-level resource estimation, and high-speed hardware co-

simulation interfaces, both a generic JTAG interface [31] and PCI based co-

simulation for FPGA hardware platforms. The Xilinx Blockset in Simulink

enables bit-true and cycle-true modeling, and includes common parameter

blocks such as finite impulse response (FIR) filter, fast Fourier transform

(FFT), logic gates, adders, multipliers, RAMs, etc. Moreover, most of these

blocks utilize the Xilinx cores, which are highly optimized for Xilinx devices.

The function evaluator using the hierarchical segmentation method (HSM)

in Chapter 5, the Gaussian noise generator using the Wallace method in

Chapter 7, and the LDPC encoder in Chapter 8 are implemented with this

tool.

ASC designs are synthesized with PAM-Blox II and all others with Synplicity

Synplify Pro (versions 7.3 ∼ 7.5). Place-and-route for all designs are performed

with Xilinx ISE (versions 6.0 ∼ 6.2).

2.3 Function Evaluation Methods

Many FPGA applications including digital signal processing, computer graphics

and scientific computing require the evaluation of elementary or special purpose

functions. For applications that require low precision approximation at high

speeds, full look-up tables are often employed. However, this becomes imprac-

tical for precisions higher than a few bits, because the size of the table grow

24

exponentially with respect to the input size. Six well known methods are de-

scribed below, which are better suited to high precision.

2.3.1 CORDIC

CORDIC is an acronym for COordinate Rotations DIgital Computer and offers

the opportunity to calculate desired functions in a rather simple and elegant way.

The CORDIC algorithm was first introduced by Volder [178] for the computation

of trigonometric functions, multiplication, division and data type conversion, and

later generalized to hyperbolic functions by Walther [182]. It has found its way

into diverse applications including the 8087 math coprocessor [38], the HP-35

calculator, radar signal processors and robotics.

It is based on simple iterative equations, involving only shift and add opera-

tions and was developed in an effort to avoid the time consuming multiply and

divide operations. The general CORDIC algorithm consists of the following three

iterative equations:

xk+1 = xk −mδkyk2−k

yk+1 = yk + δkxk2−k

zk+1 = zk − δkσk

The constants m, δk and σk depend on the specific computation being performed,

as explained below.

• m is either 0, 1 or −1. m = 1 is used for trigonometric and inverse trigono-

metric functions. m = −1 is used for hyperbolic, inverse hyperbolic, expo-

nential and logarithmic functions, as well as square roots. Finally, m = 1

is used for multiplication and division.

25

• δk is one of the following two signum functions:

δk = sgn(zk) =

1, zk ≥ 0

−1, zk < 0or δk = -sgn(yk) =

1, yk < 0

−1, yk ≥ 0

The first is often called the rotation mode, in which the z values are driven

to zero, whereas the second is the vectoring mode, in which the y values

are driven to zero. Note that δk requires nothing more than a comparison.

• The numbers σk are constants and are stored in a table which depend on

the value of m. For m = 1, σk = tan−1 2−k; for m = 0, σk = 2−k; and for

m = −1, σk = tanh−1 2−k.

To use these equations, appropriate starting values x1, y1 and z1 must be given.

One of these inputs, say z1, might be the number whose hyperbolic sine we wish

to approximate, sinh(z1). In all cases, the starting values must be restricted to a

certain interval about the origin in order to ensure convergence. As the iterations

proceed, one of the variables tends to zero while another variable approaches the

desired approximation.

The major disadvantage of the CORDIC algorithm is its linear convergence

resulting in an execution time which is linearly proportional to the number of

bits in the operands. In addition, CORDIC is limited to a relatively small set of

elementary functions. A comprehensive study of CORDIC algorithms on FPGAs

can be found in [3].

2.3.2 Digit-recurrence and On-line Algorithms

Digit-recurrence [41] and on-line algorithms [40] belong to the same type of meth-

ods for the approximation of functions in hardware, usually known as digit-by-

digit iterative methods, due to their linear convergence, which means that a fixed

26

number of bits of the result is obtained in each iteration. Implementations of this

type of algorithms are typically of low complexity, utilize small area and have rel-

atively large latencies. The fundamental choices in the design of a digit-by-digit

algorithm are the radix, the allowed coefficients of digits and the representation

of the partial reminder.

2.3.3 Bipartite and Multipartite Methods

The bipartite method, meaning that the table is divided into two parts, was

originally introduced by Das Sarma and Matula [159] with the aim of getting

accurate reciprocals. Improvements were suggested by Schulte and Stine [162],

[163], Muller [129], and generalizations from bipartite to multipartite method are

discussed by Denechin and Tisserand [34].

Assume an n-bit binary fixed-point system, and assume that n is a multiple

of 3 and n = 3k. We wish to design a table based implementation of function f .

A full look-up table would lead to a table of size n × 2n. Instead, we split the

input word x into three k-bit words x0, x1, and x2, that is,

x = x0 + x12−k + x22

−2k (2.1)

where x0, x1 and x2 are multiples of 2−k that are less than 1. The original bipartite

method consists in approximating the first order Taylor expansion

f(x) = f(x0 + x12−k) + x22

−2kf ′(x0 + x12−k) + x2

22−4kf ′′(ξ), (2.2)

ξ ∈ [x0 + x12−k, x]

by

f(x) = f(x0 + x12−k) + x22

−2kf ′(x0). (2.3)

27

That is, f(x) is approximated by the sum of two functions α(x0, x1) and β(x0, x2),

where

α(x0, x1) = f(x0 + x12−k)

β(x0, x2) = x22−2kf ′(x0)

The error of this approximation is roughly proportional to 2−3k. Instead of di-

rectly tabulating function f , functions α and β are tabulated. Since they are

functions of 2k bits only, each of these tables has 22n/3 entries. This results in

a total table size of 2n × 22n/3 bits, which is a significant improvement over the

full look-up table. These methods basically exploit the symmetry of the Taylor

approximations and leading zeros in the table coefficients to reduce the look-up

table size. Although these methods yield in significant improvements in table

size over direct table look-up, they can be inefficient for functions that are highly

non-linear [88].

2.3.4 Polynomial Approximation

Polynomial approximation [58], [150] involves approximating a continuous func-

tion f with one or more polynomials p of degree d on a closed interval [a, b]. The

polynomials are of the form

p(x) = cdxd + cd−1x

d−1 + ... + c1x + c0 (2.4)

and with Horner’s rule, this becomes

p(x) = ((cdx + cd−1)x + ...)x + c0 (2.5)

where x is the input. The aim is to minimize the distance ‖p − f‖. In our

work, we use minimax polynomial approximations, which involve minimizing the

maximum absolute error [128]. The distance for minimax approximations is:

‖p− f‖∞ = maxa≤x≤b

|f(x)− p(x)| (2.6)

28

Table 2.1: Maximum absolute and average errors for various fist order polynomial

approximations to ex over [−1, 1].

Taylor Legendre Chebyshev Minimax

Max. Error 0.718 0.439 0.372 0.279

Avg. Error 0.246 0.162 0.184 0.190

where [a, b] is the approximation interval. Many researchers rely on methods such

as Taylor series of simply to minimize the maximum absolute error. Table 2.1

shows the maximum and average errors of various first order polynomial approxi-

mations to ex over [−1, 1]. It can be seen that generally minimax gives the lowest

maximum error and Legendre provides the lowest average error. Therefore, when

low maximum absolute error is desired, minimax approximation should be used

(unless the polynomial coefficients are computed at run-time from stored func-

tion values [100]). The minimax polynomial is found in an iterative manner using

the Remez exchange algorithm [149], which is often used for determining optimal

coefficients for digital filters.

Sidahao et al. [165] approximate functions over the whole interval with high

order polynomials. This polynomial-only approach has the advantage of low

memory requirements, but suffers from long latencies. In addition, it will not

generate acceptable results when the function is highly non-linear. Pineiro et

al. [147] divide the interval into several uniform segments. For each segment,

they store the second order minimax polynomial approximation coefficients, and

accumulate the partial terms in a fused accumulation tree. This scheme performs

well for the evaluation of elementary functions for moderate precisions (less than

24 bits).

29

2.3.5 Polynomial Approximation with Non-uniform Segmentation

Approximations using uniform segments are suitable for functions with relatively

linear regions, but are inefficient for non-linear functions, especially when the

function varies exponentially. It is desirable to choose the boundaries of the

segments to cater for the non-linearities of the function. Highly non-linear regions

will need smaller segments than linear regions. This approach minimizes the

amount of storage required to approximate the function, leading to more compact

and efficient designs. A number of techniques that utilize non-uniform segment

sizes to cater such non-linearities have been proposed in literature. Cantoni [18]

uses optimally placed segments and presents an algorithm to find such segment

boundaries. However, although this approach minimizes the number of segments

required, such arbitrary placed segments are impractical for actual hardware

implementation, since the hardware circuitry to find the right segment for a

given input would be too complex. Combet et al. [27] and Mitchell Jr. [75] use

segments that increase by powers of two to approximate the base two logarithm.

Henkel [61] divides the interval into four arbitrary placed segments based on

the non-linearity of the function. The address for a given input is approximated

by another function that approximates the segment number for an input. This

method only works if the number of segments is small and the desired accuracy is

low. Also, the function for approximating the segment addresses is non-linear, so

in effect the problem has been moved into a different domain. Coleman et al. [26]

divide the input interval into seven P2S (powers of two segments: segments with

the size varying by increasing or decreasing powers of two) that decrease by

powers of two, and employ constant numbers of US (uniform segments: segments

with the sample sizes) nested inside each P2S, which we call P2S(US). Lewis [100]

divides the interval into US that vary by multiples of three, and each US has vari-

30

able numbers of uniform segments nested inside, which we call US(US). However,

in both cases the choice of inner and outer segment numbers is left to the use, and

a more efficient segmentation could be achieved using a systematic segmentation

scheme.

2.3.6 Rational Approximation

Rational approximation offers efficient evaluation of analytic functions repre-

sented by the ratio of two polynomials:

f(x) =cnxn + cn−1x

n−1 . . . c1x + c0

dnxm + dm−1xm−1 . . . d1x + d0

(2.7)

In general, rational approximations are the most efficient method to evaluate

functions on a microprocessor. However, they are less attractive for FPGA im-

plementations due to the presence of the divider. Typical polynomial sizes for

floating-point single precision are smaller than ten coefficients [122]. Hardware

implementations of rational approximation are studied in [79].

2.4 Issues on Function Evaluation

In this section, we describe various issues related to function evaluation. We first

describe approximation methods and applications for elementary and compound

functions. Second, we examine the dilemma designers face, when the optimum

function evaluation method for a given metric is required. Third, range reduction

is explained, which is a technique used to transform the inputs of elementary

functions into a smaller linear interval. Finally, we look at the types of errors

that can arise when attempting to evaluate functions in hardware.

31

2.4.1 Evaluation of Elementary and Compound Functions

The evaluation of elementary functions [128] such as sin(x) or log(x) has re-

ceived significant interest in the research community. They are typically com-

puted by CORDIC [178], table look-up and addition methods [162], [167], or

polynomial/rational approximations with one or more uniform segments [161].

For the evaluation of elementary functions, range reduction techniques [128] such

as those presented in [25] and [182] are used to bring the input within a linear

range. In contrast, there has been little attention on the efficient approximation

of compound functions such as√− log(x) or x log(x) for special purpose appli-

cations. Examples of such applications include N-body simulation [63], channel

coding [74], Gaussian noise generation [86] and image registration [158]. In prin-

ciple, these compound functions can be evaluated by splitting them into several

elementary functions, but this approach would result in long delay, propaga-

tion of rounding errors and possibilities of catastrophic cancellations [55]. Range

reduction is not feasible for compound functions (unless the sub-functions are

computed one by one), so highly non-linear regions of a compound function need

to be handled as well.

2.4.2 Approximation Method Selection

We can use polynomials and/or look-up tables for approximating a function f(x)

over a fixed range [a, b]. At one extreme, the entire function approximation

can be implemented as a table look-up. At the other extreme, the function

approximation can be implemented as a polynomial approximation with function-

specific coefficients. Between these two extremes, we use a table look-up to obtain

the appropriate polynomial coefficients followed by polynomial evaluation. This

table-with-polynomial method partitions the total approximation into several

32

M e

t r i c

Bitwidth

method 3

method 2 method 1

x1 x2

Figure 2.3: Certain approximation methods are better than others for a given

metric at different precisions.

segments.

In [122], Mencer and Luk show that for a given accuracy requirement it is

possible to plot the area, latency, and throughput tradeoff and thus identify

the optimal function evaluation method. The optimality depends on further

requirements such as available area, required latency and throughput. Looking

at Figure 2.3, if one desires the metric to be low (e.g. area, latency or power),

one should use method 1 for bitwidths lower than x1, method 2 for bitwidths

between x1 and x2, and method 3 for bitwidths higher than x2. Figure 2.4 shows

the results from [122], comparing area requirements of various approximation

methods with varying precision.

2.4.3 Range Reduction

We evaluate an elementary function f(x), where x has a given range [a, b] and f(x)

has a precision requirement. The evaluation typically consists of three steps [128]:

33

Figure 2.4: Area comparison in terms of configurable logic blocks for different

methods with varying data widths [122].

(1) range reduction, reducing x over the interval [a, b] to a more convenient y

over a smaller interval [a′, b′],

(2) function evaluation on the reduced interval, and

(3) range reconstruction: expansion of the result back to the original result

range.

There are two main types of range reduction:

• additive reduction: y is equal to x−mC;

• multiplicative reduction: y is equal to x/Cm

where integer m and a constant C are defined by the evaluated function.

Range reduction is widely studied, especially for CORDIC [182] and floating-

point systems on microprocessors [25]. Li et al. [102] present theorems that prove

34

the correctness and effectiveness of commonly used range reduction techniques.

Lefevre and Muller [97] suggest a method for performing range reduction on the

fly: overlapping the computation with the reception of the input bits for bit-serial

systems. Defour et al. [35] present an algorithm suitable for small and medium

sized arguments in IEEE double precision. Their method is significantly faster

than Payne and Hanek’s modular range reduction method [128], at the expense

of larger table sizes. In contrast, range reduction which adapts to different input

ranges and precisions has received little attention.

2.4.4 Types of Errors

Classically, we have three different kinds of error which affect to the global error

of an evaluation of a function:

• The input quantization error measures the fact that an input number usu-

ally represents a small interval centered around this number.

• The approximation error measures the difference between the pure mathe-

matical function and the approximate mathematical function that will be

used to evaluate it.

• Output rounding errors measure the difference between the approximated

mathematical function and the closest machine-representable value.

All of these errors need to be taken into account when approximating a func-

tion for a given output error requirement.

35

2.5 Gaussian Noise Generation

Sequences of random numbers with Gaussian probability distribution functions

are needed to simulate a wide variety of natural phenomena [51], [183]. Applica-

tions of such sequences include channel code evaluation [74], watermarking [39],

oscilloscope testing [176], simulation of economic systems [6], [156], financial mod-

eling [14] and molecular dynamics simulations [76].

Previous work on Gaussian noise generation can be divided into two types:

the generation of Gaussian noise using a combination of analog components [144],

[155], [196], and the generation of pseudo random noise using purely digital com-

ponents [5], [12], [23], [33], [53], [56], [69], [98], [127], [135], [170], [180]. The

first method tends to be practical only in highly restricted circumstances, and

suffers from its own problems with noise accuracy. The second method is often

more desirable, because of its flexibility and high performance. In addition, when

simulating communication systems, we may wish to use noise sequences that are

pseudo-random so that the same noise can be adopted for different systems. Also,

if the system fails we may wish to know which noise samples cause the system to

fail. Comprehensive but rather dated comparisons of such digital methods can

be found in [4], [130] and [143].

Digital methods for generating random Gaussian variables are almost al-

ways based on transformations or operations on uniform random variables [160].

The most widely used methods are: various rejection-acceptance methods [2],

[98], [101], [114], [115], the use of the central limit theorem [78], the inversion

method [65] and the Box-Muller method [13]. The rejection-acceptance methods,

while popular in software implementations, contain conditional loops such that

the output rates are not constant, making them less amenable to a hardware

simulation environment. The central limit theorem can, in principle, be used

36

to produce Gaussian samples, if a suitable number of samples are involved. In

practice however, approaching a Gaussian probability density function (PDF) to

a high accuracy using the central limit theorem alone would require an imprac-

tically large number of samples.

The Box-Muller method, either alone or in combination with the central limit

theorem, has been the focus of most efforts in hardware implementation. For

example, Boutillion et al. [12] present a hardware Gaussian noise generator on

an Altera FPGA based on the Box-Muller algorithm in conjunction with the

central limit theorem. Their design occupies 437 logic cells on an Altera Flex

10K100EQC240-1 FPGA, and outputs 24.5 million noise samples per second at

a clock speed of 98MHz. Recently, Xilinx have released the “Additive White

Gaussian Noise (AWGN) Core 1.0” [186], which is based on the Boutillion et

al. architecture. The drawback of these designs are revealed by statistical tests

applied to evaluate the noise samples produced. Designs which fail such statistical

tests are inadequate for high quality hardware communications simulations such

as Low-Density Parity-Check (LDPC) codes [48].

Chen et al. [22] use a cumulative distribution function (CDF) conversion table

to transform uniform random variables to Gaussian random variables. They have

implemented the Gaussian noise generator as part of a readback-signal generator

on a Xilinx Virtex-E XCV1000E FPGA at 70MHz. The method they employ is

basically the inversion method [65] implemented with a look-up table. Again, our

experiments show that the use of direct table look-up for the inversion method can

produce noise samples of insufficient quality for the communications applications

that we are targeting. Fan et al. [43] present a hardware Gaussian noise generator

based on the polar method [78] in conjunction with the central limit theorem.

Their design is implemented on a Altera Mercury EP1M120F484C7 FPGA; it

37

takes up 336 logic elements and has a clock speed of 73MHz generating a sample

every clock. The polar method is a variant of the Box-Muller method and is a

class of the rejection-acceptance methods, hence the output rate is not constant.

In order to overcome this problem, they employ a FIFO buffer with the read

speed set to half of the write speed. However, detailed statistical analysis have

not been performed to confirm the quality of the noise samples produced.

All of the methods described above produce normal variables by performing

operations on uniform variables. In contrast, Wallace proposes an algorithm

that completely avoids the use of uniform variables, operating instead using an

evolving pool of normal variables to generate additional normal variables [180].

The approach draws its inspiration from uniform random number generators that

generate one or more new uniform variables from a set of previously generated

uniform variables. Given a set of normally distributed random variables, a new

set of normally distributed random variables can be generated by applying a

linear transformation. Brent [16] has implemented a fast vectorized Gaussian

random number generator using the Wallace method on the Fujitsu VP2200 and

VPP300 vector processors. In [17] and [157], Brent and Rub outline possible

correlation problems associated with the Wallace method and discuss ways of

avoiding them.

2.6 LDPC Codes

2.6.1 Basics of LDPC Codes

Low-Density Parity-Check (LDPC) codes [48], [49] enable performance extremely

close to the best possible as determined by the Shannon capacity formula. For

the additive white Gaussian noise (AWGN) channel, the best code of rate 1/2

38

Figure 2.5: Comparison of (3,6)-regular LDPC code, Turbo code and optimized

irregular LDPC code [151].

presented in [151] has a threshold within 0.06dB from capacity, and their simu-

lation results show that the best LDPC code of length 106 achieves a bit error

probability of 10−6 less than 0.13dB away from capacity, beating the best codes

known so far. Performance comparison of various codes in terms of BER versus

SNR is illustrated in Figure 2.5. All codes are length of 106 and rate 1/2. The

BER for the AWGN channel is shown as a function of Eb/N0 (SNR per bit in

dB).

The communication system model we consider comprises of LDPC encoder,

decoder and an AWGN channel as shown in Figure 2.6. Message bits are given

as inputs to the LDPC encoder, which creates parity bits for a block of message

generating codewords. A binary antipodal modulation such as binary phase shift

keying (BPSK) is assumed at the transmitter. The signal gets corrupted by

AWGN noise during the transmission over the channel. At the receiver end,

the demodulator demodulates the received signal, filters it and performs A/D

39

LDPC Encoder

Codeword Symbols

Received Symbols

Noise

Message Bits

Decoded Bits

AWGN Channel

LDPC Decoder

Figure 2.6: LDPC communication system model.

conversion on it. This is further fed to the LDPC decoder, which iteratively

decodes the received block of codeword and provides decoded bits at the output

end.

As originally suggested by Tanner [172], LDPC codes are well represented

by bipartite graphs in which one set of nodes, the variable nodes, corresponds to

elements of the codeword and the other set of nodes, the check nodes, corresponds

to the set of parity-check constraints which define the code. Regular LDPC codes

are those for which all nodes of the same type have the same degree. For example,

a (3,6)-regular LDPC code has a graphical representation in which all variable

nodes have degree three and all check nodes have degree six. The bipartite graph

determining such a code is shown in Figure 2.7.

Irregular LDPC codes are introduced in [108] and [107] and are further stud-

ied in [105], [106] and [111]. For such an irregular LDPC code, the degrees of each

set of nodes are chosen according to some distribution. Thus, an irregular LDPC

code might have a graphical representation in which half the variable nodes have

degree three and half have degree five, while half the constraint nodes have degree

six and half have degree eight. Luby et al. [107] formally showed that properly

constructed irregular codes can approach the channel capacity more closely than

regular codes. LDPC codes exhibit an asymptotically better performance than

40

Figure 2.7: A bipartite graph of a (3,6)-regular LDPC code of length ten and

rate 1/2. There are ten variable nodes and five check nodes. For each check node

Ci the sum (over GF(2)) of all adjacent variable node is equal to zero.

Turbo codes and admit a wide range of tradeoffs between performance and com-

plexity.

LDPC codes are linear codes. Hence, they can be expressed as the null space

of a parity-check matrix H, i.e., x is a code word if and only if

HxT = 0T . (2.8)

The sparseness of H enables efficient (sub-optimal) decoding, while the random-

ness ensures (in the probabilistic sense) a good code. The H matrix corresponding

to the bipartite graph in Figure 2.7 is shown below. Note that in this example,

H is not sparse; it is just for illustration.

41

H =

1 1 1 1 0 1 1 0 0 0

0 0 1 1 1 1 1 1 0 0

0 1 0 1 0 1 0 1 1 1

1 0 1 0 1 0 0 1 1 1

1 1 0 0 1 0 1 0 1 1

(2.9)

2.6.2 LDPC Encoding

LDPC codes are linear block codes. Encoding of such codes uses the following

property:

HxT = 0T (2.10)

where x represents the codeword and H represents the parity-check matrix. A

straightforward encoding scheme requires three steps: a) Gaussian elimination

to transform the H matrix into a lower triangular form (Figure 2.8), b) split

x into information bits and parity bits, i.e., x = (s, p1, p2) where s is vector of

information bits, p1, p2 are vectors of parity bits, c) solve the equation HxT = 0

using forward-substitution. It takes about O(n3) to perform Gaussian elimina-

tion. Since afterwards the H matrix will no longer be sparse, it takes O(n2),

or more precisely, n2(

r(1−r)2

)XOR operations for the actual encoding, where r is

the code rate [151]. The code rate is the ratio of information bits to codeword

bits and has a value between 0 and 1. In order to reduce the quadratic com-

plexity, Richardson and Urbanke [152] took advantage of the sparsity of the H

matrix. They found that in most cases, the encoding complexity is either linear or

quadratic but quite manageable. For example, for a (3,6) regular code of length

n, even though the complexity is still quadratic, the actual number of operations

required is O(n) in addition to 0.0172n2. Since 0.0172 is a small number, the

42

Figure 2.8: An equivalent parity-check matrix in lower triangular form.

complexity of the encoder is still manageable for large n.

2.6.3 RU LDPC Encoding Method

In this section, we describe the Richardson and Urbanke (RU) algorithm for con-

structing efficient encoders for LDPC codes as presented in [152]. The efficiency

of the encoder arises from the sparseness of the parity-check matrix H and the

algorithm can be applied to any ‘sparse’ H. Although our example is binary,

the algorithm applies generally to matrices H whose entries belong to a field F .

We assume throughout that the rows of H are linearly independent. If the rows

are linearly dependent, then the algorithm which constructs the encoder will de-

tect the dependency and either one can choose a different matrix H, or one can

eliminate the redundant rows from H in the encoding process.

Assume we are given an m× n parity-check matrix H over F . By definition,

the associated code consists of the set of n-tuples x over F such that

HxT = 0T . (2.11)

As briefly discussed in the previous section, the most straightforward way of

constructing an encoder for such a code is the following. By means of Gaussian

43

Figure 2.9: The parity-check matrix in approximate lower triangular form

elimination bring H into an equivalent lower triangular form as shown in Fig-

ure 2.8. Split the vector x into a systematic part s, s ∈ F n−m, and a parity part

p, p ∈ Fm, such that x = (s, p). Construct a systematic encoder as follows: i) Fill

s with the (n−m) desired information symbols. ii) Determine the m parity-check

symbols using back-substitution. More precisely, for l ∈ [m] calculate

pl =n−m∑j=1

Hl,jsj +l−1∑j=1

Hl,j+n−mpj. (2.12)

Bringing the matrix H into the desired form requires O(n3) operations of pre-

processing. The actual encoding then requires O(n2) operations since, in general,

after the preprocessing the matrix will no longer be sparse.

Given that the original parity-check matrix H is sparse, one might wonder if

encoding can be accomplished in O(n). As it will be shown, typically for codes

which allow transmission at rates close to capacity, linear time encoding is indeed

possible.

Assume that by performing row and column permutations only we can bring

the parity-check matrix into the form indicated in Figure 2.9. We say that H is

in approximate lower triangular form. Note that since this transformation was

accomplished solely by permutations, the matrix is still sparse. More precisely,

44

assume that we bring the matrix in the form

H =

A B T

C D E

(2.13)

where A is of size (m − g) × (n −m), B is (m − g) × g, C is g × (n −m), D is

g × g, and finally, E is g × (m − g). Further, all these matrices are sparse and

T is lower triangular with ones along the diagonal. Multiplying this matrix from

the left by I 0

−ET−1 I

(2.14)

we get A B T

−ET−1A + C −ET−1B + D 0

(2.15)

Let x = (s, p1, p2) where s denotes the systematic part, p1 and p2 combined

denote the parity part, p1 has length g, and p2 has length (m− g). The defining

equation HxT = 0T splits naturally into two equations, namely

AsT + BpT1 + TpT

2 = 0 (2.16)

and

(−ET−1A + C)sT + (−ET−1B + D)pT1 = 0. (2.17)

Define φ := −ET−1B + D and assume for the moment that φ is nonsingular.

The general case will be discussed shortly. Then from (2.17) we conclude that

pT1 = −φ−1(−ET−1A + C)sT . (2.18)

Hence, once the g× (n−m) matrix −φ−1(−ET−1A+C) has been precomputed,

the determination of p1 can be accomplished in complexity O(g×(n−m)) simply

45

by performing a multiplication with this matrix. This complexity can be further

reduced as shown in Table 2.2. Rather than precomputing −φ−1(−ET−1A + C)

and then multiplying with sT , we can determine p1 by breaking the computation

into several smaller steps, each of which is efficiently computable.

To this end, we first determine AsT , which has complexity O(n) since A is

sparse. Next, we multiply the result by T−1. Since T−1[AsT ] = yT is equivalent

to the system [AsT ] = TyT this can also be accomplished in O(n) by back-

substitution, since T is lower triangular and also sparse. The remaining steps are

fairly straightforward. It follows that the overall complexity of determining p1 is

O(n + g2). In a similar manner noting from (2.16) that pT2 = −T−1(AsT + BpT

1 ),

we can accomplish the determination of p2 in complexity O(n) as shown step by

step in Table 2.3.

Table 2.2: Efficient computation of pT1 = −φ−1(−ET−1A + C)sT .

Operation Comment Complexity

AsT multiplication by sparse matrix O(n)

T−1[AsT ] T−1[AsT ] = yT ⇔ [AsT ] = TyT O(n)

−E[T−1AsT ] multiplication by sparse matrix O(n)

CsT multiplication by sparse matrix O(n)

[−ET−1AsT ] + [CsT ] addition O(n)

−φ−1[−ET−1AsT + CsT ] multiplication by dense g × g matrix O(g2)

A summary of the RU encoding procedure is given in Table 2.4. It consists of

two steps. A preprocessing step and the actual encoding step. In the preprocess-

ing step, we first perform row and column permutations to bring the parity-check

46

Table 2.3: Efficient computation of pT2 = −T−1(AsT + BpT

1 ).

Operation Comment Complexity

AsT multiplication by sparse matrix O(n)

BpT1 multiplication by sparse matrix O(n)

[AsT ] + [BpT1 ] addition O(n)

−T−1[AsT + BpT1 ] −T−1[AsT + BpT

1 ] = yT ⇔ −[AsT + BpT1 ] = TyT O(n)

matrix into approximate lower triangular form with as small a gap g as possible.

We also need to check whether φ := −ET−1B + D is nonsingular. Rather than

premultiplying by the matrix

I 0

−ET−1 I

, this task can be accomplished ef-

ficiently by Gaussian elimination. If after clearing the matrix E the resulting

matrix φ is seen to be singular, we can simply perform further column permu-

tations to remove this singularity. This is always possible when H is not rank

deficient, as assumed. The actual encoding then entails the steps listed in Tables

2.2 and 2.3.

47

Table 2.4: Summary of the RU encoding procedure.

Preprocessing: Input: Non-singular parity-check matrix H. Output: An equivalent parity-check matrix of

the form

24 A B T

C D E

35 such that −ET−1B + D is non-singular.

1. Perform row and column permutations to bring the parity-check matrix H into approximate lower

triangular form

H =

24 A B T

C D E

35 (2.19)

with as small a gap g as possible. We will see in subsequent sections how this can be accomplished

efficiently.

2. Use Gaussian elimination to effectively perform the pre-multiplication.

24 I 0

ET−1 I

3524 A B T

C D E

35 =

24 A B T

−ET−1A + C −ET−1B + D 0

35 (2.20)

in order to check that −ET−1B + D is non-singular, performing further column permutations is

necessary to ensure this property. (Singularity of H can be detected at this point.)

Encoding: Input: Parity-check matrix of the form

24 A B T

C D E

35 such that −ET−1B + D is non-singular

and a vector s ∈ F n−m. Output: The vector x = (s, p1, p2), p1 ∈ F g , p2 ∈ F m−g , such that HxT = 0T .

1. Determine p1 as shown in Table 2.2.

2. Determine p2 as shown in Table 2.3.

48

2.6.4 Hardware Aspects of LDPC codes

Although there have been plenty of publications on the hardware implementation

of Viterbi [20], [77], [136], [169], [190] and Turbo codes [64], [116], [146], [145],

[177] little attention has been given to the hardware implementation issues of

LDPC codes.

Levine and Schmidt [99] present a simple hardware architecture for the LDPC

codec, but it has not been implemented and is perhaps not practical for a real

design due to the large size and inefficiency. Zhang et al. investigate the finite pre-

cision effects in regular LDPC decoders in [195]. They introduce their hardware

architecture in [193] and present an FPGA based (3,6)-regular LDPC decoder

in [194]. Their design is capable of 54Mbps, but its error correcting performance

is rather poor due to the use of regular LDPC codes and simplicity of their de-

sign. In [9], Bhatt et al. present a regular LDPC implementation on a fixed-point

DSP, which achieves a bit rate of just 133.33Kbps. In [67], Howland et al. present

their parallel regular LDPC decoding architecture, and later published their ASIC

based LDPC decoder chip in [10] and [66]. Their chip is claimed to be capable of

1Gbps, but is again based on regular LDPC codes, which have lower error cor-

recting performance compared to irregular LDPC codes. Our closet competitors

are probably Metha et al. and Flarion Technologies Inc. Metha et al. are working

on an FPGA based regular LDPC simulation platform and recently published a

technical report on their preliminary architecture [120]. Flarion offer intellectual

properties for LDPC encoder and decoders [45]. Their FPGA decoder is reported

to operate at up to 384Mbps and their ASIC decoder at 10Gbps. However, little

details are known due to them being commercial products.

To the best of our knowledge, existing hardware LDPC encoders in the lit-

erature [66], [120], [193] employ the straightforward encoding method where a

49

vector of information bits is multiplied by a dense generator matrix, which has

complexity quadratic to the block length.

Recently, Johnson and Weller [73] proposed a family of irregular LDPC codes

with low encoding complexity based on quasi-cyclic codes. The quasi-cyclic codes

can be encoded with a shift register circuit of size equal to the code dimension.

Low-Density Generator-Matrix (LDGM) codes [50] have also received consider-

able attention due to their linear encoding complexity. However, quasi-cyclic

and LDGM codes are subsets of LDPC codes and restrict the way the parity-

check matrix is constructed. In most cases, they are outperformed by properly

constructed irregular codes such as those in [174].

2.7 Summary

In this chapter, we have presented background material and related work of

this thesis. We have first introduced FPGAs: their architecture, applications

and design tools. Different methods for approximating functions have been

described, covering CORDIC, digit-recurrence and on-line algorithms, bipar-

tite/multipartite methods, polynomial approximation, polynomial approximation

with non-uniform segmentation, and rational approximation. Our implementa-

tions of some of these methods are presented in Chapters 3 ∼ 5.

Various issues involved with function evaluation such as different approx-

imation requirements for elementary and compound functions, approximation

method selection, range reduction, and the types of errors that can occur during

approximation process have been discussed. In Chapter 3, we address the automa-

tion of method selection when evaluating elementary functions, and in Chapter 4

we present hardware architecture for range reduction. Chapter 5, presents a hi-

50

erarchical segmentation method suitable for approximating non-linear compound

functions.

Several ways of generating Gaussian noise have been discussed, which are used

for various applications including channel code simulations. In Chapters 6 and 7

we present two hardware architectures suitable for applications that require high

speed/quality noise generators.

Finally, we have introduced the basics of LDPC codes and LDPC encoding,

described the RU algorithm for efficient encoding of LDPC codes, and looked

at previous work that deals with the hardware related issues of LDPC codes.

We present a flexible hardware LDPC encoder based on the RU algorithm in

Chapter 8.

51

CHAPTER 3

Automating Optimized Table-with-Polynomial

Function Evaluation

3.1 Introduction

Hardware implementation of elementary functions is a widely studied field with

many research papers (e.g. [19], [163], [171], [185]) and books (e.g. [46], [128])

devoted to the topic. Even though many methods are available for evaluating

functions, it is difficult for designers to know which method to select for a given

implementation.

Advanced FPGAs enable the development of low-cost and high-speed function

evaluation units, customizable to particular applications. Such customization can

take place at run time by reconfiguring the FPGA, so that different functions,

function evaluation methods, or precision can be introduced according to run-

time conditions. Consequently, the automation of function evaluation design is

one of the key bottlenecks in the further application of function evaluation in

reconfigurable computing. The main contributions of this chapter are:

• A methodology for the automation of function evaluation unit design, cov-

ering table look-up, table-with-polynomial and polynomial-only methods.

• An implementation of a partially automated system for design space explo-

52

ration of function evaluation in hardware, including:

– Algorithmic design space exploration with MATLAB.

– Hardware design space exploration with ASC.

• Method selection results for three commonly used elementary functions:

sin(x), log(1 + x) and 2x.

The rest of this chapter is organized as follows. Section 3.2 provides an

overview of our approach. Section 3.3 presents the algorithmic design space

exploration with MATLAB. Section 3.4 describes the automation of the ASC

design space exploration process. Section 3.5 shows how ASC designs can be

verified. Section 3.6 discusses results, and Section 3.7 offers summary and future

work.

3.2 Overview

We can use polynomials and/or look-up tables for approximating a function f(x)

over a fixed range [a, b]. At one extreme, the entire function approximation

can be implemented as a table look-up. At the other extreme, the function

approximation can be implemented as a polynomial approximation with function-

specific coefficients. The polynomials are of the form

f(x) = cdxd + cd−1x

d−1 + ... + c1x + c0. (3.1)

We use Horner’s rule to reduce the number of multiplications:

f(x) = ((cdx + cd−1)x + ...)x + c0 (3.2)

where x is the input, d is the polynomial degree and c are the coefficients.

53

Between these two extremes, we use a table followed by a polynomial. This

table-with-polynomial method partitions the total approximation into several

segments. In this work, we employ uniformly sized segments, which have been

widely studied in literature [19], [70], [100].

As discussed in Section 2.4.2 in Chapter 2, for a given accuracy requirement

it is possible to plot the area, latency, and throughput tradeoff and thus identify

the optimal function evaluation method. The optimality depends on further

requirements such as available area, required latency and throughput. We shall

illustrate this approach using Figures 3.13 to 3.15, where several methods are

combined to provide the optimal implementations in area, latency or throughput

for different bit-widths for the function sin(x).

The contribution of this chapter is the design and implementation of a method-

ology to automate this process. Here, MATLAB automates the mathematical

side of function approximation (e.g. bitwidth and coefficient selection), while

ASC [123] automates the hardware design space exploration of area, latency

and throughput. Figure 3.1 shows the proposed methodology and Figure 3.2

illustrates how ASC optimizes designs automatically for the user specified met-

ric [123]. Area optimization time-shares common blocks, latency optimization

uses no registers in the intermediate data paths, and throughput optimization

inserts pipeline registers.

3.3 Algorithmic Design Space Exploration with MATLAB

Given a target accuracy, or number of output bits so that the required accu-

racy is one unit in the last place (ulp), it is straightforward to automate the

design of a sufficiently accurate table, and with help from MATLAB, also to

54

Algorithm Selection

Algorithm Exploration

Function Range Accuracy

MATLAB

Arithmetic Exploration

FPGA Implementations

A Stream Compiler

ASC code

Results of Design Space Exploration

Figure 3.1: Block diagram of methodology for automation.

find the optimal coefficient for a polynomial-only implementation. The inter-

esting designs are between the table-only and polynomial-only designs – those

involving both a table and a polynomial. Three MATLAB programs have been

developed: TABLE (table look-up), TABLE+POLY (table-with-polynomial) and

POLY (polynomial-only). The programs take a set of parameters (e.g. function,

input range, operand bitwidth, required accuracy, bitwidths of the operations

and the coefficients and the polynomial degree) and generate function evaluation

units in ASC code.

TABLE produces a single table, holding results for all possible inputs; each

input is used to index the table. If the input is n bits and the precision of the

results is m bits, the size of the table would be 2n ×m. It can be seen that the

disadvantage of this approach is that the table size varies exponentially with the

input size.

55

A r e a O p t i m i z e d

L a t e n c y O p t i m i z e d

T h r o u g h p u t O p t i m i z e d

Figure 3.2: Principles behind automatic design optimization with ASC.

TABLE+POLY implements the table-with-polynomial method. The input

interval [a, b] is split into N = 2I equally sized segments. The I leftmost bits

of the argument x serve as the index into the table, which holds the coefficients

for that particular interval. We use degree two polynomials for approximating

the segments, which are known to give good results for moderate precisions [91].

The program starts with I = 0 (i.e. one segment over the whole input range)

and finds the minimax polynomial coefficients, those minimize the maximum

absolute error [128]. I is incremented until the maximum error over all segments

is lower than the requested error. The operations are performed in fixed-point

and in finite precision with the user supplied parameters, which are emulated by

MATLAB.

POLY generates an implementation which approximates the function over

the whole input range with a single polynomial. It starts with a degree one

polynomial and finds the minimax polynomial coefficients. The polynomial degree

is incremented until the desired accuracy is met.

56

3.4 Hardware Design Space Exploration with ASC

While Mencer et al. [123] show the details of the design space exploration pro-

cess with ASC, we now utilize ASC (version 0.5) to automate this process. The

idea is to retain user-control over all features available on the gate level, whilst

automating many of the tedious tasks involved in exploring the design space.

Therefore ASC allows the user to specify the dimensions of design space explo-

ration, e.g. bitwidths of certain variables, optimization metrics such as area,

latency, or throughput, and in fact anything else that is accessible in ASC code,

which includes algorithm level, arithmetic unit level and gate level constructs. For

example, suppose we wish to explore how the bitwidth of a particular ASC vari-

able affects area and throughput. To do this we first parameterize the bitwidth

definition of this variable in the ASC code. Then we specify the detail of the

exploration in the following manner:

RUN0 = −XBITWIDTH = 8, 16, 24, 32 (3.3)

which states that we wish to investigate bitwidths of 8, 16, 24 and 32. At this

point, typing ‘make run0’ begins an automatic exploration of the design space,

generating a vast array of data (e.g. number of 4-input LUTs, total equivalent

gate count, throughput and latency) for each different bitwidth. ASC also au-

tomatically generates graphs for key pieces of this data, in an effort to further

reduce the time required to evaluate it.

The design space explorer, or ‘user’, in our case is of course the MATLAB

program that mathematically designs the arithmetic units on the algorithmic level

and provides ASC with a set of ASC programs, each of which results in a large

number of implementations. Each ASC implementation in return results in a

57

1e-06

1e-05

1e-04

0.001

0.01

0.1

8 10 12 14 16 18 20 22 24

Max

Err

or

Bitwidth

sin(x)

POLYTABLE

TABLE+POLY

Figure 3.3: Accuracy graph: maximum error versus bitwidth for sin(x) with the

three methods.

number of design space exploration graphs and data files. The remaining manual

step, which is difficult to automate, involves inspecting the graphs and extracting

useful information about the variation of the metrics. It would be interesting to

see how such information from the hardware design space exploration can be used

to steer the algorithmic design space exploration.

One dimension of the design space is technology mapping on the FPGA side.

Should we use block RAMs, LUT memory or LUT logic implementations of

the mathematical look-up tables generated by MATLAB? Table 3.1 shows ASC

results which substantiate the view that logic minimization of tables containing

smooth functions is usually preferable over using block RAMs or LUT memory

to implement the table for the precisions used in this chapter. Therefore, in

this chapter we limit the exploration to combinational logic implementations of

tables.

58

Table 3.1: Various place and route results of 12-bit approximations to sin(x). The

logic minimized LUT implementation of the tables minimizes latency and area,

while keeping comparable throughput to the other methods, e.g. block RAM

(BRAM) based implementation.

ASC memory 4-input LUTs clock speed latency throughput

optimization type [MHz] [ns] [Mbps]

latency block RAM 919 + 1BRAM 17.89 111.81 250.41

LUT memory 1086 15.74 63.51 220.43

LUT logic 813 16.63 60.11 232.93

throughput block RAM 919 + 1BRAM 39.49 177.28 552.79

LUT memory 1086 36.29 192.88 508.09

LUT logic 967 39.26 178.29 549.67

3.5 Verification with ASC

One major problem of automated hardware design is the verification of the re-

sults, to make sure that the output circuit is actually correct. ASC offers two

mechanisms for this activity based on a software version of the implementation.

• Accuracy Graphs: graphs showing the accuracy of the gate-level simu-

lation result (SIM) compared to a software version using double precision

floating-point (SW ), automatically generated by MATLAB, plotting:

max.error = max(|SW − SIM |), or

max.error = max(|SW − FPGA|)

when comparing to an actual FPGA output (FPGA).

Figure 3.3 shows an example graph. Here the precisions of the coefficients

59

and the operations are increased according to the bitwidth (e.g. when

bitwidth=16, all coefficients and operations are set to 16 bits), and the

output bitwidth is fixed at 24 bits.

• Regression Testing: same as the accuracy graph, but instead of plotting a

graph, ASC compares the result to a maximally tolerated error and reports

only ‘pass’ or ‘fail’ at the end.

3.6 Results

We demonstrate our approach with three elementary functions: sin(x), log(x+1)

and 2x. Five bit sizes 8, 12, 16, 20 and 24 bits are considered for the bitwidth. In

this chapter, we implement designs with n-bit inputs and n-bit outputs. However,

the position of the decimal (or binary) point in the input and output formats can

be different in order to maximize the precision that can be described. All results

are post place-and-route, and are implemented on a Xilinx Virtex-II XC2V6000-6

device [187].

In algorithmic space explored by MATLAB, there are three methods, three

functions and five bitwidths, resulting in 45 designs. These designs are generated

by the user with hand-optimized coefficient and operation bitwidths. ASC takes

the 45 algorithmic designs and generates a large number of implementations in the

hardware space with different optimization metrics. With the aid of the automatic

design exploration features of ASC (Section 3.4), we are able to generate all the

implementation results in one go with a single ‘make’ file. It takes around twelve

hours on a dual Athlon MP 2.13GHz PC with 2GB DDR-SDRAM.

The following graphs are a subset of the full design space exploration which

we show for demonstration purposes. Figures 3.4 to 3.15 show a set of FPGA

60

implementations resulting from a 2D cut of the multidimensional design space.

In Figures 3.4 to 3.6, we fix the function and approximation method to sin(x)

and TABLE+POLY, and obtain area, latency and throughput results for various

bitwidths and optimization methods. Degree two polynomials are used for all

TABLE+POLY experiments in this chapter.

Figure 3.4 shows how the area (in terms of the number of 4-input LUTs) varies

with bitwidth. The lower part shows LUTs used for logic while the small top part

of the bars shows LUTs used for routing. We observe that designs optimized

for area are significantly smaller than other designs. In addition, as one would

expect, the area increases with the bit width. Designs optimized for throughput

have the largest area; this is due to the registers used for pipelining. Figure 3.5

shows that designs optimized for latency have significantly less delay, and the

increase in delay with the bitwidth is lower than others. Designs optimized for

area have the longest delay, which is due to hardware being shared in a time-

multiplexed manner. Figure 3.6 shows that designs optimized for throughput

perform significantly better than others. Designs optimized for area perform

worst, which is again due to the hardware sharing. We note that the throughput

is rather unpredictable with increasing bitwidth. This is because the throughput

is solely determined by the critical path, which does not necessarily increase with

bitwidth (circuit area).

Figures 3.7 to 3.9 show various metric-against-metric scatter plots of 12-bit

approximations to sin(x) with different methods and optimizations. For TABLE,

only results with area optimization are shown, because the results for other opti-

mizations applied are identical (since for TABLE optimizations are not possible).

With the aid of such plots, one can decide rapidly what methods to use for

meeting specific requirements in area, latency or throughput.

61

0

500

1000

1500

2000

2500

3000

3500

4000

4500

242016128

Are

a [4

-inpu

t LU

Ts]

Bitwidth

TABLE+POLY, sin(x)

OPT-AREAOPT-LATENCY

OPT-THROUGHPUT

Figure 3.4: Area versus bitwidth for sin(x) with TABLE+POLY. OPT indicates

for what metric the design is optimized for. Lower part: LUTs for logic; small

top part: LUTs for routing.

0

200

400

600

800

1000

1200

1400

1600

1800

8 10 12 14 16 18 20 22 24

Late

ncy

[ns]

Bitwidth

TABLE+POLY, sin(x)

OPT-AREAOPT-LATENCY

OPT-THROUGHPUT

Figure 3.5: Latency versus bitwidth for sin(x) with TABLE+POLY. Shows the

impact of latency optimization.

62

0

100

200

300

400

500

600

8 10 12 14 16 18 20 22 24

Thr

ough

put [

Mbp

s]

Bitwidth

TABLE+POLY, sin(x)

OPT-AREAOPT-LATENCY

OPT-THROUGHPUT

Figure 3.6: Throughput versus bitwidth for sin(x) with TABLE+POLY. Shows

the impact of throughput optimization.

0

200

400

600

800

1000

0 500 1000 1500 2000 2500 3000

Late

ncy

[ns]

Area [4-input LUTs]

sin(x), 12 bits

OPT-AREA-TABLE+POLYOPT-LATENCY-TABLE+POLY

OPT-THROUGHPUT-TABLE+POLYOPT-AREA-TABLEOPT-AREA-POLY

OPT-LATENCY-POLYOPT-THROUGHPUT-POLY

Figure 3.7: Latency versus area for 12-bit approximations to sin(x). The Pare-

to-optimal points [124] in the latency-area space are shown.

63

0

100

200

300

400

500

600

700

800

900

0 100 200 300 400 500 600 700

Late

ncy

[ns]

Throughput [Mbps]

sin(x), 12 bits




Figure 3.8: Latency versus throughput for 12-bit approximations to sin(x). The

Pareto-optimal points in the latency-throughput space are shown.

0

200

400

600

800

1000

1200

1400

0 500 1000 1500 2000 2500 3000

Thr

ough

put [

Mbp

s]

Area [4-input LUTs]

sin(x), 12 bits




Figure 3.9: Area versus throughput for 12-bit approximations to sin(x). The

Pareto-optimal points in the throughput-area space are shown.

64

In Figures 3.10 to 3.12, we fix the approximation method to TABLE+POLY,

and obtain area, latency and throughput results for all three functions at various

bitwidths. Optimization methods are used for all three experiments (e.g. area is

optimized to get the area results).

From Figure 3.10, we observe that sin(x) requires the most and 2x requires the

least area. The difference gets more apparent as the bitwidth increases. This is

because 2x is the most linear of the three functions, hence requires fewer number

of segments for the approximations. This leads to a reduction in the number of

entries in the coefficient table and hence less area on the device.

Figure 3.11 shows the variations of the latency with the bitwidth. We observe

that all three functions have similar behavior. In Figure 3.12, we observe that

again the three functions have similar behavior, with 2x performing slightly better

than others for bitwidths higher than 16 bits. We suspect that this is because of

the lower area requirement of 2x, which leads to less routing delay.

Figures 3.13 to 3.15 show the main emphasis and contribution of this chap-

ter, illustrating which approximation method to use for the best area, latency or

throughput performance. We fix the function to sin(x) and obtain results for all

three methods at various bitwidths. Again, area/latency/throughput optimiza-

tions are performed for a given experiment. For experiments involving TABLE,

we have managed to obtain results up to 12 bits only, due to memory limitations

of our PCs.

From Figure 3.13, we observe that TABLE has the least area at 8 bits, but the

area increases rapidly making it less desirable at higher bitwidths. The reason for

this is the exponential increase in table to the input size for full look-up tables.

The TABLE+POLY approach yields the least area for precisions higher than

eight bits. This is due to the efficiency of using multiple segments with minimax

65

coefficients. We have observed that for POLY, roughly one more polynomial term

(i.e. one more multiply-and-add module) is needed every four bits. Hence, we

see a linear behavior with the POLY curve. We are unable to generate TABLE

results beyond 12 bits, due to the device size restrictions.

Figure 3.14 shows that TABLE has significantly smaller latency than others.

We expect that this will be the case for bitwidths higher than 12 bits as well.

POLY has the worst delay, which is due to computations involving high-degree

polynomials, and the terms of the polynomials increase with the bitwidth. The

latency for TABLE+POLY is relatively low across all bitwidths, because the

number of memory accesses and polynomial degree are fixed.

In Figure 3.15, we observe how the throughput varies with bitwidth. For

low bitwidths, TABLE designs result in the best throughput, which is due to

the short delay for a single memory access. However, the performance quickly

degrades and we predict that at bitwidths higher than 12 bits, it will perform

worse than the other two methods due to rapid increase in routing congestion.

The performance of TABLE+POLY is better than POLY before 15 bits and gets

worse after. This is due to the increase in the size of the table with precision,

which leads to longer delays for memory accesses.

66

0

200

400

600

800

1000

1200

1400

1600

242016128

Are

a [4

-inpu

t LU

Ts]

Bitwidth

Optimize Area

sin(x)log(1+x)

2x

Figure 3.10: Area versus bitwidth for the three functions with TABLE+POLY.

Lower part: LUTs for logic; small top part: LUTs for routing.

0

10

20

30

40

50

60

70

80

90

100

8 10 12 14 16 18 20 22 24

Late

ncy

[ns]

Bitwidth

Optimize Latency

sin(x)log(1+x)

2x

Figure 3.11: Latency versus bitwidth for the three functions with TABLE+POLY.

67

0

100

200

300

400

500

600

700

800

8 10 12 14 16 18 20 22 24

Thr

ough

put [

Mbp

s]

Bitwidth

Optimize Throughput

sin(x)log(1+x)

2x

Figure 3.12: Throughput versus bitwidth for the three functions with TA-

BLE+POLY. Throughput is similar across functions, as expected.

0

500

1000

1500

2000

2500

3000

8 10 12 14 16 18 20 22 24

Are

a [4

-inpu

t LU

Ts]

Bitwidth

Optimize Area, sin(x)

TABLEPOLY

TABLE+POLY

Figure 3.13: Area versus bitwidth for sin(x) with the three methods. Note that

the TABLE method gets too large already for 14 bits.

68

0

50

100

150

200

250

300

350

8 10 12 14 16 18 20 22 24

Late

ncy

[ns]

Bitwidth

Optimize Latency, sin(x)

TABLEPOLY

TABLE+POLY

Figure 3.14: Latency versus bitwidth for sin(x) with the three methods.

0

100

200

300

400

500

600

700

800

900

1000

8 10 12 14 16 18 20 22 24

Thr

ough

put [

Mbp

s]

Bitwidth

Optimize Throughput, sin(x)

TABLEPOLY

TABLE+POLY

Figure 3.15: Throughput versus bitwidth for sin(x) with the three methods.

69

3.7 Summary

We have presented a methodology for the automation of function evaluation

unit design, covering table look-up, table-with-polynomial and polynomial-only

methods. An implementation of a partially automated system for design space

exploration of function evaluation in hardware has been demonstrated, including

algorithmic design space exploration with MATLAB and hardware design space

exploration with ASC. We have also compared block RAMs, LUT memory and

LUT logic implementations for storing mathematical look-up tables generated by

MATLAB. It is observed that the logic minimized LUT implementation of the

tables minimizes latency and area, while keeping comparable throughput to the

other methods.

Method selection results for sin(x), log(1 + x) and 2x have been shown. Area

and speed results for area, latency and throughput optimized designs have been

examined. Demonstrating that indeed an optimum method does exist for a given

function, precision and metric. We conclude that the automation of function

evaluation unit design is within reach, even though there are many remaining

issues for further study, which are discussed in Chapter 10.

70

CHAPTER 4

Adaptive Range Reduction

for Function Evaluation

4.1 Introduction

One of the main challenges in function evaluation is to provide a programming

tool or library that delivers the best function evaluation unit for a given function,

with the associated input and output range and precision. In Chapter 3, we have

shown the connection between precision and function evaluation methods. This

chapter focuses on adaptive range reduction, which transforms the input domain

into a smaller manageable range, such as the ranges used for the functions in

Chapter 3. The main contributions of this chapter are:

• Framework for adaptive range reduction based on a parametric function

evaluation library, and on function approximation by polynomials and ta-

bles and pre-computing all possible input and output ranges.

• Implementation of design space exploration for adaptive range reduction,

using MATLAB in producing function evaluation parameters for hardware

designs targeting the ASC system.

• Evaluation of the proposed approach by exploring various effects of range re-

duction of several arithmetic functions such as sin(x) and log(x) on through-

71

put, latency and area for FPGA designs accurate to one ulp.

The rest of this chapter is organized as follows. Section 4.2 covers overview

and background material. Section 4.3 shows the design of the adaptive func-

tion evaluation library for ASC. Section 4.4 presents the implementation of the

algorithmic design space exploration with MATLAB, ASC library code genera-

tion, and the automation of the ASC design space exploration process optimizing

area, latency or throughput. Section 4.5 discusses results, and Section 4.6 offers

summary and thoughts on future work.

4.2 Overview

We evaluate an elementary function f(x), where x and f(x) have a given range

[a, b] and precision requirement. The evaluation typically consists of three steps [128]:

(1) range reduction, reducing x over the interval [a, b] to a more convenient y

over a smaller interval [a′, b′],

(2) function evaluation on the reduced interval, and

(3) range reconstruction: expansion of the result back to the original result

range.

There are two main types of range reduction:

• additive reduction: y is equal to x−mC;

• multiplicative reduction: y is equal to x/Cm

where integer m and a constant C are defined by the evaluated function.

Range reduction is widely studied, especially for CORDIC [182] and floating-

point number systems on microprocessors [25]. In contrast, range reduction which

72

f u n c t i o n f ( x ) i n p u t f o r m a t m e t h o d

A p p r o x i m a t e f ( x ) ( M A T L A B )

H a r d w a r e C o m p i l e r ( A S C )

F P G A i m p l e m e n t a t i o n s

L i b r a r y G e n e r a t o r ( P e r l S c r i p t )

A S C c o d e F u n c t i o n

E v a l u a t i o n L i b r a r y

( A S C L i b )

U s e r

L i b r a r y C o n s t r u c t i o n L i b r a r y U s a g e

Figure 4.1: Design flow: MATLAB generates all the ASC code for the library. The

user simply indexes into the library to obtain the specific function approximation

unit.

adapts to different input ranges and precisions has received little attention. To

the best of our knowledge, this is the first work that deals with this issue. Our

design flow is illustrated in 4.1.

4.3 Design

This section describes our approach for adaptive range reduction. Section 4.3.1

provides an overview. Section 4.3.2 describes the degrees of freedom for choosing

different parameters in our method.

73

4.3.1 Design Overview

Figure 4.1 shows the design flow of this research. The function of interest, its

input range and precision, and evaluation method are supplied to our MATLAB

program, which automatically designs the function approximator and produces

its hardware description. In our case, MATLAB produces code for ASC. This

large collection of ASC functions is then transformed by a Perl script into an

ASC function evaluation library (ASC lib). ASC then takes care of design space

exploration on the architecture level, the arithmetic level, and the gate level of

abstraction. The result is an optimized function evaluation library for computing

with FPGAs.

Given a function f(x) and an interval [a, b] we approximate the function with

polynomials and tables. Tasks in designing a function evaluation library include

automating the selection of range reduction, the selection and design of the func-

tion evaluation method, and area, latency and throughput optimizations on the

lower levels of abstraction. This section shows how we design a function evalua-

tion library that contains optimized implementations for a large number of range

and precision combinations.

The conventional way of implementing function evaluation is shown below for

the three functions evaluated in this chapter. We use ASC code notation [123]

in Figure 4.2 to show various methods of function evaluation including range

reduction and range reconstruction. Figures 4.3, 4.4 and 4.5 show the circuit

diagrams.

The code in Figure 4.2 shows as an example a different function evaluation

method for each function. In reality we use many combinations of method and

function. sin(x) is an instance of additive reduction, whereas log(x) and√

x are

instances of multiplicative reduction.

74

Evaluating f(x) = sin(x)

// Range Reduction

x1 = abs(x) % (2*pi);

x2 = IF(x1>pi, x1-pi, x1);

y = IF(x2>(pi/2), pi-x2, x2);

// Evaluation Method

// f(y) where y = [0,pi/2)

// e.g. polynomial-only (po)

f1 = (a*y+b)*y+c;

// Range Reconstruction

f = IF(x1>pi, f1, -f1);

Evaluating f(x) = log(x)

// Range Reduction

exp = LeadingOneDetect(x)-FracWidth(x);

y = x >> exp;


// f(y) where y = [0.5,1)

// e.g. table+degree-1-polynomial (tp1)

f1 = Table1[y]*y+Table2[y];


f = f1+exp*log(2);

Evaluating f(x) =√

x

// Range Reduction

exp = LeadingOneDetect(x)-FracWidth(x);

x1 = x >> exp;

y = IF(exp[0], x1 >> 1, x1);


// f(y) where y = [0.25,1)

// e.g. table+degree-2-polynomial (tp2)

f1 = (Table1[y]*y+Table2[y])*y+Table3[y];


exp1 = IF(exp[0], exp+1 >> 1, exp >> 1);

f = f1 << exp1;

Figure 4.2: Description of range reduction, evaluation method and range recon-

struction for the three functions sin(x), log(x) and√

x.

75

m o d 2 -

x

A p p r o x i m a t i o n U n i t

y

g ( y )

n e g a t e

f ( x ) = s i n ( x )

R a n g e R e d u c t i o n

A p p r o x i m a t i o n

R a n g e R e c o n s t r u c t i o n

> - / 2

>

Figure 4.3: Circuit for evaluating sin(x).

76

x

L e a d i n g O n e D e t e c t o r

x . p r e c

e x p

x > > e x p


y

g ( y )

l o g ( 2 )

f ( x ) = l o g ( x )




Figure 4.4: Circuit for evaluating log(x).

77

x

L e a d i n g O n e D e t e c t o r

x . p r e c

e x p

x > > e x p


y

g ( y )

f ( x ) = x




1

> > 1

e x p 1

g ( y ) < < e x p 1

e x p [ 0 ]

> > 1

Figure 4.5: Circuit for evaluating√

x.

78

Figure 4.6 shows the functions over the range reduced intervals. We observe

that the functions have a linear behavior over these intervals.

0 0.5 1 1.50

0.2

0.4

0.6

0.8

1

y

sin(

y)

0.5 0.6 0.7 0.8 0.9 1

−0.6

−0.4

−0.2

0

y

log(

y)

0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5

0.6

0.7

0.8

0.9

1

y

√y

Figure 4.6: Plot of the three functions over the range reduced intervals.

The central contribution of this chapter lies in reconsidering the above struc-

ture for user-defined fixed-point bitwidths. When programming FPGAs one can

select any bitwidth for the integer part and the fractional part of the fixed-point

number. As a consequence, a function evaluation library obtains the range and

precision of the input and can use this information to produce an optimized func-

tion evaluation unit. Previous work [122] shows the subproblem of how to select

function evaluation methods based on precision. In this work we add the issue

of input/output range and range reduction. Based on input range and precision

79

we now have the following degrees of freedom:

1. applicability of range reduction

2. evaluation method selection

3. evaluation method design

• find minimal bitwidths

• find minimal polynomial degree

(for polynomial-only method)

• find minimal segments

(for table-with-polynomial method)

4. optimize: area, latency or throughput

The ASC function evaluation library takes the range, precision and optimiza-

tion metric, and instantiates one of many instances of the corresponding function

evaluation unit.

4.3.2 Degrees of Freedom

Applicability of Range Reduction

Assume we require a hardware unit to compute sin(x) and x is a fixed-point

variable with four integer bits and eight fraction bits. Then the range of the

input is [0, 16) and the expected range of the output is [−1, 1]. The precision of

the input and output is 2−8 which also sets the ulp (unit in last place). Given a

particular function that we want to evaluate, we can decide whether it is necessary

to implement range reduction or not. In order to make the correct decision we

need to consider the optimization metric (area, latency or throughput), design a

80

function evaluation unit with and without range reduction, and select the best

one.

In practice, we actually pre-compute all possible input ranges and store for

each function a particular range r so that for all input ranges smaller than r we do

not use range reduction, and for all input ranges above r we use range reduction.

We obtain the graphs which determine r after place-and-route. Section 4.5 shows

the detailed graphs of this step.

Evaluation Method Selection

As discussed in Section 2.3 in Chapter 2, there are many possible methods for

evaluating functions. In this chapter we explore polynomial-only (po) and table-

with-polynomial methods with polynomials of degree two to four (tp2∼tp4 ). The

architecture for an approximation unit with a table-with-polynomial scheme is

shown in Figure 4.8. The polynomial coefficients are found in a minimax sense.

For the table-with-polynomial approach, the input interval is split into 2k equally

sized segments. The k leftmost bits of the argument y serve as the index into the

table, which holds the coefficients for that particular interval. Segmentation for

evaluating log(y) with eight uniform segments (k = 3) is illustrated in Figure 4.7.

Note that for the polynomial-only approach, there would be just one entry (coef-

ficient) in the table and no addressing bits. The table-with-polynomial methods

(tp) trade off table area versus polynomial area.

Evaluation Method Design

Once we know which method to use, we need to design the optimized unit. For

the polynomial-only method we can find the minimal degree of the polynomial

that will satisfy the required output precision. Then we need to find the opti-

81

0.5 0.5625 0.625 0.6875 0.75 0.8125 0.875 0.9375 1

−0.6

−0.5

−0.4

−0.3

−0.2

−0.1

0

y

log(

y)

Figure 4.7: Segmentation for evaluating log(y) with eight uniform segments. The

leftmost three bits of the inputs are used as the segment index.

mized bitwidths of the computation inside the function evaluation units for all

the methods.

Optimize: Area, Latency or Throughput

While the options or selections of the previous degrees of freedom are pre-computed

with MATLAB, the area, latency and throughput optimizations on the arithmetic

and gate-levels can be left for the compiler to worry about (as discussed in Sec-

tion 3.2 in Chapter 3). The next section about the implementation contains

details on how this is achieved.

4.4 Implementation

This section presents the implementation of the algorithmic design space explo-

ration with MATLAB, ASC library code generation, and the automation of the

ASC design space exploration process optimizing area, latency or throughput.

82

c d

0

index

1 ...

... 2 k

c d - 1 ...

...

k j - k

n

w d - 1 w d

c 0

w 0

c 1

w 1

j

g ( y )

y

y a y b

Figure 4.8: Architecture of table-with-polynomial unit for degree d polynomials.

Horner’s rule is used to evaluate the polynomials.

4.4.1 Algorithmic Design Space Exploration

We use MATLAB to generate a large number of implementations for function

evaluation. We consider several function evaluation methods: polynomial-only

(po), and table-with-polynomial of degree two to four (tp2∼tp4 ). For a given

function and any range/precision pair, the MATLAB code generates polyno-

mial coefficients which form entries of the look-up tables based on the Remez

method [128]. The range and precision are represented by the integer and fraction

83

bitwidths respectively. The Remez method computes the minimax coefficients

that minimize the maximum absolute error over an interval. In this fashion we

also obtain minimal bitwidths and the minimal number of polynomial terms for

the po method. For tp methods, we find the minimal table size and the coefficient

bitwidths for the given range and precision.

The following structure of 2000 lines of MATLAB code provides the following

functionality.

// for a given function f, input format i,

// method m and polynomial degree d

if (m==‘po’) // for polynomial-only

// find minimum polynomial degree

min_degree = find_min_degree(f,i);

// find minimum internal fraction bitwidth

int_bw = find_min_int_bw(f,i,min_degree);

// generate polynomial coefficients

coeffs = gen_coeffs(f,i,min_degree,int_bw);

// generate ASC code

gen_ASC(f,i,min_degree,int_bw,coeffs);

elseif (m==‘tp’) // for table-with-polynomial

// find minimum number of segments

min_segs = find_min_segs(f,i,d);

// find minimum internal fraction bitwidth

int_bw = find_min_int_bw(f,i,d,min_segs);

// generate coefficient look-up table

table = gen_table(f,i,d,int_bw,min_segs);

// generate ASC code

gen_ASC(f,i,d,int_bw,table,min_segs);

end

84

For this implementation we use uniform bitwidths for the internal datapath

fraction bitwidth. This minimum bitwidth is found using a binary search method.

In the future we hope to support non-uniform minimum bitwidth by using more

advanced bitwidth minimization techniques such as BitSize [47]. The ASC code

automatically generated for evaluating sin(x) for range 8 bits and precision 8 bits

with tp2 is shown in Figure 4.9.

85

// ASC for evaluating sin(x) in fixed-point with range reduction

// Range ReductionHWfix &HWreduce_HWsin_16_8_tp2_wrr(HWfix &x, HWint &sign_x, HWint&temp_sign)

const double pit2_const = 6.2832031250000000000000000e+000; // Pi * 2const double pi_const = 3.1416015625000000000000000e+000; // Piconst double pio2_const = 1.5703125000000000000000000e+000; // Pi / 2

HWfix &pit2 = *(new HWfix(TMP, 13, 10, UNSIGNED)); HWfix &pi = *(new HWfix(TMP, 12, 10, UNSIGNED));HWfix &pio2 = *(new HWfix(TMP, 11, 10, UNSIGNED)); HWfix &abs_x = *(new HWfix(TMP, 18, 8));HWfix &x1 = *(new HWfix(TMP, 14, 10)); HWfix &x2 = *(new HWfix(TMP, 13, 10));HWfix &x3 = *(new HWfix(TMP, 12, 10));

pit2 = pit2_const;pi = pi_const;pio2 = pio2_const;

sign_x = x[15];abs_x = HWabs(x);x1 = abs_x%pit2;temp_sign = x1>pi;x2 = IF(temp_sign, x1-pi, x1);x3 = IF(x2>pio2, pi-x2, x2);

return x3;

// ApproximationHWfix &HWapproximate_HWsin_16_8_tp2_wrr(HWfix &reduced_x)

double c2_init[4]=-3.125000000e-002, -8.496093750e-002, -1.181640625e-001, -1.220703125e-001;double c1_init[4]= 5.126953125e-001, 4.482421875e-001, 2.744140625e-001, 3.417968750e-002;double c0_init[4]=-9.765625000e-004, 4.785156250e-001, 8.408203125e-001, 9.980468750e-001;

HWfix &reduced_x_temp = *(new HWfix(TMP, 14, 10)); HWfix &temp1 = *(new HWfix(TMP, 11, 10));HWfix &temp2 = *(new HWfix(TMP, 11, 10)); HWfix &_x = *(new HWfix(TMP, 8, 7));HWfix &c2 = *(new HWfix(TMP, 12, 11)); HWfix &c1 = *(new HWfix(TMP, 12, 11));HWfix &c0 = *(new HWfix(TMP, 12, 11)); HWfix &dp1 = *(new HWfix(TMP, 11, 10));HWfix &dp2 = *(new HWfix(TMP, 11, 10)); HWfix &_dp2 = *(new HWfix(TMP, 12, 11));HWfix &dp3 = *(new HWfix(TMP, 11, 10)); Wfix &dp4 = *(new HWfix(TMP, 10, 8));HWint &coeff_addr = *(new HWint(TMP, 2, UNSIGNED));

HWvector<HWfix> &c2_mem = *new HWvector<HWfix>(4, new HWfix(TMP, 11, 10), c2_init);HWvector<HWfix> &c1_mem = *new HWvector<HWfix>(4, new HWfix(TMP, 11, 10), c1_init);HWvector<HWfix> &c0_mem = *new HWvector<HWfix>(4, new HWfix(TMP, 11, 10), c0_init);

reduced_x_temp = reduced_x;coeff_addr = reduced_x_temp<<1;temp1 = coeff_addr;temp2 = reduced_x_temp<<1;_x = temp2 - temp1;c2 = c2_mem[coeff_addr];c1 = c1_mem[coeff_addr];c0 = c0_mem[coeff_addr];dp1 = _x * c2;dp2 = dp1 + c1;_dp2 = dp2;dp3 = _x * _dp2;dp4 = dp3 + c0;

return dp4;

// Range ReconstructionHWfix &HWreconstruct_HWsin_16_8_tp2_wrr(HWfix &approximated_x, HWint&sign_x, HWint &temp_sign)

HWfix &fx = *(new HWfix(TMP, 10, 8));

fx = IF(temp_sign==sign_x, approximated_x, -approximated_x);

return fx;

// EvaluationHWfix &HWsin_16_8_tp2_wrr(HWfix &x)

HWfix &fx = *(new HWfix(TMP, 10, 8)); HWfix &reduced_x = *(new HWfix(TMP, 12, 10));HWfix &approximated_x = *(new HWfix(TMP, 10, 8)); HWint &sign_x = *(new HWint(TMP, 1, UNSIGNED));HWint &temp_sign = *(new HWint(TMP, 1, UNSIGNED));

// Range Reductionreduced_x = HWreduce_HWsin_16_8_tp2_wrr(x, sign_x, temp_sign);

// Approximationapproximated_x = HWapproximate_HWsin_16_8_tp2_wrr(reduced_x);

// Range Reconstructionfx = HWreconstruct_HWsin_16_8_tp2_wrr(approximated_x, sign_x, temp_sign);

return fx;

Figure 4.9: ASC code for evaluating sin(x) for range 8 bits and precision 8 bits

with tp2. This code is automatically generated from our MATLAB tool.

86

4.4.2 ASC Code Generation and Optimizations

ASC code makes use of C++ syntax and ASC semantics which allow the user

to program on the architecture-level, the arithmetic-level and the gate-level. As

a consequence ASC code provides the productivity of high-level hardware design

tools and the performance of low-level optimized hardware design. ASC pro-

vides types and operators to enable research on custom data representation and

arithmetic. Currently supported types are HWint, HWfix and HWfloat. For this

chapter we use the HWfix type which is defined as follows:

HWfix x(TMP,size,fract_size,sign_mode);

All results in this chapter are given for sign-magnitude representation which

makes most sense for range reduction. ASC provides operator-level optimizations

of area, latency, and throughput, which is referred to below as the optimization

mode.

As a result of this work, ASC provides a function evaluation library call of

the form

y = HWsin(x);

In order to create an optimizing function evaluation library, we utilize MATLAB

to generate a vast amount of ASC code. This ASC code forms a two-dimensional

matrix, which is indexed by range and precision of the argument to the function

evaluation call. Each matrix entry consists of a pointer to an ASC function which

is called for the particular input x.

Note that for each function we determine two design selection matrices: for

minimum area (Figure 4.10) and for minimum latency (Figure 4.11) as shown

in Section 4.5. The HWsin(x) call indexes into the matrix to find the optimized

87

ASC implementation. For instance, from Figure 4.10, a√

x design with 12-bit

range and 16-bit precision, the smallest implementation would be tp3.

The function evaluation code, for example for log(x), then indexes into the

matrix of function pointers (HWlog_matrix) and accesses the correct function

based on input range and precision:

HWfix &HWlog( HWfix &x )

return HWlog_matrix[x.range][x.precision](x);

All together, the 2000 lines of MATLAB code generates 300,000 lines of ASC

code, resulting in over 1000 designs with a total of over 40 million Xilinx equiv-

alent circuit gates. This process took few days’ time on two Intel Xeon 2.6GHz

dual processor PCs, each fitted with 4GB DDR-SDRAM.

4.5 Results

After applying the method in Section 4.4.2, 1000 distinct designs are place and

routed on a Xilinx Virtex-II XC2V6000-6 device. These result in over 150

graphs/figures. We summarize all the results in two matrices which show the

Pareto-optimal solutions in Figure 4.10 for area and Figure 4.11 for latency. In

essence, these matrices tell us for each combination of range and precision of the

input which method to use for the three functions. Note that we use the term

range reduction to also include range reconstruction.

The remaining result figures show a sample of the graphs that we used to

arrive at the decisions presented in the matrices above.

Figures 4.12, 4.13, 4.14 and 4.15 show the area cost of range reduction for

88

sin(x) and log(x) implemented using po and tp3 methods. The lower part of the

bars shows LUTs used for function evaluation, and the small upper part shows

the LUTs used for range reduction. These figures show that the percentage area

used by range reduction increases with precision and range. Comparing sin(x)

with log(x), the cost of range reduction with increasing range is large for sin(x),

due to the use of the modulus operation which incorporates a divider. In contrast,

log(x) uses a barrel shifter to do the range reduction.

To decide when to use range reduction, we consider Figures 4.16, 4.17, 4.18

and 4.19, which show the area and latency results for sin(x) and log(x) evaluated

using range reduction (WRR) and without range reduction (WOR). In the case of

evaluating with WOR, we approximate the function over the entire user defined

range with the given methods (tp2∼tp4 ).

Considering the area for sin(x), WOR has a lower LUT usage than WRR

when the range is less then six bits. In the case of log(x), we observe that even

for ranges as low as two bits, the LUT usage for WOR is significantly higher

than WRR and this gap increases with range. This is due to the non-linear

region of log(x) near zero which requires more segments to approximate with

WOR. Considering the latency results for sin(x) and log(x), WOR is always

faster than the corresponding WRR method. This is due to the absence of the

range reduction step.

Figures 4.20 and 4.21 highlight the area and latency tradeoffs where the area

increase with precision is smaller for area optimized designs, and the latency

increase is smaller for latency optimized designs. Figures 4.22 and 4.23 show a

similar tradeoff when we consider the range while keeping the precision fixed.

By looking at these figures along with other figures, we are able to create the

resulting matrices in Figures 4.10 and 4.11. From the two matrices, we observe

89

that mostly tp2 is the most attractive solution. This result is not too surprising,

since second order polynomials are known to give good trade offs between table

size and circuit complexity for the bitwidths we aim in this chapter. But when

the precision requirement is high (such as 16 bits in Figure 4.10), we see that

tp3 gives the smallest area. This is because at low precision requirements, table

sizes are manageable with low order polynomials. However, table sizes increase

rapidly with increasing precision, at which higher order polynomials result in

significantly smaller tables.

90

sin : tp 2 log : tp 2 sqrt : tp 2



sin : tp 2 log : tp 2 sqrt : po




sin : po log : tp 2 sqrt : tp 2








sin : po log : po

sqrt : tp 2

16 12 8 4

4

8

12

16

Range [ bits ]

P r e

c i s i

o n [ b

i t s ]

Figure 4.10: Area matrix which tells us for each input range/precision combina-

tion which design to use for minimum area.




sin : po log : po sqrt : po




sin : tp 2 log : po

sqrt : tp 2




sin : po log : tp 2 sqrt : tp 2




sin : tp 2 log : po

sqrt : tp 2

16 12 8 4

4

8

12

16

Range [ bits ]

P r e

c i s i

o n [ b

i t s ]

Figure 4.11: Latency matrix which tells us for each input range/precision com-

bination which design to use for minimum latency.

91

4 8 12 160

2000

4000

6000

8000

10000

12000sin(x) − po

Range [bits]

Are

a [4

−in

put L

UT

s]

Precision 4Precision 8Precision 12Precision 16

Figure 4.12: Area cost of range reduction (upper part) for sin(x) implemented

using po with the designs optimized for area.

4 8 12 160

1000

2000

3000

4000

5000

6000

7000sin(x) − tp3

Range [bits]

Are

a [4

−in

put L

UT

s]


Figure 4.13: Area cost of range reduction (upper part) for sin(x) implemented

using tp3 with the designs optimized for area.

92

4 8 12 160

2000

4000

6000

8000

10000log(x) − po

Range [bits]

Are

a [4

−in

put L

UT

s]


Figure 4.14: Area cost of range reduction (upper part) for log(x) implemented

using po with the designs optimized for area.

4 8 12 160

500

1000

1500

2000

2500

3000

3500

4000

4500log(x) − tp3

Range [bits]

Are

a [4

−in

put L

UT

s]


Figure 4.15: Area cost of range reduction (upper part) for log(x) implemented

using tp3 with the designs optimized for area.

93

4 6 80

1000

2000

3000

4000

5000

6000

7000

8000

Range [bits]

Are

a [4

−in

put L

UT

s]

sin(x)

tp2 WORtp2 WRRtp3 WORtp3 WRRtp4 WORtp4 WRR

Figure 4.16: Area for sin(x) with precision of eight bits for different methods

with (WRR, solid line) and without (WOR, dashed line) range reduction, with

the designs optimized for area.

4 6 840

60

80

100

120

140

160

180

200

220

240

Range [bits]

Late

ncy

[ns]

sin(x)


Figure 4.17: Latency for sin(x) with precision of eight bits for different methods


the designs optimized for latency.

94

2 3 40

2000

4000

6000

8000

10000

12000

Range [bits]

Are

a [4

−in

put L

UT

s]

log(x)


Figure 4.18: Area for log(x) with precision of eight bits for different methods


the designs optimized for area.

2 3 450

60

70

80

90

100

110

120

130

140

Range [bits]

Late

ncy

[ns]

log(x)


Figure 4.19: Latency for sin(x) with precision of eight bits for different methods


the designs optimized for latency.

95

4 8 12 160

1000

2000

3000

4000

5000

6000

7000sin(x) − tp3

Precision [bits]

Are

a [4

Inpu

t LU

Ts]

Range 4, area optRange 8, area optRange 12, area optRange 16, area optRange 4, latency optRange 8, latency optRange 12, latency optRange 16, latency opt

Figure 4.20: Area versus precision for sin(x) using tp3 for different ranges and

optimization.

4 8 12 160

1000

2000

3000

4000

5000

6000

7000

8000

9000sin(x) − tp3

Precision [bits]

Late

ncy

[ns]

Range 4, area optRange 8, area optRange 12, area optRange 16, area optRange 4, latency optRange 8, latency optRange 12, latency optRange 16, latency opt

Figure 4.21: Latency versus precision for sin(x) using tp3 for different ranges and

optimization.

96

4 8 12 16400

600

800

1000

1200

1400

1600

1800

2000Area Optimization − Precision 8 bits

Range [bits]

Are

a [4

Inpu

t LU

Ts]

sin, tp2sin, tp3sin, tp4sin, posqrt, tp2sqrt, tp3sqrt, tp4sqrt, polog, tp2log, tp3log, tp4log, po

Figure 4.22: Area versus range for all three functions using different methods

with the precision fixed at eight bits optimized for area.

4 8 12 160

100

200

300

400

500

600

700Latency Optimization − Precision 8 bits

Range [bits]

Late

ncy

[ns]

sin, tp2sin, tp3sin, tp4sin, posqrt, tp2sqrt, tp3sqrt, tp4sqrt, polog, tp2log, tp3log, tp4log, po

Figure 4.23: Latency versus range for all three functions using different methods

with the precision fixed at eight bits optimized for latency.

97

4 8 12 160

500

1000

1500

2000

2500

3000

3500

4000Area Optimization − po

Range [bits]

Are

a [4

Inpu

t LU

Ts]

sin, prec. 4sin, prec. 8sin, prec. 12sin, prec. 16sqrt, prec. 4sqrt, prec. 8sqrt, prec. 12sqrt, prec. 16log, prec. 4log, prec. 8log, prec. 12log, prec. 16

Figure 4.24: Area versus range for all three functions using po for different pre-

cisions optimized for area.

4 8 12 160

100

200

300

400

500

600

700

800

900Latency Optimization − po

Range [bits]

Late

ncy

[ns]


Figure 4.25: Latency versus range for all three functions using po for different

precisions optimized for latency.

98

4 8 12 16200

400

600

800

1000

1200

1400

1600

1800

2000Area Optimization − tp3

Range [bits]

Are

a [4

Inpu

t LU

Ts]


Figure 4.26: Area versus range for all three functions using po for different pre-

cisions optimized for area.

4 8 12 160

100

200

300

400

500

600

700

800Latency Optimization − tp3

Range [bits]

Late

ncy

[ns]


Figure 4.27: Latency versus range for all three functions using po for different

precisions optimized for latency.

99

4.6 Summary

We have presented the design space exploration of function evaluation with cus-

tom range and precision. The result is an optimizing function evaluation library

for ASC. The novel aspect of this work is the method and range reduction se-

lection based on range and precision of the input/output variables. The detailed

research issues to which this chapter contributes are:

• exploration of the area and speed tradeoffs of function evaluation with and

without range reduction, using ASC;

• given a function, its input/output range/precision, and an optimization

metric, we automate the decision about whether range reduction helps to

optimize the metric by pre-computing a large library of function evaluation

generators;

• given the above and a decision regarding range reduction, we automate

the decision which is the best evaluation method to use by looking at the

range/precision/method space and selecting the best method in each case;

• given the method, we automate the decision about which bitwidths and

number of polynomial terms to use by constructing the function evaluation

generators via MATLAB simulation and computation.

In addition, we show the productivity which we obtain from combining MAT-

LAB with ASC, exploring over 40 million Xilinx equivalent circuit gates in few

days’ time on two Intel Xeon 2.6GHz dual processor PCs, each fitted with 4GB

DDR-SDRAM.

100

CHAPTER 5

The Hierarchical Segmentation Method

for Function Evaluation

5.1 Introduction

In Chapters 3 and 4, we presented the evaluation of elementary functions [128].

Range reduction techniques such as those presented in Chapter 4 are used to

bring the input within a linear range. In contrast, there has been little attention

on the efficient approximation of compound functions for special purpose appli-

cations. Examples of such applications include N-body simulation [63], channel

coding [74], Gaussian noise generation [86] and image registration [158]. In prin-

ciple, these compound functions can be evaluated by splitting them into several

elementary functions, but this approach results in long delay, propagation of

rounding errors and possibilities of catastrophic cancellations [55]. Range reduc-

tion is not feasible for compound functions (unless the sub-functions are com-

puted one by one), so highly non-linear regions of a function need to be handled

as well. Since we are looking at the entire function over a given input range,

the advantages of our method increase significantly as compound functions be-

come more complex. We present an efficient adaptive hierarchical segmentation

scheme based on piecewise polynomial approximations that caters well for these

non-linear regions. We illustrate our method with the following four functions:

101

f1 =√− log(x) (5.1)

f2 = x log(x) (5.2)

f3 =0.0004x + 0.0002

x4 − 1.96x3 + 1.348x2 − 0.378x + 0.0373(5.3)

f4 = cos(πx/2) (5.4)

where x is an n-bit number over [0, 1) of the form 0.xn−1..x0. The function f1

is used in the Box-Muller algorithm for the generation of Gaussian noise (Chap-

ter 6), and f2 is commonly used for entropy calculation such as mutual informa-

tion computation in image registration [158]. The trigonometric function f4 is

widely used in many applications including Gaussian noise generation, robot arm

control [109] and direct digital frequency synthesizers [112].

Note that the functions f1 and f2 cannot be computed for x = 0, therefore we

approximate these functions over (0, 1) and generate an exception when x = 0. In

this chapter, we implement an n-bit in, n-bit out system. However, the position

of the decimal (or binary) point in the input and output formats can be different

in order to maximize the precision that can be described.

The principal contribution of this chapter is a systematic method for pro-

ducing fast and efficient hardware function evaluators for both compound and

elementary functions using piecewise polynomial approximations with a hierar-

chical segmentation scheme. The novelties of our work include:

• an algorithm for locating optimum segment boundaries given a function,

input interval and maximum error;

• a scheme for piecewise polynomial approximations with a hierarchy of seg-

ments;

102

• evaluation of this method with four compound functions;

• hardware architecture and implementation of the proposed method.

The rest of this chapter is organized as follows: Section 5.2 covers related work.

Section 5.3 explains how our algorithm finds the optimum placement of the seg-

ments. Section 5.4 presents our hierarchical segmentation scheme. Section 5.5

describes our hardware architecture. Section 5.6 analyzes the various errors in-

volved in our approximations. Section 5.8 discusses evaluation and results, and

Section 5.9 offers summary.

5.2 Related Work

Approximations using uniform segments are suitable for functions with linear

regions, but are inefficient for non-linear functions, especially when the function

varies exponentially. It is desirable to choose the boundaries of the segments

to cater for the non-linearities of the function. Highly non-linear regions may

need smaller segments than linear regions. This approach minimizes the amount

of storage required to approximate the function, leading to more compact and

efficient designs. We use a hierarchy of uniform segments (US) and powers of two

segments (P2S), that is segments with the size varying by increasing or decreasing

powers of two.

Similar approaches to ours have been proposed for the approximation of the

non-linear functions in logarithmic number systems (LNS). Henkel [61] divides

the interval into four arbitrary placed segments based on the non-linearity of the

function. The address for a given input is approximated by another function

that approximates the segment number for an input. This method only works

if the number of segments is small and the desired accuracy is low. Also, the

103

function for approximating the segment addresses is non-linear, so in effect the

problem has been moved into a different domain. Coleman et al. [26] divide the

input interval into seven P2S that decrease by powers of two, and employ constant

numbers of US nested inside each P2S, which we call P2S(US). Lewis [100] divides

the interval into US that vary by multiples of three, and each US has variable

numbers of uniform segments nested inside, which we call US(US). However, in

both cases the choice of inner and outer segment numbers is done manually, and a

more efficient segmentation can be achieved using our segmentation scheme. We

generalize the idea of hierarchical segmentation and provide a systematic way of

partitioning a function.

5.3 Optimum Placement of Segments

The problem of piecewise approximation with variable segment boundaries has

received considerable attention in the mathematical literature, especially with the

theory of splines [15], [42]. To quote Rice [150]: “The key to the successful use

of splines is to have the location of knots as variables.” This section introduces a

method for computing the optimum placement of segments for function approxi-

mation. We shall use it as a reference in comparing the uniform segment method

and our proposed method, as shown in Table 5.2 (Section 5.4). Let f be a contin-

uous function on [a, b], and let an integer m ≥ 2 specify the number of contiguous

segments into which [a, b] has been partitioned: a = u0 ≤ u1 ≤ ... ≤ um = b.

Let d be a non-negative integer and let Pi denote the set of functions pi whose

polynomials are of degrees less or equal to d. For i = 1, ..., m, define

hi(ui−1, ui) = minpi∈Pi

maxui−1≤x≤ui

|f(x)− pi(x)|. (5.5)

104

Let emax = emax(u) = max1≤i≤m hi(ui−1, ui). The segmented minimax approxi-

mation problem is that of minimizing emax over all partitions u of [a, b]. If the

error norm is a non-decreasing function of the length of the interval of approx-

imation, the function to be approximated is continuous and that the goal is to

minimize the maximum error norm on each interval, then a balanced error solu-

tion is locally optimal. The term “balanced error” means that the error norms

on each interval are equal [81]. One class of algorithms to tackle this problem is

based on the remainder formula [42] and assumes that the (d + 1)th derivative

of f is either of fixed sign or bounded away from zero [140]. However, in many

practical cases this assumption does not hold [138]. Often, the (d + 1)th deriva-

tive may be zero or very small over most of [a, b] except a few points where it has

very large values. This is precisely the case with the non-linear functions we are

approximating.

In Lawson paper, an iterative technique for finding the balanced error so-

lution [81] is presented. However, his technique has a rather serious defect: if

at some intermediate step of the algorithm an interval with zero-error norm (or

even much smaller than others) is found, the method fails. This turns out to

be a common occurrence in various practical applications [138]. Pavlidis and

Maika present a better scheme in their paper [142] which results in a suboptimal

balanced error solution. It is based on an iteration of the form

uk+1i = uk

i + c(eki+1 − ek

i ), i = 1, ..., n− 1. (5.6)

Here uki is the value of the i-th point and the k-th iteration, ek

i is the error on

(uki−1, u

ki ] and c is an appropriate small positive number. It can be shown that for

sufficiently small c the scheme converges to a solution [142]. A reasonable choice

for c is the inverse of the change in error norm divided by the size of the motion of

a boundary which caucused it. We have implemented this scheme in MATLAB,

105

where the function to be approximated f , the interval of approximation [a, b],

the degree d of the polynomial approximations and the number of segments m

are given as inputs. The program outputs the segment boundaries u1..m−1 and

the maximum absolute error emax. Our tests show that this scheme requires

large numbers of iterations for a reasonable value of m and balance criterion

(deviation of errors of the segments), and often fails to converge. In addition, for

our purposes we would like to give f , [a, b], d and emax as inputs and obtain m

and u1..m−1.

We have developed a novel algorithm to find the optimum boundaries for a

given f , [a, b], d, emax and unit in the last place (ulp). The ulp is the least

significant bit of the fraction of a number in its standard representation. For

instance, if a number has F fractional bits, the ulp of that number would be

2−F . The ulp is required as an input, since the input is quantized to n bits.

The MATLAB code for the algorithm is shown in Figure 5.1. The algorithm is

based on binary search and finds the optimum boundaries over [a, b]. We first set

x1 = a and x2 = b and find the minimax approximation over the interval [x1, x2].

If the error e of this approximation is larger than emax, we set x2 = (x1 + x2)/2

and obtain the error for [a, x2]. We keep halving the interval of approximation

until e ≤ emax. At this point we increment x2 by a small amount and compute

the error again. This small amount is either (abs(x2 − prev x2))/2 or the ulp,

whichever is smaller (prev x2 is the value of x2 in the previous iteration). When

this small amount is the ulp, in the next iterations x2 will keep oscillating between

the ideal (un-quantized) boundary. We take the x2 whose error e is just below

emax as our boundary, set x1 = x2 and x2 = b, and move on to approximating

over [x1, x2]. This is performed until the error over [x1, x2] is less than or equal

to emax and x2 has the same value as b. We can see that the boundaries up to the

last one are optimum for the given ulp (the last segment is always smaller than its

106

optimum size, as it can be seen in Figure 5.2 for f2). Although our segments are

not optimum in the sense that the errors of the segments are not fully balanced,

we can conclude that given the error constraint emax and the ulp, the placement

of our segment boundaries is optimum. This is because the maximum error we

obtain is less than or equal to emax and this is not achievable with fewer segments.

The results of our segmentation can be used for various other applications [138]

including pattern recognition [37], [141], data compression, non-linear filtering

and picture processing [140].

In the ideal case, one would use these optimum boundaries to approximate

the functions. However, from a hardware implementation point of view, this is

can be impractical. The circuit to find the right segment for a given input could

be complex, hence large and slow. Nevertheless, the optimum segments give us

an indication of how well a given segmentation scheme matches the optimum

segmentation. Moreover, they provide information on the non-linearities of a

function. Figure 5.2 shows the optimum boundaries for the four functions in

Section 5.1 for 16-bit operands and second order approximations. We observe that

f1 needs more segments in the regions near 0 and 1, f2 requires more segments

near 0 and f3 requires more segments in the two regions in the lower and upper

half of the interval.

In Figure 5.3 and Figure 5.4, we observe how the optimum number of segments

change for first and second order approximations as the bit width increases. We

can see that they have an exponential behavior. The interesting observation is

that for all four functions, the optimum segment numbers vary by a factor of

around 4 per bit for first order and 1.6 per bit for second order approximations.

Therefore as the bitwidths get larger, the memory savings of using second order

approximations get larger (Figure 5.5).

107

Figure 5.5 compares the ratio of the number of optimum segments required by

first and second order approximations for 8, 12, 16, 20 and 24-bit approximations

to the four functions. We can see that savings of second order approximations

get larger as the bit width increases. However one should note that, whereas first

order approximations involve one multiply and one add, second order approxima-

tions involve two multiplies and two adds. Therefore, there is a tradeoff between

the look-up table size and the circuit complexity. For low latency and low accu-

racy applications, first order approximations may be appropriate. Second order

approximations may be suitable for applications that require small look-up tables

and high accuracies.

108

% Inputs: a, b, d, f, e_max, ulp

% Output: u()

x1 = a; x2 = b; m = 1; done = 0; check_x2 = 0; prev_x2 = a;

oscillating = 0; done = 0;

while (~done)

e = minimax(f,d,x1,x2,ulp);

if (e <= e_max)

if (x2 == b)

u(i) = x2;

done = 1;

else

if (oscillating)

u(m) = x2;

prev_x2 = x2;

x1 = x2;

x2 = b;

m = m+1;

oscillating = 0;

else

change_x2 = abs(x2-prev_x2)/2;

prev_x2 = x2;

if (change_x2 > ulp)

x2 = x2 + change_x2;

else

x2 = x2 + ulp;

end

end

end

else

change_x2 = abs(x2-prev_x2)/2;

prev_x2 = x2;

if (change_x2 > ulp)

x2 = x2 - change_x2;

else

x2 = x2 - ulp;

if (check_x2 == x2)

oscillating = 1;

else

check_x2 = x2;

end

end

end

end

Figure 5.1: MATLAB code for finding the optimum boundaries.

109

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0

0.5

1

1.5

2

2.5

3

x

f 1(x)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−0.35

−0.3

−0.25

−0.2

−0.15

−0.1

−0.05

x

f 2(x)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2

0.4

0.6

0.8

1

1.2

x

f 3(x)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2

0.4

0.6

0.8

1

x

f 4(x)

Figure 5.2: Optimum locations of the segments for the four functions in Sec-

tion 5.1 for 16-bit operands and second order approximation.

110

8 10 12 14 16 18 20 22 240

1000

2000

3000

4000

5000

6000

Operand Bit Width

Num

ber

of O

ptim

um S

egm

ents

f1f2f3f4

Figure 5.3: Numbers of optimum segments for first order approximations to the

functions for various operand bitwidths.

8 10 12 14 16 18 20 22 240

50

100

150

200

250

300

350

400

Operand Bit Width

Num

ber

of O

ptim

um S

egm

ents

f1f2f3f4

Figure 5.4: Numbers of optimum segments for second order approximations to

the functions for various operand bitwidths.

111

8 10 12 14 16 18 20 22 240

5

10

15

20

25

Operand Bit Width

Firs

t Ord

er /

Sec

ond

Ord

er

f1f2f3f4

Figure 5.5: Ratio of the number of optimum segments required for first and

second order approximations to the functions.

112

5.4 The Hierarchical Segmentation Method

Let a segmentation scheme Λ ∈ US, P2S where US = uniform segments and

P2S = powers of two segments. The proposed segment hierarchy H is of the form

Λ0(Λ1(...(Λλ−1))) where λ is the number of levels in the hierarchy. This structure

can be implemented in a cascaded look-up table structure, where the output of

one table is used as the address of the next. Let i = 0..λ. The input x is split

into λ + 1 partitions called δi. Let vi denote the bit width and si denote the

number of segments of the ith partition δi. Therefore, n =∑λ

i=0 vi, where n is

the number of bits of the input x. Then si can be defined with the following set

of equations:

si = 2vi , if Λi = US (5.7)

si ≤ 2vi, if Λi = P2S (5.8)

For US, it is clear that 2vi segments can be formed, since uniform segments are

addressed with vi bits. However for P2S, it is not so clear why up to 2vi segments

can be formed. Consider the case when Λ0 = P2S, n = 8, v0 = 5 and v1 = 3.

When vi = 5 it is possible to construct 10 P2S as illustrated in Table 5.1. Notice

that the segment sizes increase by powers of two till “01111111” (end of location

4) and start decreasing by powers of two from “10000000” (beginning of location

5) until the end. It can be seen that the maximum number of P2S that can be

constructed with δi is 2vi. Fewer segments can be obtained by omitting parts

of the table. For example with locations 0-4, one can have segments that only

increase by powers of two. To compute the segment address for a given input

δ0, we need to detect the leading zeros for locations 0-4, and leading ones for

locations 5-9. A simple cascade of AND and OR gates and a 1-bit multi-operand

adder can be used to find the segment address for a given input δi as shown in

113

Table 5.1: The ranges for P2S addresses for Λ1 = P2S, n = 8, v0 = 5 and v1 = 3.

The five P2S address bits δ0 are highlighted in bold.

P2S address range

0 0 0 0 0 0 | 0 0 0 ∼ 0 0 0 0 0 | 1 1 1

1 0 0 0 0 1 | 0 0 0 ∼ 0 0 0 0 1 | 1 1 1

2 0 0 0 1 | 0 0 0 0 ∼ 0 0 0 1 | 1 1 1 1

3 0 0 1 | 0 0 0 0 0 ∼ 0 0 1 | 1 1 1 1 1

4 0 1 | 0 0 0 0 0 0 ∼ 0 1 | 1 1 1 1 1 1

5 1 0 | 0 0 0 0 0 0 ∼ 1 0 | 1 1 1 1 1 1

6 1 1 0 | 0 0 0 0 0 ∼ 1 1 0 | 1 1 1 1 1

7 1 1 1 0 | 0 0 0 0 ∼ 1 1 1 0 | 1 1 1 1

8 1 1 1 1 0 | 0 0 0 ∼ 1 1 1 1 0 | 1 1 1

9 1 1 1 1 1 | 0 0 0 ∼ 1 1 1 1 1 | 1 1 1

Figure 5.6. The appropriate taps are taken from the cascades depending on the

choice of the segments and are added to work out the P2S address. For P2S that

increase and decrease by powers of two, the full circuit is used, and for P2S that

decrease only to the left side (P2SL), just the AND gates are used. Similarly for

P2S that decrease to the right side (P2SR), the cascade OR gates are used. These

circuits can be pipelined and a circuit with shorter critical path but requiring

more area can be used [80]. Note that in the last partition, δλ is not used as

an address. If Λi = US, then δi+1 uses the next set of bits vi+1. However if

Λi = P2S, then the location of δi+1 depends on the value of δi. Let j denote the

P2S address, where j = 0..si− 1. From the vertical lines in Table 5.1, we observe

that δi+1 should be placed after a0 for j = 0 and j = si − 1, after aj−1 for j = 1

114

a v-2 a v-3 a v-1 a v-4 a 0

P2S Address

Figure 5.6: Circuit to calculate the P2S address for a given input δi, where

δi = av−1av−2..a0. The adder counts the number of ones in the output of the two

prefix circuits.

to j = (si/2)− 1, and after asi−2−j for j = si/2 to j = si − 2.

In principle it is possible to have any number of levels of nested Λ, as long as∑λ

i=0 vi ≤ n. The more levels are used, the closer the total number of segments

m will be to the optimum. However as λ (the number of levels) increases the

partitioning problem becomes more complex, and the cascade of look-up tables

gets longer, increasing the delay to find the final segment. Therefore there is

a tradeoff between the partitioning complexity, delay and m. Our tests with

the functions we consider in this chapter show that the rate of reduction of m

decreases rapidly as λ increases. λ = 2 gives a very close m to the optimum with

acceptable partitioning complexity and delay. Moreover, λ > 2 gives diminishing

returns in terms of small improvement in m with high partitioning complexity

and long delays. Therefore in this work, we limit ourselves to λ = 2, which

consists of one outer segment Λ0 and one inner segment Λ1. P2S is used as the

outer segment if the function varies exponentially in the beginning and the end

of the interval. P2SL and P2SR are used as the outer segment when the function

115

varies exponentially at the beginning or at the end respectively. US is used if

the function is non-linear in arbitrary regions. Although we limit ourselves with

λ = 2, higher levels of hierarchies could be useful for certain functions.

In Section 6.4 (Chapter 6), we approximate the functions√− log(x) with

P2S and cos(πx/2) with US(P2S) which are needed by the Box-Muller algo-

rithm. These two schemes are found to be sufficient to generate high quality

noise samples. However, these schemes are perhaps inappropriate for applica-

tions that require high accuracies, since when P2S is used as the most inner

segmentation, the segments in the middle regions are large causing large er-

rors. Moreover, the address calculation circuit is needed for P2S, therefore P2S

should be avoided if the difference is small compared to using US. US(P2S(US))

could be useful for cases when there are highly non-linear regions in the mid-

dle parts of the function. The hierarchy schemes we have chosen are H =

P2S(US), P2SL(US), P2SR(US), US(US). These four schemes cover most of

the non-linear functions of interest.

We have implemented the hierarchical segmentation method (HSM) in MAT-

LAB, which deals with the four schemes. The program called HFS (hierarchical

function segmenter) takes the following inputs: the function f to be approx-

imated, input range, operand size n, hierarchy scheme H, number of bits for

the outer segment v0, the requested output error emax, and the precision of the

polynomial coefficients and the data paths. HFS divides the input interval into

outer segments whose boundaries are determined by H and v0. HFS finds the

minimum number of bits v1 for the inner segments for each outer segment, which

meets the requested output error constraint. For each outer segment, HFS starts

with v1 = 0 and computes the error e of the approximation. If e > emax then v1 is

incremented and the error e for each inner segment is computed, i.e. the number

116

of inner segments is doubled in every iteration. If it detects that e > emax it incre-

ments v1 again. This process is repeated until e ≤ emax for all inner segments of

the current outer segment. This is the point at which HFS obtains the minimum

number of bits for the current outer segment. HFS performs this process for all

outer segments. The main MATLAB code for finding the hierarchical boundaries

and their polynomial coefficients is shown in Figure 5.7. Note that minimax2

takes the precisions of the polynomial coefficients and data paths into account.

The outer boundaries are determined by H and v0.

Experiments are carried out to find the minimum number of bits for v0. Fig-

ure 5.8 shows how the total number of segments varies with v0 for 16-bit second

order approximation to f3. We can observe the figure of U shape, and there is

a point at which v0 is at a minimum, which is five bits in this particular case.

When v0 is too small, there are not enough outer segments to cater to local non-

linearities. When v0 is too large, there are too many unnecessary outer segments.

Note that when v0 = 0, it is equivalent to using standard uniform segmenta-

tion. Figure 5.9 shows the segmented functions obtained from HFS for 16-bit

second order approximations to the four functions. It can be seen that the seg-

ments produced by HFS closely resemble the optimum segments in Figure 5.2.

Table 5.2 shows a comparison in terms of numbers of segments for various second

order approximations for uniform, HSM, and the optimum number of segments.

Double precision is used for the data paths and the output for this comparison.

We can see that HSM is significantly more efficient for the first three functions

than using uniform segments, and the difference between the optimum ones are

around a factor of two. However, for f4, the improvements over uniform segments

are small due to the function being very linear. Looking at the results for 24-bit

approximation to f1, we can see that HSM performs worse than average. This is

due to the fact that insufficient bits are left for δ1 (19 bits are already used for

117

δ0).

Figure 5.10 shows our design flow for approximating functions. First the

user supplies the following to the HFS: f , input range, H, n, v0, emax, and the

precision of the polynomial coefficients and the data paths. HFS computes the

segment boundaries and the polynomial coefficients and stores the data into a

file. It also provides the user with a report, which contains the total number of

segments m, maximum error, percentage of exactly rounded results, and the sizes

of the multipliers, adders and look-up tables. There is a parameterizable reference

design template library for the four hierarchy schemes defined by H for first

and second order approximations. A design generator instantiates the relevant

reference design templates with information from the data file and generates the

hardware design in VHDL.

An interesting aspect of our approach is that it could be used to accelerate

applications that have involve pure floating-point calculations such as software

applications. This is because our method computes compound functions at once

using polynomial approximations, instead of decomposing the compound func-

tions into sub-functions and computing the sub-functions one by one. Versions of

FastMath [44] used P2S to approximate the non-linear functions in logarithmic

number systems (LNS) to speed up software applications without the use of a

coprocessor.

118

% Inputs: d, f, e_max, ulp, v0, H, n, precisions

% Output: hier_boundaries_table, poly_coeffs_table

for i=1:(length(outer_boundaries)-1)

x1 = outer_boundaries(i)

x2 = outer_boundaries(i+1);

hier_boundaries = x1;

[e, poly_coeffs] = minimax2(f,d,x1,x2,ulp);

if (e > e_max)

outer_seg_size = x2-x1;

v1 = 1;

while (e > e_max)

inner_seg_size = outer_seg_size/(2^v1);

hier_boundaries = [];

poly_coeffs = [];

for j=1:2^v1

x1 = outer_boundaries(i)

+ (inner_seg_size*j)

- inner_seg_size;

x2 = x1 + inner_seg_size;

[e, _poly_coeffs]

= minimax2(f,d,x1,x2,ulp);

hier_boundaries(j,:) = x1;

poly_coeffs(j,:) = _poly_coeffs;

end

v1 = v1 + 1;

end

end

hier_boundaries_table

= [hier_boundaries_table; hier_boundaries];

poly_coeffs_table

= [poly_coeffs_table; poly_coeffs];

end

Figure 5.7: Main MATLAB code for finding the hierarchical boundaries and their

polynomial coefficients.

119

0 2 4 6 8 10100

150

200

250

300

350

400

450

500

550

v0

Num

ber

of S

egm

ents

Figure 5.8: Variation of total number of segments against v0 for a 16-bit second

order approximation to f3.

120

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

1.5

2

2.5

3

x

f 1(x)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−0.35

−0.3

−0.25

−0.2

−0.15

−0.1

−0.05

x

f 2(x)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2

0.4

0.6

0.8

1

1.2

x

f 3(x)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.2

0.4

0.6

0.8

1

x

f 4(x)

Figure 5.9: The segmented functions generated by HFS for 16-bit second order

approximations. f1, f2, f3 and f4 employ P2S(US), P2SL(US), US(US) and

US(US) respectively. The black and grey vertical lines are the boundaries for the

outer and inner segments respectively.

121

Table 5.2: Number of segments for second order approximations to the four

functions. Results for uniform, HSM and optimum are shown.

function order operand uniform HSM optimum HSM

width segments segments segments /optimum

f1 1 8 64 13 7 1.86

12 4,096 78 35 2.23

16 65,536 395 161 2.45

20 1,048,576 1,876 723 2.59

24 33,554,432 8,608 2,302 3.74

2 8 8 5 4 1.25

12 1,024 23 15 1.53

16 32,768 72 44 1.64

20 524,288 218 126 1.73

24 16,777,216 742 287 2.59

f2 1 8 32 19 11 1.73

12 512 93 45 2.07

16 8,192 381 181 2.10

20 131,072 1533 724 2.12

24 2,097,152 6,141 2,896 2.12

2 8 8 5 4 1.25

12 128 15 10 1.50

16 2,048 44 26 1.69

20 32,768 124 66 1.88

24 524,228 315 167 1.89

f3 1 8 256 36 20 1.80

12 1,024 172 81 2.12

16 4096 683 303 2.25

20 16,384 2,723 1,296 2.10

24 65,536 10,609 5,182 2.05

2 8 64 20 10 2.00

12 256 41 24 1.71

16 512 107 59 1.81

20 1,024 234 151 1.55

24 2,048 573 379 1.51

f4 1 8 8 7 5 1.40

12 32 27 20 1.35

16 128 110 77 1.43

20 512 435 307 1.42

24 2048 1,739 1,228 1.42

2 8 4 3 2 2.00

12 8 7 4 1.71

16 16 15 10 1.81

20 64 45 23 1.55

24 128 111 58 1.51

122

Hierarchical Function

Segmenter

Data File

Synthesis

Place and Route Report

Hardware

User Input

Design Generator

Reference Design Library

Figure 5.10: Design flow of our approach.

123

5.5 Architecture

The architecture of our function evaluator for HSM is shown in Figure 5.11. The

P2S unit performs the P2S address calculation (Figure 5.6) on δ0 if δ0 is of type

P2S (Λ0 = P2S). If Λ0 = US, δ0 is bypassed. The bit selection unit selects the

appropriate bits from the input based on the values of v0 and v1. This variable

bit selection is implemented using a barrel shifter. There are two look-up tables:

one used for storing the v1 values and the offset (ROM0), and the other storing

the polynomial coefficients (ROM1). The offset in ROM0 stores the starting

address in ROM1 for the different δ0 values. The depth s0 of ROM0 is defined in

Equations (5.7) and (5.8), and the depth of m of ROM1 is the total number of

segments. The size of the two look-up tables are defined as follows:

ROM0 = dlog2(max(v1))e+ (5.9)

dlog2(max(offset))e × s0

ROM1 =d∑

i=0

wi ×m. (5.10)

In practice, ROM0 is significantly smaller than ROM1, since the depth is

bounded by v0 and the entries v1 and offset are small. There is an interesting

tradeoff factor for ROM1: the wider the widths of the coefficients w, the fewer

segments m are needed, since the approximations will be more accurate. However,

if w is over a certain threshold, it has negligible effect on m. It is desirable to

find the right widths that minimize the total ROM size.

Let bj denote the boundaries of the outer segments where j = 0..m − 1 and

θ = max(bj− bj+1) which is the maximum width of the outer segment. Instead of

approximating each interval over [bi, bi+1), we perform the translation x = x− bi,

124

which translates the interval [bi, bi+1) to [0, θ). This form reduces the widths of

the data paths, since x ∈ [0, θ) requires fewer bits to represent than x.

Highly non-linear functions such as f1 which have exponentially varying re-

gions to infinity, have a large dynamic range on the coefficients. For instance, the

largest coefficient c2 of a 24-bit second order approximation to f1 is in the order

of 1012. In such cases, floating-point arithmetic is needed. For f2 and f3, where

the ranges of the coefficients are relatively small, standard fixed-point arithmetic

is used.

For high throughput applications, the P2S unit, the multipliers and the adders

can be pipelined. For typical applications targeting FPGAs, the size of the two

ROMs are small and can be implemented on-chip using distributed RAM or block

RAM. Often the multiplier would be the part taking up a significant portion of

the area. The size of the multipliers depend on the width of v1 + v2 and the

coefficients. Recent FPGAs, such as Xilinx Virtex-II or Altera Stratix devices,

provide dedicated hardware resources for multiplication which can benefit the

proposed architecture.

5.6 Error Analysis

The error of the approximation etotal = f(x)− p(x) is the difference between the

ideal mathematical value and the approximation. In our work, we regard IEEE

double precision floating-point as the exact value. There are two ways the result

can be rounded: exact rounding [161] where the result rounded to the nearest, and

faithful rounding [159] where it is rounded to the nearest or next nearest. Exact

rounding requires etotal ≤ 0.5 ulp and faithful rounding requires etotal ≤ 1 ulp.

There is no known method for determining the accuracy required to guarantee

125

exactly rounded results for elementary functions, a problem known as the table

maker’s dilemma. The authors in [161] indicate that by computing an elementary

function to sufficiently high precision, it is possible to guarantee that all results

are exactly rounded. However, the degree of additional precision required is

almost doubled, greatly increasing the hardware complexity. Therefore, we have

opted for faithful rounding, which is good enough for most practical applications.

The total error etotal consists of the following four types of errors:

• ein due to interpolating f(x) with a polynomial;

• eco for rounding polynomial coefficients to finite precision;

• edp for rounding results from multipliers and adders in various data paths;

• erd for rounding the final result to n-bits.

Thus, our error requirement for faithful rounding is:

etotal = ein + eco + edp + erd ≤ 1 ulp. (5.11)

The final rounding step erd rounds the result to the nearest (so do the other

rounding steps), thus introduces a maximum error of 0.5 ulp. So our requirement

is

etotal = ein + eco + edp ≤ 0.5 ulp. (5.12)

In HSM, when we find the minimax approximation over an interval, all four

types of errors are already taken into account. This is because the user supplies

the parameters for the finite precisions of the coefficients, data paths and the

final result. Rather than computing the segments with great precision first and

then applying finite approximations, it is better to take the finite precisions into

account when the actual approximations are being made.

126

It is desirable to minimize the bitwidths for both the coefficients and the

data paths, which leads to size reductions in look-up tables, multipliers and

adders. Currently, the precision of the coefficients and the internal data paths are

supplied by the user. Pineiro et al. [147], and Schulte and Schwartzlander Jr. [161]

present an exhaustive iterative technique to find suboptimum bitwidths for the

polynomial coefficients. Hauser and Purdy [59] use genetic algorithms to find the

optimal coefficients. Unfortunately, the two approaches become inefficient when

the output accuracy requirement is high. We plan to automate the optimization

of the bitwidths for both coefficients and data paths using bit analysis techniques

such as those presented in [29] and [47].

5.7 The Effects of Polynomial Degrees

The degrees of the polynomials play an important role when approximating func-

tions with HSM: the higher the degree, the less segments are needed to meet the

same error requirement. However, higher degree polynomials require more mul-

tipliers and adders, leading to higher circuit complexity and more delay. Hence,

there is a tradeoff between table size, circuit complexity and delay.

Table size. As the polynomial degree d increases, the width of the coefficient

look-up table increases linearly. One needs to store d+1 coefficients per segment.

Circuit Complexity. As d increases, one needs more adders and multipliers

to perform the actual polynomial evaluation. These increase in a linear manner:

since we are using Horner’s rule, d adders and d multipliers are required.

Delay. Note that the polynomial coefficients in the look-up table can be accessed

in parallel. Hence, the delay differences between the different polynomial degrees

occur when performing the actual polynomial evaluation. Due to the serial nature

127

of Horner’s rule, the increase in delay of the polynomial evaluation is again linear

(d adders and d multipliers).

We vary the polynomial degree for a given approximation, the circuit com-

plexity and delay are predictable. However, for the table size, although the

“number of coefficients per segment” (the width of the table) is predictable, the

look-up table size depends on the total of number of segments (the depth of the

table) as well. To explore the behavior of the table size, we have calculated ta-

ble sizes at various operand sizes between 8 to 16 bits, and polynomial degrees

from one to five. Double precision is used for the data paths to get these results.

The bitwidths of the polynomial coefficients are assumed to be the same as the

operand size. Mesh plots of these parameters for the four functions are shown in

Figure 5.12.

Interestingly, all four functions share the same behavior: we observe that

the plots have an exponential behavior in table size in both the operand width

and the polynomial degree. Although first order approximations have the least

circuit complexity and delay (one adder and one multiplier), we can see that

they perform bad (in terms of table size) for operand widths of more than 16

bits. Second order approximations have reasonable circuit complexity and delay

(two adders and two multipliers) and for the bitwidths used in these experiments

(up to 24 bits), they yield reasonable table sizes for all four functions.

Another observation is that the improvements in table size of using third

or higher order polynomials are very small. For instance, looking at the 24-bit

results to f2, the table sizes of first, second and third order approximations are

294768, 22680 and 8256 bits respectively. The difference between first and second

order is a factor of 11.3, whereas the difference between second and third order is

a factor of 2.7. Therefore, the overhead of having an extra adder and multiplier

128

stage for third order approximations maybe not be worth while for a table size

reduction of a factor of just 2.7.

Hence, we conclude that for operand sizes of 16 bits or fewer, first order

approximations yield good results. For operand sizes between 16 and 24 bits,

second order approximations are perhaps more appropriate. We predict that for

operand sizes larger than 24 bits, the table size improvements of using third or

higher order polynomials will appear more dramatic.

Earlier, we have compared the number of segments of HSM to the optimum

segmentation. For first and second order approximations, the ratio of segments

obtained by HSM and the optimum is around a factor of two. To explore how

this ratio behaves at varying operand width and polynomial degree, we obtain the

results shown in Figure 5.13. We can see that the ratios are at around a factor of

two at various parameters, with HSM showing no obvious signs of degradation.

129

c d

0

index

1 ...

... m-1

c d-1 ...

...

p ( x )

P2S unit

v 1 0

index

1 ...

...

v 0

d 2

S 0 -1

v 1 v 2

n

L 0

bit selection

unit

w d-1 w d

d 0 d 1

offset

v 0

:

v 1 +v 2

c 0

w 0

c 1

w 1

Figure 5.11: HSM function evaluator architecture for λ = 2 and degree d approx-

imations. Note that ‘:’ is a concatenation operator.

130

12

34

5

1015

20

0

1

2

3

4

x 105

Degree

f1

Operand Width

Tab

le S

ize

12

34

5

1015

20

0

1

2

x 105

Degree

f2

Operand WidthT

able

Siz

e

12

34

5

1015

20

0

2

4

x 105

Degree

f3

Operand Width

Tab

le S

ize

12

34

5

1015

20

0

2

4

6

8

x 104

Degree

f4

Operand Width

Tab

le S

ize

Figure 5.12: Variations of the table sizes to the four functions with varying

polynomial degrees and operand bitwidths.

131

12

34

5

1015

20

1

2

3

Degree

f1

Operand Width

HS

M /

Opt

imum

12

34

5

1015

20

1

1.5

2

Degree

f2

Operand WidthH

SM

/ O

ptim

um

12

34

5

1015

20

1

1.5

2

Degree

f3

Operand Width

HS

M /

Opt

imum

12

34

5

1015

20

1

1.2

1.4

1.6

1.8

Degree

f4

Operand Width

HS

M /

Opt

imum

Figure 5.13: Variations of the HSM/Optimum segment ratio with polynomial

degrees and operand bitwidths.

132

5.8 Evaluation and Results

Table 5.3 compares HSM with direct table look-up, the symmetric bipartite table

method (SBTM) [162] and the symmetric table addition method (STAM) [63],

[167] for 16 and 24-bit approximations to f2. SBTM and STAM use bipartite

and multipartite tables to exploit the symmetry of the Taylor approximations

and leading zeros in the table coefficients to reduce the look-up table size, as

discussed in Section 2.3.3 in Chapter 2.

The uniform segmentation compared here is similar to the ones described

in [147] and [70]. The polynomial-only method [165] is not considered here, since

they require unpractically high order polynomials when the function is non-linear.

For instance, in order to achieve just 8-bit accuracy to f2 with the polynomial-

only method, one requires a polynomial degree of 12.

We observe that table sizes for direct look-up approach are not feasible when

the accuracy requirement is high. SBTM/STAM significantly reduce the table

sizes compared to the direct table look-up approach, at the expense of some

adders and control circuitry. In the 16-bit results, HSM4 has the smallest table

size, being 546 and 8.5 times smaller than direct look-up and STAM. The table

size improvement of HSM is of course at the expense of more multipliers and

adders, hence higher latency. Generally, the higher the polynomial degree, the

smaller the table size. However, in the 16-bit case, HSM5 actually has lager table

size than HSM4. This is because of the extra overhead of having to store one more

polynomial coefficient per segment exceeds the reduction in number of segments

compared to HSM4. We observe for the 24-bit results, the differences in table

sizes between HSM and other methods are even larger. Moreover, the reductions

in table size by using higher order polynomials get greater as the operand width

increases (i.e. as the accuracy requirement increases). For applications that

133

require relatively low accuracies and latencies, SBTM/STAM may be preferred.

For high accuracy applications that can tolerate more latencies, HSM would be

more appropriate.

The reference design templates have been implemented using Xilinx System

Generator. As mentioned in Section 5.4, these design templates are fully param-

eterizable, and changes to the desired function, input interval, operand width or

finite precision parameters can result in producing a new design automatically.

The Xilinx System Generator design templates used for first order US(US) and

second order P2SL(US) are depicted in Figure 5.14 and Figure 5.15.

A variant [87] of our approximation scheme to f1 and f4, with one level of P2S

and US(P2S), has been implemented and successfully used for the generation of

Gaussian noise samples [86]. Table 5.4 contains implementation results for f2

and f3 with 16 and 24-bit operands and second order approximation, which are

mapped and tested on a Xilinx Virtex-II XC2V4000-6 FPGA. The precision of

bit width and the data paths have been optimized to minimize the size of the

multipliers and look-up tables. The design is fully pipelined generating a result

every clock cycle. Designs with lower latency and clock speed can be obtained

by reducing the number of pipeline stages. The designs have been tested exhaus-

tively over all possible input values to verify that all outputs are indeed faithfully

rounded. There are many compelling arguments in the literature that if the result

is faithfully rounded, one should maximize the percentage of results that are ex-

actly rounded. For the two designs, over 86% of the results are exactly rounded.

Higher percentage can be achieved by controlling the precisions of the coefficients

and data paths. We have observed the exact rounding is possible, but with a sig-

nificant increase in the precisions. The data path widths, number of segments,

table size and percentage of exactly rounded results to the implementations are

134

Fig

ure

5.14

:X

ilin

xSyst

emG

ener

ator

des

ign

tem

pla

teuse

dfo

rfirs

tor

der

US(U

S).

135

Fig

ure

5.15

:X

ilin

xSyst

emG

ener

ator

des

ign

tem

pla

teuse

dfo

rse

cond

order

P2S

L(U

S).

136

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1

−0.5

0

0.5

1

x

Err

or in

ulp

Figure 5.16: Error in ulp for 16-bit second order approximation to f3.

presented in Table 5.5.

Although we have not synthesized the designs for SBTM and STAM, we esti-

mate that they will take significantly less area in terms of slices than HSM (since

only adders and some control circuity are required, and adders are efficiently

implemented on Xilinx FPGAs using fast-carry chains), but at the expense of

more block RAM usage. The difference of block RAM usage between HSM and

SBTM/STAM will get more significant as the accuracy requirement increases as

shown in Table 5.3.

Figure 5.16 shows how the error (in ulp) varies with the input for 16-bit second

order approximation to f3. We observe that most of the errors (in absolute terms)

are less than 0.5 ulp, i.e. most of the results are exactly rounded.

Our hardware implementations have been compared with software implemen-

tations (Table 5.6). The FPGA implementations compute the functions using

HSM with 24-bit operands and second order polynomials. Software implementa-

tions are written in C generating single precision floating-point numbers, and are

compiled with the GNU gcc 3.2.2 compiler [54]. This is a fair comparison in terms

137

of precision, since single precision floating-point has 24-bit mantissa accuracy.

For the f2 function, the FPGA implementation is 20 times faster than the

Athlon based PC in terms of throughput, and 1.3 times faster in terms of com-

pletion time. We suspect that the inferior results of the Pentium 4 PC is due

to inefficient implementation of the log function in the gcc math libraries for

the Pentium 4 CPU. Looking at the f3 function, the FPGA implementation is 90

times faster than the Athlon based PC in terms of throughput, and 7 times faster

in terms of completion time. This increase in performance gap is due to the f3

function being more ‘compound’ than the f2 function. Whereas a CPU computes

each elementary operation of the function one by one, HSM looks at the entire

function at once. Hence, the more compound a function is, the advantages of

HSM get bigger.

Note that the FPGA implementations use only a fraction of the device used

(less than 2%), hence by instantiating multiple function evaluators on the same

chip for parallel execution, we can expect even larger performance improvements.

138

Table 5.3: Comparison of direct look-up, SBTM, STAM and HSM for 16 and

24-bit approximations to f2. The subscript for HSM denotes the polynomial

degree, and the subscript for STAM denotes the number of multipartite tables

used. Note that SBTM is equivalent to STAM2.

operand method table size compression multipliers adders

width [bits]

16 direct 1,048,576 546.1 - -

SBTM 29,696 15.5 - 1

STAM4 16,384 8.5 - 3

HSM1 24,384 12.7 1 2

HSM2 4,620 2.4 2 3

HSM3 2,304 1.2 3 4

HSM4 1,920 1.0 4 5

HSM5 2,112 1.1 5 6

24 direct 402,653,184 77,672.3 - -

SBTM 2,293,760 442.5 - 1

STAM6 491,520 94.8 - 5

HSM1 393,024 75.8 1 2

HSM2 40,446 7.8 2 3

HSM3 11,008 2.1 3 4

HSM4 6,720 1.3 4 5

HSM5 5,184 1.0 5 6

139

Table 5.4: Hardware synthesis results on a Xilinx Virtex-II XC2V4000-6 FPGA

for 16 and 24-bit, first and second order approximations to f2 and f3.

function order operand speed latency slices block block

width [MHz] [cycles] RAMs multipliers

f2 1 16 202 11 332 2 2

24 160 12 897 44 4

2 16 153 13 483 1 4

24 135 14 871 2 10

f3 1 16 232 8 198 2 1

24 161 10 418 37 2

2 16 198 12 234 1 3

24 157 13 409 3 4

140

Tab

le5.

5:W

idth

sof

the

dat

apat

hs,

num

ber

ofse

gmen

ts,ta

ble

size

and

per

centa

geof

exac

tly

rounded

resu

lts

for

16

and

24-b

itse

cond

order

appro

xim

atio

ns

tof 2

and

f 3.

func

tion

oper

and

w(x

)w

(C2)

w(C

1)

w(C

0)

w(x

C2)

w(x

C2C

1)

w(x

C2C

1x)

w(p

x)

segm

ents

tabl

esi

zeex

actl

y

wid

th[b

its]

roun

ded

[%]

f 216

1432

2520

2124

2116

604,

620

86

2422

4831

2831

3431

2437

840

,446

87

f 316

1017

1819

1819

1816

125

6,75

086

2416

2627

2727

2727

2476

461

,120

87

141

Table 5.6: Performance comparison: computation of f2 and f3 functions. The

Athlon and the Pentium 4 PCs are equipped with 512MB and 1GB DDR-S-

DRAMs respectively.

function platform speed throughput completion

[MHz] [operations / second] time [ns]

f2 XC2V4000-6 FPGA 135 135 million 104

AMD Athlon PC 1400 7.14 million 140

Intel Pentium 4 PC 2600 0.48 million 2088

f3 XC2V4000-6 FPGA 157 157 million 83

AMD Athlon PC 1400 1.76 million 569

Intel Pentium 4 PC 2600 1.43 million 692

142

5.9 Summary

We have presented a novel method for evaluating functions using piecewise poly-

nomial approximations with an efficient hierarchical segmentation scheme. Our

method is illustrated using four non-linear compound functions,√− log(x), x log(x),

a high order rational function and cos(πx/2). An algorithm that finds the op-

timum segments for a given function, input range, maximum error and ulp has

been presented. The four hierarchical schemes P2S(US), P2SL(US), P2SR(US)

and US(US) deal with the non-linearities of functions which occur frequently. A

simple cascade of AND and OR gates can be used to rapidly calculate the P2S

address for a given input. Results show the advantages of using our hierarchi-

cal approach over the traditional uniform approach. Compared to other popular

methods such as STAM, our approach has longer latencies due to the increased

number of arithmetic operations. However, the size of the look-up tables are

considerably smaller (up to a factor of 94.8 depending on method and precision).

143

CHAPTER 6


using the Box-Muller Method

6.1 Introduction

The availability of high quality Gaussian random numbers is critical to many sim-

ulation, graphics and Monte Carlo applications. Numerical methods for Gaussian

random number generation have a long history in mathematics and communica-

tions. As described in [78] and the references cited therein, most methods involve

initially generating samples of a uniform random variable and then applying a

transformation to obtain samples drawn from a unit-variance, zero-mean Gaus-

sian PDF f(x) = (1/√

2π) e−x2/2. In the overwhelming majority of cases, this

occurs in environments such as computer based simulation where functions such

as sine, cosine, and square roots are easily performed, and where there is sufficient

precision so that finite-word length effects are negligible.

There are many applications in which large simulations using Gaussian noise.

These include financial modeling [14], simulation of economic systems [6] and

molecular dynamics simulations [76]. For all of these applications, hardware

based simulation offers the potential to speed up simulation by several orders of

magnitude, but is feasible only if suitably fast and high-quality noise generators

can be implemented in environments with the limited word length, and the com-

144

putational, memory and data flow properties typical of hardware systems. In

addition, while any deviation from an ideal Gaussian PDF creates the potential

for degrading the simulation results, very large simulations create particularly

stringent requirements on the quality of the PDF in the tails. Samples that lie at

large multiples of σ (standard deviations) away from the mean are by definition

extremely rare but are also exactly the noise realizations that are most likely to

induce events of high interest in understanding the behavior of the overall system.

To accurately obtain good characteristics in the tails requires the combination of

1) an underlying method that creates high σ values with the proper frequency,

and 2) a hardware implementation of the method that preserves the requisite

precision at all of the stages to ensure that high σ behavior is not compromised.

There has been little attention focused on efficient hardware implementation

of Gaussian noise generators, as the noise in real hardware systems is of course

supplied by the environment and does not typically need to be generated inter-

nally. Recent advances in coding, however, have made the case for hardware

based simulation of channel codes much more compelling, and provide strong

motivation to examine the Gaussian noise generation problem in the framework

of limited word length, and limited computational and memory resources. For

example, computer simulations to examine LDPC code behavior can be time

consuming, particularly when the behavior at BERs in the error floor region is

being studied. Hardware based simulation offers the potential of speeding up

code evaluation by several orders of magnitude, but is feasible only if suitably

fast and high-quality noise generators can be implemented in hardware alongside

the channel decoder.

Probably the most well known method for generating Gaussian noise is known

as the Box-Muller transformation [13]. It allows us to transform uniformly dis-

145

tributed random variables, to a new set of random variables with a Gaussian

distribution. We start with two independent random numbers, which come from

a uniform distribution (in the range from 0 to 1). Then apply mathematical trans-

formations to get two new independent random numbers which have a Gaussian

distribution with zero mean and a standard deviation of one.

The principal contribution of this chapter is a hardware Gaussian noise gener-

ator based on the Box-Muller method, that offers quality suitable for simulations

involving large numbers of noise samples. The noise generator occupies approx-

imately 10% of the resources on a Xilinx Virtex-II XC2V4000-6 device, while

producing over 133 million samples per second. In contrast to previous work, we

focus specific attention on the accuracy of the noise samples in the high σ regions

of the Gaussian PDF, which are particularly important in achieving accurate

results during large simulations. The key novelties of our work include:

• a hardware architecture which involves the use of non-uniform piecewise lin-

ear approximations in computing trigonometric and logarithmic functions;

• exploration of hardware implementations of the proposed architecture tar-

geting both advanced high-speed FPGAs and low-cost FPGAs;

• evaluation of the proposed approach using several different statistical tests,

including the chi-square test and the Anderson-Darling test, as well as

through application to decoding of LDPC codes.

The rest of this chapter is organized as follows. Section 6.2 covers related work.

Section 6.3 briefly reviews the Box-Muller algorithm, and discusses how each of its

steps can be handled in a hardware architecture. Section 6.4 presents a method

for function evaluation based on non-uniform segments. Section 6.5 explains

how the function evaluation method is used to compute the functions in the

146

Box-Muller algorithm. Section 6.6 describes technology-specific implementation

of the hardware architecture. Section 6.7 discusses evaluation and results, and

Section 6.8 offers summary.

6.2 Related Work

There is little previous work on high quality digital hardware Gaussian noise

generators. The most relevant publications are probably [12] and [186], which

discuss designs targeting FPGAs. We present a design with significantly improved

efficiency, which also passes statistical tests widely used for testing normality. In

addition, previous work produces noise samples that are targeted primarily for

the output region below about 4σ, and therefore does not specifically address the

high σ values of 4σ to 6σ and beyond; these are critical in the large simulations

motivating our work.

The Box-Muller algorithm requires the approximation of non-linear functions.

When using piecewise polynomials to approximate such functions, it is desirable

to choose the boundaries of the segments to cater for the non-linearities of the

function. Highly non-linear regions may need smaller segments than linear re-

gions. This approach minimizes the amount of storage required to approximate

the function, leading to more compact and efficient designs. We employ a novel

hierarchy of uniform segments and segments that vary by powers of two to cover

the non-linearities of different functions appropriately. Moreover, we present a

hardware architecture which is suitable for hardware implementation.

147

6.3 Architecture

This section provides an overview of the Box-Muller method and the associated

four-stage hardware architecture. The implementation of this architecture in

FPGA technology is presented in Section 6.6.

The Box-Muller method is conceptually straightforward. Given two indepen-

dent realizations u1 and u2 of a uniform random variable over the interval [0,1),

and a set of intermediate functions f , g1 and g2 such that

f(u1) =√− ln(u1) (6.1)

g1(u2) =√

2 sin(2π u2) (6.2)

g2(u2) =√

2 cos(2π u2) (6.3)

the products

x1 = f(u1) g1(u2) (6.4)

x2 = f(u1) g2(u2) (6.5)

provide two samples of a Gaussian distribution N(0, 1).

The above equations lead to an architecture that has four stages (Figure 6.1).

1. A shift register based uniform random number generator,

2. implementation of the functions f , g1, g2 and the subsequent multiplica-

tions,

3. a sample accumulation step that exploits the central limit theorem to over-

come quantization and approximation errors, and

4. a simple multiplexor based circuit to support generation of one result per

clock cycle.

148

A similar basic approach has been taken in other hardware Gaussian noise im-

plementations [12]; what distinguishes our work is the detail of the functional

implementation developed to deal with: (a) Gaussian noise with high σ values,

and (b) evaluations using commonly-used statistical tests.

In the following, each of the four stages in our architecture is described in

detail.

The first stage. This stage involves generation of the uniformly distributed

realizations u1 and u2. The implementation of this stage is straightforward, and

can be accomplished using well-known techniques based on Linear Feedback Shift

Registers (LFSRs) [24]. To ensure maximum randomness, we use an independent

shift register for each bit of u1 and u2. The resources needed are related to the

periodicity desired in the shift registers. Since m-bit LFSRs with irreducible

polynomials can produce random numbers with periodicity of 2m − 1, hardware

required will be proportional to the number of bits of precision needed in u1 and

u2.

The necessary precisions of u1 and u2 are related to the maximum σ value

that the full system will produce. Since g1 and g2 are bounded by [−√2,√

2], the

maximum output is determined by f , which in turn takes on its largest values

when u1 is smallest. For example, when 16 bits are used for u1, the maximum

possible Gaussian sample has an absolute value of 4.7σ. With 32 bits we use in

this chapter, we can get up to 6.7σ. Using more bits of u1 means that we need

to approximate non-linear parts of f closer to zero. In addition, the precisions of

u1, u2, g1 and g2 should be large enough so that there are enough diversities in

the outputs. Low precisions will cause the statistical tests to fail.

The second stage. This stage involves the most interesting challenges: efficient

implementation of the functions f , g1 and g2. Direct computation of the functions

149

u 1

LFSRs

g 1 ( u

2 ) f ( u

1 ) g

2 ( u

2 )

ACC (2)

u 2

u 2

50

32 18 18

x 1 x 2

32 32

x

ACC (2)

y

MUX

32

Stage 1

Stage 2

Stage 3

Stage 4

x

Figure 6.1: Gaussian noise generator architecture. The black boxes are buffers.

150

using methods such as CORDIC leads to prohibitively long computation times. A

direct look-up table would allow outputs to be obtained in only a few clock cycles,

but this leads to prohibitively large memory requirements. For example, a look-up

table for f(u1) with sufficient resolution for u1 would require 232 entries. Instead,

we use a two-step process based on non-uniform piecewise linear approximation.

Our approach is described in sections 6.4 and 6.5.

The third stage. This stage involves a sample accumulation step that exploits

the central limit theorem to overcome quantization and approximation errors.

As is well known, given a sequence of realizations of independent and identically

distributed random variables x1, x2, ..., xl with unit variance and zero mean, the

distribution of

x1 + x2 + ... + xl√l

tends to be normally distributed as l → ∞. We find that l = 2 is sufficient

to overcome the effects of the approximation errors, so we use an accumulator

(the ACC(2) component shown in Figure 6.1) that sums two successive inputs

to produce an output every other cycle. The central limit theorem calls for a

division by√

2, which is potentially problematic in hardware. Fortunately, since

computation of g1 and g2 involves a multiplication by√

2 (equations (6.2) and

(6.3)), this multiplication is in effect cancelled by the subsequent division, so it

can be dispensed with in both places in the implementation. This optimization

also alters the range of g as implemented to [-1,1].

The fourth stage. This stage involves a multiplexor based circuit to select one of

the two ACC(2) component outputs in alternate clock cycles. The multiplexor

is controlled by a circuit that toggles its output. This enables producing an

output every clock cycle, rather than two outputs every other cycle. The buffer

151

after the second ACC(2) is needed to ensure one valid noise sample is fed to

the multiplexor every clock cycle, rather than two valid samples every two clock

cycles.

Two further remarks about this architecture can be made. First, it is pos-

sible to speed up the output rate further by having multiple noise generators

running in parallel, provided that the LFSRs are initialized with different ran-

dom seeds. Second, the periodicity can be increased by using larger LFSRs and

higher σ values can be obtained using more bits for u1, both with little increase

in complexity.

6.4 Function Evaluation for Non-uniform Segmentation

This section presents a method for function evaluation based on an innovative

technique involving non-uniform segmentation. This method is a variant of the

segmentation ideas presented in Chapter 5. The interval of approximation is

divided into a set of sub-intervals, called segments. The best-fit straight line,

in a minimax sense (minimize worst-case error), to each segment is found. A

look-up table is used to store the coefficients for each line segment, and the

functions can then be evaluated using a multiplier and an adder to calculate the

linear approximation. Uniform segmentation methods have been proposed, which

involve similar hardware [122].

Using well-known methods that compute elementary functions such as CORDIC,

the evaluation of compound functions is a multi-stage process. Consider the

evaluation of the f function as defined in Equation (6.1) over the interval (0, 1)

(Figure 6.2). Using CORDIC, the computation of this function is a two-stage

process: the logarithm of x followed by the square root. With our approach, we

152

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

1

2

3

4

5

u

f(u)

Figure 6.2: The f function. The asterisks indicate the boundaries of the linear

approximations.

look at the entire function over the given domain, and therefore we do not need

to have two stages. As shown in Figure 6.2, the greatest non-linearities of the f

function occur in the regions close to zero and one. If uniform segments are used,

a large number of small segments would be required to get accurate approxima-

tions in the non-linear regions. However, in the middle part of the curve where it

is relatively linear, accurate approximation can be obtained using relatively few

segments. It would be efficient to use small segments for the non-linear regions,

and large segments for linear regions. Arbitrary-sized segments would enable us

to have the least error for a given number of segments; however, the hardware to

calculate the segment address for a given input can be complex. Our objective is

to provide near arbitrary-sized segments with a simple circuit to find the segment

address for a given input.

We have developed a novel method which can construct piecewise linear ap-

proximation. The main features of our proposed method include: (a) the segment

153

lengths used in a given region depends on the local linearity, with more segments

deployed for regions of higher non-linearity; and (b) the boundaries between seg-

ments are chosen such that the task of identifying which segment to use for a

given input can be rapidly performed. The method is based on early ideas to

the hierarchical segmentation method (HSM) described in Chapter 5. It is not

as sophisticated as HSM, but is sufficient to generate high quality Gaussian noise

samples.

As an example to illustrate our approach, consider approximating f with an

8-bit input. Using the traditional approach, the most-significant bits of u are

used to index the uniform segments. For instance if the most-significant four bits

are used, 16 uniform segments are used to approximate the function. Using our

approach, it is possible to adopt small segments for non-linear regions (regions

near 0 and 1), and large segments for linear regions (regions around 0.5). The

idea is to use segments that grow by a factor of two from 0 to 0.5, and segments

that shrink by a factor of two from 0.5 to 1 in the horizontal axis of Figure 6.2.

We use segment boundaries at locations 2n−8 and 1−2−n where 0 ≤ n < 8. Up to

14 segments can be formed this way. A circuit based on prefix computation can

be used for calculating segment addresses (Figure 6.3, same as the circuit used for

HSM in Chapter 5) for a given input x. It checks the number of leading zeros and

ones to work out the segment address. A cascade of OR gates is used for segments

that grow by factors of two, and a cascade of AND gates is used for segments that

shrink by factors of two; these circuits can be pipelined and a circuit with shorter

critical path but requiring more area can be used [80]. Note that the choice of

segments does not have to be factors of two, it could be more. The appropriate

taps are taken from the cascades depending on the choice of the segments and are

added to work out the segment address. In Figure 6.3, the maximum available

taps are taken, giving 14 segment addresses. Some taps would not be taken if the

154

address

+

segment

x7 x6 x5 x4 x3 2x x1

Figure 6.3: Circuit to calculate the segment address for a given input x. The

adder counts the number of ones in the output of the two prefix circuits. Note

that the least-significant bit xo is not required.

segments grow or shrink by more than a factor of two. It can be seen that the

critical path of this circuit is the path from x6 or x7 to the output of the adder.

By introducing pipeline registers between the gates, higher throughput can easily

be achieved.

When approximating f with 32-bit inputs based on polynomials of the form

p(u) = c1 × u + c0 (6.6)

the gradient of the steepest part of the curve is in the order of 108, thus large

multipliers would be required. To overcome this problem, we use scaling factors

of multiples of two to reduce the magnitude of the gradient, essentially trading

precision for range. This is appropriate since the larger the gradient, the less

important precision becomes. The use of scaling factors provides the user the

ability to control the precision for both c1 and c0, resulting in variation of the

size of the multiplier and adder. Hence for each segment, four coefficients are

stored: c1 and its scaling factor, c0 and its scaling factor. Note that the precision

155

of the approximation p(x) depends on the maximum error desired between p(x)

and the actual function.

It is also possible to divide the input interval into uniform or non-uniform

intervals, and have uniform or non-uniform segments inside each interval. In this

case, the most-significant bits are used to address the intervals, and the least-

significant bits are used to address the segments inside each interval. It can be

seen that one can have any number of nested combinations of uniform and non-

uniform segments. This hybrid combination of nested uniform and non-uniform

segments provides a flexible way to choose the segment boundaries.

The architecture of our function evaluator, shown in Figure 6.4, is based on

first order polynomials. The most-significant bits are used to select the interval,

and the least-significant bits are passed through the segment address calculator

which calculates the segment address within the interval. The ROM outputs the

four coefficients for the chosen interval and segment. c1 is multiplied by the input

x and c s1 is used to scale the output. The scaling circuit involves shifters, which

increase or decrease the value by powers of two. This scaled multiplication value

is added to the scaled c0 coefficient to produce the final result.

6.5 Function Evaluation for Noise Generator

This section explains in detail how the function evaluation method based on non-

uniform segmentation is used to compute the f and g functions for Gaussian

noise generation (Equations (6.1)∼(6.3)). We first consider the f function. As

stated earlier, the greatest non-linearities of this function occur in the regions

close to zero and one. To be consistent with the change in linearity, we use line

segment locations to boundaries at locations 2n−32 for 0 < u ≤ 0.5, and 1−2−n for

156

LSBs MSBs

u

ROM

segment address

calcuation

scaling scaling

c 1

2 (c 1 u) +

c_s 1 c 0 c_s 0

c_s 1 2 c 0 c_s 0

Figure 6.4: Function evaluator architecture based on non-unform segmentation.

157

0 2 4 6 8 10 12 14 160.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

0.12

0.13

Number of Bits

Max

imum

Abs

olut

e E

rror

Figure 6.5: Variation of function approximation error with number of bits for the

gradient of the f function.

0.5 < u ≤ 1, where 0 ≤ n < 32. A total of 59 segments are used to approximate

this function as shown in Figure 6.2. Since f approaches infinity for u values

close to zero, the smallest u value is 2−32, resulting in a maximum output value

of around 4.7.

The maximum absolute error of this approximation is 0.020 (compared against

IEEE double precision). However this is the case only if we have infinite preci-

sion for the coefficients and data paths, which is not realistic. Multipliers take

significant amount of resources on FPGAs, therefore the coefficients for the gra-

dient should be as small as possible. Tests are carried out to find the optimum

number of bits for the gradient coefficients that provides the least absolute er-

ror. Figure 6.5 shows how the maximum absolute error varies with the number

of bits used for the gradient of the f function. Similar tests are performed for

the y-intercept coefficients and various data paths. The figure indicates that six

bits are sufficient to give a maximum absolute error of 0.031. Our requirement

158

0 0.25 0.5 0.75 1−1

0

1

u

g(u) Region 0

Region 2

Region 1

Region 3

cos(u)

sin(u)

Figure 6.6: The g functions. Only the thick line is approximated; see Figure 4.

The most significant 2 bits of u2 are used to choose which of the four regions to

use; the remaining bits select a location within Region 0.

is faithfully rounded results [159] (results are rounded to the nearest or next

nearest), where the approximation should differ from the true value by less than

one ulp. With this error, it is sufficient to give an output accuracy of eight bits

(three bits for integer and five for fraction). If uniform segments are used, small

segment size would be needed in order to cope with the highly non-linear parts

of the curve. In fact, one would require around 617 million segments to get the

same maximum absolute error with uniform segments. This is a good example to

demonstrate the effectiveness of our non-uniform approach. It is clear that our

approach works well especially for functions with exponential behavior.

The computation of g1 and g2 is carried out in a similar way. Given the

symmetry of the sine and cosine functions, the axis can be considered in four

regions related by symmetry, labeled 0 to 3 in Figure 6.6. To evaluate the func-

tions g1 and g2, due to the symmetry of the sine and cosine functions, only the

input range [0, 1/4) for g1 needs to be approximated [128]. The specific axis-

159

0 0.0625 0.125 0.1875 0.25 0

1

u

g 1(u)

Figure 6.7: Approximation for g1 over [0, 1/4). The asterisks indicate the segment

boundaries of the linear approximations.

partitioning technique for f is unsuitable for g1, since the non-linearities of the

two functions are different. If the same technique is used, there would be many

unnecessary segments near the beginning and end of the curve, and not enough

segments in the middle regions. As before we consider both the local linearity

of the curve, and the computational concerns with respect to choosing specific

segment boundary locations, leading to the approximations shown in Figure 6.7.

The curve is divided into four uniform intervals and within each interval, non-

uniform segmentation is applied. Note that for each interval, not all taps are

taken from the segment address calculator. The boundaries are chosen in a way

to minimize the approximation error. For the first three intervals, non-uniform

segments increasing and decreasing by powers of two with six segments each are

used. For the last interval, only three segments are used by omitting taps. Since

this interval is the most non-linear, sufficiently good accuracy can be achieved

with only a few segments. We use a total of 21 segments to approximate this

function.

160

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−0.03

−0.025

−0.02

−0.015

−0.01

−0.005

0

0.005

u

App

roxi

mat

ion

Err

or f(

u)

Figure 6.8: Approximation error to f . The worst case and average errors are

0.031 and 0.000048 respectively.

With finite precision on the coefficients and data paths, the maximum absolute

error of this approximation is 0.031, which is sufficient to give an output accuracy

of eight bits (all eight bits for fraction). Using uniform segments, the same error

can be obtained with a slightly larger number of segments; this is because the

curve does not have high non-linearities.

The maximum absolute errors to the two functions, 0.031 and 0.00079, may

seem to be rather high. However, the average errors to the two functions are

in fact 0.000048 and 0.0000012 respectively. Lower average approximation er-

rors to the functions ensure overall higher noise quality. The error plots for the

approximations to f and g1 are shown in Figures 6.8 and 6.9.

Table 6.1 shows a comparison of the number of segments for the two functions

for non-uniform and uniform segmentation in order to achieve the same worst-case

error. Note that for uniform segmentation, the number of segments needs to be a

power of two. This is because the most-significant n bits are used for addressing.

For instance, the actual number of uniform segments needed for the f function

161

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

−4

−2

0

2

4

6

8

u

App

roxi

mat

ion

Err

or g

1(u)

Figure 6.9: Approximation error to g1. The worst case and average errors are


is 617 million, but one billion segments are used which is the next power of two

(230). We do not have this kind of restriction with our non-uniform addressing

scheme. The table also shows the number of bits used for each coefficient in the

look-up tables. The look-up table sizes are 59× (6 + 5 + 32 + 5) = 2832 bits for

the f function and 21× (8 + 4 + 16 + 4) = 672 bits for the g1 function, giving a

total look-up table size of just 3504 bits for all three functions. With such small

look-up table size, all the coefficients can be stored on-chip for fast access. Note

that the g2 function shares the same look-up table with g1.

6.6 Implementation

This section presents implementations of the four-stage architecture using FPGA

technology.

We use 32 bits for u1, allowing a maximum output of 6.7σ. Higher values

of σ can be supported by increasing the number of bits for u1; for instance 46

162

Table 6.1: Comparing two segmentation methods. Second column shows the

comparison of the number of segments for non-uniform and uniform segmentation.

Third column shows the number of bits used for the coefficients to approximate

f and g1.

function non-uniform uniform c1 c s1 c0 c s0

f 59 1 billion 6 5 32 5

g1 21 32 8 4 16 4

bits would yield a maximum output of 8σ. For u2, 18 bits are found to be

sufficient without loss of performance (lower bitwidths cause the statistical tests

to fail). This is because the trigonometric functions in g1 and g2 can be computed

over [0,1/4) instead of [0,1), with symmetry used to derive the remainder of the

[0,1) interval. In terms of hardware resources, the size of these uniform random

number inputs (u1, u2, g1 and g2) affects the size of the multipliers and adders

(see Figure 6.4). The more bits there are, the larger the multipliers and adders

must be.

The combination of 32 bits for u1 and 18 bits for u2 means that 50 shift

registers are needed. We choose to target a period of about 1018 for the noise

generator, which exceeds by several orders of magnitude even the most ambi-

tious simulation size that can be contemplated with current hardware. Since

1018 is approximately 260, we use 60-bit LFSRs. In order for the LFSRs to it-

erate through this large period, they are configured with polynomials which will

produce maximum sequence lengths for a given LFSR size [125].

The 50 60-bit LFSRs can be implemented in configurable hardware using

surprisingly few resources. Recent-generation reconfigurable hardware has a

163

large amount of user-configurable elements. For instance the Xilinx Virtex-II

XC2V4000-6 has 23040 user-configurable elements known as slices. The SRL16

primitive in Xilinx Virtex FPGAs enables a look-up table to be configured as

a 16-bit shift register. A 60-bit LFSR using SRL16s instead of flipflops can be

packed into three slices instead of 32 [125]. So we just need 150 slices for the 50

LFSRs. Note that all 50 LFSRs are initialized with random seeds.

It could also be argued that application of the central limit theorem should be

unnecessary if f , g1 and g2 are implemented with sufficient accuracy. However,

there is hardware tradeoff involved in increasing the accuracy of these functions.

We have found that application of the central limit theorem once (by summing

two values as described above) results in a net reduction in complexity when

the corresponding looser tolerances in the piecewise linear approximations are

exploited.

Having a larger number of terms in the central limit theorem step would fur-

ther simplify the linear approximations, but would slow the execution speed due

to the need for accumulating more terms. For instance, when 17 approximations

are used for f and 6 for g, eight values need to be summed in order to pass the

statistical tests. When 59 approximations are used for f and 21 for g, without

summing, the statistical tests fail after around 700 million samples. Therefore,

we sum two samples to pass the tests.

Several FPGA implementations have been developed, using the Handel-C

hardware compiler from Celoxica [21]. We have mapped and tested the design

onto a hardware platform with a Xilinx Virtex-II XC2V4000-6 device. This design

occupies 2514 slices, eight block multipliers and two block RAMs, which takes up

around 10% of the device. Stage two, the function evaluator, takes up 2137 slices

or 85% of the slices used. A pipelined version of our design operates at 133MHz,

164

and hence our design produces 133 million Gaussian noise samples per second.

We have also implemented our design on a Xilinx Spartan-IIE XC2S300E-7

FPGA. This design runs at 62MHz and has 2829 slices and eight block RAMs,

which requires over 90% of this device. This implementation can produce 133

million samples in around two seconds.

It is possible to increase the performance by exploiting parallelism. We have

experimented with placing multiple instances of our noise generator in an FPGA,

and find that there is a small reduction in clock speed probably due to the high

fan-out of the clock tree. For instance, a design with three instances of our noise

generator takes up around 32% of the resources in an XC2V4000-6 device; it runs

at 126MHz, producing 378 million noise samples per second.

In Section 6.7, the performance of the hardware designs presented above is

compared with those of software implementations.


This section describes the statistical tests that we use to analyze the properties

of the generated Gaussian noise.

In order to ensure the randomness of the uniform random samples u1 and u2,

we have tested the LFSR with the Diehard test suite [113], which is a popular

tool among statisticians for testing uniformity. The LFSR passed all the Diehard

tests indicating that the uniform random samples generated are indeed uniformly

randomly distributed.

We use two well-known goodness-of-fit tests to check the normality of the ran-

dom variables: the chi-square (χ2) test and the Anderson-Darling (A-D) test [32].

The χ2 test involves quantizing the x axis into k bins, determining the actual

165

and expected number of samples appearing in each bin, and using the results

to derive a single number that serves as an overall quality metric. Let t be the

number of observations, pi be the probability that each observation fall into the

category i and Yi be the number of observations that actually do fall into category

i. The χ2 statistic is given by

χ2 =k∑

i=1

(Yi − tpi)2

tpi

(6.7)

This test, which is essentially a comparison between an experimentally deter-

mined histogram and the ideal PDF, is sensitive not only to the quality of the

noise generator itself, but also to the number and size of the k bins used on the

x axis. For example, a noise generator that models the true PDF accurately for

low absolute values of x but fails for large x could yield a good χ2 result if the

examined regions are too closely centered around the origin. It is precisely for

these high |x| regions where a noise generator is critically important, and most

likely to be flawed.

Consider a simulation involving generation of 1012 noise samples, conducted

with the goal of exploring performance for a channel decoder in the range of BERs

from 10−9 to 10−10. In samples drawn from a true unit-variance Gaussian PDF, we

would expect that approximately half a million samples from the set of 1012 would

have absolute value greater than x = 5. These high σ noise values are precisely

the ones likely to cause problems in decoding, so a hardware implementation

that fails to faithfully produce them appropriately risks creating incorrect and

deceptively optimistic results in simulation. To counter this, we extend the tests

to specifically examine the expected versus actual production of high σ values.

While the χ2 test deals with quantized aspects of a design, the A-D test deals

with continuous properties. It is a modification of the Kolmogorov-Smirnov (K-

S) test [78] and gives more weight to the tails than the K-S test does. The K-S

166

test is distribution free in the sense that the critical values do not depend on

the specific distribution being tested. The A-D test makes use of the specific

distribution (normal in our case) in calculating critical values. For comparing a

data set to a known CDF F (x), the A-D statistic A2 is defined by

A2 =N∑

i=1

1− 2i

N[lnF (xi) + ln(1− F (xN+1−i))]−N (6.8)

where xi is the ith sorted and standardized sample value, and N is the sample

size.

A p-value [32] can be obtained from the tests, which is the probability that the

deviation of the observed from that expected is due to chance alone. A sample set

with a small p-value means that it is less likely to follow the target distribution.

The general convention is to reject the null hypothesis – that the samples are

normally distributed – if the p-value is less than 0.05.

Figures 6.10, 6.11 and 6.12 illustrate the effect on the PDF of different im-

plementation choices. Figure 6.10 shows the PDF obtained when 17 and 6 linear

approximations are used for f and g1 respectively. The figure (as well as the

others in this section) is based on a simulation of four million Gaussian random

variables. There are distinct error regions visible in the PDF, which occur when

there are large errors in the approximation of f and g1. These distinct errors

cause the χ2 and A-D tests to fail. Increasing the number of linear approxima-

tions to 59 and 21 respectively leads to the PDF shown in Figure 6.11. It is clear

that the error regions have decreased significantly. However, although this passes

the A-D test, it fails the χ2 test when the sample size is sufficiently large. When

the further enhancement of summing two successive samples as discussed earlier

is added, the PDF of Figure 6.12 results.

This implementation passes the statistical tests even with extremely large

167

numbers of samples. We have run a simulation of 1010 samples to calculate the

p-values for the χ2 and A-D test. For the χ2 test, we use 100 bins for the x axis

over the range [-7,7]. The p-values for the χ2 and A-D tests are found to be 0.3842

and 0.9058 respectively, which are well above 0.05, indicating that the generated

noise samples are indeed normally distributed. To test the noise quality in the

high σ regions, we run a simulation of 107 samples over the range [-7,-4] and [4,7]

with 100 bins. This is equivalent to a simulation size of over 1011 samples. The

p-values for the χ2 and A-D tests are found to be 0.6432 and 0.9143, showing

that the noise quality even in the high σ regions is high.

In order to explore the possibility of temporal statistical dependencies [154]

between the Gaussian variables, we generate scatter plots showing pairs yi and

yi+1. This is to test serial correlations between successive samples, which can

occur if the noise generator is improperly designed. If correlations exist, certain

patterns can be seen in the scatter plot [154]. An example based on 10000 Gaus-

sian variables is shown in Figure 6.13, which displays no obvious correlations.

Our hardware implementations, described in Section 6.6, have been compared

to several software implementations based on the polar method [78] and the

Ziggurat method [115], which are the fastest methods for generating Gaussian

noise for instruction processors. The software implementations are written in

C generating single precision floating-point numbers, and are compiled with the

GNU gcc 3.2.2 compiler. The uniform random number generator used is the

mrand48 C function in UNIX, which uses a linear congruential algorithm [78]

and 48-bit integer arithmetic (period of 248). This algorithm can generate one

billion 48-bit uniform random numbers on a Pentium 4 2.6GHz PC in just 23

seconds.

The results are shown in Table 6.2. The XC2V4000-6 FPGA belongs to

168

−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

x

PD

F(x

)

Figure 6.10: PDF of the generated noise with 17 approximations for f and 6 for g

for a population of four million. The p-values of the χ2 and A-D tests are 0.00002

and 0.0084 respectively.

−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

x

PD

F(x

)

Figure 6.11: PDF of the generated noise with 59 approximations for f and 21

for g for a population of four million. The p-values of the χ2 and A-D tests are


169

−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

x

PD

F(x

)

Figure 6.12: PDF of the generated noise with 59 approximations for f and 21 for

g with two accumulated samples for a population of four million. The p-values

of the χ2 and A-D tests are 0.3842 and 0.9058 respectively.

−4 −3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4

yi

y i+1

Figure 6.13: Scatter plot of two successive accumulative noise samples for a

population of 10000. No obvious correlations can be seen.

170

Table 6.2: Performance comparison: time for producing one billion Gaussian

noise samples. All PCs are equipped with 1GB DDR-SDRAM.

platform speed [MHz] method time [s]

XC2V4000-6 FPGA 105 96% usage 1

XC2V4000-6 FPGA 126 32% usage 2.6

XC2V4000-6 FPGA 133 10% usage 7.5

XC2S300E-7 FPGA 62 90% usage 16

Intel Pentium 4 PC 2600 Ziggurat 50

AMD Athlon PC 1400 Ziggurat 72

Intel Pentium 4 PC 2600 Polar 147

AMD Athlon PC 1400 Polar 214

the Xilinx Virtex-II family, while the XC2S300E-7 FPGA belongs to the Xilinx

Spartan-IIE family. It can be seen that our hardware designs are faster than

software implementations by 3–200 times, depending on the device used and the

resource utilization. Such speedups are mainly due to the ability to perform bit-

level and parallel operations in FPGAs, which result in a more efficient usage of

silicon area for a given design over general purpose microprocessors.

Figure 6.14 shows how the number of noise generator instances affects the

output rate. While ideally the output rate would scale linearly with the number

of noise generator instances (dotted line), in practice the output rate grows slower

than expected, because the clock speed of the design deteriorates as the number

of noise generators increases. This deterioration is probably due to the increased

171

routing congestion and delay. We are able to fit up to nine instances on the

Virtex-II XC2V4000-6, which can generate almost one billion noise samples per

second.

We have used our noise generator in LDPC decoding experiments [74]. Al-

though the output precision of our noise generator is 32 bits, 16 bits are found to

be sufficient for our LDPC decoding experiments (other applications such as fi-

nancial modeling [14] may require higher precisions). To obtain a benchmark, we

performed LDPC decoding using a full precision (64-bit floating-point represen-

tation) software implementation of belief propagation in which the noise samples

are also of full precision. We then performed decoding using the LDPC algorithm

but with noise samples created using the design presented in this chapter. Over

many simulations, we have found no distinguishable difference in code perfor-

mance, even in the high Eb/N0 (low SNR) regions where the error floor in BER

is as low as 10−9 (1012 codewords are simulated). To generate 1012 noise samples

on a 2.6GHz Pentium 4 PC, it takes over 11 hours, whereas a single instance of

our hardware noise generator takes just over two hours. On a PC, where LDPC

encoding, noise generation and LDPC decoding are performed, the simulation

time for 1012 codeword samples will be a lot longer than ten hours, since all three

modules need to be performed. However, in our hardware simulation we have the

advantage of running all three modules in parallel. Although the hardware im-

plementation of our hardware LDPC decoder is currently at a preliminary stage

(implemented serially), it has a throughput of around 500Kbps, which is over 20

times faster than our PC based simulations. We are currently in the process of

implementing a fully parallel scalable decoder, which we predict will be several

orders of magnitude faster than traditional software simulations.

Comparing our implementation with other hardware Gaussian noise genera-

172

1 2 3 4 5 6 7 8 90

200

400

600

800

1000

1200

Number of Instances

Mill

ion

Sam

ples

/ S

econ

d

Figure 6.14: Variation of output rate against the number of noise generator

instances.

tors, the only implementation known on a Xilinx FPGA is the AWGN core [186]

from Xilinx. This implementation follows the ideas presented in [12]. Although

this core is around twice as fast as and four times smaller than our design, it is

only capable of a maximum σ value of 4.7 (whereas we can achieve 6.7 σ and

more). In addition, we have tested the design with our statistical tests, and found

out that the noise samples fails the χ2 test after around 200,000 samples. Hence,

we find the design to be inadequate for our low BER and high quality LDPC

decoding experiments.

6.8 Summary

We have presented a hardware Gaussian noise generator based on the Box-Muller

method, designed to facilitate simulations implemented in hardware which involve

large numbers of samples. A key aspect of the design is the use of non-uniform

173

piecewise linear approximations in computing trigonometric and logarithmic func-

tions, with the boundaries between each approximation chosen carefully to enable

rapid computation of coefficients from the inputs.

Our noise generator design occupies approximately 10% of a Xilinx Virtex-

II XC2V4000-6 FPGA and 90% of a Xilinx Spartan-IIE XC2S300E-7, and can

produce 133 million samples per second. The performance can be improved by

exploiting parallelism: an XC2V4000-6 FPGA with nine parallel instances of the

noise generator at 105MHz can run 50 times faster than a 2.6GHz Pentium 4 PC.

Statistical tests, including the χ2 test and the A-D test, as well as application

in LDPC decoding have been used to confirm the quality of the noise samples.

The output of the noise generator accurately models a true Gaussian PDF even

at very high σ values.

This noise generator has been integrated with the LDPC decoder presented

in [74], and is being used for exploring LDPC code behavior at UCLA and JPL

(Jet Propulsion Laboratory, NASA). It is also being used at the Chinese Univer-

sity of Hong Kong for Monte Carlo simulations of financial models [192]. In the

next chapter, we describe another hardware Gaussian noise generator based on a

recent method proposed by Wallace [180].

174

CHAPTER 7


using the Wallace Method

7.1 Introduction

Most of the methods including the Box-Muller method described in previous

chapter, produce normal variables by performing operations on uniform vari-

ables. In contrast, in [180] Wallace proposes an algorithm that completely avoids

the use of uniform variables, operating instead using an evolving pool of normal

variables to generate additional normal variables. The approach draws its inspi-

ration from uniform random number generators that generate one or more new

uniform variables from a set of previously generated uniform variables. Given a

set of normally distributed random variables, a new set of normally distributed

random variables can be generated by applying a linear transformation.

Although the Wallace method is simple and fast, it can suffer from correlations

at the output due to its feedback nature. This issue will be discussed in detail in

Chapter 8.

The principal contribution of this chapter is a hardware Gaussian noise gen-

erator based on the Wallace method that offers quality suitable for simulations

involving very large numbers of noise samples. The noise generator occupies ap-

proximately 3% of the resources on a Xilinx Virtex-II XC2V4000-6 device, while

175

producing over 155 million samples per second. The key contributions of our

work include:

• a hardware architecture for the Wallace method;

• exploration of hardware implementations of the proposed architecture tar-

geting both advanced high-speed FPGAs and low-cost FPGAs;

• evaluation of the proposed approach using several different statistical tests,

including the chi-square test and the Anderson-Darling test, as well as

through application to a large communications simulation involving LDPC

codes.

The rest of this chapter is organized as follows. Section 7.2 provides an

overview of the Wallace method. Section 7.3 describes our Wallace implementa-

tion, and discusses how each of its steps can be handled in a hardware architec-

ture. Section 7.4 describes technology-specific implementation of the hardware

architecture. Section 7.5 discusses evaluation and results, and Section 7.6 offers

summary.

7.2 The Wallace Method

The Wallace approach [180] draws its inspiration from uniform random number

generators that generate one or more new uniform variables from a set of previ-

ously generated uniform variables. Given a set of normally distributed random

variables, a new set of normally distributed random variables can be generated

by applying a linear transformation. Brent [16] implemented a fast vectorized

Gaussian random number generator using the Wallace method on the Fujitsu

VP2200 and VPP300 vector processors. In [17] and [157], Brent and Rub outline

176

P o o l o f N

G a u s s i a n S a m p l e s

K - b y - K T r a n s f o r m a t i o n

x K - 1

x 0

S u m - o f - s q u a r e s C o r r e c t i o n

G a u s s i a n S a m p l e s

x ' K - 1

x ' 0

Figure 7.1: Overview of the Wallace method.

the possible problems associated with the Wallace method and discuss ways of

avoiding them.

The Wallace method is a fast algorithm for generating normally distributed

pseudo-random numbers which generates the target distributions directly using

their maximal-entropy properties. This algorithm is particularly suitable for high

throughput hardware implementation since no transcendental functions such as√

x, log(x) or sin(x) are required. It takes a pool of KL normally distributed

random numbers from the normal distribution. These values are normalized so

that their average squared value is one. In L transformation steps, K numbers are

treated as a vector X, and transformed into K new numbers from the components

of the K vector X1 = AX where A is an orthogonal matrix. If the original K

values are normally distributed, then so are the K new values. Furthermore, this

transformation preserves the sum of squares. An overview of the Wallace method

is depicted in Figure 7.1.

The process of generating a new pool of normally distributed random numbers

is called a ‘pass’. After a pass, a pool of new Gaussian random numbers is formed.

As there are KL variables in the data pool, L transformation steps are performed

during each pass. A K-vector X is multiplied with the orthogonal matrix A in

177

performing a transformation step.

As stated by Wallace, it is desirable that any value in the pool should even-

tually contribute to every value in the pools formed after several passes. In

Wallace’s original method, the old pool is treated as an L-by-K array stored

in row-major order, and the new pass is treated as an L-by-K array stored in

column major order. Hence, each pass effectively transposes the values in the

pool. If L is odd, the transposition is sufficient to ensure eventual mixing of the

values. However, if L is even, which is desirable for hardware implementation),

transposition alone is not sufficient. We describe in Section 7.3 how we overcome

this problem to reduce correlation even further.

The initial values in the pool are normalized so that their average squared

value is one. Because A is orthogonal, the subsequent passes do not alter the

sum of the squares. This would be a defect, since if x1, ..., xN are are independent

samples from the normal distribution, we would expect∑i=N

i=1 x2i to have a chi-

squared distribution χ2N . In order to overcome this defect, a variate from the

previous pool is used to approximate a random sample S from the χ2N distribution.

A scaling factor is introduced to ensure that the sum of the squares of the values

in the pool is S, the random sample.

7.3 Architecture

This section provides an overview of the hardware design for the Wallace method,

which involves a four-stage hardware architecture shown in Figure 7.2. The

implementation of this architecture in FPGA technology will be presented in

Section 7.4. In Figure 7.2, the select signals for the multiplexors and the clock

enable signals for the registers are omitted for simplicity.

178

L F S R 1 L F S R 3 0

5 2

:

1

: :

1 0

1

s t a r t 1 0 s t r i d e 1 0 m a s k

< < 1

1 0 1 0 1 0 1 0

p a d d r

s a d d r

q a d d r

r a d d r

a d d r _ A

P o o l R A M

a d d r _ B A

B

T r a n s f o r m a t i o n

d a t a _ A

d a t a _ B

i n i t P o o l R O M

2 4

2 4

2 4 2 4

2 4

2 4

S x C 2

C 1 2 4

n o i s e

r e s e t

S t a g e 1

S t a g e 2

S t a g e 3

S t a g e 4

L F S R S e e d R O M

C o u n t e r

1 0

A '

B '

C '

G

1

i n i t G

Figure 7.2: Overview of our Gaussian noise generator architecture based on the

Wallace method. The triangle in Stage 4 is a constant coefficient multiplier.

179

In our design illustrated in Figure 7.2, we choose K = 4 and L = 256, giving

a pool size N of 1024. On-chip, true dual read/write port synchronous RAM is

used to implement the pool. The dual-port RAM allows two values to be read

and written simultaneously, improving the memory bandwidth.

As all the variables from the pool are used to generate the new pseudo random

numbers, the indices should cover all the numbers in the pool and at the same

time reduce the correlations between them. The addresses which index the pool

are started from a random origin ‘start’, stepped by a random odd ‘stride’ and

XOR is performed with a random ‘mask’.

In order to achieve better mixing of the Gaussian random number generator,

more pass types can be used during a pass by introducing different orthogonal

matrices. As in Wallace’s original implementation, two orthogonal matrices A0

and A1 are chosen for our design:

A0 =1

2

1 −1 −1 −1

1 −1 1 1

1 1 −1 1

1 1 1 −1

A1 =1

2

−1 1 1 1

−1 1 −1 −1

−1 −1 1 −1

−1 −1 −1 1

During a pass, A0 is used for L ≤ 127 and A1 is used for L ≥ 128. As the

elements of the matrices A0 and A1 are only 1 or −1, only simple integer addition

and shift operations are required. The Gaussian random variables in the pool are

held as 24-bit two’s complement integers. For the given set of four values p, q, r, s

180

to be transformed, and with our choice of A0 and A1, the new values p′, q′, r′, s′

can be calculated from the old ones as follows:

p′ = p− t; q′ = t− q; r′ = t− r; s′ = t− s; (7.1)

and

p′ = t− p; q′ = q − t; r′ = r − t; s′ = s− t; (7.2)

where t = 12(p + q + r + s).

7.3.1 The First Stage

This stage involves generation of the uniformly distributed realizations start,

stride and mask. The implementation of this stage is straightforward, and can

be accomplished using well-known techniques based on Linear Feedback Shift

Registers (LFSRs) [24]. To ensure maximum randomness, we use an independent

shift register for each bit of start, stride and mask. The resources needed are

related to the periodicity desired in the shift registers. Since m-bit LFSRs with

irreducible polynomials can produce random numbers with periodicity of 2m− 1,

the hardware required will be proportional to the number of bits of precision

needed. Since we use a pool size of 1024, 10 bits are needed for the three variables,

meaning that 30 LFSRs are needed. If the reset signal is set, we would like to

generate the same sequences again. The “LFSR Seed ROM” contains the initial

seeds for the 30 LFSRs, which are loaded when the reset signal is set. 52-bit

LFSRs are used in our architecture, which give a period of 252− 1 (≈ 4.5× 1015).

Hence, the size of the “LFSR Seed ROM” is 30× 52 = 1560 bits.

181

7.3.2 The Second Stage

This stage follows the techniques used by Wallace in his FastNorm2 implemen-

tation [181]. It generates the addresses for the four values p, q, r, s from start,

stride and mask. To ensure the value of stride is odd, OR with one is performed.

The addresses are calculated as follows:

p addr = start⊕mask (7.3)

q addr = (start + stride)⊕mask (7.4)

r addr = (start + stride× 2)⊕mask (7.5)

s addr = (start + stride× 3)⊕mask (7.6)

The multiplication by two is implemented simply by a left shift, and the mul-

tiplication by three is implemented by a left shift followed by an adder. This

addressing scheme ensures that the correlations between variables are kept min-

imum.

7.3.3 The Third Stage

This stage involves the most interesting challenge: efficiently performing the

actual transformation. This stage contains the “Pool RAM” which holds the

pool of 1024 Gaussian random variables. Dual-port RAM is used to implement

the pool. Since each variable in the pool is 24 bits, the total size of the pool

is 1024 × 24 = 24576 bits. The “init Pool ROM” and the counter are used to

initialize the pool with the original pool contents when the reset signal is set; this

ROM is single ported and has the same size as the pool. The contents of this

ROM is generated in software using the Box-Muller method, and the variables

are normalized so that their sum of squares is equal to one.

Figure 7.3 shows how we perform the transformation steps described in equa-

182

R 1

R 2

R 3

R 4

> > 1 R t

R p

R q

R r

R s

R p '

R q '

R r '

R s '

A

B

C '

A '

B '

Figure 7.3: The transformation circuit of Stage 3. The square boxes are registers.

The select signals for the multiplexors and the clock enable signals for the registers

are omitted for simplicity.

tions (7.1) and (7.2). The timing diagram of this circuit and the “Pool RAM” is

illustrated in Figure 7.4. All ports and registers of the transformation circuit and

ports of the dual-port RAM are shown. We observe that the dual-port RAM is

fully utilized. t is calculated in three steps:

x = p + q (7.7)

y = r + s (7.8)

t = x + y. (7.9)

In principle, we could share a single adder in conjunction with multiplex-

ors to perform all the operations of the transformation circuit. However, high-

183

c l k

r e s e t

a d d r _ A A _ p 0 A _ r 0 A _ p 0 A _ r 0 A _ p 1 A _ r 1 A _ p 1 A _ r 1

a d d r _ B A _ q 0 A _ s 0 A _ q 0 A _ s 0 A _ q 1 A _ s 1 A _ q 1 A _ s 1

p 0 r 0 p 1 r 1

R 2 q 0 s 0 q 1 s 1

W E

A _ p 2 A _ r 2

A _ q 2 A _ s 2

A ' p 1 ' r 1 '

B ' q 1 ' s 1 '

x 0 x 1

R 4 y 0 y 1

t 0 t 1

p 0 p 1

R q q 0 q 1

r 0 r 1

s 0 s 1

p 0 '

R q ' q 0 '

r 0 '

s 0 '

R 1

R 3

R t

R p

R r

R s

R p '

R r '

R s '

B

A p 0 r 0 p 1 r 1

q 0 s 0 q 1 s 1

p 2

q 2

C ' p 0 ' q 0 ' r 0 '

Figure 7.4: Detailed timing diagram of the transformation circuit and the dual–

port “Pool RAM”. A z indicates the address of the data z and WE is the write

enable signal of the “Pool RAM”.

184

speed adders are efficiently implemented on FPGAs by fast-carry chains. In fact,

both a two-input 24-bit multiplexor and a 24-bit adder occupy 14 slices (user-

configurable elements on the FPGA) in a Xilinx Virtex-II FPGA. In addition the

use of multiplexors would increase the delay significantly. For these reasons, we

decide to use separate adders/subractors for each operation. For other devices

such as Application-Specific Integrated Circuits (ASICs), it can be more efficient

to adopt the former approach involving hardware sharing. The critical path of

the entire Wallace design is from Rp to Rp′ which is just a multiplexor followed

by a subtractor.

7.3.4 The Fourth Stage

This stage performs the sum of squares correction described in Section 7.2. It

follows the approach used by Wallace in his FastNorm2 implementation [181].

A random sample S with an approximate χ2N distribution can be obtained as

S =1

2(C + A× x)2 (7.10)

where x has unit normal distribution, A = 1 + 18N

and C =√

2N − A2 for large

N . Hence, S can be computed as

S =

√1

2N× A× (B + x) (7.11)

where B = CA. We set C1 = A×√2N and C2 = B.

The noise sample C ′, generated from the transformation circuit of Stage 3,

is multiplied by G to correct the sum of the squares and hence the final noise

sample. G is obtained by

G = S × C2 + C1. (7.12)

Since C1 and C2 are constants, they are precalculated in software and stored as

constants in the hardware design.

185

Before a pass, S is assigned with a variable from a previous pass, and G is

updated. For the very first pass when the reset signal is set, G is initialized to

1/√

N/ts where ts is the sum of squares of the initial pool. Note that we are

using a pool size of N = 1024.

7.4 Implementation

This section presents implementations of the four-stage architecture using FPGA

technology.

The 30 52-bit LFSRs in Stage 1 can be implemented in configurable hardware

using a small amount of resources. Recent FPGAs have many of user-configurable

element: for instance, the Xilinx Virtex-II XC2V4000-6 device has 23040 user-

configurable elements known as slices. A look-up table can be configured as a

16-bit shift register using the SRL16 primitive in Xilinx Virtex and Virtex-II

series FPGAs. A 52-bit LFSR using SRL16s instead of flipflops can be packed

into three slices instead of 32 [125]. Hence, our design contains 90 slices to

implement the 30 52-bit LFSRs. Note that all 30 LFSRs are initialized with

uniformly distributed random seeds.

Xilinx Virtex-II devices have embedded memory elements and multipliers,

which are known as block RAMs and MULT18X18s. Each block RAM can hold

18Kb of data and each embedded multiplier can implement a 18-bit by 18-bit

multiplication. If the data or the multiplication are larger than 18kb or 18-bit by

18-bit, the Xilinx tools will use multiple block RAMs and embedded multipliers

to implement them. The Xilinx Virtex-II XC2V4000-6 device has 120 block

RAMs and 120 embedded multipliers in total. The “LFSR Seed ROM” and

the “init Pool ROM” are implemented using single-port block RAMs, while the

186

“Pool RAM” is implemented using dual-port block RAMs. The sizes of “LFSR

Seed ROM”, “init Pool ROM” and “Pool RAM” are 1560, 24576 and 24576 bits.

Hence they occupy one, two and two block RAMs respectively. The constant

coefficient multiplier in Stage 4 uses two block RAMs to implement part of the

multiplication. The 24-bit by 24-bit multiplier in Stage 4 occupies four embedded

multipliers.

Several FPGA implementations have been developed, using Xilinx System

Generator 6.2 [188]. All designs are heavily pipelined to maximize throughput.

Synplicity Synplify Pro 7.5.1 is used for synthesis with the retiming and pipelin-

ing options turned on. For place-and-route, Xilinx ISE 6.2.01i is used with the

maximum effort level and the clock constraints are carefully tuned to give the

fastest clock frequency. We have mapped and tested the Wallace design onto a

hardware platform with a Xilinx Virtex-II XC2V4000-6 FPGA. The design occu-

pies 895 slices, seven block RAMs and four embedded multipliers, which takes up

around 3% of the device. The pipelined design operates at 155MHz, and hence

our design produces 155 million Gaussian noise samples per second. The resource

usage of each of the four stages is shown in Table 7.1. It may be surprising to see

that Stage 1 occupies 281 slices, since the 30 LFSRs require just 90 slices. This is

due to extra components such as logic gates, registers and multiplexors required

to initialize the LFSRs with seeds. Xilinx System Generator design diagrams of

Stage 1 and Stage 2 are depicted in Figure 7.5 and Figure 7.6. Stage 3 and Stage

4 are shown in Figure 7.7.

The latency of our design is 1680 clock cycles (≈ 11µs at 155MHz). 1560 cycles

are used to initialize the 30 52-bit LFSRs. The LFSRs need to be initialized one

by one, since the “LFSR Seed ROM” is single ported. The other 120 cycles are

needed to fill up the pipelines of the design. Although the latency is very large,

187

Figure 7.5: Wallace architecture Stage 1 in Xilinx System Generator. The 30

LFSRs generate uniform random bits for Stage 2.

188

Fig

ure

7.6:

Wal

lace

arch

itec

ture

Sta

ge2

inX

ilin

xSyst

emG

ener

ator

.P

seudo

random

addre

sses

for

p,q,

r,s

are

gener

ated

.

189

Fig

ure

7.7:

Wal

lace

arch

itec

ture

Sta

ge3

and

Sta

ge4

inX

ilin

xSyst

emG

ener

ator

.O

rthog

onal

tran

sfor

mat

ion

is

per

form

edan

dsu

mof

squar

esco

rrec

ted.

190

Table 7.1: Resource utilization for the four stages of the noise generator on a

Xilinx Virtex-II XC2V4000-6 FPGA.

stage slices block RAMs multipliers

1 281 1 -

2 180 - -

3 214 4 -

4 220 2 4

total 895 7 4

it is not important since we only care about the throughput in a hardware based

simulation. Figures 7.8 and 7.8 show the placed and routed Wallace designs on

a Xilinx Virtex-II XC2V4000-6 FPGA.

From a hardware designer’s point of view, it is interesting to explore the

tradeoffs between using different types of hardware resources. For instance, a

look-up table can be implemented using block RAM or distributed RAM with

slices. Table 7.2 shows our noise generator implemented using different FPGA

resources. We observe that the design using slices only is more than four times

the number of slices and has significantly lower clock speed than our original

design. Also, the area and speed penalty of using slices to implement tables

instead of block RAMs is especially high. Hence in our opinion, dedicated FPGA

resources such as block RAMs and embedded multipliers should be used whenever

applicable.

We have also implemented our design on a low-cost Xilinx Spartan-III XC3S200E-

5 FPGA. The design runs at 106MHz and takes up the same amount of resources

191

Figure 7.8: Our Wallace design placed on a Xilinx Virtex-II XC2V4000-6 FPGA.

Figure 7.9: Our Wallace design routed on a Xilinx Virtex-II XC2V4000-6 FPGA.

192

Table 7.2: Hardware implementation results of the noise generator using different

types of FPGA resources on a Xilinx Virtex-II XC2V4000-6 FPGA.

FPGA resources used slices block embedded speed

RAMs multipliers [MHz]

slices + block RAMs + multipliers 895 7 4 155

slices + block RAMs 1215 7 - 152

slices + multipliers 3702 - 4 118

slices 4020 - - 112

as the Virtex-II design above, which requires around half of the resources in the

device.

The performance can be improved by concurrent execution. We have experi-

mented with placing multiple instances of our noise generator in an FPGA, and

discovered that there is a small reduction in clock speed due to increased routing

congestion. For example, eight instances of our noise generator on an XC2V4000-

6 FPGA run at 144MHz. They take up around 31% of the resources producing

over one billion noise samples per second.


This section describes the statistical tests that we use to analyze the properties

of the generated Gaussian noise.

To ensure the randomness of the uniform random numbers start, stride and

mask, we have tested the LFSRs with the Diehard tests [113]. The LFSRs pass all

193

the tests indicating that the uniform random samples generated are indeed uni-

formly randomly distributed. As in Chapter 6, we use two well-known goodness-

of-fit tests to check the normality of the random variables: the chi-square (χ2)

test and the Anderson-Darling (A-D) test [32].

Our hardware Wallace implementation passes the statistical tests even with

extremely large numbers of samples. We have run a simulation of 1010 samples

to calculate the p-values for the χ2 and A-D test. For the χ2 test, we use 100 bins

for the x axis over the range [-7,7]. The p-values for the χ2 and A-D tests are

found to be 0.5385 and 0.7372 respectively, which are well above 0.05, indicating

that the generated noise samples are indeed normally distributed. To test the

noise quality in the high σ regions, we run a simulation of 107 samples over the

range [-7,-4] and [4,7] with 100 bins. This is equivalent to a simulation size of

over 1011 samples. The p-values for the χ2 and A-D tests are found to be 0.6839

and 0.7662, showing that the noise quality even in the high σ regions is high.

If (x, y) is a pair random numbers with Gaussian distributions, then u =

e(x2+y2)/2 should be uniform over [0, 1]. Six million Gaussian variables, randomly

picked from a population of 1010 samples generated from our design are trans-

formed using this identity, resulting in three million uniform random variables.

These uniform variables are tested with the Diehard tests [113] for uniformity.

They pass all tests indicating that the transformed numbers are indeed uniformly

distributed.

To explore the possibility of temporal statistical dependencies [154] between

the Gaussian variables, we generate scatter plots showing pairs yi and yi+1. This

is to test serial correlations between successive samples, which can occur if the

noise generator is improperly designed. If undesirable correlations exist, certain

patterns can be seen in the scatter plot [154]. An example based on 10000 Gaus-

194

−4 −3 −2 −1 0 1 2 3 4−4

−3

−2

−1

0

1

2

3

4

yi

y i+1

Figure 7.10: Scatter plot of two successive noise samples for a population of

10000. No obvious correlations can be seen.

sian variables is shown in Figure 7.10; there is no evidence of obvious correlations.

Figures 7.11 and 7.12 show the PDF obtained from our Gaussian noise gen-

erator for a population of one and four million samples. Both set of samples pass

the χ2 and the A-D test, and we observe that the PDF is very smooth when the

sample size is large.

We compare our design with two other designs: “White Gaussian Noise Gen-

erator” block available in Xilinx System Generator 6.2 [188] and the design de-

scribed in Chapter 6. The “White Gaussian Noise Generator” block is based on

the “Additive White Gaussian Noise (AWGN) Core 1.0” from Xilinx [186]. The

Xilinx core follows the architecture presented by Boutillon et al. in [12], which

uses the Box-Muller method in conjunction with the central limit theorem. The

195

−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

x

PD

F(x

)

Figure 7.11: PDF of the generated noise from our design for a population of one

million. The p-values of the χ2 and A-D tests are 0.9994 and 0.2332 respectively.

design in Chapter 6 is also based on the Box-Muller method and central limit

theorem, but we employ more sophisticated approximation techniques for the

mathematical functions in the Box-Muller method, resulting in significantly more

statistically accurate noise samples. We test the noise samples generated from

the Xilinx block with the χ2 and the A-D test. We find that the samples fail

the tests after just around 160,000 samples. Figure 7.13 shows the PDF for a

populations of one million noise samples from the Xilinx block. The samples fail

both the χ2 and the A-D test, and we observe some undesirable spikes in the

PDF.

Table 7.3 compares the Xilinx block, our Box-Muller design in Chapter 6 and

our current Wallace design. We can see that the Xilinx block uses less resources

and slightly faster than our Wallace design, but as mentioned above the block fails

the statistical tests after a very small amount of samples. Both of our Box-Muller

196

−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

x

PD

F(x

)

Figure 7.12: PDF of the generated noise from our design for a population of four

million. The p-values of the χ2 and A-D tests are 0.7303 and 0.8763 respectively.

and Wallace designs pass the statistical tests, even with very large numbers of

samples. However, our Wallace design is around almost three times smaller and

slightly faster.

Figure 7.14 shows the variation of the χ2 test p-value with sample size for

the Xilinx block and various Wallace implementations using different data path

bitwidths. The 0.05 p-value pass mark is shown as a dotted line. We observe

that the Xilinx block fails after a small number of samples. For the Wallace

implementations, bitwidths lower than 24 bits fail the test gradually as the sample

size increases. Using 24 bits, which is the bitwidth used in our design, does not

fail the test even at large numbers of samples, and does not show signs of the

quality degrading.

Table 7.4 shows the hardware implementation results when multiple instances

of the noise generator are implemented on the device. We are able to fit up to

197

−6 −4 −2 0 2 4 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

x

PD

F(x

)

Figure 7.13: PDF of the generated noise from the Xilinx block for a population

of one million. The p-values of the χ2 and A-D tests are 0.0000 and 0.0002

respectively.

16 instances on the XC2V4000-6 FPGA, the number of block RAMs available on

the device being the limit. Of course using a bigger device such as the Virtex-

4 XC4VFX140-11 device [189] (which has 62848 slices, 560 block RAMS and

192 embedded multipliers), we are able to fit over 50 instances. Note that it is

perfectly valid to use multiple instances of the noise generator, as long as the

LFSRs and pool RAMs are initialized with different random seeds and noise

samples.

Figure 7.15 shows how the number of noise generator instances affects the

output rate. While ideally the output rate would scale linearly with the number

of noise generator instances (dotted line), in practice the output rate grows slower

than expected, because the clock speed of the design deteriorates as the number

of noise generators increases. This deterioration is probably due to the increased

198

Table 7.3: Comparisons of different hardware Gaussian noise generators imple-

mented on Xilinx Virtex-II XC2V4000-6 FPGAs. All designs generate a noise

sample every clock.

Xilinx [188] Chapter 6 this design

slices 653 2514 895

block RAMs 4 2 7

multipliers 8 8 4

speed [MHz] 168 133 155

pass χ2 test no yes yes

pass A-D test no yes yes

routing congestion and delay.

We have used our noise generator in LDPC decoding experiments [74]. Al-

though the output precision of our noise generator is 24 bits, 16 bits are found

to be sufficient for our LDPC decoding experiments. If precisions higher than 24

bits are required, we can simply increase the size of the data paths and the noise

samples in the memories.

To obtain a benchmark, we performed LDPC decoding using a full precision

(64-bit floating-point representation) software implementation of belief propaga-

tion in which the noise samples are also of full precision. We then performed

decoding using the LDPC algorithm but with noise samples created using the

design presented in this chapter. Over many simulations, we have found no dis-

tinguishable difference in code performance, even in the high Eb/N0 regions where

199

105

106

107

108

109

1010

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Samples

p−va

lue

Xilinx12−bit Wallace16−bit Wallace20−bit Wallace24−bit Wallace

Figure 7.14: Variation of the χ2 test p-value with sample size for the Xilinx block,

12-bit, 16-bit, 20-bit and 24-bit Wallace implementation.

the error floor in BER is as low as 10−9 (1012 codewords are simulated).

Our hardware implementations have been compared to several software imple-

mentations based on the Wallace, Ziggurat [115], polar and Box-Muller method [78],

which are known to be the fastest methods for generating Gaussian noise for

instruction processors. For the Wallace and Ziggurat methods, FastNorm2 avail-

able in [181] and rnorrexp available in [115] are used. In order to make a fair

comparison, we use the same uniform number generator for all implementations.

The mixed multiplicative congruential (Lehmer) generator [179] used in the Fast-

Norm2 implementation is chosen. Software implementations are run on an Intel

Pentium 4 2.6GHz PC is equipped with 1GB DDR-SDRAM. They are written in

ANSI C and compiled with the GNU gcc 3.2.2 compiler with -O3 optimization,

generating double precision floating-point numbers. The results are shown in Ta-

200

Table 7.4: Hardware implementation results on a Xilinx Virtex-II XC2V4000-6

FPGA for for different numbers of noise generator instances. The device has

23040 slices, 120 block RAMs and 120 embedded multipliers in total.

inst slices block embedded speed million

RAMs multipliers [MHz] samples / sec

1 895 7 4 155 155

4 3590 28 16 151 606

8 7178 56 32 144 1149

12 10776 84 48 140 1668

16 14359 112 64 115 1843

ble 7.5. The XC2V4000-6 FPGA belongs to the Xilinx Virtex-II family, while the

XC3S200E-5 FPGA belongs to the Xilinx Spartan-III family. It can be seen that

our hardware designs are faster than software implementations by 2–491 times,

depending on the device used and the resource utilization. Looking at the PC

results, we can see that the Wallace method performs significantly better than

other methods.

201

2 4 6 8 10 12 14 160

500

1000

1500

2000

2500

Number of Instances

Mill

ion

Sam

ples

/ S

econ

d

Figure 7.15: Variation of output rate against the number of noise generator

instances.

Table 7.5: Performance comparison: time for producing one billion Gaussian

noise samples.

platform speed method time ratio

[MHz] [s]

XC2V4000-6 FPGA 115 16 inst 0.54 1

XC2V4000-6 FPGA 155 1 inst 6.5 12

XC3S200E-5 FPGA 106 1 inst 9.4 17

Intel Pentium 4 PC 2600 Wallace 22 41

Intel Pentium 4 PC 2600 Ziggurat 63 117

Intel Pentium 4 PC 2600 Polar 16f4 304

Intel Pentium 4 PC 2600 Box-Muller 265 491

202

7.6 Summary

We have presented a hardware Gaussian noise generator using the Wallace method

to support simulations which involve very large numbers of samples.

Our noise generator architecture contains four stages. It takes up approxi-

mately 3% of a Xilinx Virtex-II XC2V4000-6 FPGA and half of a Xilinx Spartan-

III XC3S200E-5, and can produce 155 million samples per second. Further im-

provement in performance can be obtained by concurrent execution: 16 parallel

instances of the noise generator on an XC2V4000-6 FPGA at 115MHz can run 41

times faster than software on a 2.6GHz Pentium 4 PC. The quality of the noise

samples is confirmed by two statistical tests; the χ2 test and the A-D test, and

also by applications involving LDPC decoding. The output of the noise generator

accurately models a true Gaussian PDF even at very high σ values. Although the

Wallace design occupies smaller area and is faster than the Box-Muller design in

Chapter 6, it has slight correlations between successive transformations, which

may be undesirable for certain types of applications. Strategies to reduce such

correlations are discussed in the next chapter.

203

CHAPTER 8

Design Parameter Optimization

for the Wallace Method

8.1 Introduction

The Wallace method [180], described in the previous chapter, creates new outputs

based on linear combinations of a continually refreshed pool of previous outputs.

Outputs are produced in blocks, each containing the same number of values as

the pool, and each of which then becomes the pool for generation of the next

block. This process of generating a new pool from the old pool is called a ‘pass’.

This method is simple and fast, but can suffer from correlations at the output

due to its feedback nature. The main contributions of this chapter are:

• Tests designed specifically to detect correlations in the Wallace method.

• Parameter optimizations to reduce correlations.

• Identification of parameters minimizing execution time and cache require-

ments, while keeping correlations at minimum.

• Detailed performance tradeoff analysis on Athlon XP and Pentium 4 plat-

forms and comparisons with other methods.

This chapter is organized as follows. Section 8.2 provides a brief overview of

the Wallace method. Section 8.3 analyzes correlations that can occur with the

204

Wallace method and proposes stringent tests designed specifically to be sensitive

to such correlations. Section 8.4 describes parameter optimizations which help

to keep correlations sufficiently low to pass statistical tests. Section 8.5 provides

performance tradeoffs with different parameter settings and compare the opti-

mized Wallace method with other method. Section 8.6 examines modifications

needed to the hardware design in Chapter 7 if the optimized parameters are used,

and Section 8.7 offers summary.

8.2 Overview of the Wallace Method

At startup, the Wallace method involves seeding a pool with samples drawn

from a zero-mean Gaussian probability density function (PDF), and all subse-

quent outputs are then produced by applying K-dimensional orthogonal trans-

formations in L transformation steps to the contents of the pool. Two key design

parameters are therefore the size of the pool N = KL and the dimension K of

the orthogonal transformation.

Wallace’s original description utilized a pool size N of 1024 and a Hadamard

transform [52] with dimension K = 4 requiring only additions, subtractions and

shifts. Since orthogonal transformations are energy-conserving, if no other rescal-

ing is performed on the pool then the sum of the squares (or variance) of all blocks

would be identical. In order to address this defect, a variate from the previous

pool is used to approximate a random sample from the χ2N distribution. A scaling

factor is introduced to ensure that the sum of the squares of the values in the

pool is the random sample. We can control the output rate by a factor of R

(i.e. the number of passes performed before noise samples are output) to reduce

205

correlations. This parameter is discussed in detail in the subsequent sections.

A simplified pseudo code of the Wallace method is shown in Figure 8.1. The

generate_addr() function generates pseudo random addresses for the array hold-

ing pool and is discussed in Section 8.4. As seen in Figure 8.1, there are no

conditional operations involved in the Wallace method, meaning that the output

data rate is a deterministic function of the underlying clock rate of the system.

It is this attribute as well as the simplicity of the arithmetic that makes the

Wallace method particularly attractive for hardware implementations. Wallace

provides several generations of source code referred to as FastNorm1, FastNorm2

and FastNorm3 [181].

As observed by Brent [16], [17] and Rub [157], one concern of the Wallace

method is the issue of correlations given the use of previous outputs to generate

new outputs. This is particularly problematic in the case of realizations with very

large absolute values lying in the tails of the Gaussian. When such a large value

is created and output, it also enters the pool from where it contributes directly

to K values in the subsequent block, K2 values in the next block, and so on with

diminishing influence. Similar correlations can be found in the reverse direction

as well. In other words, the presence of a very large output in a given block

conveys a higher likelihood of abnormally large values in the previous block, as

it is those values which, when linearly combined, led to the large output.

Given the computational advantages that the Wallace method offers, it is

reasonable to ask what design choices can be made in order to maintain extremely

high output noise quality. More specifically, given a requirement that the output

accurately model a Gaussian out to magnitudes of Mσ, what design options exist

to achieve this requirement? In what follows, we discuss the measurement of the

correlations, illustrate their impact in the form of the PDF, explore the extent of

206

01: for i = 1..R /* R = retention factor */

02: for j = 1..L /* L = N/K */

03: /* read K values from pool */

04: for z = 1..K /* K = matrix size */

05: addr = generate_addr();

06: x[z-1] = pool[addr];

07: end

08: /* apply transformation to the K values */

09: x’[0..(K-1)] = transform(x[0..(K-1)]);

10: /* write K values to pool */

11: for z = 1..K

12: addr = generate_addr();

13: pool[addr] = x’[z-1];

14: end

15: end

16: end

17: pool[0..(N-1)] = sum_of_sq_corr(pool[0..(N-1)]);

18: return pool[0..(N-1)];

Figure 8.1: Pseudo code of the Wallace method.

this impact as a function of design parameters, and provide information relating

to when the Wallace method can still produce noise of sufficient quality for a

given application despite the use of recycled outputs.

207

8.3 Measuring the Wallace Correlations

The chi-square (χ2) goodness-of-fit test [32] is used to check the normality of

the random variables. We focus on correlations due to outputs of high absolute

value relative to σ, as it is those occurrences that correspond to perturbations

of the PDF in subsequent (and previous) output blocks. The following simple

experiment illustrates the correlation problem. We first run FastNorm2, which

uses a pool size of 1024 and transform dimension of four, for a sufficiently long

time to generate 4000 realizations with absolute value exceeding 5σ. We then

extract the 1024 data values comprising the block immediately following the

block containing each large output, and combine all the data from many such

blocks (a total of 4000× 1024 ' 4 million noise outputs) into a single sequence.

Selecting data preferentially in the neighborhood of the high-value outputs is

fair from a testing standpoint, as an ideal Gaussian noise sequence is of course

independent and identically distributed, and this approach to testing is aimed

specifically at an area of potential weakness of this method.

The four million values are evaluated directly using the χ2 test based on 200

bins spaced uniformly over [−7, 7]. The chi-squared output χ2199 is 3081.6, which is

a strong failure as it is well above the typical upper limit of 232.9 that corresponds

to a 0.05 confidence level. As illustrated in Figure 8.2, the bins causing the failure

are centered in the region of 4σ ∼ 5σ, illustrating the expected effect that in a

method based on reusing outputs, large outputs will lead to more large outputs.

Figure 8.3 shows the result when the quality of the noise is evaluated as a

function of distance relative to a block containing a high-value (> 5σ) output.

The dotted horizontal line is the 0.05 confidence level, i.e. values below this

line pass the χ2 test. The block containing the high value output is indexed

by 0 in the horizontal axis of the figure. An index of 1 refers to the block

208

−6 −4 −2 0 2 4 60

50

100

150

200

250

FastNorm2: N = 1024, R = 1, K = 4 , χ2199

= 3081.6

Bin

χ2 199

Figure 8.2: Four million samples of blocks immediately following the block con-

taining a 5σ output, evaluated with the χ2 test with 200 bins over [−7, 7] for

FastNorm2. The χ2199 contributions of each of the bins are shown.

immediately following block 0; an index of −1 refers to the block immediately

preceding block 0. Block 0 is not shown in the figure, since its χ2199 output is

in the order of millions. Figure 8.3 illustrates two main points. First, it shows

that the correlations are approximately symmetric. In other words, the presence

of a very high output not only leads to poorer noise quality in the following

block, but also indicates statistically exceptional behavior in the previous block.

Second, the improved performance as a function of displacement means that one

way to improve the noise quality is simply to retain only some fraction 1/R of

the output blocks. For the set of parameter choices used in generating the date

in Figure 8.3, choosing R = 3 so that only every third block is delivered to the

output of the noise generator would eliminate the correlation issue, albeit at the

cost of dropping the throughput by a factor of three.

Interestingly, the approach of applying a Gaussian-to-uniform transformation

209

−6 −5 −4 −3 −2 −1 1 2 3 4 5 60

500

1000

1500

2000

2500

3000

FastNorm2: N = 1024, R = 1, K = 4

Displacement

χ2 199

Figure 8.3: The χ2199 values of blocks relative to a block containing a realization

with absolute value of 5σ or higher. Four million samples are compiled for each

block. The dotted horizontal line indicates the 0.05 confidence level.

followed by a test such as the Diehard suite [113] can fail to capture problems in

the tail regions in general. If (x, y) is a pair of random numbers with Gaussian

distributions, then u = e(x2+y2)/2 should be uniform over [0, 1]. Indeed using this

identity, in the specific case of the data used to generate Figures 8.2 and 8.3 all

18 diehard tests are passed. Given the general scarcity of high absolute value

outputs, even significant deviations in their numbers from the expected amount

can be masked by the mixing that occurs in the Gaussian-to-uniform transfor-

mation. Additionally, failure to isolate and test those blocks in the immediate

neighborhood of high-value outputs can also mask the problem illustrated in the

Figures 8.2 and 8.3. The most direct way to identify block-by-block perturbations

in the Wallace output is to use knowledge of the underlying Wallace algorithm

and the locations of block boundaries within the data stream.

210

8.4 Reducing the Wallace Correlations

There are three basic ways to reduce correlations in the Wallace outputs. First,

the dimension K of the transform can be increased. Higher values of K mean that

each new output is a linear combination of K previous outputs, and this greater

amount of mixing dilutes the impact of any individual member of the pool. While

there are many ways to generate orthogonal transforms of a given size, Hadamard

transforms are particularly attractive because they are trivially generated, and

other than a scaling factor can be implemented using only additions. For these

reasons, we used Hadamard transforms in the experiments described below. Sec-

ond, the overall size of the pool N can be increased. Increasing N while holding

K constant does not directly reduce the correlation between each set of K in-

puts and the K outputs they produce, but distributing the K outputs within a

larger N has a randomizing effect on the output. Finally, as noted above, not

all of the blocks that are generated need to be output as noise samples. At the

cost of reducing the output rate by a factor of R, the correlation impact can be

made arbitrarily small. The most advanced software version provided by Wallace,

FastNorm3 implements this method with R selectable from 2, 4, 8 or 16.

For our experiments, when K = 4, we use the following two Hadamard ma-

trices A0 and A1 and interchange them for each transformation:

A0 =1

2

−1 1 1 1

1 −1 1 1

−1 −1 1 −1

−1 −1 −1 1

(8.1)

211

A1 =1

2

1 −1 −1 −1

−1 1 −1 −1

1 1 −1 1

1 1 1 −1

(8.2)

Note that A1 is simply the negated version of A0, which is a valid approach

to obtain a new Hadamard matrix. For the given set of four values x0, x1, x2, x3

to be transformed, and with our choice of A0 and A1, the new values x′0, x′1, x

′2, x

′3

can be calculated from the old ones as follows:

x′0 = t− x0; x′1 = t− x1; x′2 = x2 − t; x′3 = x3 − t; (8.3)

and

x′0 = x0 − t; x′1 = x1 − t; x′2 = t− x2; x′3 = t− x3; (8.4)

where t = 12(x0 +x1 +x2 +x3). Rather than straightforward matrix-vector multi-

plication, this approach (as used in the FastNorm implementations) reduces the

number of addtions/subtractions required. We perform similar optimizations for

larger transformation matrices. Orthogonal matrices of size 8 and 16 are obtained

by using the property: if H is a Hadamard matrix, then H ′ =

H H

H −H

is

also a Hadamard matrix [52].

It is desirable that any value in the pool should eventually contribute to every

value in the pools formed after several passes. To achieve this mixing effect, we

need a pseudo random address generator for the indices addr of x0, . . . , xK−1. As

in FastNorm2, we use the permutations of the form addr=((addr+stride)&w)^mask,

where w = N−1. The initial values of addr, stride and mask are generated from

a uniform random number generator at the beginning of each pass, and stride

is ensured to be odd. The sizes of each of the three uniform random numbers are

212

log2 N bits. Such addresses are produced by the generate_addr() function in

line 5 and 12 of Figure 8.1. From Figure 8.1, we observe that number of integer

additions, AND and XOR operations required for address generation is inversely

proportional to K.

Figure 8.4 illustrates the impact of various design choices on noise quality.

To obtain each data point in the three graphs, four million samples are compiled

from the block immediately after each block containing a high value (absolute

value ≥ 5σ = 5) output. The number of instances necessary to accumulate the

four million samples is a function of pool size. For example, when a pool size of

2048 is used, the generator is run for a long enough time to generate 2000 high-

value outputs. For each such output, the data in the single block immediately

following the one containing the high value output are retained. χ2 tests are

performed using 200 bins spaced uniformly over [−7, 7]. The vertical axis gives

the χ2199 result plotted on a logarithmic scale, with the customary upper limit

of 232.9 indicated using a dotted horizontal line in each figure. The pool size

is plotted on the horizontal axis, also in logarithmic scale. Each figure contains

three curves, corresponding to transform sizes K of 4, 8 and 16 respectively. The

three graphs in Figure 8.4 present the cases of discard ratios R ranging from one

(all data is output) to three (only every third block is output).

We observe that when R = 1, the quality of the noise samples improve with

the pool size N and the transformation size K. At this retention setting, for

the three transformation sizes 4, 8 and 16, the samples pass the χ2 tests at pool

sizes of 32768, 8192 and 4096 respectively. Comparing the three graphs, as R is

increased the quality improves dramatically for small N . However, there is little

quality difference for large N , suggesting that choosing R > 1 and paying the

associated penalty in speed does not provide significant benefit.

213

Table 8.1: Number of arithmetic operations per transform/sample for the trans-

formation at various sizes of K.

Additions/Subtractions Multiplications

K Transform Sample Transform Sample

4 7 1.8 1 0.3

8 31 3.9 9 1.1

16 115 7.3 17 1.1

8.5 Performance Comparisons

There are three factors that effect the execution speed of the Wallace method:

arithmetic operations for the address generation and transformation, and num-

ber of table accesses. Table 8.1 shows the number of arithmetic operations per

transform/sample for the transformations at various sizes of K. Such arithmetic

operations include additions/subractions and multiplications as shown in (8.3)

and (8.4) in Section 8.4. From the table, we observe that the number of addi-

tions/subtractions and multiplications required per sample are around K/2 and

1. The number of multiplications for K = 4 is lower than others, because the

scaling factor of its orthogonal matrix is 1/2. For hardware designs or integer

arithmetic, shifts could be used for K = 4 and K = 16 instead of multiplica-

tions, because the scaling factor 1/√

K is a power of two. We use 64-bit double

precision floating-point representation for the arithmetic operations, but integer

arithmetic is also feasible.

Besides the arithmetic operations, another important factor is the number

of table accesses. For each transformation, we read K values from the table

214

holding the pool, apply the transformation and write the K values back to the

table. For each pass, we perform a total of N reads and N writes in order to

read/write values from the table holding the pool. The number of table accesses

is not affected by K, but is directly proportional to R. As mentioned earlier, R

determines the number of passes before noise samples are output, e.g. if R = 2,

two passes are performed with 2N reads and 2N writes before samples are output.

We use two PCs, one fitted with an AMD Athlon XP 2400+ (2GHz) proces-

sor and the other fitted an Intel Pentium 4 2GHz processor, for our performance

measurements. These two platforms are arguably most commonly used for com-

puter simulations. Both platforms are equipped with 1GB DDR-SDRAM and run

Mandrake Linux 9.1. Designs are written in ANSI C and compiled with the GNU

gcc 3.2.2 compiler with -O3 optimization, generating double precision floating-

point numbers. Processor specific instruction sets such as ‘3DNow!’ or ‘SSE2’ are

not used. The -O3 setting performs optimizations such as prefetching, scalar re-

placement, and loop and memory access transformations. It is recommended for

applications that have loops that heavily use floating-point calculations and pro-

cess large data sets, which is very much the case for the Wallace method. For the

experiments in this section, we measure the execution time: time taken to pro-

duce one noise sample. The specifications of interest of the two processors [1, 68]

are listed in Table 8.2. Details of the data caches of the two processors is ob-

tained using the RightMark Memory Analyzer [153] and are shown in Table 8.3.

Since our noise samples are in 64-bit (= 8 bytes) double precision, in principle,

pool sizes of 32768 and 65536 could fit into the level 2 caches of the Athlon XP

(32768× 8 = 256KB) and Pentium 4 (65536× 8 = 512KB).

Figure 8.5 explores how the execution time of arithmetic operations and table

accesses behave with varying K at N = 4096 and R = 1. Results are obtained

215

Table 8.2: Specifications of the AMD Athlon XP and Intel Pentium 4 platforms

used in our experiments.

Specification Athlon XP Pentium 4

Process 0.13 micron 0.13 micron

Processor Core Thoroughbred Northwood

Clock Speed 2GHz 2GHz

Pipeline Stages 10 20

Floating-Point Units 3 1

Branch Predictor Entries 2048 4096

Frontside Bus Speed 266MHz 400MHz

from the gprof profiler: for each experiment, one billion iterations are run in

a loop and the overhead of the loop construct is subtracted. The lower part

of the bars shows the time consumed by arithmetic operations, and the upper

part shows the time consumed by table accesses. Besides the transformation, the

arithmetic operation times include other overheads such as address calculation

and branches, but they are small compared to the transformations.

Looking at the Athlon XP results, we observe that the arithmetic operation

times increase with K. However, the table access times are decreasing. Although

we always read and write 4096 locations, for small K such as K = 4, we read/write

four locations consecutively in L = 1024 steps. For large K such as K = 16, we

perform 16 consecutive reads/writes in L = 256 steps. Such reads and writes,

which correspond to line 4 and 11 in Figure 8.1, can cause a branch misprediction

when z = K + 1. This results in a pipeline stall where the whole pipeline needs

to be flushed and refilled, causing severe delay. This effect explains the reduction

216

Table 8.3: Details of the AMD Athlon XP and Intel Pentium 4 data caches.

Athlon XP Pentium 4

Specification Level 1 Level 2 Level 1 Level 2

Size 64KB 256KB 8KB 512KB

Speed 2GHz 2GHz 2GHz 2GHz

Latency 3 cycles 11 cycles 2 cycles 9 cycles

Sets 512 256 32 1024

Block Size 64 bytes 64 bytes 64 bytes 64 bytes

Associativity 2 way 16 way 4 way 8 way

in table access times with increasing K. We have used the SimpleScalar x86

processor simulator [166] to confirm that the number of branch mispredictions

reduces with K. Moreover, when compared to the Athlon XP, we see a significant

performance loss in the Pentium 4 results. This loss can be due to the fact that

the Pentium 4 arithmetic operations are slightly slower than the Athlon XP,

since it has only one floating-point unit than the three available on the Athlon

XP. Table access occupies a large portion of the execution time, possibly because

of the Pentium 4’s smaller level 1 cache and its deeper pipeline: its 20-stage

pipeline has high branch misprediction penalties [62].

Figure 8.6 explores the execution time tradeoffs as a function of parameter

choice on the two platforms for N = 512 (size = 4KB) to N = 8192 (size =

64KB). Table 8.4 shows the numerical results at N = 4096. On both platforms,

as expected, we see a direct relationship between choosing a lower retention factor

R and the associated increase in execution time. Looking at the Athlon XP results

in Table 8.4, K = 4 is significantly faster than K = 8 and K = 16, especially

for large R. This observation is likely due to the small number of multiplications

217

Table 8.4: Execution time in nanoseconds for the AMD Athlon XP and Intel

Pentium 4 platforms at N = 4096.

R 1 2 3

K 4 8 16 4 8 16 4 8 16

Athlon XP 6 8 9 11 18 18 17 26 27

Pentium 4 22 19 19 39 31 34 56 46 51

involved when K = 4 (Table 8.1). The Pentium 4 results are less linear, probably

due to the small level 1 cache and branch misprediction penalties. Figure 8.6 also

shows that increasing the pool size N causes no significant change in execution

time, though it does of course involve more resources for the pool table. We

conclude that N has no significant execution time impact on both platforms, K

has little effect on the Pentium 4 but has a notable effect on the Athlon XP, and

as one would expect, the consequences of increasing R are most significant in all

cases.

Thus, for these implementations at least, the much improved noise quality

enabled by larger transforms represents an extremely good tradeoff to make.

Based on these observations and the results in Figure 8.4, it is better in terms

of noise quality and speed, to use large N and K but keep R = 1. Hence, for

example, choosing N = 4096, R = 1 and K = 16 on both platforms leads to

an optimized Wallace implementation, that has low execution time and cache

requirements, while keeping the correlation effects at minimum.

Figure 8.7 shows the execution time variation for pool sizes of 4KB (N = 512)

to 512KB (N = 65536) at R = 1 and K = 16. Looking at the Athlon XP curve,

the execution time stays pretty much constant up to 64KB and suddenly starts

218

to increase from 128KB. In principle, we could store the whole pool in the level

2 cache up to 256KB. However, it is not just our pool that is stored in the cache;

data such as other program variables and operating system data are also stored.

Most likely, the entire pool is kept in the level 2 cache up to 64KB, but beyond

this point (e.g. 128KB), the cache is saturated and cache misses occur, such that

parts of the pool need to be fetched from the main memory. Hence, we observe

this sudden increase in execution time. The same applies to the Pentium 4 curve,

except the saturation effect occurs at 256KB due to the Pentium 4’s level 2 cache,

which is twice as large as that for the Athlon XP.

In order to investigate how the level 2 cache saturation effect varies with

different values of N , we again use the SimpleScalar x86 simulator. Figure 8.8

shows the level 2 cache miss rates for different level 2 cache sizes at various pool

sizes at R = 1 at K = 16. Level 1 cache is fixed at 16KB throughout and 65536

noise samples are generated for each data point. The 256KB level 2 cache result

uses 1024 sets, 128 byte blocks, two way set associativity, and LRU (least recently

used) replacement policy. Smaller level 2 cache sizes are obtained by reducing

the number of sets by powers of two. As expected, we observe a rapid increase in

miss rate once the level 2 cache is saturated; this observation is consistent with

the trend of increasing execution time shown in Figure 8.7.

In Table 8.5, we compare the performance of our optimized Wallace imple-

mentation against the Ziggurat, Polar and Box-Muller method on the Athlon XP

and Pentium 4 platforms. For the Ziggurat method, rnorrexp from [115] is used.

For the Polar and Box-Muller method, we follow the algorithms described in [78].

In order to make a fair comparison, we use the same uniform number generator

for all implementations. The mixed multiplicative congruential (Lehmer) gen-

erator [179] used in the FastNorm implementations is chosen. We observe that

219

the optimized Wallace implementation is more than three times faster than the

Ziggurat method, which is often known as the fastest Gaussian random num-

ber generator for instruction processors. These results suggest that the Wallace

method with the proposed optimizations in this work, should be considered as a

serious candidate when high-speed Gaussian noise generation is required.

Table 8.5: Performance comparison of different software Gaussian random num-

ber generators. The Wallace implementations use N = 4096, R = 1 and K = 16.

Method Platform Execution Time [ns] Ratio

Wallace Athlon XP 9 1

Pentium 4 19 2.1

Ziggurat Athlon XP 30 3.3

Pentium 4 62 6.9

Polar Athlon XP 117 13.0

Pentium 4 170 18.9

Box-Muller Athlon XP 158 17.6

Pentium 4 275 30.6

8.6 Hardware Design with Optimized Parameters

The Wallace hardware architecture presented in Chapter 7 uses N = 1024, R = 1

and K = 4, meaning that significant correlation could be detected with our tests

described in Section 8.3. Modifying the hardware architecture to reflect the

new optimized parameters N = 4096, R = 1 and K = 16 would mainly involve

more addition/subtraction and memory requirements. In summary, the following

220

architectural changes are needed:

• Since each noise variable in the pool is 24 bits, the size of the pool required

would be 4096 × 24 = 98304 bits. Hence, we would need six block RAMs

(each block RAM can hold 18Kb) each for “init Pool ROM” and “Pool

RAM”.

• A pool size of 4096 means that we need 12 bits each for the three random

numbers start, stride and mask. This means that six more LFSRs are

needed, and also six more entries are needed in the “LFSR Seed ROM”,

which will still fit into a single block RAM.

• Since adders are cheap on FPGAs, there will be just a slight increase in the

number of slices for the increased number of transformation operations.

• The scheduling of the transformation circuit (Figures 7.3 and 7.4 in Chap-

ter 7) will have to be modified to reflect the new set of arithmetic operations.

Given that one can find a good scheduling strategy for the transformation,

the optimized parameters will have little affect on the speed, since one can always

pipeline hardware designs. Like the implementation in Chapter 7, we predict it

will be likely in the 155MHz range on a Virtex-II FPGA, resulting in around

6.5ns per sample.

221

512 1024 2048 4096 8192 16384 3276810

2

103

104

N

χ2 199

R = 1

K = 4K = 8K = 16

512 1024 2048 4096 8192 16384 3276810

2

103

104

N

χ2 199

R = 2

K = 4K = 8K = 16

512 1024 2048 4096 8192 16384 3276810

2

103

104

N

χ2 199

R = 3

K = 4K = 8K = 16

Figure 8.4: Impact of various design choices on the χ2199 value. Four million

samples are compiled from the block immediately after each block containing an

absolute value of 5σ or higher for each data point. The dotted horizontal line

indicates the 0.05 confidence level.

222

4 8 160

5

10

15

20

25

K

Exe

cutio

n T

ime

[ns]

53%

67%

32%

45%

23%

30%

47%

33% 68%

55%77%

70%

N = 4096, R = 1

Athlon XPPentium 4

Figure 8.5: Speed comparisons at various K at N = 4096 and R = 1. Lower

part: arithmetic operations. Upper part: table accesses.

512 1024 2048 4096 81920

10

20

30

40

50

60

N

Exe

cutio

n T

ime

[ns]

AMD Athlon XP 2GHz

R = 1, K = 4R = 1, K = 8R = 1, K = 16R = 2, K = 4R = 2, K = 8R = 2, K = 16R = 3, K = 4R = 3, K = 8R = 3, K = 16

512 1024 2048 4096 81920

10

20

30

40

50

60

N

Exe

cutio

n T

ime

[ns]

Intel Pentium 4 2GHz

Figure 8.6: Speed comparisons for different parameter choices. The solid, dashed

and dotted lines are for R = 1, R = 2 and R = 3 respectively.

223

4 8 16 32 64 128 256 5120

10

20

30

40

50

60

70

80

Pool Size [KB]

Exe

cutio

n T

ime

[ns]

R = 1, K = 16

Athlon XP, 256KB Level 2 CachePentium 4, 512KB Level 2 Cache

Figure 8.7: Execution times for different pool sizes at R = 1 and K = 16.

The solid and dotted lines are for the Athlon XP and the Pentium 4 processors

respectively.

4 8 16 32 64 128 256 5120

20

40

60

80

100

Pool Size [KB]

Leve

l 2 C

ache

Mis

s R

ate

[%]

SimpleScalar x86, Level 1 Cache = 16KB

16KB Level 2 Cache32KB Level 2 Cache64KB Level 2 Cache128KB Level 2 Cache256KB Level 2 Cache

Figure 8.8: Level 2 cache miss rates on the SimpleScalar x86 simulator for differ-

ent pool sizes at R = 1, K = 16 and various level 2 cache sizes. Level 1 cache is

fixed at 16KB and 65536 noise samples are generated for each data point.

224

8.7 Summary

We have explored the impact of parameter choice on noise quality for the Wallace

Gaussian random number generator. Using tests designed specifically to identify

the presence of correlations due to the use of previous outputs in generating new

outputs, we have identified specific combinations of pool size, transform size, and

retention factor (one example is N = 4096, R = 1 and K = 16) that deliver high

quality noise output at high speeds. Thorough performance tradeoff studies have

been conducted for AMD Athlon XP and Intel Pentium 4 based platforms. With

the aid of these studies, we have shown that much improved noise quality enabled

by larger transforms and pool sizes represents an extremely good tradeoff to make.

Performance comparisons with other Gaussian random number generators have

been carried out, demonstrating that given a careful choice of parameters, the

Wallace method is a serious competitor due to its speed advantages. We have

also examined the architectural changes needed, if the optimized parameters are

used with the Wallace design presented in Chapter 7.

As noted earlier, the Wallace method is particularly attractive from an im-

plementation standpoint, because of its lack of conditional statements and its

reliance on simple mathematical computations. While the presence of some cor-

relation between data in nearby blocks is an unavoidable byproduct of any ap-

proach using feedback, the results presented here provide specific guidance on

how to create extremely high quality noise with no detectable correlation even

when highly targeted tests are used.

225

CHAPTER 9

Flexible Hardware Encoder for LDPC Codes

9.1 Introduction

In the past few years, LDPC codes [48], [49] have received much attention be-

cause of their excellent performance and the large degree of parallelism that can

be exploited in the decoder. LDPC codes are being widely considered as the

most promising candidate ECC scheme for many applications in telecommunica-

tions and storage devices. Recently, LDPC codes have been selected over Turbo

codes [7] by Europe’s DVB standards group for next-generation digital satellite

broadcasting due to their superior performance. Provided that the information

block lengths are long enough, performance close to the Shannon limit can be

achieved with LDPC codes.

Although LDPC codes achieve better performance and have low decoding

complexity compared to Turbo codes, one of the major drawbacks of LDPC

codes lies in their apparently high encoding complexity. Whereas Turbo codes

can be encoded in linear time, a straightforward implementation for an LDPC

code has complexity quadratic in the block length. Note that the complexity

referred to here is measured in the number of mathematical operations required

per bit. In [152], Richardson and Urbanke (RU) show that linear time encoding

is achievable through careful linear manipulation of ‘good’ LDPC codes. In their

paper, they present methods to preprocess the parity-check matrix H and a set

226

of matrix operations to perform the actual encoding. We have implemented the

preprocessing in software since it needs to be performed only once for a given

H matrix. For the actual hardware encoder, we have identified the operations

that can be run in parallel and scheduled the tasks to maximize throughput.

In addition we have designed an efficient memory architecture for storing sparse

matrices.

The principal contribution of this chapter is a fast and efficient hardware

encoder for both irregular and regular LDPC codes based on the RU method.

The novelties of our work include:

• software preprocessor to bring the parity-check matrix H into a approxi-

mate lower triangular form;

• hardware architecture with an efficient memory organization for storing and

performing computations on sparse matrices;

• implementation and evaluation of the encoder that is capable of 80 times

speedup over a 2.4GHz PC; we also explore run-time reconfiguration op-

portunities.

The rest of this chapter is organized as follows. Section 9.2 presents an

overview of our approach. Section 9.3 describes how we preprocess the H matrix.

Section 9.4 presents our hardware encoder architecture. Section 9.5 describes the

main components used in our encoder. Section 9.6 discusses our implementation

results and Section 9.7 offers summary.

227

A B

C D E

T

n - m m - g g

m

n

m - g

g

0

Figure 9.1: The parity-check matrix H in ALT form. A, B, C, and E are sparse

matrices, D is a dense matrix, and T is a sparse lower triangular matrix.

9.2 Overview

The RU algorithm as described in Section 2.6.3, consists of two steps: a prepro-

cessing step and the actual encoding step. In the preprocessing step, row and

column permutations are performed to bring the parity-check matrix H into an

approximate lower triangular (ALT) form (Figure 9.1). Since the transformation

is accomplished by permutations only, the sparseness of the matrix is preserved.

The actual encoding is carried out by matrix-multiplication, forward-substitution

and vector addition operations. Since the preprocessing needs to be performed

only once on a given H matrix, we execute this operation in software. The ac-

tual encoding step is done in hardware. The RU encoding algorithm is presented

in [152] as a set of matrix operations. We have examined the algorithm and

identified the operations that can be executed in parallel. The operations im-

plemented in our hardware encoder are scheduled to maximize concurrency and

throughput. Moreover, we employ an efficient memory architecture for storing

sparse matrices, which minimizes memory usage.

228

H M a t r i x

P r e p r o c e s s o r ( S W )

A L T H M a t r i x

E n c o d e r ( H W ) M e s s a g e B l o c k s C o d e w o r d s

Figure 9.2: LDPC encoding framework.

The basic framework of our encoder is shown in Figure 9.2. Our approach

for LDPC encoding consists of two steps: preprocessing and hardware encoding.

First, the original parity-check matrix H is preprocessed with the RU algorithm

to generate the appropriate look-up tables consisting of the six matrices needed

by the hardware encoder. These matrices are generated from the RU algorithm

and contain information on how the input message blocks are encoded to generate

codewords. This preprocessing step is implemented in software and needs to be

performed once for a given H matrix. The hardware encoder itself is implemented

on an FPGA and uses the look-up tables (ALT H matrix) generated from the

preprocessing step to encode the message blocks. Note that the preprocessing

step does not involve any data. Hence, during a normal encoding operation only

the hardware encoder is needed. Although our implementation is based on H

matrices that are binary, GF(2), it can be extended to matrices that belong to

higher order fields.

The RU algorithm and the hardware architecture proposed in this chapter

make no restrictions on the actual H matrix. This flexibility allows our hardware

229

n - m m

m

n

0

Figure 9.3: An equivalent parity-check matrix in lower triangular form. Note

that n = block length and m = block length× (1− code rate).

architecture to be used in any application involving LDPC codes. Different appli-

cations require different H matrices. Applications requiring low latency typically

use shorter block lengths (less than 1000 bits), while applications requiring op-

eration near the channel capacity require longer block lengths (more than 10000

bits). Code rate r also influences the dimensions of the H matrix. Low code

rates offer more error protection at the expense of information throughput and

are often used when the SNR is very low (e.g. deep space communications). The

dimensions of the H matrix are (block length×(1−code rate)) by (block length)

as illustrated in Figure 9.3. Our hardware architecture is completely flexible in

regards to block length and code rate.

Another issue related to encoder flexibility is the specific location of ones in

the H matrix. Properly designed regular LDPC codes have performance that

continues to improve as the SNR is increased. Irregular LDPC codes do not have

this property, they have a so-called ‘error floor’, meaning that after a certain level

of performance is reached, the performance stops improving. For example, say a

code operates at a BER of 10−5 at 2dB SNR. If an error floor exists, the BER

230

will be the same when the SNR is increased to 5dB. If an error floor was not

present, then the BER would improve to say 10−6. While regular LDPC codes

have no error floor, they do not perform as close capacity as irregular codes. This

means that as the SNR is increased, the BER will decrease faster with irregular

codes than with regular codes. An ideal code performs close to capacity and

contains no error floor. We have designed high-performance LDPC codes in [174]

using special code construction techniques, which perform close to capacity and

have reduced error floors. Our hardware is completely flexible in regards to the

location of ones in the H matrix, in other words, this hardware can encode any

LDPC code that is created.

9.3 Preprocessing

In preprocessing, row and column permutations are performed to bring the H

matrix into an ALT form. Richardson and Urbanke [152] introduced three al-

gorithms to perform this task. There are three types of greedy algorithms: a,

b and c. We choose greedy algorithm a for our software preprocessor due to its

simplicity. The three algorithms are discussed in detail at the end of this section.

Preprocessing consists of two steps: triangulation and rank checking. Tri-

angulation is the process of row and column permutations that produces an H

matrix similar to the one shown in Figure 9.1, with the smallest gap g possible.

Multiplying

I 0

−ET−1 I

(9.1)

231

Table 9.1: Computation of pT1 = −F−1(−ET−1A + C)sT . Note that

T−1[AsT ] = yT ⇒ TyT = [AsT ].

index operation comment complexity

1 AsT multiplication by sparse matrix O(n)

2 T−1[AsT ] forward-substitution by sparse matrix O(n)

3 −E[T−1AsT ] multiplication by sparse matrix O(n)

4 CsT multiplication by sparse matrix O(n)

5 [−ET−1AsT ] + [CsT ] vector addition O(n)

6 −F−1[−ET−1AsT + CsT ] multiplication by dense g × g matrix O(g2)

Table 9.2: Computation of pT2 = −T−1(AsT + BpT

1 ).

index operation comment complexity

7 AsT multiplication by sparse matrix O(n)

8 BpT1 multiplication by sparse matrix O(n)

9 [AsT ] + [BpT1 ] vector addition O(n)

10 −T−1[AsT + BpT1 ] forward-substitution by sparse matrix O(n)

from the left of HxT = 0, where H =

A B T

C D E

we get

A B T

−ET−1A + C −ET−1B + D 0

s

p1

p2

=

0

0

(9.2)

This gives two equations and two unknowns p1, p2. Define F = −ET−1B +D

and assume for the moment that F is nonsingular, solving for p1, p2 yields

pT1 = −F−1(−ET−1A + C)sT . (9.3)

232

and

pT2 = −T−1(AsT + BpT

1 ). (9.4)

From Table 9.1 and Table 9.2 we can see that the complexity of the actual oper-

ations required to obtain p1 and p2 are mostly linear, except for the dense matrix

multiplication −F−1(−ET−1A + C)sT , which has complexity O(n2). Thus, we

have achieved near linear encoding complexity.

Since the equation for p1 depends on the inverse of F , the method will only

work when F is nonsingular (invertible). This requires additional rank checking

before the actual encoding. To obtain F , we perform Gaussian elimination on

the original H to bring it into the form A B T

−ET−1A + C −ET−1B + D 0

(9.5)

If F is singular, we swap columns of F with columns to the left of F and keep

doing this until F becomes nonsingular.

So far, we have shown the encoding complexity to be linear, except for the

dense g×g matrix multiplication, where g is the gap of the preprocessed H matrix

(Figure 9.1). Thus for efficiency, we should make g as small as possible.

The greedy algorithm a is used to find the best possible lower triangularization

of the parity matrix H. The algorithm begins by assigning Q = HT . Then the

following steps are applied:

1. Find a vector of indices to degree one rows in Q and call this vector α. If

α is empty, remove the left most column of Q and repeat step 1 with the

modified Q matrix. Let l equal to the length of the vector α.

2. Modify Q so that the degree one rows indicated by the elements of α are

moved to the top of Q (row numbers 1 to l).

233

3. Reorder the columns of the modified Q so that the rows that were moved

to the top of the matrix form a diagonal. This step is known as diagonal

extension.

4. Modify Q again by removing the first l rows and columns of modified Q.

5. Find a vector of indices to degree one rows in modified Q and call this

vector α. If α is empty, the algorithm terminates. Otherwise, go to step 2.

To summarize the above algorithm, through row and column operations a

small identity matrix is created at the top left corner of the main matrix, and

then the rows and columns participating in the identity matrix are deleted. This

is repeated until there are no more degree one rows in the remaining matrix.

Thus, through row and column swapping, we have produced the H matrix in

Figure 9.1.

The gap size is the equal to the number of columns removed in step 1. Note

that the gap size is further reduced by applying the greedy algorithm a to HT

rather than H itself. This is due to the different starting columns when applying

step 1. Since H is not a square matrix, we can only triangularize at most the

number of columns as there are rows. As a result, if we were to try to carry

out step 1 with a ‘fat’ H matrix, we can only look at part of the columns. For

example, for a rate 1/2 H matrix, in order to achieve the result in Figure 9.1,

we can only start with the middle column and work with the right half of the

matrix to search for degree one rows. On the other hand, if we apply step 1 to a

‘skinny’ HT matrix, we can start with the very first column since the number of

columns now is much less than the number of rows. This enables us to look at the

entire matrix when searching for degree one rows. This extra degree of freedom

results in better triangulation. The two different approaches are illustrated and

234

0

starting column

H 0

starting column

H T

Figure 9.4: Different starting columns for H and HT .

compared in Figure 9.4.

In addition to the greedy algorithm a, Richardson and Urbanke also intro-

duced greedy algorithm b and greedy algorithm c. In algorithm b, rather than

choosing the starting columns (starting point of triangulation) independent of

one another, they are chosen based on the weights of the rows with which they

are connected. This may reduce gaps but also requires more complicated process-

ing. Greedy algorithm c is built upon algorithm b with a looser constraint on the

weight distributions of rows and hence the definition of the starting columns with

which they are connected. Both b and c offer slightly smaller gaps in some cases;

however, we have chosen a since it offers satisfactory results in all cases we have

examined. Triangulation can be time-consuming with large block sizes. Since

this step only needs to be performed once for a given H matrix, such overhead

is tolerable.

9.4 Encoder Architecture

The hardware encoder computes the two parity parts p1 and p2 according to the

operations described in Table 9.1 and Table 9.2. Operations that can be executed

235

Message Blocks

Front Buffer

s

Back Buffer

s

Front Buffer

Cs

Back Buffer

Cs

MVM

C

Front Buffer

As

Back Buffer

As

A

T

TAs

E

ETAs ETAsCs

F

p1

B

Bp1 Front Buffer AsBp1

Back Buffer AsBp1

Buffer s

Buffer s

Permutation Table

Codewords

Stage 1 Stage 2 Stage 3 Stage 4

p2

MVM

FS MVM VA MVM MVM

VA FS CWG

Buffer p1

Buffer s

finish start

Stage Controller

finish start finish start finish start

Figure 9.5: Overview of our hardware encoder architecture. Double buffering is

used between the stages for concurrent execution. Grey and white box indicate

RAMs and operations respectively.

in parallel are identified and are scheduled to maximize parallelism. An overview

of our hardware encoder architecture is shown in Figure 9.5. The operations are

grouped into four stages that run in parallel and double buffering is used between

the stages so that they can be executed in parallel. Each stage generates a ‘finish’

signal once its computation is completed. Once all stages are completed (which is

when a codeword is generated), a ‘start’ signal is sent from the stage controller to

each stage for the next execution. The stages have been carefully partitioned to

balance the workloads between the stages, while minimizing the overall latency,

idle times and buffering requirements. This flexible architecture supports any

rate and block length, but has been specifically optimized for rate 1/2 codes.

The aim of dividing the encoding process into different stages is to balance

the execution times among the stages, so that the idle time of any of the stages is

minimized. Given that the rate is 1/2, the gap is small and the edges (ones) of the

236

H matrix are distributed in a random manner, the matrix A will contain nearly

half of the edges of the entire preprocessed H matrix. Also, since the matrix T is

lower-triangular, the number of its edges will be around half of A. Therefore, the

computation AsT (operation 1) will take the longest. This is because the number

of operations are proportional to the number of edges, which will be clarified

later. Since the gap is small, operations involving B, C, E and F will be very

fast.

In Stage 1, we simply write the message block to buffers. Since the message

block length is n − m, this stage will take n − m clock cycles. In Stage 2, we

perform operations 1 and 4 in parallel (the operations are listed in Table 9.1 and

Table 9.2). We do not do any other operations in this stage, since subsequent

operations are dependent on the result of operation 1 and operation 1 takes

the most time. In Stage 3, we perform all the remaining operations needed to

compute p1 as well as operations 8 and 9. In Stage 4, we perform operation 10 and

codeword generation. This segmentation into four stages balances the workload

across stages well for rate 1/2. In principle, we could parallelize some of the

matrix-vector multiplications and forward-substitutions to get higher throughput.

However, parallelizing those operations would involve duplicating the look-up

tables (since dual-port RAM is the best we can get from current FPGAs), which

would require significantly more area. Moreover, we can simply replicate many

instances of the encoder on the same chip to process several message blocks in

parallel without increasing RAM area needed for the look-up tables of the six

matrices. These look-up tables can be shared among the encoder instances.

Depending on the channel conditions, codes with different rates perform bet-

ter than others. For instance, when the SNR is low, higher rate codes are more

appropriate. Therefore one could implement an adaptive LDPC encoder, which

237

changes rate or block length depending on the channel conditions. Of course the

LDPC encoder would have to be synchronized with an adaptive LDPC decoder.

Although the architecture shown in Figure 9.5 could be used for different rates, it

is optimized for rate 1/2 codes. Codes with different rates differ in the dimensions

of the H matrix, leading to different edges ratios for the six matrices. Therefore,

different scheduling of the operations are needed for different rates to maximize

concurrency. Bit files of different designs optimized for different rates can be

stored in memory, and run-time reconfiguration of FPGAs can be exploited to

reconfigure the adaptive LDPC encoder at run-time for different channel con-

ditions. Since reconfiguration can be performed in a matter of milliseconds on

modern FPGAs, such adaptive LDPC encoders/decoders are viable options.

The main operations performed in the encoder are matrix-vector multiplica-

tion (MVM), forward-substitution (FS), vector addition (VA) and codeword gen-

eration (CWG). Codeword generation involves first constructing an intermediate

codeword by writing (s, p1, p2) into a memory. Then according to the permuta-

tion table, which contains the information on the row permutations performed

during the preprocessing step, the intermediate codeword is rearranged to gener-

ate the final codeword which is then valid with regards to the original H matrix.

The hardware architectures for vector addition, matrix-vector multiplication and

forward-substitution are described in the next section. Since we are dealing with

a binary system, multiplications can be performed with an AND gate and additions

with an XOR gate.

238

Y Z

index calculator

index

1

2

...

...

n

index

1

2

...

...

...

n

index X

1

2

...

...

...

n

index

...

Figure 9.6: Circuit for vector addition (VA).

9.5 Components for the Encoder

9.5.1 Vector Addition

This involves the computation of X + Y = Z, where X, Y and Z are vectors,

and Z is what we are trying to compute. Since we are dealing with a binary

system, vector addition can be simply achieved by performing XOR operations on

the corresponding elements of the two vectors. The circuit for vector addition is

shown in Figure 9.6. The index calculator increments the index every clock cycle.

9.5.2 Matrix-Vector Multiplication

This involves the computation of XY = Z, where X is a matrix, Y and Z

are vectors, and Z is what were are trying to compute. We shall illustrate our

approach with an example. Consider the multiplication of a 5 × 6 matrix X

by a vector Y to obtain a resulting vector Z. In this case, X is known from

the preprocessing step and is sparse. It would be inefficient to store this matrix

directly in a memory, since most of the locations will be zeroes. Instead, the

239

Table 9.3: Matrix X stored in memory. The location of the edges of each row

and an extra bit indicating the end of a row are stored.

address 0 1 2 3 4 5 6 7 8

data 3 5 1 2 4 6 0 3 4

end row 0 1 1 0 0 1 1 0 1

location of the edges (ones) of each row is stored, with an extra bit indicating

the end of a row. For example, if

X =

0 0 1 0 1 0

1 0 0 0 0 0

0 1 0 1 0 1

0 0 0 0 0 0

0 0 1 1 0 0

(9.6)

it would be stored in memory as shown in Table 9.3. Memory #6 is a special

case, indicating that the fourth row of matrix X has no edges.

The location of the edges of a row in X are used as bit selectors for the vector

Y . This bit selecting process has the same effect as performing AND operations

with the bits of a row in X and the bits in vector Y . XOR is performed on the

selected bits to calculate the resulting bits for Z. This operation is performed

for each row of X starting from the first one. Figure 9.7 shows our matrix-vector

multiplication circuit. The Z index calculator calculates the location of the Z

matrix to be written. The index is simply incremented every time there is an end

of a row. It can be seen that the number of clock cycles required to compute Z

is directly proportional to the number of edges in X.

240

end row data Y

X

Z

Z index calculator

index

index

1

2

...

...

...

n

index

1

2

...

...

...

n

index

data

Figure 9.7: Circuit for matrix-vector multiplication (MVM).

9.5.3 Forward-Substitution

Consider the equation XZ = Y , where X is a lower-triangular matrix, Y and Z

are vectors and Z is the vector we want to compute. X is given by

266666666666664

1 0 · · · · · · · · · · · · · · · 0

x(2,1) 1 0 · · · · · · · · · · · · 0

x(3,1) x(3,2) 1 0 · · · · · · · · · 0

x(4,1) x(4,2) x(4,3) 1 0 · · · · · · 0

..

....

..

....

..

....

..

....

x(n,1) x(n,2) · · · · · · · · · · · · x(n,n−1) 1

377777777777775

One way to approach this problem is to take the inverse of X and compute

Z = X−1Y . However, matrix inversion is a complex procedure and requires a

significant amount of processing time. Moreover, after inversion, X will be no

longer sparse. A better way is to use forward-substitution exploiting the fact that

X is lower triangular. The elements of the vector Z can be computed with the

241

following set of equations.

z1 = y1

z2 = y2 ⊕ x(2,1)z1

z3 = y3 ⊕ x(3,1)z1 ⊕ x(3,2)z2

...

zn = yn ⊕ x(n,1)z1 ⊕ x(n,2)z2 ⊕ · · · ⊕ x(n,n−1)zn−1

This can be generalized as:

zi = yi ⊕i−1⊕j=1

x(i,j)zj, 1 ≤ i ≤ n (9.7)

Just like the matrix-vector multiplication, to compute an element in Z, we need

elements from X and Y . However, we also require the previous elements of Z that

have been computed previously. Therefore, the circuit for forward-substitution is

similar to the one in Figure 9.7 with slight modifications as shown in Figure 9.8.

The index calculator computes the memory location of Y to be read and Z to be

written. From (9.7), these two addresses are identical. As in the matrix-vector

multiplication case, the index calculator is incremented every time there is an end

of a row and the clock cycles for the computation is proportional to the number

of edges in X.

9.6 Implementation and Results

The preprocessor has been implemented using MATLAB. Preprocessing times

for H matrices with rate 1/2 for various block lengths on a Pentium 4 2.4GHz

PC are shown in Table 9.4. A MATLAB tool we have developed, that constructs

high-performance irregular LDPC codes with low error floors is used to generate

the H matrices [174].

242

end row data Y

X

Z

index calculator

index

index

1

2

...

...

n

index

1

2

...

...

...

n

index

data

read index

write index

Figure 9.8: Circuit for forward-substitution (FS).

We can see that the preprocessing times for large block lengths can be long.

However, we are not too concerned about this since preprocessing needs to be

performed only once for any given H matrix. We also observe that the gap

remains small even for large block lengths. The primary reason for these small

gaps is the large number of rows in HT whose degrees are less than three in their

degree distributions. Low degree rows in HT lead to high probabilities of finding

degree one rows in the diagonal extension step of the greedy algorithm.

A scatter plot of a preprocessed irregular 500 × 1000 H matrix (i.e. block

length of 1000 bits and rate 1/2) is shown in Figure 9.9. The diagonal ones

of the matrix T can be clearly seen. Also, as expected since the gap is small

(g=2 in this case), the preprocessed H matrix consists of mainly A and T . The

blocky artifacts next to the diagonal of T are created by the diagonal extension

step of the greedy algorithm, during which an identity matrix is formed in every

243

Table 9.4: Preprocessing times and gaps for H matrices with rate 1/2 for vari-

ous block lengths performed on a Pentium 4 2.4GHz PC equipped with 512MB

DDR-SDRAM.

block preprocessing gap

length time [s]

500 3 2

1000 14 2

2000 83 2

4000 587 2

8000 3124 2

iteration. In Table 9.5, we show the number of edges for the six matrices for a

preprocessed 1000× 2000 irregular H matrix. We observe that, the matrices A,

B and T contain most of the edges, indicating that operations involving them

will dominate the encoding times.

The actual hardware encoder has been implemented using Xilinx System Gen-

erator and is heavily pipelined for maximum throughput. The codewords gener-

ated from our hardware encoder have been verified against our MATLAB model

for correctness. The four stage architecture design in Xilinx System Generator

are depicted in Figure 9.10. Stage 2 and the stage controller are shown in detail in

Figure 9.11. The MVM and FS circuits are shown in Figure 9.12 and Figure 9.13.

Let e(A) denote the number of edges for the matrix A, and c(S1) denote the

number of clock cycles taken by Stage 1 (see Figure 9.5). The number of clock

244

0 100 200 300 400 500 600 700 800 900 1000

0

100

200

300

400

500

Figure 9.9: Scatter plot of a preprocessed irregular 500× 1000 H matrix in ALT

form with a gap of two. Ones appear as dots.

cycles taken by each stage is given by

c(S1) = n−m

c(S2) = max(e(A), e(C))

c(S3) = e(T ) + e(E) + (n−m) + e(P ) + e(B) + (m− g)

c(S4) = e(T ) + 2((n−m) + g + (m− g)).

The number of clock cycles per codeword (CPC) is determined by the stage that

takes the longest, i.e.

CPC = max[c(S1), c(S2), c(S3), c(S4)].

For a given clock speed, the number of codewords per second (CPS) is given by

CPS = clock speed /CPC.

Therefore the codeword throughput (bits per second) of the encoder is

codeword bits throughput = CPS× block size

245

Fig

ure

9.10

:T

he

four

stag

eLD

PC

enco

der

arch

itec

ture

inX

ilin

xSyst

emG

ener

ator

.E

ach

stag

eco

nta

ins

mult

iple

subsy

stem

sper

form

ing

MV

M,FS,VA

orC

WG

.

246

Figure 9.11: LDPC encoder architecture Stage 2 and stage controller in Xilinx

System Generator.

247

Fig

ure

9.12

:T

he

mat

rix-v

ecto

rm

ult

iplica

tion

(MV

M)

circ

uit

inX

ilin

xSyst

emG

ener

ator

.

248

Fig

ure

9.13

:T

he

forw

ard-s

ubst

ituti

on(F

S)

circ

uit

inX

ilin

xSyst

emG

ener

ator

.

249

Table 9.5: Dimensions and number of edges for the matrices A, B, T , C, F and

E generated from a 1000× 2000 irregular H matrix.

matrix dimension edges

A 998× 1000 6273

B 998× 2 998

T 998× 998 2398

C 2× 1000 10

F 2× 2 2

E 2× 998 6

and the information throughput is given by

information bits throughput = codeword throughput× rate.

The latency of the encoder is the time taken for the four stages to fill up. This

is given by

latency = (4× CPC) / clock speed.

An encoder for block length of 2000 bits and rate 1/2 has been synthesized on a

Xilinx Virtex-II XC2V4000-6 device. The design takes up 870 slices and 19 block

RAMs, which uses approximately 4% of the device. The clock cycles taken by

each of the four stages of this design are

c(S1) = 1000

c(S2) = 6273

c(S3) = 5402

c(S4) = 6398.

250

We observe that the workloads across the stages are well balanced, which is the

case with our architecture for all rate 1/2 codes. The design is capable of running

at 143MHz and with a CPC of 6398 cycles, the resulting in a codeword throughput

and latency is 45Mbps (million bits per second) and 0.179ms respectively. This

kind of a throughput is sufficient to cover most applications, including wireless

networking and optical-link deep-space communications. Implementation results

for various encoders with block lengths ranging from 500 to 8000 bits for rate

1/2 codes are shown in Table 9.6. We see an increase in resources and latency

with block length due to the increase in size of the H matrix. This increase in

resources leads to reductions in clock speed and throughput due to routing delays.

The usage of distributed RAMs and block RAMs have been carefully performed

to minimize the waste of the 18Kb Virtex-II block RAMs. In Table 9.7, we show

how the performance varies with different rates with a fixed block length of 2000

bits. The encoder is optimized for rate 1/2 codes (by dividing the operations

shown in Figure 9.5), therefore we see the some performance loss for other rates.

Multiple instances of the encoder can be implemented on the same device to

encode multiple message blocks in parallel. Note that RAMs for the six matrices

describing the preprocessed H matrix can be shared among the encoders. This is

because the six matrices are indexing the operands, and are used one by one in

a linear manner. Synthesis results are shown in Table 9.8 for multiple instances

of an encoder with block length of 2000 bits and rate 1/2. The design with 16

instances consumes 73% of the device and is capable of a codeword throughput

of 410Mbps. Figure 9.14 shows how the number of encoder instances affects the

codeword throughput. The dotted line shows the linear relationship between the

output rate and the number of instances, if the clock speed does not deteriorate

with the increasing number of instances. While ideally the throughput would

scale linearly with the number of encoder instances, in practice the output rate

251


for rate 1/2 for various block length.

block edges slices block speed throughput latency

length RAMs [MHz] [Mbps] [ms]

500 2418 562 12 161 50 0.040

1000 4859 682 13 152 48 0.084

2000 9687 870 19 143 45 0.179

4000 19452 1340 27 127 40 0.405

8000 38905 2148 49 110 34 0.937

grows slower than expected, because the clock speed of the design deteriorates

as the number of noise generators increases. This deterioration is probably due

to the increase in routing delays. Note that multiple FPGAs could be used to

speed up the encoding even further. For instance, an implementation of three

Xilinx Virtex-II XC2V4000-6 devices would be capable of a codeword throughput

of 1.2Gbps for block length of 2000 bits and rate 1/2 codes.

Our hardware implementations of the encoder for block length of 2000 bits

and rate 1/2 has been compared to software implementations. The software

implementations are written in C and compiled with Microsoft Visual C++ 6.0.

The results are shown in Table 9.9. It can be seen that our hardware designs are

faster than software implementations by 10–300 times, depending on the device

used and the resource utilization.

Regarding the feasibility of an adaptive LDPC encoder, the XC2V4000-6

FPGA has 15 million configuration bits [187]. The configuration bits can be

252


for block length of 2000 bits for various rates.

rate edges slices block speed throughput latency

RAMs [MHz] [Mbps] [ms]

1/3 8896 1109 19 127 34 0.232

1/2 9687 870 19 143 44 0.179

2/3 9513 1065 18 125 33 0.235

fed to the device with eight bits in parallel at 50MHz, which is 400Mbps. So the

entire device can be configured in around 35ms (smaller devices would take less

time). If an adaptive LDPC encoder reconfigures itself every few seconds or tens

of seconds, the overhead of the reconfiguration time would still be acceptable if

the adapted encoder improves throughput and minimizes retransmission time.

9.7 Summary

We have described a hardware design of an efficient LDPC encoder based on

the RU method. Whereas a straightforward implementation of an encoder has

complexity quadratic in the block length, the RU method admits linear time

encoding through careful linear manipulation of the parity matrix for both regular

and irregular LDPC codes.

A preprocessor is written to optimize the parity-check matrix through the row

and column permutations, generating the look-up tables and parameters needed

by the hardware encoder. An efficient architecture for storing and performing

253


for block length of 2000 bits and rate 1/2 for different numbers of encoder in-

stances.

instances slices block speed throughput latency

RAMs [MHz] [Mbps] [ms]

1 870 19 143 44 0.179

4 3547 36 90 112 0.284

8 6978 60 89 222 0.288

12 12702 83 86 322 0.298

16 16906 107 82 410 0.312

computations on sparse matrices has been discussed. The encoding steps have

been scheduled into different stages optimizing concurrency while reducing idle

times. Run-time reconfiguration of FPGAs can be used to load different designs

optimized for various rates at run-time for an adaptive LDPC encoder.

Implementation results for encoders of various block lengths and rates have

been presented. An encoder for block length of 2000 bits and rate 1/2 takes

up 4% of resources on a Xilinx Virtex-II XC2V4000-6 device. It is capable of

running at 143MHz resulting in a codeword throughput of 45Mbps and latency

of 0.179ms. The performance can be improved by mapping several instances of

the encoder onto the same chip to encode multiple message blocks concurrently.

An implementation of 16 instances of the encoder on the same device at 82MHz

is capable of 410 million codeword bits per second, 80 times faster than an Intel

Pentium 4 2.4GHz PC. The LDPC encoder architecture we have proposed in this

254

2 4 6 8 10 12 14 160

100

200

300

400

500

600

700

800

Number of Instances

Cod

ewor

d T

hrou

ghpu

t (M

bps)

Figure 9.14: Variation of throughput with the number of encoder instances.

chapter, has been chosen by JPL as a candidate for their future space missions.

255

Table 9.9: Performance comparison of block length of 2000 bits and rate 1/2

encoders: time for producing 410 million codeword bits.

platform speed time

[MHz] [s]

XC2V4000-6 FPGA, 16 encoder instances 82 1

XC2V4000-6 FPGA, 1 encoder instance 143 9

Intel Pentium 4 PC, 512MB DDR-SDRAM 2400 80

Intel Pentium-III PC, 256MB SDR-SDRAM 700 312

256

CHAPTER 10

Conclusions

10.1 Summary

Three main topics have been presented in this thesis: function evaluation, Gaus-

sian noise generation and LDPC encoding.

In Chapter 3 [95], we have presented a methodology for the automation of

function evaluation unit design, covering table look-up, table-with-polynomial

and polynomial-only methods. An implementation of a partially automated sys-

tem for design space exploration of function evaluation in hardware has been

demonstrated, including algorithmic design space exploration with MATLAB and

hardware design space exploration with ASC, A Stream Compiler, for FPGAs.

Method selection results for sin(x), log(1 + x) and 2x have been shown. We have

concluded that the automation of function evaluation unit design is within reach,

even though there are many remaining issues for further study.

In Chapter 4 [83], [84], a framework for adaptive range reduction based on a

parametric function evaluation library, and on function approximation by poly-

nomials and tables and pre-computing all possible input/output ranges has been

presented. We have demonstrated an implementation of design space exploration

for adaptive range reduction, using MATLAB for producing function evalua-

tion parameters for hardware designs targeting the ASC system. The proposed

approach has been evaluated by exploring various effects of range reduction

257

of several arithmetic functions such as sin(x), log(x) and√

x on throughput,

latency and area for FPGA designs. For a given function, its input/output

range/precision, and an optimization metric, we automate the decision about

whether range reduction helps to optimize the metric by pre-computing a large

library of function evaluation generators. Given the evaluation method, we auto-

mate the decision about which bitwidths and number of polynomial terms to use

by constructing the function evaluation generators via MATLAB simulation and

computation. In addition, we show the productivity which we obtain from com-

bining MATLAB with ASC, exploring over 40 million Xilinx equivalent circuit

gates in a relatively short amount of time.

In Chapter 5 [88], [90], [91], we have presented a novel method for evaluating

functions using piecewise polynomial approximations with an efficient hierarchical

segmentation scheme. Our method is illustrated using four non-linear compound

functions,√− log(x), x log(x), a high order rational function and cos(πx/2). An

algorithm that finds the optimum segments for a given function, input range,

maximum error and ulp (unit in the last place) has been presented. The four

hierarchical schemes P2S(US), P2SL(US), P2SR(US) and US(US) deal with the

non-linearities of functions which occur frequently. A simple cascade of AND and

OR gates can be used to rapidly calculate the P2S address for a given input.

Results show the advantages of using our hierarchical approach over the tradi-

tional uniform approach. We have also explored the effects of different polynomial

degrees on our hierarchical segmentation method. Compared to other popular

methods, our approach has longer latency and more operators, but the size of

the look-up tables and thus the total area are considerably smaller.

In Chapters 6 and 7, we have presented two hardware Gaussian noise gener-

ators designed to facilitate Monte Carlo simulations implemented in hardware,

258

which involve very large numbers of samples. The first design [86], [89] is based

on the Box-Muller method and the central limit theorem. This approach involves

the computation of two functions:√− ln(x) and cos(2πx). A key aspect of the

design is the use of non-uniform piecewise linear approximations [87] for comput-

ing trigonometric and logarithmic functions, with the boundaries between each

approximation chosen carefully to enable rapid computation of coefficients from

the inputs. The noise generator design occupies approximately 10% of a Xilinx

Virtex-II XC2V4000-6 FPGA and 90% of a Xilinx Spartan-IIE XC2S300E-7, and

can produce 133 million samples per second. The performance can be improved

by exploiting parallelism: an XC2V4000-6 FPGA with nine parallel instances of

the noise generator at 105MHz can run 50 times faster than a 2.6GHz Pentium

4 PC. This noise generator is currently being used for exploring LDPC code be-

havior at UCLA and JPL (Jet Propulsion Laboratory, NASA), and Monte Carlo

simulations of financial models at the Chinese University of Hong Kong.

The second noise generator [82], [94] is based on the Wallace method. It is a

fast algorithm for generating normally distributed pseudo-random numbers which

generates the target distributions directly using their maximal-entropy properties.

The Wallace method takes a pool of normally distributed random numbers from

the normal distribution. Through transformation steps, a new pool of normal

distributed random numbers are generated. The noise generator design occupies

approximately 3% of a Xilinx Virtex-II XC2V4000-6 FPGA and half of a Xilinx

Spartan-III XC3S200E-5, and can produce 155 million samples per second. An

XC2V4000-6 FPGA with 16 parallel instances of the noise generator at 115MHz

can run 98 times faster than a 2.6GHz Pentium 4 PC. The two noise generators are

used as a key component in a hardware simulation system including: exploration

LDPC code behavior at very low BERs in the range of 10−9 to 10−10, and financial

modeling [14], [192]. For both noise generators, statistical tests, including the χ2

259

test and the A-D test, as well as application in LDPC decoding have been used

to confirm the quality of the noise samples. The output of the noise generators

accurately model a true Gaussian PDF even at very high σ values.

In Chapter 8 [92], we have explored the impact of parameter choice on noise

quality of the Wallace method. Using tests designed specifically to identify the

presence of correlations due to the use of previous outputs in generating new

outputs, we have identified specific combinations of pool size, transform size,

and retention factor that deliver high quality noise output at high speeds (one

examples is pool size = 4096, transformation size = 16, and retention factor =

1). Detailed performance tradeoff studies have been conducted for AMD Athlon

XP and Intel Pentium 4 based platforms. Performance comparisons with other

software Gaussian random number generators have been carried out, demonstrat-

ing that given a careful choice of parameters, the Wallace method is a serious

competitor due to its speed advantages.

In Chapter 9 [93], we have described a hardware design of an efficient LDPC

encoder based on the RU method. A preprocessor is written to optimize the

parity-check matrix through the row and column permutations, generating the

look-up tables and parameters needed by the hardware encoder. An efficient

architecture for storing and performing computations on sparse matrices has been

discussed. Implementation results for encoders of various block lengths and rates

have been presented. An encoder for block length of 2000 bits and rate 1/2 takes

up 4% of resources on a Xilinx Virtex-II XC2V4000-6 device. It is capable of

running at 143MHz resulting in a codeword throughput of 45Mbps and latency

of 0.179ms. The performance can be improved by mapping several instances of

the encoder onto the same chip to encode multiple message blocks concurrently.

An implementation of 16 instances of the encoder on the same device at 82MHz

260

is capable of 410 million codeword bits per second, 80 times faster than an Intel

Pentium 4 2.4GHz PC. Due to the increasing demand for high-speed deep space

communications, our LDPC encoder architecture has been chosen by JPL as a

candidate for NASA’s future space missions.

10.2 Future Work

10.2.1 Function Evaluation

For the evaluation of elementary functions, we want implement other elementary

functions and explore other evaluation methods such as rational approximation

and symmetric table addition methods. We also hope to utilize embedded RAMs

and multipliers available in modern FPGAs. Our designs will be optimized fur-

ther by employing non-uniform bitwidth minimization techniques such as Bit-

Size [47]. The final objective is to progress towards a fully automated library

that provides optimal function evaluation hardware units given input/output

range and precision.

One of the major problems we are facing with this objective, is the fact that

we cannot verify a given approximation for all the possible inputs. For instance,

if the input is 24 bits, the output errors for all 224 possible inputs needs to

be computed to ensure correctness for every output. This can take days even

on the fastest PCs available today. Because of this performance bottleneck, in

the present implementation, a set of random samples are taken from the input

domain. Hence ideally, we need to move towards a framework were the library

construction itself (e.g. calculating coefficients and minimizing bitwidths, see

Figure 4.1 in Chapter 4) is done hardware. It is would enable us to create a fully

automated/accurate library of all the elementary functions and approximation

261

methods of interest, and test for all possible input values. This would involve the

generation of a comprehensive matrix of precision/range for various combinations

of metrics based on the structure shown in Figures 4.10 and 4.11.

In the work presented in Chapters 3 and 4, when we optimize a design for a

given metric (area, latency or throughput), the design is optimized by ASC to

give the best possible result for the specified metric. But in many situations, the

user may want to specify a combination of metrics. For instance, when designing

a modulator for a mobile phone, the designer may want to set a constraint on

the maximum latency that can tolerated, while meeting a certain throughput

and area requirement. Moreover, power consumption is a major factor in many

modern mobile devices, hence adding power optimization to ASC would also be

useful. We are also planning to explore the impact on power consumption [184]

across different bitwidths, methods and functions.

There are various extensions we want to make for the hierarchical segmen-

tation method (HSM) presented in Chapter 5. Many functions such as belief

propagation in LDPC decoding [74] involve two input variables [139], hence we

want to extend HSM to cover multivariate functions. The current implemen-

tation of HSM employs fixed-point arithmetic. However, it would be desirable

to support floating-point as well to address operations that have large dynamic

ranges. The bitwidths of various operations in the data paths have been min-

imized by hand, however this process is very time consuming and perhaps far

from optimal. Bitwidth minimization techniques such as those presented in [29]

and [47] are highly desirable. Also, we hope to explore how HSM can be used

to speed up addition and subtraction functions in logarithmic number systems

(LNS) [26] which are highly non-linear.

For all function evaluations units described in Chapters 3, 4 and 5, Horner’s

262

rule is used to reduce the number of operations in the polynomial. However,

more sophisticated methods exist which can reduce the number of operations

even further, such as those described by Knuth in [78]. We are planning to

investigate how these methods can be mapped efficiently into hardware. The

function evaluation units perform faithful rounding (accurate to 1 ulp, rounded

to the nearest or next nearest), however certain applications may require exact

rounding (accurate to 0.5 ulp, rounded to the nearest) [161]. We are investigating

how exact rounding can be achieved for our evaluation units, which would involve

using the right bitwidths for the operators in the data paths.

10.2.2 Gaussian Noise Generation

In Chapter 8, we have identified a set of design parameters for the Wallace method

to reduce correlations. We are planning to modify the Wallace hardware archi-

tecture presented in Chapter 7 with these new set of parameters. This would

mainly involve more addition/subtraction and memory requirements (discussed

in Section 8.6 in Chapter 8).

The statistical tests for the noise generators including the χ2 test and the

A-D test, have been carried out in software using a hardware emulation model in

C. Hence, we are only able to test up to around 1010 noise samples due to lack

of computational power. Ideally, these tests ought to be performed in hardware,

which would enable us to verify the noise samples for even larger numbers of

samples.

Recently, we have come across the inversion method [65], which involves the

inverse Gaussian CDF and uniform random samples to pick a point in the CDF.

This approach requires the approximation of the inverse Gaussian CDF, which

is highly non-linear, but could be dealt with a floating-point implementation of

263

HSM. This method has the advantage of having to approximate just one function

to generate a Gaussian random variate. We are looking at implementing this

method with the aid of a floating-point implementation of HSM.

Finally, we want to further refine of our noise generator architectures for

various applications. For instance those which involve different channels such

magnetic disc channels [22] and other communication channels [148] including

Rayleigh [36], Ricean and Nakagami-m [191], which are all based on Gaussian

noise.

10.2.3 LDPC Coding

The four-stage architecture of the LDPC encoder presented in Chapter 9 is cur-

rently optimized for rate 1/2 codes. It would be desirable to develop a set of

architectures optimized for different code rates, which would resulting in maxi-

mum throughput and minimum latency for the given rate.

The current LDPC decoder implementation [74] developed by our colleagues

in UCLA, is still in a preliminary stage and has a throughput of several hundreds

of kilobits per second due to its serial nature. We have now identified interesting

decoder architectures that would lead to a more parallel and scalable design. We

hope to implement this new improved design in the near future, which should

reach throughput of several tens of megabits per second.

Finally, using our current LDPC encoder/decoder architecture, we want to

implement an adaptive LDPC codec. This would involve supporting different

H matrices at run-time and adaptively choosing the appropriate H matrix de-

pending on the channel conditions, such as the SNR. Adaptive architectures for

Viterbi [169] and Turbo codes [103] have been proposed in literature, but not for

LDPC codes.

264

References

[1] Advanced Micro Devices Inc. AMD Ahtlon processor technical brief, 1999.Document number 22054.

[2] J.H. Ahrens and U. Dieter. An alias method for sampling from the normaldistribution. Computing, 42(2-3):159–170, 1989.

[3] R. Andraka. A survey of CORDIC algorithms for FPGA based comput-ers. In Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 191–200, 1998.

[4] D.R. Barr and N.L. Sezak. A comparison of multivariate normal generators.Communications of the ACM, 15(12):1048–1049, 1972.

[5] N.C. Beaulieu and C.C. Tan. An FFT method for generating bandlimitedGaussian noise variates. In Proceedings of IEEE Global CommunicationsConference, pages 684–688, 1997.

[6] A.R. Bergstrom. Gaussian estimation of mixed-order continuous-time dy-namic models with unobservable stochastic trends from mixed stock andflow data. Econometric Theory, 13(4):467–505, 1997.

[7] C. Berrou, A. Glavieuxand, and P. Thitimajshima. Near Shannon limiterror-correcting coding and decoding: Turbo-codes. In Proceedings of IEEEConference on Communications, pages 1064–1070, 1993.

[8] V. Bhagavatula, H. Song, and J. Liu. Low-density parity-check (LDPC)codes for optical data storage. In Proceedings of IEEE International Sympo-sium on Optical Memory and Optical Data Storage Topical Meeting, pages371–373, 2002.

[9] T. Bhatt, K. Narayanan, and N. Kehtarnavaz. Fixed-point DSP imple-mentation of low-density parity check codes. In Proceedings of IEEE DSPWorkshop, 2000.

[10] A.J. Blanksby and C.J. Howland. A 690-mW 1-Gb/s 1024-b, rate-1/2 low-density parity-check code decoder. IEEE Journal of Solid-State Circuits,37(3):404–412, 2002.

[11] M. Bossert. Channel Coding for Telecommunications. John Wiley & Sons,1999.

265

[12] E. Boutillon, J.L. Danger, and A. Gazel. Design of high speed AWGNcommunication channel emulator. Analog Integrated Circuits and SignalProcessing, 34(2):133–142, 2003.

[13] G.E.P. Box and M.E. Muller. A note on the generation of random normaldeviates. Annals of Mathematical Statistics, 29:610–611, 1958.

[14] A. Brace, D. Gatarek, and M. Musiela. The market model of interest ratedynamics. Mathematical Finance, 7(2):127–155, 1997.

[15] D.D. Braess. Chebyshev approximation by spline functions with free knots.Numerische Mathematik, 17:357–366, 1971.

[16] R.P. Brent. A fast vectorised implementation of Wallace’s normal randomnumber generator. ANU Computer Science Technical Reports TR-CS-97-07, The Australian National University, 1997.

[17] R.P. Brent. Some comments on C.S. Wallace’s random number generators.The Computer Journal, 2003. To appear.

[18] A. Cantoni. Optimal curve fitting with piecewise linear functions. IEEETransactions on Computers, C-20(1):59–67, 1971.

[19] J. Cao, B.W.Y. We, and J. Cheng. High-performance architectures forelementary function generation. In Proceedings of IEEE Symposium onComputer Arithmetic, pages 136–144, 2001.

[20] J. Cavallaro and M. Vaya. VITURBO: A reconfigurable architecture forViterbi and Turbo decoding. In Proceedings of IEEE International Confer-ence on Acoustics, Speech, and Signal Processing, volume 2, pages 497–500,2003.

[21] Celoxica Limited. Handel-C language reference manual v3.1, 2002. http:

//www.celoxica.com.

[22] J. Chen, J. Moon, and K. Bazargan. Reconfigurable readback-signal gen-erator based on a field-programmable gate array. IEEE Transactions onMagnetics, 40(3):1744–1750, 2004.

[23] P.L. Chu. Fast Gaussian noise generator. IEEE Transactions on Acoustics,Speech, and Signal Processing, 37(10):1593–1597, 1989.

[24] P.P. Chu and R.E. Jones. Design techniques of FPGA based random num-ber generator. In Proceedings of Military and Aerospace Applications ofProgrammable Devices and Technology Conference, 1999.

266

http://www.celoxica.com

http://www.celoxica.com

[25] W.J. Cody and W. Waite. Software Manual for the Elementary Functions.Prentice Hall, 1980.

[26] J.N. Coleman, E. Chester, C.I. Softley, and J. Kadlec. Arithmetic on theEuropean logarithmic microprocessor. IEEE Transactions on Computers,49(7):702–715, 2000.

[27] M. Combet, H. Van Zonneveld, and L. Verbeek. Computation of the basetwo logarithm of binary numbers. IEEE Transactions on Electrical Com-puters, EC-14(6):863–867, 1965.

[28] K. Compton and S. Hauck. Reconfigurable computing: a survey of systemsand software. ACM Computing Surveys, 34(2):171–210, 2002.

[29] G.A. Constantinides, P.Y.K. Cheung, and W. Luk. Wordlength optimiza-tion for linear digital signal processing. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 22(10):1432–1442, 2003.

[30] D.J. Costello, J. Hagenauer, H. Imai, and S.B. Wicker. Applicationsof error control and coding. IEEE Transactions on Information Theory,44(6):2531–2560, 1998.

[31] C. Cousineau, F. Laperle, and Y. Savaria. Design of a JTAG based runtime reconfigurable system. In Proceedings of IEEE Symposium on Field-Programmable Custom Computing Machines, pages 21–23, 1999.

[32] R.B. D’Agostino and M.A. Stephens. Goodness-of-Fit Techniques. MarcelDekker Inc., 1986.

[33] J.L. Danger, A. Ghazel, E. Boutillon, and H. Laamari. Efficient FPGA im-plementation of Gaussian noise generator for communication channel em-ulation. In Proceedings of IEEE International Conference on Electronics,Circuits, and Systems, volume 1, pages 366–369, 2000.

[34] F. de Dinechin and A. Tisserand. Some improvements on multipartite tablemethods. In Proceedings of IEEE Symposium on Computer Arithmetic,pages 128–135, 2001.

[35] D. Defour, P. Kornerup, J. Muller, and N. Revol. A new range reductionalgorithm. In Proceedings of Asilomar Conference on Circuits, Systems,and Computers, volume 2, pages 1656–1660, 2001.

[36] D. Derrien and E. Boutillon. Quality measurement of a colored Gaussiannoise generator hardware implementation based on statistical properties.In Proceedings of IEEE International Symposium on Signal Processing andInformation Technology, 2002.

267

[37] R.O. Duda, D.G. Stork, and P.E. Hart. Pattern Classification and SceneAnalysis: Pattern Classification. John Wiley & Sons, 2000.

[38] J. Duprat and J.M. Muller. The CORDIC algorithm: new results forfast VLSI implementation. IEEE Transactions on Computers, 42:168–178,1993.

[39] J.J. Eggers, J.K. Su, and B. Girod. Robustness of a blind image watermark-ing scheme. In Proceedings of IEEE International Conference on ImageProcessing, volume 3, pages 17–20, 2000.

[40] M.D. Ercegovac. A general hardware-oriented method for evaluation offunctions and computations in a digital computer. IEEE Transactions onComputers, 26(7):667–680, 1977.

[41] M.D. Ercegovac and T. Lang. Division and square root, digit-recurrencealgorithms and implementations. Kluwer Academic Publishers, 1994.

[42] R.E. Esch and W.L. Eastman. Computational methods for best splineapproximation. Journal of Approximation Theory, 2:85–96, 1969.

[43] Y. Fan, Z. Zilic, and M.W. Chiang. A versatile high speed bit error ratetesting scheme. In Proceedings of IEEE International Symposium on Qual-ity Electronic Design, pages 395–400, 2004.

[44] FastMath: software faster than a coprocessor. C User’s Journal, 9(7):12,1991.

[45] Flarion Technologies Inc. Vector-low-density parity-check coding solutiondata sheet, 2002. http://www.flarion.com.

[46] M.J. Flynn and S.F. Oberman. Advanced Computer Arithmetic Design.John Wiley & Sons, 2001.

[47] A. Abdul Gaffar, O. Mencer, W. Luk, and P.Y.K. Cheung. Unifying bit-width optimisation for fixed-point and floating-point designs. In Proceed-ings of IEEE Symposium on Field-Programmable Custom Computing Ma-chines, pages 79–88, 2004.

[48] R.G. Gallager. Low-density parity-check codes. IEEE Transactions onInformation Theory, 8:21–28, 1962.

[49] R.G. Gallager. Low-Density Parity-Check Codes. MIT Press, 1963.

268

http://www.flarion.com

[50] J. Garcia-Frias and W. Zhong. Approaching Shannon performance by it-erative decoding of linear codes with low-density generator matrix. IEEECommunication Letters, 7:266–268, 2003.

[51] C.W. Gardiner. Handbook of Stochastic Methods. Springer-Verlag, 1990.

[52] A.V. Geramita and J. Seberry. Orthogonal Designs: Quadratic Forms andHadamard Matrices. Marcel Dekker Inc., 1979.

[53] A. Ghazel, E. Boutillon, J.L. Danger, G. Gulak, and H. Laamari. Designand performance analysis of a high speed AWGN communication channelemulator. In Proceedings of IEEE Pacific Rim Conference on Communica-tions, Computers, and Signal Processing, volume 2, pages 374–377, 2001.

[54] GNU Project. gcc 3.2 Manual, 2003. http://gcc.gnu.org.

[55] D. Goldberg. What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys, 23(1):5–48, 1991.

[56] N. Golshan. A novel digital implementation of a Gaussian noise genera-tor. In Proceedings of IEEE Instrumentation and Measurement TechnologyConference, pages 256–257, 1989.

[57] B.D. Hart and D.P. Taylor. On the irreducible error floor in fast fadingchannels. IEEE Transactions on Vehicular Technology, 49(3):1044–1047,2000.

[58] J.F. Hart. Computer Approximations. John Wiley & Sons, 1968.

[59] J.W. Hauser and C.N. Purdy. Approximating functions for embedded andASIC applications. In Proceedings of IEEE Midwest Symposium on Circuitsand Systems, pages 478–481, 2001.

[60] H. Hemmati. Overview of laser communication research at JPL. In Pro-ceedings of SPIE The Search for Extraterrestrial Intelligence in the OpticalSpectrum III, volume 4273, 2001.

[61] H. Henkel. Improved addition for the logarithmic number system. IEEETransactions on Acoustics, Speech, and Signal Processing, 37(2):301–303,1989.

[62] J.L. Hennessy, D.A. Patterson, and D. Goldberg. Computer Architecture:A Quantitative Approach. Morgan Kaufmann, third edition, 2002.

269

http://gcc.gnu.org

[63] C.H. Ho, K.H. Tsoi, H.C. Yeung, Y.M. Lam, K.H. Lee, P.H.W. Leong,R. Ludewig, P. Zipf, A.G. Ortiz, and M. Glesner. Arbitrary function ap-proximation in HDLs. In Proceedings of IEEE International Conference onField-Programmable Technology, pages 110–117, 2003.

[64] S. Hong and W.E. Stark. Design and implementation of a low complexityVLSI Turbo-code decoder architecture for low energy mobile wireless com-munications. Journal of VLSI Signal Processing, pages 2350–2354, 2000.

[65] W. Hormann and J. Leydold. Continuous random variate generation byfast numerical inversion. ACM Transactions on Modeling and ComputerSimulation, 13(4):347–362, 2003.

[66] C.J. Howland and A.J. Blanksby. A 220mW 1 Gb/s 1024-Bit rate-1/2low density parity check code decoder. In Proceedings of IEEE CustomIntegrated Circuits Conference, pages 293–296, 2001.

[67] C.J. Howland and A.J. Blanksby. Parallel decoding architectures for lowdensity parity check codes. In Proceedings of IEEE International Sympo-sium on Circuits and Systems, volume 4, pages 742–745, 2001.

[68] Intel Corp. Intel Pentium 4 processor with 512-KB L2 cache on 0.13 micronprocess and Intel Pentium 4 processor extreme edition supporting Hyper-threading datasheet, 2004. Document number 298643-012.

[69] F.C. Ionescu. Theory and practice of a fully controllable white noise gen-erator. In Proceedings of IEEE International Semiconductor Conference,volume 2, pages 319–322, 1996.

[70] V.K. Jain, S.A. Wadecar, and L. Lin. A universal nonlinear componentand its application to WSI. IEEE Transactions on Components, Hybridsand Manufacturing Tech., 16(7):656–664, 1993.

[71] Jet Propulsion Laboratory. Basics of Space Flight, 2004. http://www2.

jpl.nasa.gov/basics.

[72] J. Jiang, W. Luk, and D. Rueckert. FPGA-based computation of free-form deformations in medical image registration. In Proceedings of IEEEInternational Conference on Field-Programmable Technology, pages 234–241, 2003.

[73] S.J. Johnson and S.R. Weller. A family of irregular LDPC codes with lowencoding complexity. IEEE Communications Letters, 7(2):79–81, 2003.

270

http://www2.jpl.nasa.gov/basics

http://www2.jpl.nasa.gov/basics

[74] C. Jones, E. Valles, M. Smith, and J.D. Villasenor. Approximate-min*constraint node updating for LDPC code decoding. In Proceedings of IEEEMilitary Communications Conference, volume 1, pages 157–162, 2003.

[75] J.N. Mitchell Jr. Computer multiplication and division using binary loga-rithms. IRE Transactions Electrical Computers, EC-11:512–517, 1962.

[76] B. Jung, H. Lenhof, P. Muller, and C. Rub. Langevin dynamics simu-lations of macromolecules on parallel computers. Macromolecular Theoryand Simulations, pages 507–521, 1997.

[77] J. Cavallaro K. Chadha and. A reconfigurable Viterbi decoder architec-ture. In Proceedings of Asilomar Conference on Circuits, Systems, andComputers, pages 66–71, 2001.

[78] D.E. Knuth. Seminumerical algorithms, volume 2 of The Art of ComputerProgramming. Addison-Wesley, third edition, 1997.

[79] I. Koren and O. Zinaty. Evaluating elementary functions in a numeri-cal coprocessor based on rational approximations. IEEE Transactions onComputers, 39(8):1030–1037, 1990.

[80] R.E. Ladner and M.J. Fischer. Parallel prefix computation. Journal of theACM, 27(4):831–838, 1980.

[81] C.L. Lawson. Characteristic properties of the segmented rational minimaxapproximation problem. Numerische Mathematik, 6:293–301, 1964.

[82] D. Lee. Gaussian noise generation for Monte Carlo simulations in hardware.In Proceedings of The Korean Scientists and Engineers Association in theUK 30th Anniversary Conference, pages 182–185, 2004.

[83] D. Lee, A. Abdul Gaffar, O. Mencer, and W. Luk. Adaptive range reductionfor hardware function evaluation. In Proceedings of IEEE InternationalConference on Field-Programmable Technology, pages 169–176, 2004.

[84] D. Lee, A. Abdul Gaffar, O. Mencer, and W. Luk. Automating optimizedhardware function evaluation. IEEE Transactions on Computers, 2004.Submitted.

[85] D. Lee, W. Luk, and P.Y.K. Cheung. Incremental programming for re-configurable engines. In Proceedings of IEEE International Conference onField-Programmable Technology, pages 411–415, 2002.

271

[86] D. Lee, W. Luk, J.D. Villasenor, and P.Y.K. Cheung. A hardware Gaussiannoise generator for channel code evaluation. In Proceedings of IEEE Sym-posium on Field-Programmable Custom Computing Machines, pages 69–78,2003.

[87] D. Lee, W. Luk, J.D. Villasenor, and P.Y.K. Cheung. Hardware func-tion evaluation using non-linear segments. In Proceedings of InternationalConference on Field-Programmable Logic and its Applications, LNCS 2778,pages 796–807. Springer-Verlag, 2003.

[88] D. Lee, W. Luk, J.D. Villasenor, and P.Y.K. Cheung. Hierarchical segmen-tation schemes for function evaluation. In Proceedings of IEEE Interna-tional Conference on Field-Programmable Technology, pages 92–99, 2003.

[89] D. Lee, W. Luk, J.D. Villasenor, and P.Y.K. Cheung. A Gaussian noisegenerator for hardware-based simulations. IEEE Transactions on Comput-ers, 53(12):1523–1534, 2004.

[90] D. Lee, W. Luk, J.D. Villasenor, and P.Y.K. Cheung. The effects of poly-nomial degrees on the hierarchical segmentation method. In W. Rosenstieland P. Lysaght, editors, New Algorithms, Architectures, and Applicationsfor Reconfigurable Computing. Kluwer Academic Publishers, 2004.

[91] D. Lee, W. Luk, J.D. Villasenor, and P.Y.K. Cheung. The hierarchical seg-mentation method for function evaluation. IEEE Transactions on Circuitsand Systems I, 2004. Submitted.

[92] D. Lee, W. Luk, J.D. Villasenor, and P.H.W. Leong. Design parameteroptimization for the Wallace Gaussian random number generator. ACMTransactions on Modeling and Computer Simulation, 2004. Submitted.

[93] D. Lee, W. Luk, C. Wang, C. Jones, M. Smith, and J.D. Villasenor. Aflexible hardware encoder for low-density parity-check codes. In Proceedingsof IEEE Symposium on Field-Programmable Custom Computing Machines,2004.

[94] D. Lee, W. Luk, G. Zhang, P.H.W. Leong, and J.D. Villasenor. A hardwareGaussian noise generator using the Wallace method. IEEE Transactionson VLSI, 2004. Submitted.

[95] D. Lee, O. Mencer, D.J. Pearce, and W. Luk. Automating optimized table-with-polynomial function evaluation for FPGAs. In Proceedings of Interna-tional Conference on Field-Programmable Logic and its Applications, LNCS3203, pages 364–373. Springer-Verlag, 2004.

272

[96] T.K. Lee, S. Yusuf, W. Luk, M. Sloman, E. Lupu, and N. Dulay. Compilingpolicy descriptions into reconfigurable firewall processors. In Proceedingsof IEEE Symposium on Field-Programmable Custom Computing Machines,pages 39–48, 2003.

[97] V. Lefevre and J.M. Muller. On-the-fly range reduction. Journal of VLSISignal Processing, 33:31–35, 2003.

[98] J.L. Leva. A fast normal random number generator. ACM TransactionsMathematical Software, 18(4):449–453, 1992.

[99] B. Levine, R.R. Taylor, and H. Schmit. Implementation of near Shannonlimit error-correcting codes using reconfigurable hardware. In Proceedingsof IEEE Symposium on Field-Programmable Custom Computing Machines,pages 217–226, 2000.

[100] D.M. Lewis. Interleaved memory function interpolators with applicationto an accurate LNS arithmetic unit. IEEE Transactions on Computers,43(8):974–982, 1994.

[101] J. Leydold. Automatic sampling with the ratio-of-uniforms method. ACMTransactions Mathematical Software, 26(1):78–98, 2000.

[102] R.C. Li, S. Boldo, and M. Daumas. Theorems on efficient argument reduc-tions. In Proceedings of IEEE Symposium on Computer Arithmetic, pages129–136, 2003.

[103] J. Liang, R. Tessier, and D. Goeckel. A dynamically-reconfigurable, power-efficient Turbo decoder. In Proceedings of IEEE Symposium on Field-Programmable Custom Computing Machines, 2004.

[104] J. Liang, R. Tessier, and O. Mencer. Floating point unit generationand evaluation for FPGAs. In Proceedings of IEEE Symposium on Field-Programmable Custom Computing Machines, pages 185–194, 2003.

[105] M. Luby, M. Mitzenmacher, A. Shokrollahi, and D. Spielman. Analysis oflow density codes and improved designs using irregular graphs. In Proceed-ings of the ACM Symposium on the Theory of Computing, pages 249–258,1998.

[106] M. Luby, M. Mitzenmacher, A. Shokrollahi, and D. Spielman. Improvedlow-density parity-check codes using irregular graphs and belief propaga-tion. In Proceedings of IEEE Symposium on Information Theory, page 117,1998.

273

[107] M. Luby, M. Mitzenmacher, A. Shokrollahi, and D. Spielman. Improvedlow-density parity-check codes using irregular graphs. IEEE Transactionson Information Theory, 47:585–598, 2001.

[108] M. Luby, M. Mitzenmacher, A. Shokrollahi, D. Spielman, and V. Stemann.Practical loss-resilient codes. In Proceedings of the ACM Symposium on theTheory of Computing, pages 150–159, 1997.

[109] J.N. Lygouras, B.G. Mertzios, and N.C. Voulgaris. Design and constructionof a microcomputer controlled light-weight robot arm. In Proceedings of theIEEE International Workshop on Intelligent Motion Control, pages 551–555, 1990.

[110] D.J.C MacKay. Good error-correcting codes based on very sparse matrices.IEEE Transactions on Information Theory, 45:399–431, 1999.

[111] D.J.C MacKay, S. Wilson, and M. Davey. Comparison of constructions ofirregular Gallager codes. IEEE Transactions on Communications, 47:1449–1454, 1999.

[112] A. Madisetti, A.Y. Kwentus, and A.N. Willson. A 100-MHz, 16-b, directdigital frequency synthesizer with a 100-dBc spurious-free dynamic range.IEEE Journal of Solid-State Circuits, 34(8):1034–1042, 1999.

[113] G. Marsaglia. Diehard: a battery of tests of randomness, 1997. http:

//stat.fsu.edu/∼geo/diehard.html.

[114] G. Marsaglia, M.D. MacLaren, and T.A. Bray. A fast procedure for gen-erating normal random variables. Communications of the ACM, 7(1):4–10,1964.

[115] G. Marsaglia and W.W. Tsang. The Ziggurat method for generating ran-dom variables. Journal of Statistical Software, 5(8):1–7, 2000.

[116] G. Masera, G. Piccinini, M. Ruo Roch, and M. Zamboni. VLSI architec-tures for Turbo codes. IEEE Transactions on VLSI, 7(3):369–379, 1999.

[117] The MathWorks Inc. MATLAB Manual v6.5, 2002. http://www.

mathworks.com.

[118] C. Maxfield. The Design Warrior’s Guide to FPGAs. Newnes, 2004.

[119] M. McKee. Mars laser will beam super-fast data. New Scientist, Sep 2004.http://www.newscientist.com/news/news.jsp?id=ns99996409.

274

http://stat.fsu.edu/~geo/diehard.html

http://stat.fsu.edu/~geo/diehard.html

http://www.mathworks.com

http://www.mathworks.com

http://www.newscientist.com/news/news.jsp?id=ns99996409

[120] G. Mehta and H. Lee. An FPGA implementation of the graph encoder-decoder for regular LDPC codes. CRL Technical Report 8-4-2002-1, Com-munications Research Laboratory, University of Pittsburgh, 2002.

[121] O. Mencer. PAM-Blox II: design and evaluation of C++ module generationfor computing with FPGAs. In Proceedings of IEEE Symposium on Field-Programmable Custom Computing Machines, pages 67–76, 2002.

[122] O. Mencer and W. Luk. Parameterized high throughput function evaluationfor FPGAs. Journal of VLSI Signal Proceedings of Systems, 36(1):17–25,2004.

[123] O. Mencer, D.J. Pearce, L.W. Howes, and W. Luk. Design space explo-ration with A Stream Compiler. In Proceedings of IEEE InternationalConference on Field-Programmable Technology, pages 270–277, 2003.

[124] G.D. Micheli. Synthesis and Optimization of Digital Circuits. McGraw-Hill, 1994.

[125] A. Miller and M. Gulotta. PN generators using the SRL macro. In XilinxApplication Note, number XAPP211, 2001.

[126] R.H. Morelos-Zaragoza. The Art of Error Correcting Coding. John Wiley& Sons, 2002.

[127] G.S Muller and C.K. Pauw. On the generation of a smooth Gaussianrandom variable to 5 standard deviations. In Proceedings of IEEE SouthernAfrican Conference on Communications and Signal Processing, pages 62–66, 1988.

[128] J.M. Muller. Elementary Functions: Algorithms and Implementation.Birkhauser Verlag AG, 1997.

[129] J.M. Muller. A few results on table-based methods. Reliable Computing,5(3):279–288, 1999.

[130] M.E. Muller. A comparison of methods for generating normal deviates ondigital computers. Journal of the ACM, 6(3):376–383, 1959.

[131] Nallatech. BenONE User Guide, 2002. http://www.nallatech.com.

[132] T. Oenning and J. Moon. Low density parity check coding for magneticrecording channels with media noise. In Proceedings of IEEE Conferenceon Communications, volume 7, pages 2189–2193, 2001.

275

http://www.nallatech.com

[133] E.P. O’Grady and C.H. Wang. Performance limitations in parallel processorsimulations. Transactions of the Society for Computer Simulation, 4:311–330, 1987.

[134] I. Page and W. Luk. Compiling Occam into FPGAs. In FPGAs. AbingdonEE&CS Books, 1991.

[135] K. Page and E.M. Chau. A FPGA ASIC communication channel systemsemulator. In Proceedings of IEEE ASIC Conference, pages 345–348, 1993.

[136] B. Pandita and S.K. Roy. Design and implementation of a Viterbi decoderusing FPGAs. In Proceedings of IEEE International Conference on VLSIDesign, pages 611–614, 1999.

[137] B. Patrice, R. Didier, and V. Jean. Programmable active memories: a per-formance assessment. In Proceedings of ACM/SIGDA International Sym-posium on Field-Programmable Gate Arrays, 1992.

[138] T. Pavlidis. Waveform segmentation through functional approximation.IEEE Transactions on Computers, C-22(7):689–697, 1973.

[139] T. Pavlidis. Optimal piecewise polynomial L2 approximation of functionsof one and two variables. IEEE Transactions on Computers, C-24:98–102,1975.

[140] T. Pavlidis. The use of algorithms of piecewise approximations for pic-ture processing applications. ACM Transactions Mathematical Software,2(4):305–321, 1976.

[141] T. Pavlidis and S.L. Horowitz. Segmentation of plane curves. IEEE Trans-actions on Computers, C-23:860–870, 1974.

[142] T. Pavlidis and A.P. Maika. Uniform piecewise polynomial approximationwith variable joints. Journal of Approximation Theory, 12:61–69, 1974.

[143] W.H. Payne. Normal random numbers: using machine analysis to choosethe best algorithm. ACM Transactions Mathematical Software, 3(4):346–358, 1977.

[144] C.S. Petrie and J.A. Connelly. The sampling of noise for random numbergeneration. In Proceedings of IEEE International Symposium on Circuitsand Systems, volume 6, pages 26–29, 1999.

[145] S.S. Pietrobon. Implementation and performance of a Turbo/MAP de-coder. International Journal of Satellite Communications, 16:23–46, 1998.

276

[146] J.A. Pineiro, J.D. Bruguera, and J.M. Muller. A Turbo/MAP decoder foruse in satellite circuits. In IEEE International Conference on Informationand Communications Security, volume 1, pages 427–431, 1997.

[147] J.A. Pineiro, J.D. Bruguera, and J.M. Muller. Faithful powering computa-tion using table look-up and a fused accumulation tree. In Proceedings ofIEEE Symposium on Computer Arithmetic, pages 40–47, 2001.

[148] J. Proakis. Digital communications. McGraw-Hill, fourth edition, 2000.

[149] E. Remez. Sur un procede convergent d’approximations successives pourdeterminer les polynomes d’approximation. IC.R. Academie des Sciences,Paris, (198), 1934.

[150] J.R. Rice. The Approximation of Functions, volume 2. Addison-Wesley,1969.

[151] T. Richardson, A. Shokrollahi, and R. Urbanke. Design of provably goodlow-density parity check codes. In IEEE International Symposium on In-formation Theory, pages 25–30, 2000.

[152] T. Richardson and R. Urbanke. Efficient encoding of low-density parity-check codes. IEEE Transactions on Information Theory, 47:638–656, 2001.

[153] RightMark Gathering. RightMark Memory Analyzer 3.4, 2004. http://

www.rightmark.org.

[154] B.D. Ripley. Stochastic Simulation. John Wiley & Sons, 1987.

[155] S. Rocchi and V. Vignoli. A chaotic CMOS true-random analog/digitalwithe noise generator. In Proceedings of IEEE International Symposiumon Circuits and Systems, volume 5, pages 463–466, 1999.

[156] C. Rose. A statistical identity linking folded and censored distributions.Journal of Economic Dynamics and Control, 19(8):1391–1403, 1995.

[157] C. Rub. On Wallace’s method for the generation of normal variates. MPIInformatik Research Reports MPI-I-98-1-020, Max-Planck-Institut fur In-formatik, Germany, 1998.

[158] D. Rueckert, L.I. Sonoda, C. Hayes, D.L. Hill, M.O. Leach, and D.J.Hawkes. Nonrigid registration using free-form deformations: applicationto breast MR images. IEEE Transactions on Medical Imaging, 18(8):712–720, 1999.

277

http://www.rightmark.org

http://www.rightmark.org

[159] D. Das Sarma and D.W. Matula. Faithful bipartite ROM reciprocal tables.In Proceedings of IEEE Symposium on Computer Arithmetic, pages 17–28,1995.

[160] M.F. Schollmeyer and W.H. Tranter. Noise generators for the simulation ofdigital communication systems. In Proceedings of IEEE Annual SimulationSymposium, pages 264–275, 1991.

[161] M.J. Schulte and E.E. Schwartzlander Jr. Hardware designs for ex-actly rounded elementary functions. IEEE Transactions on Computers,43(8):964–973, 1994.

[162] M.J. Schulte and J.E. Stine. Symmetric bipartite tables for accurate func-tion approximation. In Proceedings of IEEE Symposium on ComputerArithmetic, pages 175–183, 1997.

[163] M.J. Schulte and J.E. Stine. Approximating elementary functions withsymmetric bipartite tables. IEEE Transactions on Computers, 48(9):842–847, 1999.

[164] C.E. Shannon. A mathematical theory of communication. In Bell SystemTechnical Journal, number 27, pages 379–423, 1948.

[165] N. Sidahao, G.A. Constantinides, and P.Y.K. Cheung. Architectures forfunction evaluation on FPGAs. In Proceedings of IEEE International Sym-posium on Circuits and Systems, volume 2, pages 804–807, 2003.

[166] SimpleScalar LLC. SimpleScalar 4.0, 2004. http://www.simplescalar.

com.

[167] J.E. Stine and M.J. Schulte. The symmetric table addition method for accu-rate function approximation. Journal of VLSI Signal Processing, 21(2):167–177, 1999.

[168] H. Styles and W. Luk. Customizing graphics applications: techniquesand programming interface. In Proceedings of IEEE Symposium on Field-Programmable Custom Computing Machines, pages 77–90, 2000.

[169] S. Swaminathan, R. Tessier, D. Goeckel, and W. Burleson. A dynamicallyreconfigurable adaptive Viterbi decoder. In Proceedings of ACM/SIGDAInternational Symposium on Field-Programmable Gate Arrays, pages 227–236, 2002.

278

http://www.simplescalar.com

http://www.simplescalar.com

[170] K. Tae, J. Chung, and D. Kim. Noise generation system using DCT. InProceedings of IEEE International Symposium on Circuits and Systems,volume 4, pages 29–32, 2002.

[171] P.T.P. Tang. Table lookup algorithms for elementary functions and their er-ror analysis. In Proceedings of IEEE Symposium on Computer Arithmetic,pages 232–236, 1991.

[172] M. Tanner. A recursive approach to low complexity codes. IEEE Transac-tions on Information Theory, IT-27:533–547, 1981.

[173] N. Telle, R.C.C. Cheung, and W.Luk. Customising hardware designsfor elliptic curve cryptography. In International Workshop on ComputerSystems: Architectures, Modeling, and Simulation, LNCS 3133. Springer-Verlag, 2004.

[174] T. Tian, C. Jones, J. Villasenor, and R. Wesel. Construction of irregularLDPC codes with low error floors. In Proceedings of IEEE InternationalConference on Communications, volume 5, pages 3125–3129, 2003.

[175] T. Todman and W. Luk. Methods and tools for high-resolution imaging. InProceedings of International Conference on Field-Programmable Logic andits Applications, LNCS 3203, pages 627–636. Springer-Verlag, 2004.

[176] J. Vedral and J. Holub. Oscilloscope testing by means of stochastic signal.Measurement Science Review, 1(1), 2001.

[177] F. Viglione, G. Masera, G. Piccinini, M. Ruo Roch, and M. Zamboni. A50 Mbit/s iterative Turbo-decoder. In Proceedings of Design, Automationand Test in Europe Conference, pages 176–180, 2000.

[178] J.E. Volder. The CORDIC trigonometric computing technique. IEEETransactions on Electrical Computers, EC-8(3):330–334, 1959.

[179] C.S. Wallace. A long-period pseudo-random generator. Technical ReportTR89/123, Monash University, Australia, 1989.

[180] C.S. Wallace. Fast pseudorandom generators for normal and exponen-tial variates. ACM Transactions on Mathematical Software, 22(1):119–127,1996.

[181] C.S. Wallace. MDMC Software - Random Number Generators. 2003.http://www.datamining.monash.edu.au/software/random.

279

http://www.datamining.monash.edu.au/software/random

[182] J.S. Walther. A unified algorithm for elementary functions. In Proceedingsof AFIPS Spring Joint Computer Conference, pages 379–385, 1971.

[183] N. Wax. Noise and Stochastic Processes. Donver Publications Inc., 1954.

[184] S. Wilton, S. Ang, and W. Luk. The impact of pipelining on energy per op-eration in field-programmable gate arrays. In Proceedings of InternationalConference on Field-Programmable Logic and its Applications, LNCS 3203,pages 719–728. Springer-Verlag, 2004.

[185] W.F. Wong and E. Goto. Fast hardware-based algorithms for elementaryfunction computations using rectangular multipliers. IEEE Transactionson Computers, 43:278–294, 1994.

[186] Xilinx Inc. Additive White Gaussian Noise (AWGN) Core v1.0, 2002.http://www.xilinx.com.

[187] Xilinx Inc. Virtex-II Platform FPGAs: Detailed Sescription, 2003. http:

//www.xilinx.com.

[188] Xilinx Inc. Xilinx System Generator User Guide v6.2, 2003. http://www.xilinx.com.

[189] Xilinx Inc. Virtex-4 Family Overview, 2004. http://www.xilinx.com.

[190] D. Yeh, G. Feygin, and P. Chow. RACER: A reconfigurable constraint-length 14 Viterbi decoder. In Proceedings of IEEE Symposium on Field-Programmable Custom Computing Machines, pages 60–69, 1996.

[191] K.W. Yip and T.S. Ng. A simulation model for Nakagami-m fading chan-nels, m<1. IEEE Transactions on Communications, 48(2):214–221, 2000.

[192] G. Zhang, P.H.W. Leong, C.H. Ho, K.H. Tsoi, R.C.C. Cheung, D. Lee, andW. Luk. Monte Carlo Simulation using FPGAs. IEEE Transactions onVLSI, 2004. Submitted.

[193] T. Zhang and K.K. Parhi. VLSI implementation-oriented (3,k)-regular low-density parity-check codes. In Proceedings of IEEE Workshop on SignalProcessing Systems, pages 25–36, 2001.

[194] T. Zhang and K.K. Parhi. A 54 Mbps (3,6)-regular FPGA LDPC decoder.In Proceedings of IEEE Workshop on Signal Processing Systems, pages 127–132, 2002.

280

http://www.xilinx.com






[195] T. Zhang, Z. Wang, and K.K. Parhi. On finite precision implementationof low-density parity-check codes decoder. In Proceedings of IEEE Inter-national Symposium on Circuits and Systems, volume 6, pages 202–205,2001.

[196] H. Zhun and H. Chen. A truly random number generator based on thermalnoise. In Proceedings of IEEE International Conference on ASIC, pages862–864, 2001.

281

Date post:	27-Jun-2020
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

Hardware Designs for Function Evaluation and LDPC Codingr95152/paper/LDPC... · ation library. The...

Documents