Imperial College
London
Hardware Designs for
Function Evaluation and LDPC Coding
A thesis submitted in partial satisfaction
of the requirements for the degree
Doctor of Philosophy in Computing
by
Dong-U Lee
October 2004
c© Copyright by
Dong-U Lee
October 2004
To my parents for their love and support,
and my country Korea. . .
2
Acknowledgments
I thank my supervisor Prof. Wayne Luk for his advice and direction on both
academic and non-academic issues. I would also like to thank Prof. John D. Vil-
lasenor from UCLA, Prof. Philip H.W. Leong from the Chinese University of
Hong Kong, Prof. Peter Y.K. Cheung from the Department of EEE and Dr. Os-
kar Mencer from the Department of Computing for their help on my research
topics.
Many thanks to my colleagues Altaf Abdul Gaffar, Andreas Fidjeland, An-
thony Ng, Arran Derbyshire, Danny Lee, David Pearce, David Thomas, Henry
Styles, Jose Gabriel de Fiqueiredo Coutinho, Jun Jiang, Ray Cheung, Shay Ping
Seng, Sherif Yusuf, Tero Rissa and Tim Todman from Imperial College, Chris
Jones, Connie Wang, David Choi, Esteban Valles and Mike Smith from UCLA,
and Dr. Guanglie Zhang from the Chinese University of Hong Kong for their
assistance. I am especially thankful to Altaf Abdul Gaffar and Ray Cheung who
helped me with numerous Linux programming tasks, and Tim Todman who proof
read this thesis.
The financial support of Celoxica Limited, Xilinx Inc., the U.K. Engineering
and Physical Sciences Research Council PhD Studentship from the Department of
Computing, Imperial College, and the U.S. Office of Naval Research is gratefully
acknowledged.
3
Abstract of the Thesis
Hardware based implementations are desirable, since they can be several or-
ders of magnitudes faster than software based methods. Reconfigurable devices
such as Field-Programmable Gate Arrays (FPGAs) are ideal candidates for this
purpose, because of their speed and flexibility. Three main achievements are
presented in this thesis: function evaluation, Gaussian noise generation, and
Low-Density Parity-Check (LDPC) code encoding. First, our function evalu-
ation research covers both elementary functions and compound functions. For
elementary functions, we automate function evaluation unit design covering table
look-up, table-with-polynomial and polynomial-only methods. We also illustrate
a framework for adaptive range reduction based on a parametric function evalu-
ation library. The proposed approach is evaluated by exploring various effects of
several arithmetic functions on throughput, latency and area for FPGA designs.
For compound functions which are often non-linear, we present an evaluation
method based on piecewise polynomial approximation with a novel hierarchical
segmentation scheme, which involves uniform segments and segments with size
varying by powers of two. Second, our research on Gaussian noise generation re-
sults in two hardware architectures, some of which can be used for Monte Carlo
simulations such as evaluating the performance of LDPC codes. The first design is
based on the Box-Muller method and the central limit theorem, while the second
design is based on the Wallace method. The quality of the noise produced by the
two noise generators are characterized with various statistical tests. We also ex-
amine how design parameters affect the noise quality with the Wallace method.
Third, our research on LDPC encoding describes a flexible hardware encoder
for regular and irregular LDPC codes. Our architecture, based on an encoding
method proposed by Richardson and Urbanke, has linear encoding complexity.
4
Table of Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1 Objectives and Contributions . . . . . . . . . . . . . . . . . . . . 5
1.2 Computer Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Error Correcting Coding and LDPC Codes . . . . . . . . . . . . . 9
1.4 Overview of our Approach . . . . . . . . . . . . . . . . . . . . . . 11
1.4.1 Function Evaluation . . . . . . . . . . . . . . . . . . . . . 11
1.4.2 Gaussian noise generation . . . . . . . . . . . . . . . . . . 13
1.4.3 LDPC Encoding . . . . . . . . . . . . . . . . . . . . . . . 18
2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.2 Design Tools . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 Function Evaluation Methods . . . . . . . . . . . . . . . . . . . . 24
2.3.1 CORDIC . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.2 Digit-recurrence and On-line Algorithms . . . . . . . . . . 26
2.3.3 Bipartite and Multipartite Methods . . . . . . . . . . . . . 27
2.3.4 Polynomial Approximation . . . . . . . . . . . . . . . . . . 28
2.3.5 Polynomial Approximation with Non-uniform Segmentation 30
2.3.6 Rational Approximation . . . . . . . . . . . . . . . . . . . 31
5
2.4 Issues on Function Evaluation . . . . . . . . . . . . . . . . . . . . 31
2.4.1 Evaluation of Elementary and Compound Functions . . . . 32
2.4.2 Approximation Method Selection . . . . . . . . . . . . . . 32
2.4.3 Range Reduction . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.4 Types of Errors . . . . . . . . . . . . . . . . . . . . . . . . 35
2.5 Gaussian Noise Generation . . . . . . . . . . . . . . . . . . . . . . 36
2.6 LDPC Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.6.1 Basics of LDPC Codes . . . . . . . . . . . . . . . . . . . . 38
2.6.2 LDPC Encoding . . . . . . . . . . . . . . . . . . . . . . . 42
2.6.3 RU LDPC Encoding Method . . . . . . . . . . . . . . . . 43
2.6.4 Hardware Aspects of LDPC codes . . . . . . . . . . . . . . 49
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3 Automating Optimized Table-with-Polynomial
Function Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3 Algorithmic Design Space Exploration with MATLAB . . . . . . . 54
3.4 Hardware Design Space Exploration with ASC . . . . . . . . . . . 57
3.5 Verification with ASC . . . . . . . . . . . . . . . . . . . . . . . . 59
3.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4 Adaptive Range Reduction
6
for Function Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.3 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.3.1 Design Overview . . . . . . . . . . . . . . . . . . . . . . . 74
4.3.2 Degrees of Freedom . . . . . . . . . . . . . . . . . . . . . . 80
4.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.4.1 Algorithmic Design Space Exploration . . . . . . . . . . . 83
4.4.2 ASC Code Generation and Optimizations . . . . . . . . . 87
4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5 The Hierarchical Segmentation Method
for Function Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3 Optimum Placement of Segments . . . . . . . . . . . . . . . . . . 104
5.4 The Hierarchical Segmentation Method . . . . . . . . . . . . . . . 113
5.5 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.6 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.7 The Effects of Polynomial Degrees . . . . . . . . . . . . . . . . . . 127
5.8 Evaluation and Results . . . . . . . . . . . . . . . . . . . . . . . . 133
5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7
6 Gaussian Noise Generator
using the Box-Muller Method . . . . . . . . . . . . . . . . . . . . . . 144
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.4 Function Evaluation for Non-uniform Segmentation . . . . . . . . 152
6.5 Function Evaluation for Noise Generator . . . . . . . . . . . . . . 156
6.6 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
6.7 Evaluation and Results . . . . . . . . . . . . . . . . . . . . . . . . 165
6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
7 Gaussian Noise Generator
using the Wallace Method . . . . . . . . . . . . . . . . . . . . . . . . . 175
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
7.2 The Wallace Method . . . . . . . . . . . . . . . . . . . . . . . . . 176
7.3 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
7.3.1 The First Stage . . . . . . . . . . . . . . . . . . . . . . . . 181
7.3.2 The Second Stage . . . . . . . . . . . . . . . . . . . . . . . 182
7.3.3 The Third Stage . . . . . . . . . . . . . . . . . . . . . . . 182
7.3.4 The Fourth Stage . . . . . . . . . . . . . . . . . . . . . . . 185
7.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
7.5 Evaluation and Results . . . . . . . . . . . . . . . . . . . . . . . . 193
7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
8
8 Design Parameter Optimization
for the Wallace Method . . . . . . . . . . . . . . . . . . . . . . . . . . 204
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
8.2 Overview of the Wallace Method . . . . . . . . . . . . . . . . . . 205
8.3 Measuring the Wallace Correlations . . . . . . . . . . . . . . . . . 208
8.4 Reducing the Wallace Correlations . . . . . . . . . . . . . . . . . 211
8.5 Performance Comparisons . . . . . . . . . . . . . . . . . . . . . . 214
8.6 Hardware Design with Optimized Parameters . . . . . . . . . . . 220
8.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
9 Flexible Hardware Encoder for LDPC Codes . . . . . . . . . . . 226
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
9.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
9.3 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
9.4 Encoder Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 235
9.5 Components for the Encoder . . . . . . . . . . . . . . . . . . . . . 239
9.5.1 Vector Addition . . . . . . . . . . . . . . . . . . . . . . . . 239
9.5.2 Matrix-Vector Multiplication . . . . . . . . . . . . . . . . . 239
9.5.3 Forward-Substitution . . . . . . . . . . . . . . . . . . . . . 241
9.6 Implementation and Results . . . . . . . . . . . . . . . . . . . . . 242
9.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
10.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
9
10.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
10.2.1 Function Evaluation . . . . . . . . . . . . . . . . . . . . . 261
10.2.2 Gaussian Noise Generation . . . . . . . . . . . . . . . . . . 263
10.2.3 LDPC Coding . . . . . . . . . . . . . . . . . . . . . . . . . 264
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
10
List of Figures
1.1 Relations of the chapters in this thesis. . . . . . . . . . . . . . . . 7
1.2 Design flow for evaluating elementary functions. . . . . . . . . . . 13
1.3 Design flow for evaluating non-linear functions using the hierarchi-
cal segmentation method. . . . . . . . . . . . . . . . . . . . . . . 14
1.4 The BenONE board from Nallatech used to run our LDPC simu-
lation experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.5 Our LDPC hardware simulation framework. . . . . . . . . . . . . 17
1.6 LDPC encoding framework. . . . . . . . . . . . . . . . . . . . . . 18
2.1 Simplified view of a Xilinx logic cell. A single slice contains 2.25
logic cells. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Architecture of a typical FPGA. . . . . . . . . . . . . . . . . . . . 22
2.3 Certain approximation methods are better than others for a given
metric at different precisions. . . . . . . . . . . . . . . . . . . . . 33
2.4 Area comparison in terms of configurable logic blocks for different
methods with varying data widths [122]. . . . . . . . . . . . . . . 34
2.5 Comparison of (3,6)-regular LDPC code, Turbo code and opti-
mized irregular LDPC code [151]. . . . . . . . . . . . . . . . . . . 39
2.6 LDPC communication system model. . . . . . . . . . . . . . . . . 40
2.7 A bipartite graph of a (3,6)-regular LDPC code of length ten and
rate 1/2. There are ten variable nodes and five check nodes. For
each check node Ci the sum (over GF(2)) of all adjacent variable
node is equal to zero. . . . . . . . . . . . . . . . . . . . . . . . . . 41
11
2.8 An equivalent parity-check matrix in lower triangular form. . . . . 43
2.9 The parity-check matrix in approximate lower triangular form . . 44
3.1 Block diagram of methodology for automation. . . . . . . . . . . . 55
3.2 Principles behind automatic design optimization with ASC. . . . 56
3.3 Accuracy graph: maximum error versus bitwidth for sin(x) with
the three methods. . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4 Area versus bitwidth for sin(x) with TABLE+POLY. OPT indi-
cates for what metric the design is optimized for. Lower part:
LUTs for logic; small top part: LUTs for routing. . . . . . . . . . 62
3.5 Latency versus bitwidth for sin(x) with TABLE+POLY. Shows
the impact of latency optimization. . . . . . . . . . . . . . . . . . 62
3.6 Throughput versus bitwidth for sin(x) with TABLE+POLY. Shows
the impact of throughput optimization. . . . . . . . . . . . . . . . 63
3.7 Latency versus area for 12-bit approximations to sin(x). The
Pareto-optimal points [124] in the latency-area space are shown. 63
3.8 Latency versus throughput for 12-bit approximations to sin(x).
The Pareto-optimal points in the latency-throughput space are
shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.9 Area versus throughput for 12-bit approximations to sin(x). The
Pareto-optimal points in the throughput-area space are shown. . 64
3.10 Area versus bitwidth for the three functions with TABLE+POLY.
Lower part: LUTs for logic; small top part: LUTs for routing. . . 67
3.11 Latency versus bitwidth for the three functions with TABLE+POLY. 67
12
3.12 Throughput versus bitwidth for the three functions with TABLE+POLY.
Throughput is similar across functions, as expected. . . . . . . . . 68
3.13 Area versus bitwidth for sin(x) with the three methods. Note that
the TABLE method gets too large already for 14 bits. . . . . . . . 68
3.14 Latency versus bitwidth for sin(x) with the three methods. . . . 69
3.15 Throughput versus bitwidth for sin(x) with the three methods. . . 69
4.1 Design flow: MATLAB generates all the ASC code for the library.
The user simply indexes into the library to obtain the specific
function approximation unit. . . . . . . . . . . . . . . . . . . . . . 73
4.2 Description of range reduction, evaluation method and range re-
construction for the three functions sin(x), log(x) and√
x. . . . . 75
4.3 Circuit for evaluating sin(x). . . . . . . . . . . . . . . . . . . . . . 76
4.4 Circuit for evaluating log(x). . . . . . . . . . . . . . . . . . . . . . 77
4.5 Circuit for evaluating√
x. . . . . . . . . . . . . . . . . . . . . . . 78
4.6 Plot of the three functions over the range reduced intervals. . . . 79
4.7 Segmentation for evaluating log(y) with eight uniform segments.
The leftmost three bits of the inputs are used as the segment index. 82
4.8 Architecture of table-with-polynomial unit for degree d polynomi-
als. Horner’s rule is used to evaluate the polynomials. . . . . . . . 83
4.9 ASC code for evaluating sin(x) for range 8 bits and precision 8 bits
with tp2. This code is automatically generated from our MATLAB
tool. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.10 Area matrix which tells us for each input range/precision combi-
nation which design to use for minimum area. . . . . . . . . . . . 91
13
4.11 Latency matrix which tells us for each input range/precision com-
bination which design to use for minimum latency. . . . . . . . . . 91
4.12 Area cost of range reduction (upper part) for sin(x) implemented
using po with the designs optimized for area. . . . . . . . . . . . . 92
4.13 Area cost of range reduction (upper part) for sin(x) implemented
using tp3 with the designs optimized for area. . . . . . . . . . . . 92
4.14 Area cost of range reduction (upper part) for log(x) implemented
using po with the designs optimized for area. . . . . . . . . . . . . 93
4.15 Area cost of range reduction (upper part) for log(x) implemented
using tp3 with the designs optimized for area. . . . . . . . . . . . 93
4.16 Area for sin(x) with precision of eight bits for different methods
with (WRR, solid line) and without (WOR, dashed line) range
reduction, with the designs optimized for area. . . . . . . . . . . . 94
4.17 Latency for sin(x) with precision of eight bits for different methods
with (WRR, solid line) and without (WOR, dashed line) range
reduction, with the designs optimized for latency. . . . . . . . . . 94
4.18 Area for log(x) with precision of eight bits for different methods
with (WRR, solid line) and without (WOR, dashed line) range
reduction, with the designs optimized for area. . . . . . . . . . . . 95
4.19 Latency for sin(x) with precision of eight bits for different methods
with (WRR, solid line) and without (WOR, dashed line) range
reduction, with the designs optimized for latency. . . . . . . . . . 95
4.20 Area versus precision for sin(x) using tp3 for different ranges and
optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
14
4.21 Latency versus precision for sin(x) using tp3 for different ranges
and optimization. . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
4.22 Area versus range for all three functions using different methods
with the precision fixed at eight bits optimized for area. . . . . . . 97
4.23 Latency versus range for all three functions using different methods
with the precision fixed at eight bits optimized for latency. . . . . 97
4.24 Area versus range for all three functions using po for different
precisions optimized for area. . . . . . . . . . . . . . . . . . . . . 98
4.25 Latency versus range for all three functions using po for different
precisions optimized for latency. . . . . . . . . . . . . . . . . . . . 98
4.26 Area versus range for all three functions using po for different
precisions optimized for area. . . . . . . . . . . . . . . . . . . . . 99
4.27 Latency versus range for all three functions using po for different
precisions optimized for latency. . . . . . . . . . . . . . . . . . . . 99
5.1 MATLAB code for finding the optimum boundaries. . . . . . . . . 109
5.2 Optimum locations of the segments for the four functions in Sec-
tion 5.1 for 16-bit operands and second order approximation. . . . 110
5.3 Numbers of optimum segments for first order approximations to
the functions for various operand bitwidths. . . . . . . . . . . . . 111
5.4 Numbers of optimum segments for second order approximations to
the functions for various operand bitwidths. . . . . . . . . . . . . 111
5.5 Ratio of the number of optimum segments required for first and
second order approximations to the functions. . . . . . . . . . . . 112
15
5.6 Circuit to calculate the P2S address for a given input δi, where
δi = av−1av−2..a0. The adder counts the number of ones in the
output of the two prefix circuits. . . . . . . . . . . . . . . . . . . . 115
5.7 Main MATLAB code for finding the hierarchical boundaries and
their polynomial coefficients. . . . . . . . . . . . . . . . . . . . . . 119
5.8 Variation of total number of segments against v0 for a 16-bit second
order approximation to f3. . . . . . . . . . . . . . . . . . . . . . . 120
5.9 The segmented functions generated by HFS for 16-bit second order
approximations. f1, f2, f3 and f4 employ P2S(US), P2SL(US),
US(US) and US(US) respectively. The black and grey vertical lines
are the boundaries for the outer and inner segments respectively. . 121
5.10 Design flow of our approach. . . . . . . . . . . . . . . . . . . . . . 123
5.11 HSM function evaluator architecture for λ = 2 and degree d ap-
proximations. Note that ‘:’ is a concatenation operator. . . . . . . 130
5.12 Variations of the table sizes to the four functions with varying
polynomial degrees and operand bitwidths. . . . . . . . . . . . . . 131
5.13 Variations of the HSM/Optimum segment ratio with polynomial
degrees and operand bitwidths. . . . . . . . . . . . . . . . . . . . 132
5.14 Xilinx System Generator design template used for first order US(US).135
5.15 Xilinx System Generator design template used for second order
P2SL(US). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.16 Error in ulp for 16-bit second order approximation to f3. . . . . . 137
6.1 Gaussian noise generator architecture. The black boxes are buffers. 150
16
6.2 The f function. The asterisks indicate the boundaries of the linear
approximations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.3 Circuit to calculate the segment address for a given input x. The
adder counts the number of ones in the output of the two prefix
circuits. Note that the least-significant bit xo is not required. . . . 155
6.4 Function evaluator architecture based on non-unform segmentation.157
6.5 Variation of function approximation error with number of bits for
the gradient of the f function. . . . . . . . . . . . . . . . . . . . . 158
6.6 The g functions. Only the thick line is approximated; see Figure
4. The most significant 2 bits of u2 are used to choose which of
the four regions to use; the remaining bits select a location within
Region 0. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
6.7 Approximation for g1 over [0, 1/4). The asterisks indicate the seg-
ment boundaries of the linear approximations. . . . . . . . . . . . 160
6.8 Approximation error to f . The worst case and average errors are
0.031 and 0.000048 respectively. . . . . . . . . . . . . . . . . . . . 161
6.9 Approximation error to g1. The worst case and average errors are
0.00079 and 0.0000012 respectively. . . . . . . . . . . . . . . . . . 162
6.10 PDF of the generated noise with 17 approximations for f and 6
for g for a population of four million. The p-values of the χ2 and
A-D tests are 0.00002 and 0.0084 respectively. . . . . . . . . . . . 169
6.11 PDF of the generated noise with 59 approximations for f and 21
for g for a population of four million. The p-values of the χ2 and
A-D tests are 0.0012 and 0.3487 respectively. . . . . . . . . . . . . 169
17
6.12 PDF of the generated noise with 59 approximations for f and
21 for g with two accumulated samples for a population of four
million. The p-values of the χ2 and A-D tests are 0.3842 and
0.9058 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 170
6.13 Scatter plot of two successive accumulative noise samples for a
population of 10000. No obvious correlations can be seen. . . . . . 170
6.14 Variation of output rate against the number of noise generator
instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
7.1 Overview of the Wallace method. . . . . . . . . . . . . . . . . . . 177
7.2 Overview of our Gaussian noise generator architecture based on the
Wallace method. The triangle in Stage 4 is a constant coefficient
multiplier. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
7.3 The transformation circuit of Stage 3. The square boxes are reg-
isters. The select signals for the multiplexors and the clock enable
signals for the registers are omitted for simplicity. . . . . . . . . . 183
7.4 Detailed timing diagram of the transformation circuit and the
dual-port “Pool RAM”. A z indicates the address of the data z
and WE is the write enable signal of the “Pool RAM”. . . . . . . 184
7.5 Wallace architecture Stage 1 in Xilinx System Generator. The 30
LFSRs generate uniform random bits for Stage 2. . . . . . . . . . 188
7.6 Wallace architecture Stage 2 in Xilinx System Generator. Pseudo
random addresses for p, q, r, s are generated. . . . . . . . . . . . . 189
7.7 Wallace architecture Stage 3 and Stage 4 in Xilinx System Gener-
ator. Orthogonal transformation is performed and sum of squares
corrected. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
18
7.8 Our Wallace design placed on a Xilinx Virtex-II XC2V4000-6 FPGA.192
7.9 Our Wallace design routed on a Xilinx Virtex-II XC2V4000-6 FPGA.192
7.10 Scatter plot of two successive noise samples for a population of
10000. No obvious correlations can be seen. . . . . . . . . . . . . 195
7.11 PDF of the generated noise from our design for a population of
one million. The p-values of the χ2 and A-D tests are 0.9994 and
0.2332 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 196
7.12 PDF of the generated noise from our design for a population of
four million. The p-values of the χ2 and A-D tests are 0.7303 and
0.8763 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 197
7.13 PDF of the generated noise from the Xilinx block for a population
of one million. The p-values of the χ2 and A-D tests are 0.0000
and 0.0002 respectively. . . . . . . . . . . . . . . . . . . . . . . . . 198
7.14 Variation of the χ2 test p-value with sample size for the Xilinx
block, 12-bit, 16-bit, 20-bit and 24-bit Wallace implementation. . 200
7.15 Variation of output rate against the number of noise generator
instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
8.1 Pseudo code of the Wallace method. . . . . . . . . . . . . . . . . 207
8.2 Four million samples of blocks immediately following the block
containing a 5σ output, evaluated with the χ2 test with 200 bins
over [−7, 7] for FastNorm2. The χ2199 contributions of each of the
bins are shown. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
19
8.3 The χ2199 values of blocks relative to a block containing a realization
with absolute value of 5σ or higher. Four million samples are
compiled for each block. The dotted horizontal line indicates the
0.05 confidence level. . . . . . . . . . . . . . . . . . . . . . . . . . 210
8.4 Impact of various design choices on the χ2199 value. Four million
samples are compiled from the block immediately after each block
containing an absolute value of 5σ or higher for each data point.
The dotted horizontal line indicates the 0.05 confidence level. . . . 222
8.5 Speed comparisons at various K at N = 4096 and R = 1. Lower
part: arithmetic operations. Upper part: table accesses. . . . . . . 223
8.6 Speed comparisons for different parameter choices. The solid,
dashed and dotted lines are for R = 1, R = 2 and R = 3 re-
spectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
8.7 Execution times for different pool sizes at R = 1 and K = 16. The
solid and dotted lines are for the Athlon XP and the Pentium 4
processors respectively. . . . . . . . . . . . . . . . . . . . . . . . . 224
8.8 Level 2 cache miss rates on the SimpleScalar x86 simulator for
different pool sizes at R = 1, K = 16 and various level 2 cache
sizes. Level 1 cache is fixed at 16KB and 65536 noise samples are
generated for each data point. . . . . . . . . . . . . . . . . . . . . 224
9.1 The parity-check matrix H in ALT form. A, B, C, and E are
sparse matrices, D is a dense matrix, and T is a sparse lower
triangular matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
9.2 LDPC encoding framework. . . . . . . . . . . . . . . . . . . . . . 229
20
9.3 An equivalent parity-check matrix in lower triangular form. Note
that n = block length and m = block length× (1− code rate). . . 230
9.4 Different starting columns for H and HT . . . . . . . . . . . . . . . 235
9.5 Overview of our hardware encoder architecture. Double buffering
is used between the stages for concurrent execution. Grey and
white box indicate RAMs and operations respectively. . . . . . . . 236
9.6 Circuit for vector addition (VA). . . . . . . . . . . . . . . . . . . . 239
9.7 Circuit for matrix-vector multiplication (MVM). . . . . . . . . . . 241
9.8 Circuit for forward-substitution (FS). . . . . . . . . . . . . . . . . 243
9.9 Scatter plot of a preprocessed irregular 500 × 1000 H matrix in
ALT form with a gap of two. Ones appear as dots. . . . . . . . . 245
9.10 The four stage LDPC encoder architecture in Xilinx System Gener-
ator. Each stage contains multiple subsystems performing MVM,
FS, VA or CWG. . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
9.11 LDPC encoder architecture Stage 2 and stage controller in Xilinx
System Generator. . . . . . . . . . . . . . . . . . . . . . . . . . . 247
9.12 The matrix-vector multiplication (MVM) circuit in Xilinx System
Generator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
9.13 The forward-substitution (FS) circuit in Xilinx System Generator. 249
9.14 Variation of throughput with the number of encoder instances. . . 255
21
List of Tables
2.1 Maximum absolute and average errors for various fist order poly-
nomial approximations to ex over [−1, 1]. . . . . . . . . . . . . . . 29
2.2 Efficient computation of pT1 = −φ−1(−ET−1A + C)sT . . . . . . . 46
2.3 Efficient computation of pT2 = −T−1(AsT + BpT
1 ). . . . . . . . . . 47
2.4 Summary of the RU encoding procedure. . . . . . . . . . . . . . . 48
3.1 Various place and route results of 12-bit approximations to sin(x).
The logic minimized LUT implementation of the tables minimizes
latency and area, while keeping comparable throughput to the
other methods, e.g. block RAM (BRAM) based implementation. . 59
5.1 The ranges for P2S addresses for Λ1 = P2S, n = 8, v0 = 5 and
v1 = 3. The five P2S address bits δ0 are highlighted in bold. . . . 114
5.2 Number of segments for second order approximations to the four
functions. Results for uniform, HSM and optimum are shown. . . 122
5.3 Comparison of direct look-up, SBTM, STAM and HSM for 16 and
24-bit approximations to f2. The subscript for HSM denotes the
polynomial degree, and the subscript for STAM denotes the num-
ber of multipartite tables used. Note that SBTM is equivalent to
STAM2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.4 Hardware synthesis results on a Xilinx Virtex-II XC2V4000-6 FPGA
for 16 and 24-bit, first and second order approximations to f2 and
f3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
22
5.5 Widths of the data paths, number of segments, table size and
percentage of exactly rounded results for 16 and 24-bit second
order approximations to f2 and f3. . . . . . . . . . . . . . . . . . 141
5.6 Performance comparison: computation of f2 and f3 functions. The
Athlon and the Pentium 4 PCs are equipped with 512MB and 1GB
DDR-SDRAMs respectively. . . . . . . . . . . . . . . . . . . . . . 142
6.1 Comparing two segmentation methods. Second column shows the
comparison of the number of segments for non-uniform and uni-
form segmentation. Third column shows the number of bits used
for the coefficients to approximate f and g1. . . . . . . . . . . . . 163
6.2 Performance comparison: time for producing one billion Gaussian
noise samples. All PCs are equipped with 1GB DDR-SDRAM. . . 171
7.1 Resource utilization for the four stages of the noise generator on a
Xilinx Virtex-II XC2V4000-6 FPGA. . . . . . . . . . . . . . . . . 191
7.2 Hardware implementation results of the noise generator using dif-
ferent types of FPGA resources on a Xilinx Virtex-II XC2V4000-6
FPGA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
7.3 Comparisons of different hardware Gaussian noise generators im-
plemented on Xilinx Virtex-II XC2V4000-6 FPGAs. All designs
generate a noise sample every clock. . . . . . . . . . . . . . . . . . 199
7.4 Hardware implementation results on a Xilinx Virtex-II XC2V4000-
6 FPGA for for different numbers of noise generator instances.
The device has 23040 slices, 120 block RAMs and 120 embedded
multipliers in total. . . . . . . . . . . . . . . . . . . . . . . . . . . 201
23
7.5 Performance comparison: time for producing one billion Gaussian
noise samples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
8.1 Number of arithmetic operations per transform/sample for the
transformation at various sizes of K. . . . . . . . . . . . . . . . . 214
8.2 Specifications of the AMD Athlon XP and Intel Pentium 4 plat-
forms used in our experiments. . . . . . . . . . . . . . . . . . . . . 216
8.3 Details of the AMD Athlon XP and Intel Pentium 4 data caches. 217
8.4 Execution time in nanoseconds for the AMD Athlon XP and Intel
Pentium 4 platforms at N = 4096. . . . . . . . . . . . . . . . . . . 218
8.5 Performance comparison of different software Gaussian random
number generators. The Wallace implementations use N = 4096,
R = 1 and K = 16. . . . . . . . . . . . . . . . . . . . . . . . . . . 220
9.1 Computation of pT1 = −F−1(−ET−1A+C)sT . Note that T−1[AsT ] =
yT ⇒ TyT = [AsT ]. . . . . . . . . . . . . . . . . . . . . . . . . . 232
9.2 Computation of pT2 = −T−1(AsT + BpT
1 ). . . . . . . . . . . . . . . 232
9.3 Matrix X stored in memory. The location of the edges of each row
and an extra bit indicating the end of a row are stored. . . . . . . 240
9.4 Preprocessing times and gaps for H matrices with rate 1/2 for var-
ious block lengths performed on a Pentium 4 2.4GHz PC equipped
with 512MB DDR-SDRAM. . . . . . . . . . . . . . . . . . . . . . 244
9.5 Dimensions and number of edges for the matrices A, B, T , C, F
and E generated from a 1000× 2000 irregular H matrix. . . . . . 250
9.6 Hardware synthesis results on a Xilinx Virtex-II XC2V4000-6 FPGA
for rate 1/2 for various block length. . . . . . . . . . . . . . . . . 252
24
9.7 Hardware synthesis results on a Xilinx Virtex-II XC2V4000-6 FPGA
for block length of 2000 bits for various rates. . . . . . . . . . . . 253
9.8 Hardware synthesis results on a Xilinx Virtex-II XC2V4000-6 FPGA
for block length of 2000 bits and rate 1/2 for different numbers of
encoder instances. . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
9.9 Performance comparison of block length of 2000 bits and rate 1/2
encoders: time for producing 410 million codeword bits. . . . . . . 256
25
Abbreviations
A-D Anderson-Darling
ALT Approximate Lower Triangular
ASC A Stream Compiler
ASIC Application-Specific Integrated Circuit
AWGN Additive White Gaussian Noise
BER Bit Error Rate
CDF Cumulative Distribution Function
CORDIC COordinate Rotations DIgital Computer
CPC Cycles Per Codeword
CPS Codewords Per Second
CWG CodeWord Generation
DDR Double Data Rate
DSP Digital Signal Processor
ECC Error Correcting Coding
FPGA Field-Programmable Gate Array
FS Forward-Substitution
GF Galois Field
HFS Hierarchical Function Segmenter
HSM Hierarchical Segmentation Method
K-S Kolmogorov-Smirnov
LDGM Low-Density Generator-Matrix
LDPC Low-Density Parity-Check
LFSR Linear Feedback Shift Register
LNS Logarithmic Number Systems
LRU Least Recently Used
LUT Look-Up Table
Mbps Mega bits per second
MVM Matrix-Vector Multiplication
P2S Powers of 2 Segments
PDF Probability Distribution Function
po polynomial only
RAM Random Access Memory
ROM Read Only Memory
RU Richardson and Urbanke
S1 Stage 1
SBTM Symmetric Bipartite Table Method
SNR Signal to Noise Ratio
STAM Symmetric Table Addition Method
tp2 table-with-polynomial of degree 2
ulp unit in the last place
US Uniform Segments
VA Vector Addition
VHDL Very high speed integrated circuits Hardware Description Language
WOR WithOut Range reduction
WRR With Range Reduction
1
Publications
Journal Papers
D. Lee, A. Abdul Gaffar, O. Mencer and W. Luk, “Automating optimized hard-
ware function evaluation”, submitted to IEEE Transactions on Computers, 2004.
P.H.W. Leong, G. Zhang, D. Lee, W. Luk and J.D. Villasenor, “A comment on
the implementation of the Ziggurat method”, submitted to Journal of Statistical
Software, 2004.
D. Lee, W. Luk, J.D. Villasenor and P.H.W. Leong, “Design parameter optimiza-
tion for the Wallace Gaussian random number generator”, submitted to ACM
Transactions on Modeling and Computer Simulation, 2004.
D. Lee, W. Luk, J.D. Villasenor, G. Zhang and P.H.W. Leong, “A hardware
Gaussian noise generator using the Wallace method”, submitted to IEEE Trans-
actions on VLSI, 2004.
G. Zhang, P.H.W. Leong, C.H. Ho, K.H. Tsoi, R.C.C. Cheung, D. Lee and
W. Luk, “Monte Carlo simulation using FPGAs”, submitted to IEEE Trans-
actions on VLSI, 2004.
D. Lee, W. Luk, J.D. Villasenor and P.Y.K. Cheung, “The hierarchical segmen-
tation method for function evaluation”, submitted to IEEE Transactions on Cir-
cuits and Systems I, 2004.
D. Lee, W. Luk, J.D. Villasenor and P.Y.K. Cheung, “A hardware Gaussian noise
generator for hardware-based simulations”, IEEE Transactions on Computers,
2
volume 53, number 12, pages 1523-1534, 2004.
Book Chapter
D. Lee, W. Luk, J.D. Villasenor and P.Y.K. Cheung, “The effects of polynomial
degrees on the hierarchical segmentation method”, Chapter in New Algorithms,
Architectures, and Applications for Reconfigurable Computing, W. Rosenstiel and
P. Lysaght (Eds.), Kluwer Academic Publishers, 2004.
Conference Papers
D. Lee, A. Abdul Gaffar, O. Mencer and W. Luk, “MiniBit: Bit-width opti-
mization via affine arithmetic”, submitted to ACM/IEEE Design Automation
Conference, 2005.
D. Lee, A. Abdul Gaffar, O. Mencer and W. Luk, “Adaptive range reduction for
hardware function evaluation”, In Proceedings of IEEE International Conference
on Field-Programmable Technology (FPT), pages 169-176, Brisbane, Australia,
Dec 2004.
D. Lee, “Gaussian noise generation for Monte Carlo simulations in hardware”, In
Proceedings of The Korean Scientists and Engineers Association in the UK 30th
Anniversary Conference, pages 182-185, London, UK, Sep 2004.
D. Lee, O. Mencer, D.J. Pearce and W. Luk, “Automating optimized table-
with-polynomial function evaluation for FPGAs”, In Proceedings of International
Conference on Field Programmable Logic and its Applications (FPL), pages 364-
3
373, LNCS 3203, Springer-Verlag, Antwerp, Belgium, Aug 2004.
D. Lee, W. Luk, C. Wang, C. Jones, M. Smith and J.D. Villasenor, “A flexible
hardware encoder for low-density parity-check codes”, In Proceedings of IEEE
Symposium on Field-Programmable Custom Computing Machines (FCCM), pages
101-111, Napa Valley, USA, Apr 2004.
D. Lee, W. Luk, J.D. Villasenor and P.Y.K. Cheung, “Hierarchical segmentation
schemes for function evaluation”, In Proceedings of IEEE International Confer-
ence on Field-Programmable Technology (FPT), pages 92-99, Tokyo, Japan, Dec
2003.
D. Lee, W. Luk, J.D. Villasenor and P.Y.K. Cheung, “Hardware function eval-
uation using non-linear segments”, In Proceedings of International Conference
on Field Programmable Logic and its Applications (FPL), pages 796-807, LNCS
2778, Springer-Verlag, Lisbon, Portugal, Sep 2003.
D. Lee, W. Luk, J.D. Villasenor and P.Y.K. Cheung, “A hardware Gaussian noise
generator for channel code evaluation”, In Proceedings of IEEE Symposium on
Field-Programmable Custom Computing Machines (FCCM), pages 69-78, Napa
Valley, USA, Apr 2003.
D. Lee, T.K. Lee, W. Luk and P.Y.K. Cheung, “Incremental programming for re-
configurable engines”, In Proceedings of IEEE International Conference on Field-
Programmable Technology (FPT), pages 411-415, Shatin, Hong Kong, Dec 2002.
4
CHAPTER 1
Introduction
1.1 Objectives and Contributions
The objective of this thesis is to explore hardware designs for function evaluation,
Gaussian noise generation and Low-Density Parity-Check (LDPC) code encoding.
Our main contributions are:
• Methodology for the automation of function evaluation unit design, cov-
ering table look-up, table-with-polynomial and polynomial-only methods
(Chapter 3).
• Framework for adaptive range reduction based on a parametric function
evaluation library, and on function approximation by polynomials and ta-
bles and pre-computing all possible input an output ranges (Chapter 4).
• Efficient hierarchical segmentation method based on piecewise polynomial
approximations suitable for non-linear compound functions, which involves
uniform segments and segments with size varying by powers of two (Chap-
ter 5).
• Hardware Gaussian noise generator based on the Box-Muller method and
the central limit theorem capable of producing 133 million samples per sec-
ond with 10% resource usage on a Xilinx XC2V4000-6 FPGA (Chapter 6).
• Hardware Gaussian noise generator based on the Wallace method capable
5
of producing 155 million samples per second with 3% resource usage on a
Xilinx XC2V4000-6 FPGA (Chapter 7).
• Design parameter optimization for software implementations of the Wallace
method to reduce correlations and execution time (Chapter 8).
• Linear complexity hardware encoder for regular and irregular LDPC codes
with an efficient architecture for storing and performing computation on
sparse matrices (Chapter 9).
The most exciting contribution of this thesis is perhaps the hierarchical seg-
mentation method presented in Chapter 5. It is a systematic method for pro-
ducing fast and efficient hardware function evaluators for both compound and
elementary functions using piecewise polynomial approximations with a novel
hierarchical segmentation scheme. This method is particulary useful for approx-
imating non-linear functions or curves, using significantly less memory than the
traditional uniform segmentation approach. Depending on the function and pre-
cision, the memory requirements can be reduced to several orders of magnitudes.
We believe that there are numerous applications out there that can benefit from
our approach including data compression, function evaluation, non-linear filter-
ing, pattern recognition and picture processing.
Although the designs in this thesis target FPGA technology, we believe that
our methods are generic enough to be applied across different implementation
technologies such as ASICs. FPGAs are simply used as a platform to demonstrate
that our ideas can be efficiently mapped into hardware.
Figure 1.1 illustrates how the various chapters in this thesis are related to
each other. The chapters on function evaluation are 3, 4 and 5. The chapters on
LDPC coding are 6, 7, 8 and 9. Within the LDPC coding framework, Chapters
6, 7 and 8 are on Gaussian noise generation, which is needed for exploring LDPC
6
C h a p t e r 3 A u t o m a t i n g
F u n c t i o n E v a l u a t i o n
L D P C C o d i n g
G a u s s i a n N o i s e G e n e r a t i o n
C h a p t e r 6 B o x - M u l l e r
M e t h o d
C h a p t e r 7 W a l l a c e M e t h o d
C h a p t e r 9 L D P C
E n c o d i n g
F u n c t i o n E v a l u a t i o n
C h a p t e r 4 R a n g e R e d u c t i o n
C h a p t e r 5 H i e r a r c h i c a l
S e g m e n t a t i o n
C h a p t e r 8 W a l l a c e
O p t i m i z a t i o n
C h a p t e r 1 I n t r o d u c t i o n
C h a p t e r 2 B a c k g r o u n d C h a p t e r 1 0
C o n c l u s t i o n s
Figure 1.1: Relations of the chapters in this thesis.
code behavior in hardware. The Box-Muller method in Chapter 6 requires the
evaluation of functions and uses a variant of the hierarchical segmentation method
presented in Chapter 5.
The rest of this chapter provides historical information and an overview of
the material in Chapters 3 ∼ 8. Chapter 2 covers background material and
previous work. Chapter 3 describes a methodology for the automation of el-
ementary function evaluation unit design. Chapter 4 presents a framework for
adaptive range reduction based on a parametric elementary function evaluation li-
brary. Chapter 5 presents an efficient hierarchical segmentation method suitable
for non-linear compound functions. Chapter 6 describes a hardware Gaussian
noise generator based on the Box-Muller method and the central limit theorem.
Chapter 7 presents a hardware Gaussian noise generator based on the Wallace
method. Chapter 8 analyzes correlations that can occur in the Wallace method,
and examines parameters to reduce correlations and execution time for software
implementations. Chapter 9 describes an efficient hardware encoder with linear
encoding complexity for both regular and irregular LDPC codes, and Chapter 10
7
offers conclusions and future work.
1.2 Computer Arithmetic
Arithmetic has played important roles in human civilization, especially in the
areas of science, engineering and technology. Machine arithmetic can be traced
back as early as 500 BC in the form of abacus used in China. Many numer-
ically intensive applications, such as signal processing, require rapid execution
of arithmetic operations. The evaluation of functions is often the performance
bottleneck of many compute-bound applications. Examples of these functions in-
clude elementary functions such as log(x) and√
x, and compound functions such
as√− log(x) and x log(x). Computing these functions quickly and accurately
is a major goal in computer arithmetic. For instance, over 60% of the total run
time is devoted to function evaluation operations in a simulation of a jet engine
reported by O’Grady and Wang [133].
Recent studies have shown that the increasing importance of these math-
ematical functions in a wide variety of applications. The applications where
these functions have increasingly more important are computer 3D graphics, an-
imation, scientific computing, artificial neural networks, digital signal processing
and multimedia applications. Software implementations are often too slow for
numerically intensive or real-time applications. The increasing speed and perfor-
mance constraints of such applications have led to the development of new ded-
icated hardware for the computation of these operations, providing high-speed
solutions implemented in coprocessors, graphic cards, Digital Signal Processors
(DSPs), Application-Specific Integrated Circuits (ASICs), Field-Programmable
Gate Arrays (FPGAs) [122] and numerical processors in general.
8
1.3 Error Correcting Coding and LDPC Codes
Error correcting coding (ECC) is a critical part of modern communications sys-
tems, where it is used to detect and correct errors introduced during a transmis-
sion over a channel [11], [126]. It relies on transmitting the data in an encoded
form, such that the redundancy introduced by the coding allows a decoding de-
vice at the receiver to detect and correct errors. In this way, no request for
retransmission is required, unlike systems which only detect errors (usually by
means of a checksum transmitted with the data). In many applications, a sub-
stantial portion of the baseband signal processing is dedicated to ECC. The wide
range of ECC applications [30] include space and satellite communications, data
transmission, data storage and mobile communications.
NASA’s space missions including Galileo, Odyssey, Rovers and Voyager would
not have been possible without the use of ECC [71]. Odyssey, NASA’s Mars
spacecraft currently boasts the highest data transmission rate at 128,000 bits per
second via a radio link. However, for future space missions NASA are planning to
use optical communications via laser beams [60]. The new laser will beam back
between one million and 30 million bits per second, depending on the distance
between Mars and Earth [119]. Projects like this provide great challenges to
implement high-speed and low-power ECC systems with good error correcting
performance in deep space.
In 1948, Claude Shannon founded the field of study “Information Theory”
which is the basis of modern ECC with his discovery of the noisy channel cod-
ing theorem [164]. The theoretical contribution of Shannon’s work was a useful
definition of “information” and several “channel coding theorems” which gave ex-
plicit upper bounds, called the channel capacity, on the rate at which information
could be transmitted reliably on a given communication channel. In the context
9
of our work, the result of primary interest is the “noisy channel coding theorem
for continuous channels with average power limitations”. This theorem states
that the capacity C (which is now known as the Shannon limit) of a bandlimited
additive white Gaussian noise (AWGN) channel with bandwidth W , a channel
model that approximately represents many practical digital communication and
storage systems, is given by
C = W log2(1 + Es/N0) bits per second (bps) (1.1)
where Es is the average signal energy in each signaling interval of duration
T = 1/W , and N0/2 is the two-sided noise power spectral density. Perfect
Nyquist signalling is assumed. The proof of this theorem demonstrates that for
any transmission rate R less than or equal to the channel capacity C, there exists
a coding scheme that achieves an arbitrarily small probability of error; conversely,
if R is greater than C, no coding scheme can achieve reliable performance. Since
this theorem was published, an entire field of study has grown out of attempts
to design coding schemes that approach the Shannon limit of various channels.
In the past few years, LDPC codes have received much attention because
of their excellent performance, and have been widely considered as the most
promising candidate ECC scheme for many applications in telecommunications
and storage devices [132], [8]. LDPC codes were first proposed by Gallager in
1962 [48], [49]. He defined an (n, dv, dc) LDPC code as a code of block length
n in which each column of the parity-check matrix contains dv ones and each
row contains dc ones. Due to the regular structure (uniform column and row
weight) of Gallager’s codes, they are now called regular LDPC codes. Gallager
provided simulation results for codes with block lengths of the order of hundreds
of bits. The results indicated that LDPC codes have very good potential for error
correction. However, the high storage and computation requirements interrupted
10
the research on LDPC codes. After the discovery of Turbo codes by Berrou et
al. in 1993 [7], MacKay [110] re-established the interest in LDPC codes during
the mid to late 1990s.
1.4 Overview of our Approach
1.4.1 Function Evaluation
The evaluation of elementary functions is at the core of many compute-intensive
applications [133] which perform well on reconfigurable platforms. Yet, in or-
der to implement function evaluation efficiently, the FPGA programmer has to
choose between many function evaluation methods such as table look-up, polyno-
mial approximation, or table look-up combined with polynomial approximation.
We present a methodology and a partially automated implementation to select
the best function evaluation hardware for a given function, accuracy require-
ment, technology mapping and optimization metrics, such as area, throughput
or latency. The automation of function evaluation unit design is combined with
ASC [123], A Stream Compiler, for FPGAs. On the algorithmic side, we use
MATLAB to design approximation algorithms with polynomial coefficients and
minimize bitwidths. On the hardware implementation side, ASC provides par-
tially automated design space exploration. We illustrate our approach for sin(x),
log(1 + x) and 2x, which are commonly used in a variety of applications. We
provide a selection of graphs that characterize the design space with various di-
mensions, including accuracy, precision and function evaluation method. We also
demonstrate design space exploration by implementing more than 400 distinct
designs.
The evaluation of a function f(x) typically consists of range reduction which
11
transforms the input into a small interval, and the actual function evaluation
on the small interval. We investigate optimization of range reduction given the
range and precision of x and f(x). For every function evaluation there exists
a convenient interval such as [0, π/2) for sin(x). An example of the adaptive
range reduction method, which we propose in our work, introduces another larger
interval for which it makes sense to skip range reduction. The decision depends
on the function being evaluated, precision, and optimization metrics such as area,
latency and throughput. In addition, the input and output range has an impact
on the choice of function evaluation method such as polynomial, table based, or
combinations of the two. We explore this vast design space of adaptive range
reduction for fixed-point sin(x), log(x) and√
x accurate to one unit in the last
place (ulp) using MATLAB and ASC. These tools enable us to study over 1000
designs resulting in over 40 million Xilinx equivalent circuit gates, in a few hours’
time. The final objective is to progress towards a fully automated library that
provides optimal function evaluation hardware units given input and output range
and precision. Our design flow for evaluating elementary functions is illustrated
in Figure 1.2.
Compound functions often have non-linear properties, hence sophisticated
approximation techniques are needed. We present a method for evaluating such
functions based on piecewise polynomial approximation with a novel hierarchical
segmentation scheme. The use of hierarchal schemes of uniform segments and
segments with size varying by powers of two enables us to approximate non-
linear regions of a function particularly well. This partitioning is automated:
efficient look-up tables and their coefficients are generated for a given function,
input range, degree of the polynomials, desired accuracy and finite precision
constraints. Parameterized reference design templates are provided for various
predefined hierarchical schemes. We describe an algorithm to find the optimum
12
f u n c t i o n f ( x ) i n p u t f o r m a t m e t h o d
A p p r o x i m a t e f ( x ) ( M A T L A B )
H a r d w a r e C o m p i l e r ( A S C )
F P G A i m p l e m e n t a t i o n s
L i b r a r y G e n e r a t o r ( P e r l S c r i p t )
A S C c o d e F u n c t i o n
E v a l u a t i o n L i b r a r y
( A S C L i b )
U s e r
L i b r a r y C o n s t r u c t i o n L i b r a r y U s a g e
Figure 1.2: Design flow for evaluating elementary functions.
number of segments and the placement of their boundaries, which is used to an-
alyze the properties of a function and to benchmark our hierarchical approach.
Our method is illustrated using four non-linear compound and elementary func-
tions:√− log(x), x log(x), a high order rational function and cos(πx/2). We
present results for various operand sizes between 8 and 24 bits for first and sec-
ond order polynomial approximations. For 24-bit data, our method requires a
look-up table of size 12 times smaller than the symmetric table addition method.
Our framework for the hierarchical segmentation method is shown in Figure 1.3.
1.4.2 Gaussian noise generation
Evaluations of LDPC codes are based on computer simulations which can be time
consuming, particularly when the behavior at low bit error rates (BERs) in the
error floor region is being studied [57]. Tremendous efforts have been devoted
13
Hierarchical Function
Segmenter
Data File
Synthesis
Place and Route Report
Hardware
User Input
Design Generator
Reference Design Library
Figure 1.3: Design flow for evaluating non-linear functions using the hierarchical
segmentation method.
to analyze and improve their error-correcting performance, but little considera-
tion has been given to the practical LDPC codec hardware implementations. If
the binary Hamming distance [148] between all combinations of codewords (the
distance spectrum) is known, then analytic techniques for describing the perfor-
mance of the codes in the presence of noise is available. However, in the case of
capacity achieving random linear codes (such as LDPC codes), the problem of
finding the distance spectrum of the code is intractable and researchers resort to
the use of Monte Carlo simulation in order to characterize various code construc-
tions in terms BER versus signal to noise ratio (SNR). At very low SNRs, errors
occur often and a sufficient statistic can be gathered readily within a PC. However
at higher SNRs where errors occur rarely, the situation is different. Thorough
characterization of a code in this region may require simulation of 1010−1012 code
symbols, and computer based simulations provide inadequate means of finding
statistically sufficient set of error events, which can take several weeks.
Hardware based simulation offers the potential of speeding up code evaluation
14
by several orders of magnitude [99]. Such simulation framework consists of three
main blocks: encoder, noise channel and decoder, where the noise channel is
generally modeled by Gaussian noise. Our LDPC code simulations are run on
a reconfigurable engine, which consists of a PC and a reconfigurable hardware
platform [85]. The reconfigurable hardware platform we use is a Xilinx Virtex-II
FPGA prototyping board from Nallatech [131] shown in Figure 1.4. It consists
of two Xilinx Virtex-II XC2V4000-6 FPGAs and 4MB of SRAM. The board can
be connected to a PC via the PCI bus or USB. The grey wires are connected to a
logic analyzer for debugging purposes. A block diagram of our LDPC simulation
framework is provided in Figure 1.5. The LDPC encoder follows an algorithm
suggested in [152]. Our noise generator block improves the overall value of the
system as a Monte Carlo simulator, since noise quality at high SNRs (tails of the
Gaussian distribution) is essential. Since the LDPC decoding process is iterative
and the number of required iterations is non-deterministic, a flow control buffer
is used to greatly increase the throughput of the overall system.
We present two methods for generating Gaussian noise. The first is based on
the Box-Muller method [13] and the central limit theorem [78], which involve the
computation of two functions:√− ln(x) and cos(2πx). The accuracy and speed
in computing these functions are essential for generating high-quality Gaussian
noise samples rapidly. The use of non-uniform segments enables us to approxi-
mate non-linear regions of a function particularly well. The appropriate segment
address for a given function can be rapidly calculated in run time by a simple
combinatorial circuit. Scaling factors are used to deal with large polynomial coef-
ficients and to trade precision for range. Our function evaluator is based on first
order polynomials, and is suitable for applications requiring high performance
with small area, at the expense of accuracy. We exploit the central limit theo-
rem to overcome quantization and approximation errors. An implementation at
15
Figure 1.4: The BenONE board from Nallatech used to run our LDPC simulation
experiments.
133MHz on a Xilinx Virtex-II XC2V4000-6 FPGA takes up 10% of the device
and produces 133 million samples per second, which is seven times faster than a
2.6GHz Pentium 4 PC.
The second method is based on the Wallace method [180]. Wallace proposed
a fast algorithm for generating normally distributed pseudo-random numbers
which generates the target distributions directly using their maximal-entropy
properties. This algorithm is particularly suitable for high throughput hardware
implementation since no transcendental functions such as√
x, log(x) or sin(x)
are required. The Wallace method takes a pool of normally distributed random
numbers from the normal distribution. Through transformation steps, a new
pool of normal distributed random numbers are generated. An implementation
running at 155MHz on a Xilinx Virtex-II XC2V4000-6 FPGA takes up 3% of the
device and produces 155 million samples per second.
16
LDPC Decoder
Record Errors
Code Definition LDPC Encoder
Gaussian Noise Generator
SNR
Flow Control Buffer
Code Definition
Data Source
Compare
Figure 1.5: Our LDPC hardware simulation framework.
The outputs of the two noise generators accurately model a true Gaussian
PDF even at very high σ values (tails of the Gaussian distribution). Their
properties are explored using: (a) several different statistical tests, including
the chi-square test and the Anderson-Darling test [32], and (b) an application for
decoding of LDPC codes. Although the Wallace design has smaller area and is
faster than the Box-Muller design, it has slight correlations between successive
transformations, which may be undesirable for certain types of simulations. We
examine design parameter optimizations to reduce such correlations.
17
H M a t r i x
P r e p r o c e s s o r ( S W )
A L T H M a t r i x
E n c o d e r ( H W ) M e s s a g e B l o c k s C o d e w o r d s
Figure 1.6: LDPC encoding framework.
1.4.3 LDPC Encoding
We describe a flexible hardware encoder for regular and irregular Low-Density
Parity-Check (LDPC) codes. Although LDPC codes achieve better performance
and lower decoding complexity than Turbo codes, a major drawback is their
apparently high encoding complexity: whereas Turbo codes can be encoded in
linear time, a straightforward implementation for a LDPC code has complexity
quadratic in the block length due to dense matrix-vector multiplication. Using
an efficient encoding method proposed by Richardson and Urbanke [152], we
present a hardware LDPC encoder with linear encoding complexity. The encoder
is flexible, supporting arbitrary H matrices, rates and block lengths. We develop
a software preprocessor to bring the parity-check matrix H into a approximate
lower triangular form. A hardware architecture with an efficient memory organi-
zation for storing and performing computations on sparse matrices is proposed.
An implementation for a rate 1/2 irregular length 2000 bits LDPC code encoder
on a Xilinx Virtex-II XC2V4000-6 FPGA takes up 4% of the device. It runs at
143MHz and has a throughput of 45 million codeword bits per second (or 22 mil-
18
lion information bits per second) with a latency of 0.18ms. An implementation
of 16 instances of the encoder on the same device at 82MHz is capable of 410 mil-
lion codeword bits per second, 80 times faster than an Intel Pentium 4 2.4GHz
PC. The design flow of our LDPC encoder is illustrated in Figure 1.6. This
block is placed in front of the noise generator in our LDPC simulation framework
(Figure 1.5).
19
CHAPTER 2
Background
2.1 Introduction
The purpose of this chapter is to present the background material and related
work of this thesis. Section 2.2 introduces the basics of FPGAs and the design
tools used in this thesis. Section 2.3 introduces six of the most popular methods
for approximating functions and the existing work. Section 2.4 discusses various
issues such are range reduction related to function evaluation. Section 2.5 presents
different ways of generating Gaussian noise and explores the existing work in this
area. Finally, Section 2.6 introduces the basics of LDPC codes, LDPC encoding,
describes Richardson and Urbanke’s (RU) method for efficiently encoding LDPC
codes and looks at previous work on hardware related issues on LDPC codes.
2.2 FPGAs
2.2.1 Introduction
Field-Programmable Gate Arrays (FPGAs) have long been used for glue logic
and prototyping. More recently, they are being used for many real-life appli-
cations including communications [93], encryption [173], video image process-
ing [168], [175], medical imaging [72], network security [96] and numerical com-
20
4 - i n p u t L U T
m u x
f l i p - f l o p
a
b
c
d
e
c l o c k
c l o c k e n a b l e
s e t / r e s e t
y
q
Figure 2.1: Simplified view of a Xilinx logic cell. A single slice contains 2.25 logic
cells.
putations [104].
FPGAs can potentially approach the execution speed of application specific
hardware with the rapid programming time of microprocessors. In recent years,
the size of FPGAs has followed Moore’s law: the number of logic gate doubles
every 18 months. FPGAs can exploit improvements following Moore’s law better
than microprocessors because of their simpler and more regular structure.
The fundamental building block of Xilinx FPGAs is the logic cell [118]. A logic
cell comprises a 4-input look-up table (which can also act as a 16× 1 RAM or a
16-bit shift register), a multiplexer and a register. A simplified view of a logic cell
is depicted in Figure 2.1. Two logic cells are paired together in an element called
a slice. A slice contains additional resources such as multiplexers and carry logic
to increase the efficiency of the architecture. These extra resources are equivalent
to having more logic cells, and therefore a slice is counted as being equivalent of
2.25 logic cells. Recent-generation reconfigurable hardware has a large amount
of slices. For instance, the Xilinx Virtex-II XC2V4000-6 has 23040 slices.
The architecture of a typical FPGA is illustrated in Figure 2.2. In general,
21
Figure 2.2: Architecture of a typical FPGA.
an FPGA will have an array of configurable logic blocks (which contain two
or four slices depending on the FPGA family), programmable wires, and pro-
grammable switches to realize any function out of the logic blocks and implement
any interconnection topology. Programming is done using of the many popular
technologies such as SRAM cells, Antifuses, EPROM transistors and EEPROM
transistors. In addition to logic blocks, state-of-the-art FPGAs such as the Xilinx
Virtex-II or Virtex-4 devices contain embedded hardware elements for memory,
multiplication, multiply-and-add and even a number of hard microprocessor cores
(such as the IBM PowerPC) [189].
The long IC fabrication time is completely eliminated for these devices and
design realization times are only a few hours. The idea of user-programmability is
very exciting, most ASIC vendors now prefer FPGAs for low cost prototyping for
fine tuning of designs before fabrication. Also, from a marketing point of view, the
FPGA technology allows quick product announcements, which is commercially
attractive. The two major FPGA vendors are Altera and Xilinx. A good review
on configurable computing and FPGAs is given in [28].
22
2.2.2 Design Tools
The following three FPGA design tools are used for the implementations pre-
sented in this thesis:
• ASC [123], A Stream Compiler for FPGAs, adopts C++ custom types and
operators to provide a programming interface in the algorithmic level, the
architectural, the arithmetic level and the gate level. As a unique feature,
all levels of abstraction are accessible from C++. This enables the user
to program on the desired level for each part of the application. Semi-
automated design space exploration further increases design productivity,
while supporting optimization at all available levels of abstraction. Object-
oriented design enables efficient code-reuse; ASC includes an integrated
arithmetic unit generation library, PAM-Blox II [121], which in turn builds
upon the PamDC [137] gate library. The elementary function evaluation
units in Chapters 3 and 4 are implemented with this tool.
• Handel-C [21] is based on ANSI-C with extensions to support flexible
width variables, signals, parallel blocks, bit-manipulation operations and
channel communication. A distinctive feature is that timing of the com-
piled circuit is fixed at one cycle per C assignment. This makes it easy for
programmers to know in which cycle a statement will be executed at the
expense of reducing the scope for optimization. It gives application devel-
opers the ability to schedule hardware resources manually, and Handel-C
tools generate the resulting designs automatically. The ideas of Handel-C
are based on work by Page and Luk in compiling Occam into FPGAs [134].
The Gaussian noise generator using the Box-Muller method in Chapter 6
is implemented with this tool.
23
• Xilinx System Generator [188] is a plug-in to the MATLAB Simulink
software [117] and provides bit-accurate model of FPGA circuits. It au-
tomatically generates a synthesizable VHDL or Verilog code including a
testbench. Other unique capabilities include MATLAB m-code compila-
tion, fast system-level resource estimation, and high-speed hardware co-
simulation interfaces, both a generic JTAG interface [31] and PCI based co-
simulation for FPGA hardware platforms. The Xilinx Blockset in Simulink
enables bit-true and cycle-true modeling, and includes common parameter
blocks such as finite impulse response (FIR) filter, fast Fourier transform
(FFT), logic gates, adders, multipliers, RAMs, etc. Moreover, most of these
blocks utilize the Xilinx cores, which are highly optimized for Xilinx devices.
The function evaluator using the hierarchical segmentation method (HSM)
in Chapter 5, the Gaussian noise generator using the Wallace method in
Chapter 7, and the LDPC encoder in Chapter 8 are implemented with this
tool.
ASC designs are synthesized with PAM-Blox II and all others with Synplicity
Synplify Pro (versions 7.3 ∼ 7.5). Place-and-route for all designs are performed
with Xilinx ISE (versions 6.0 ∼ 6.2).
2.3 Function Evaluation Methods
Many FPGA applications including digital signal processing, computer graphics
and scientific computing require the evaluation of elementary or special purpose
functions. For applications that require low precision approximation at high
speeds, full look-up tables are often employed. However, this becomes imprac-
tical for precisions higher than a few bits, because the size of the table grow
24
exponentially with respect to the input size. Six well known methods are de-
scribed below, which are better suited to high precision.
2.3.1 CORDIC
CORDIC is an acronym for COordinate Rotations DIgital Computer and offers
the opportunity to calculate desired functions in a rather simple and elegant way.
The CORDIC algorithm was first introduced by Volder [178] for the computation
of trigonometric functions, multiplication, division and data type conversion, and
later generalized to hyperbolic functions by Walther [182]. It has found its way
into diverse applications including the 8087 math coprocessor [38], the HP-35
calculator, radar signal processors and robotics.
It is based on simple iterative equations, involving only shift and add opera-
tions and was developed in an effort to avoid the time consuming multiply and
divide operations. The general CORDIC algorithm consists of the following three
iterative equations:
xk+1 = xk −mδkyk2−k
yk+1 = yk + δkxk2−k
zk+1 = zk − δkσk
The constants m, δk and σk depend on the specific computation being performed,
as explained below.
• m is either 0, 1 or −1. m = 1 is used for trigonometric and inverse trigono-
metric functions. m = −1 is used for hyperbolic, inverse hyperbolic, expo-
nential and logarithmic functions, as well as square roots. Finally, m = 1
is used for multiplication and division.
25
• δk is one of the following two signum functions:
δk = sgn(zk) =
1, zk ≥ 0
−1, zk < 0or δk = -sgn(yk) =
1, yk < 0
−1, yk ≥ 0
The first is often called the rotation mode, in which the z values are driven
to zero, whereas the second is the vectoring mode, in which the y values
are driven to zero. Note that δk requires nothing more than a comparison.
• The numbers σk are constants and are stored in a table which depend on
the value of m. For m = 1, σk = tan−1 2−k; for m = 0, σk = 2−k; and for
m = −1, σk = tanh−1 2−k.
To use these equations, appropriate starting values x1, y1 and z1 must be given.
One of these inputs, say z1, might be the number whose hyperbolic sine we wish
to approximate, sinh(z1). In all cases, the starting values must be restricted to a
certain interval about the origin in order to ensure convergence. As the iterations
proceed, one of the variables tends to zero while another variable approaches the
desired approximation.
The major disadvantage of the CORDIC algorithm is its linear convergence
resulting in an execution time which is linearly proportional to the number of
bits in the operands. In addition, CORDIC is limited to a relatively small set of
elementary functions. A comprehensive study of CORDIC algorithms on FPGAs
can be found in [3].
2.3.2 Digit-recurrence and On-line Algorithms
Digit-recurrence [41] and on-line algorithms [40] belong to the same type of meth-
ods for the approximation of functions in hardware, usually known as digit-by-
digit iterative methods, due to their linear convergence, which means that a fixed
26
number of bits of the result is obtained in each iteration. Implementations of this
type of algorithms are typically of low complexity, utilize small area and have rel-
atively large latencies. The fundamental choices in the design of a digit-by-digit
algorithm are the radix, the allowed coefficients of digits and the representation
of the partial reminder.
2.3.3 Bipartite and Multipartite Methods
The bipartite method, meaning that the table is divided into two parts, was
originally introduced by Das Sarma and Matula [159] with the aim of getting
accurate reciprocals. Improvements were suggested by Schulte and Stine [162],
[163], Muller [129], and generalizations from bipartite to multipartite method are
discussed by Denechin and Tisserand [34].
Assume an n-bit binary fixed-point system, and assume that n is a multiple
of 3 and n = 3k. We wish to design a table based implementation of function f .
A full look-up table would lead to a table of size n × 2n. Instead, we split the
input word x into three k-bit words x0, x1, and x2, that is,
x = x0 + x12−k + x22
−2k (2.1)
where x0, x1 and x2 are multiples of 2−k that are less than 1. The original bipartite
method consists in approximating the first order Taylor expansion
f(x) = f(x0 + x12−k) + x22
−2kf ′(x0 + x12−k) + x2
22−4kf ′′(ξ), (2.2)
ξ ∈ [x0 + x12−k, x]
by
f(x) = f(x0 + x12−k) + x22
−2kf ′(x0). (2.3)
27
That is, f(x) is approximated by the sum of two functions α(x0, x1) and β(x0, x2),
where
α(x0, x1) = f(x0 + x12−k)
β(x0, x2) = x22−2kf ′(x0)
The error of this approximation is roughly proportional to 2−3k. Instead of di-
rectly tabulating function f , functions α and β are tabulated. Since they are
functions of 2k bits only, each of these tables has 22n/3 entries. This results in
a total table size of 2n × 22n/3 bits, which is a significant improvement over the
full look-up table. These methods basically exploit the symmetry of the Taylor
approximations and leading zeros in the table coefficients to reduce the look-up
table size. Although these methods yield in significant improvements in table
size over direct table look-up, they can be inefficient for functions that are highly
non-linear [88].
2.3.4 Polynomial Approximation
Polynomial approximation [58], [150] involves approximating a continuous func-
tion f with one or more polynomials p of degree d on a closed interval [a, b]. The
polynomials are of the form
p(x) = cdxd + cd−1x
d−1 + ... + c1x + c0 (2.4)
and with Horner’s rule, this becomes
p(x) = ((cdx + cd−1)x + ...)x + c0 (2.5)
where x is the input. The aim is to minimize the distance ‖p − f‖. In our
work, we use minimax polynomial approximations, which involve minimizing the
maximum absolute error [128]. The distance for minimax approximations is:
‖p− f‖∞ = maxa≤x≤b
|f(x)− p(x)| (2.6)
28
Table 2.1: Maximum absolute and average errors for various fist order polynomial
approximations to ex over [−1, 1].
Taylor Legendre Chebyshev Minimax
Max. Error 0.718 0.439 0.372 0.279
Avg. Error 0.246 0.162 0.184 0.190
where [a, b] is the approximation interval. Many researchers rely on methods such
as Taylor series of simply to minimize the maximum absolute error. Table 2.1
shows the maximum and average errors of various first order polynomial approxi-
mations to ex over [−1, 1]. It can be seen that generally minimax gives the lowest
maximum error and Legendre provides the lowest average error. Therefore, when
low maximum absolute error is desired, minimax approximation should be used
(unless the polynomial coefficients are computed at run-time from stored func-
tion values [100]). The minimax polynomial is found in an iterative manner using
the Remez exchange algorithm [149], which is often used for determining optimal
coefficients for digital filters.
Sidahao et al. [165] approximate functions over the whole interval with high
order polynomials. This polynomial-only approach has the advantage of low
memory requirements, but suffers from long latencies. In addition, it will not
generate acceptable results when the function is highly non-linear. Pineiro et
al. [147] divide the interval into several uniform segments. For each segment,
they store the second order minimax polynomial approximation coefficients, and
accumulate the partial terms in a fused accumulation tree. This scheme performs
well for the evaluation of elementary functions for moderate precisions (less than
24 bits).
29
2.3.5 Polynomial Approximation with Non-uniform Segmentation
Approximations using uniform segments are suitable for functions with relatively
linear regions, but are inefficient for non-linear functions, especially when the
function varies exponentially. It is desirable to choose the boundaries of the
segments to cater for the non-linearities of the function. Highly non-linear regions
will need smaller segments than linear regions. This approach minimizes the
amount of storage required to approximate the function, leading to more compact
and efficient designs. A number of techniques that utilize non-uniform segment
sizes to cater such non-linearities have been proposed in literature. Cantoni [18]
uses optimally placed segments and presents an algorithm to find such segment
boundaries. However, although this approach minimizes the number of segments
required, such arbitrary placed segments are impractical for actual hardware
implementation, since the hardware circuitry to find the right segment for a
given input would be too complex. Combet et al. [27] and Mitchell Jr. [75] use
segments that increase by powers of two to approximate the base two logarithm.
Henkel [61] divides the interval into four arbitrary placed segments based on
the non-linearity of the function. The address for a given input is approximated
by another function that approximates the segment number for an input. This
method only works if the number of segments is small and the desired accuracy is
low. Also, the function for approximating the segment addresses is non-linear, so
in effect the problem has been moved into a different domain. Coleman et al. [26]
divide the input interval into seven P2S (powers of two segments: segments with
the size varying by increasing or decreasing powers of two) that decrease by
powers of two, and employ constant numbers of US (uniform segments: segments
with the sample sizes) nested inside each P2S, which we call P2S(US). Lewis [100]
divides the interval into US that vary by multiples of three, and each US has vari-
30
able numbers of uniform segments nested inside, which we call US(US). However,
in both cases the choice of inner and outer segment numbers is left to the use, and
a more efficient segmentation could be achieved using a systematic segmentation
scheme.
2.3.6 Rational Approximation
Rational approximation offers efficient evaluation of analytic functions repre-
sented by the ratio of two polynomials:
f(x) =cnxn + cn−1x
n−1 . . . c1x + c0
dnxm + dm−1xm−1 . . . d1x + d0
(2.7)
In general, rational approximations are the most efficient method to evaluate
functions on a microprocessor. However, they are less attractive for FPGA im-
plementations due to the presence of the divider. Typical polynomial sizes for
floating-point single precision are smaller than ten coefficients [122]. Hardware
implementations of rational approximation are studied in [79].
2.4 Issues on Function Evaluation
In this section, we describe various issues related to function evaluation. We first
describe approximation methods and applications for elementary and compound
functions. Second, we examine the dilemma designers face, when the optimum
function evaluation method for a given metric is required. Third, range reduction
is explained, which is a technique used to transform the inputs of elementary
functions into a smaller linear interval. Finally, we look at the types of errors
that can arise when attempting to evaluate functions in hardware.
31
2.4.1 Evaluation of Elementary and Compound Functions
The evaluation of elementary functions [128] such as sin(x) or log(x) has re-
ceived significant interest in the research community. They are typically com-
puted by CORDIC [178], table look-up and addition methods [162], [167], or
polynomial/rational approximations with one or more uniform segments [161].
For the evaluation of elementary functions, range reduction techniques [128] such
as those presented in [25] and [182] are used to bring the input within a linear
range. In contrast, there has been little attention on the efficient approximation
of compound functions such as√− log(x) or x log(x) for special purpose appli-
cations. Examples of such applications include N-body simulation [63], channel
coding [74], Gaussian noise generation [86] and image registration [158]. In prin-
ciple, these compound functions can be evaluated by splitting them into several
elementary functions, but this approach would result in long delay, propaga-
tion of rounding errors and possibilities of catastrophic cancellations [55]. Range
reduction is not feasible for compound functions (unless the sub-functions are
computed one by one), so highly non-linear regions of a compound function need
to be handled as well.
2.4.2 Approximation Method Selection
We can use polynomials and/or look-up tables for approximating a function f(x)
over a fixed range [a, b]. At one extreme, the entire function approximation
can be implemented as a table look-up. At the other extreme, the function
approximation can be implemented as a polynomial approximation with function-
specific coefficients. Between these two extremes, we use a table look-up to obtain
the appropriate polynomial coefficients followed by polynomial evaluation. This
table-with-polynomial method partitions the total approximation into several
32
M e
t r i c
Bitwidth
method 3
method 2 method 1
x1 x2
Figure 2.3: Certain approximation methods are better than others for a given
metric at different precisions.
segments.
In [122], Mencer and Luk show that for a given accuracy requirement it is
possible to plot the area, latency, and throughput tradeoff and thus identify
the optimal function evaluation method. The optimality depends on further
requirements such as available area, required latency and throughput. Looking
at Figure 2.3, if one desires the metric to be low (e.g. area, latency or power),
one should use method 1 for bitwidths lower than x1, method 2 for bitwidths
between x1 and x2, and method 3 for bitwidths higher than x2. Figure 2.4 shows
the results from [122], comparing area requirements of various approximation
methods with varying precision.
2.4.3 Range Reduction
We evaluate an elementary function f(x), where x has a given range [a, b] and f(x)
has a precision requirement. The evaluation typically consists of three steps [128]:
33
Figure 2.4: Area comparison in terms of configurable logic blocks for different
methods with varying data widths [122].
(1) range reduction, reducing x over the interval [a, b] to a more convenient y
over a smaller interval [a′, b′],
(2) function evaluation on the reduced interval, and
(3) range reconstruction: expansion of the result back to the original result
range.
There are two main types of range reduction:
• additive reduction: y is equal to x−mC;
• multiplicative reduction: y is equal to x/Cm
where integer m and a constant C are defined by the evaluated function.
Range reduction is widely studied, especially for CORDIC [182] and floating-
point systems on microprocessors [25]. Li et al. [102] present theorems that prove
34
the correctness and effectiveness of commonly used range reduction techniques.
Lefevre and Muller [97] suggest a method for performing range reduction on the
fly: overlapping the computation with the reception of the input bits for bit-serial
systems. Defour et al. [35] present an algorithm suitable for small and medium
sized arguments in IEEE double precision. Their method is significantly faster
than Payne and Hanek’s modular range reduction method [128], at the expense
of larger table sizes. In contrast, range reduction which adapts to different input
ranges and precisions has received little attention.
2.4.4 Types of Errors
Classically, we have three different kinds of error which affect to the global error
of an evaluation of a function:
• The input quantization error measures the fact that an input number usu-
ally represents a small interval centered around this number.
• The approximation error measures the difference between the pure mathe-
matical function and the approximate mathematical function that will be
used to evaluate it.
• Output rounding errors measure the difference between the approximated
mathematical function and the closest machine-representable value.
All of these errors need to be taken into account when approximating a func-
tion for a given output error requirement.
35
2.5 Gaussian Noise Generation
Sequences of random numbers with Gaussian probability distribution functions
are needed to simulate a wide variety of natural phenomena [51], [183]. Applica-
tions of such sequences include channel code evaluation [74], watermarking [39],
oscilloscope testing [176], simulation of economic systems [6], [156], financial mod-
eling [14] and molecular dynamics simulations [76].
Previous work on Gaussian noise generation can be divided into two types:
the generation of Gaussian noise using a combination of analog components [144],
[155], [196], and the generation of pseudo random noise using purely digital com-
ponents [5], [12], [23], [33], [53], [56], [69], [98], [127], [135], [170], [180]. The
first method tends to be practical only in highly restricted circumstances, and
suffers from its own problems with noise accuracy. The second method is often
more desirable, because of its flexibility and high performance. In addition, when
simulating communication systems, we may wish to use noise sequences that are
pseudo-random so that the same noise can be adopted for different systems. Also,
if the system fails we may wish to know which noise samples cause the system to
fail. Comprehensive but rather dated comparisons of such digital methods can
be found in [4], [130] and [143].
Digital methods for generating random Gaussian variables are almost al-
ways based on transformations or operations on uniform random variables [160].
The most widely used methods are: various rejection-acceptance methods [2],
[98], [101], [114], [115], the use of the central limit theorem [78], the inversion
method [65] and the Box-Muller method [13]. The rejection-acceptance methods,
while popular in software implementations, contain conditional loops such that
the output rates are not constant, making them less amenable to a hardware
simulation environment. The central limit theorem can, in principle, be used
36
to produce Gaussian samples, if a suitable number of samples are involved. In
practice however, approaching a Gaussian probability density function (PDF) to
a high accuracy using the central limit theorem alone would require an imprac-
tically large number of samples.
The Box-Muller method, either alone or in combination with the central limit
theorem, has been the focus of most efforts in hardware implementation. For
example, Boutillion et al. [12] present a hardware Gaussian noise generator on
an Altera FPGA based on the Box-Muller algorithm in conjunction with the
central limit theorem. Their design occupies 437 logic cells on an Altera Flex
10K100EQC240-1 FPGA, and outputs 24.5 million noise samples per second at
a clock speed of 98MHz. Recently, Xilinx have released the “Additive White
Gaussian Noise (AWGN) Core 1.0” [186], which is based on the Boutillion et
al. architecture. The drawback of these designs are revealed by statistical tests
applied to evaluate the noise samples produced. Designs which fail such statistical
tests are inadequate for high quality hardware communications simulations such
as Low-Density Parity-Check (LDPC) codes [48].
Chen et al. [22] use a cumulative distribution function (CDF) conversion table
to transform uniform random variables to Gaussian random variables. They have
implemented the Gaussian noise generator as part of a readback-signal generator
on a Xilinx Virtex-E XCV1000E FPGA at 70MHz. The method they employ is
basically the inversion method [65] implemented with a look-up table. Again, our
experiments show that the use of direct table look-up for the inversion method can
produce noise samples of insufficient quality for the communications applications
that we are targeting. Fan et al. [43] present a hardware Gaussian noise generator
based on the polar method [78] in conjunction with the central limit theorem.
Their design is implemented on a Altera Mercury EP1M120F484C7 FPGA; it
37
takes up 336 logic elements and has a clock speed of 73MHz generating a sample
every clock. The polar method is a variant of the Box-Muller method and is a
class of the rejection-acceptance methods, hence the output rate is not constant.
In order to overcome this problem, they employ a FIFO buffer with the read
speed set to half of the write speed. However, detailed statistical analysis have
not been performed to confirm the quality of the noise samples produced.
All of the methods described above produce normal variables by performing
operations on uniform variables. In contrast, Wallace proposes an algorithm
that completely avoids the use of uniform variables, operating instead using an
evolving pool of normal variables to generate additional normal variables [180].
The approach draws its inspiration from uniform random number generators that
generate one or more new uniform variables from a set of previously generated
uniform variables. Given a set of normally distributed random variables, a new
set of normally distributed random variables can be generated by applying a
linear transformation. Brent [16] has implemented a fast vectorized Gaussian
random number generator using the Wallace method on the Fujitsu VP2200 and
VPP300 vector processors. In [17] and [157], Brent and Rub outline possible
correlation problems associated with the Wallace method and discuss ways of
avoiding them.
2.6 LDPC Codes
2.6.1 Basics of LDPC Codes
Low-Density Parity-Check (LDPC) codes [48], [49] enable performance extremely
close to the best possible as determined by the Shannon capacity formula. For
the additive white Gaussian noise (AWGN) channel, the best code of rate 1/2
38
Figure 2.5: Comparison of (3,6)-regular LDPC code, Turbo code and optimized
irregular LDPC code [151].
presented in [151] has a threshold within 0.06dB from capacity, and their simu-
lation results show that the best LDPC code of length 106 achieves a bit error
probability of 10−6 less than 0.13dB away from capacity, beating the best codes
known so far. Performance comparison of various codes in terms of BER versus
SNR is illustrated in Figure 2.5. All codes are length of 106 and rate 1/2. The
BER for the AWGN channel is shown as a function of Eb/N0 (SNR per bit in
dB).
The communication system model we consider comprises of LDPC encoder,
decoder and an AWGN channel as shown in Figure 2.6. Message bits are given
as inputs to the LDPC encoder, which creates parity bits for a block of message
generating codewords. A binary antipodal modulation such as binary phase shift
keying (BPSK) is assumed at the transmitter. The signal gets corrupted by
AWGN noise during the transmission over the channel. At the receiver end,
the demodulator demodulates the received signal, filters it and performs A/D
39
LDPC Encoder
Codeword Symbols
Received Symbols
Noise
Message Bits
Decoded Bits
AWGN Channel
LDPC Decoder
Figure 2.6: LDPC communication system model.
conversion on it. This is further fed to the LDPC decoder, which iteratively
decodes the received block of codeword and provides decoded bits at the output
end.
As originally suggested by Tanner [172], LDPC codes are well represented
by bipartite graphs in which one set of nodes, the variable nodes, corresponds to
elements of the codeword and the other set of nodes, the check nodes, corresponds
to the set of parity-check constraints which define the code. Regular LDPC codes
are those for which all nodes of the same type have the same degree. For example,
a (3,6)-regular LDPC code has a graphical representation in which all variable
nodes have degree three and all check nodes have degree six. The bipartite graph
determining such a code is shown in Figure 2.7.
Irregular LDPC codes are introduced in [108] and [107] and are further stud-
ied in [105], [106] and [111]. For such an irregular LDPC code, the degrees of each
set of nodes are chosen according to some distribution. Thus, an irregular LDPC
code might have a graphical representation in which half the variable nodes have
degree three and half have degree five, while half the constraint nodes have degree
six and half have degree eight. Luby et al. [107] formally showed that properly
constructed irregular codes can approach the channel capacity more closely than
regular codes. LDPC codes exhibit an asymptotically better performance than
40
Figure 2.7: A bipartite graph of a (3,6)-regular LDPC code of length ten and
rate 1/2. There are ten variable nodes and five check nodes. For each check node
Ci the sum (over GF(2)) of all adjacent variable node is equal to zero.
Turbo codes and admit a wide range of tradeoffs between performance and com-
plexity.
LDPC codes are linear codes. Hence, they can be expressed as the null space
of a parity-check matrix H, i.e., x is a code word if and only if
HxT = 0T . (2.8)
The sparseness of H enables efficient (sub-optimal) decoding, while the random-
ness ensures (in the probabilistic sense) a good code. The H matrix corresponding
to the bipartite graph in Figure 2.7 is shown below. Note that in this example,
H is not sparse; it is just for illustration.
41
H =
1 1 1 1 0 1 1 0 0 0
0 0 1 1 1 1 1 1 0 0
0 1 0 1 0 1 0 1 1 1
1 0 1 0 1 0 0 1 1 1
1 1 0 0 1 0 1 0 1 1
(2.9)
2.6.2 LDPC Encoding
LDPC codes are linear block codes. Encoding of such codes uses the following
property:
HxT = 0T (2.10)
where x represents the codeword and H represents the parity-check matrix. A
straightforward encoding scheme requires three steps: a) Gaussian elimination
to transform the H matrix into a lower triangular form (Figure 2.8), b) split
x into information bits and parity bits, i.e., x = (s, p1, p2) where s is vector of
information bits, p1, p2 are vectors of parity bits, c) solve the equation HxT = 0
using forward-substitution. It takes about O(n3) to perform Gaussian elimina-
tion. Since afterwards the H matrix will no longer be sparse, it takes O(n2),
or more precisely, n2(
r(1−r)2
)XOR operations for the actual encoding, where r is
the code rate [151]. The code rate is the ratio of information bits to codeword
bits and has a value between 0 and 1. In order to reduce the quadratic com-
plexity, Richardson and Urbanke [152] took advantage of the sparsity of the H
matrix. They found that in most cases, the encoding complexity is either linear or
quadratic but quite manageable. For example, for a (3,6) regular code of length
n, even though the complexity is still quadratic, the actual number of operations
required is O(n) in addition to 0.0172n2. Since 0.0172 is a small number, the
42
Figure 2.8: An equivalent parity-check matrix in lower triangular form.
complexity of the encoder is still manageable for large n.
2.6.3 RU LDPC Encoding Method
In this section, we describe the Richardson and Urbanke (RU) algorithm for con-
structing efficient encoders for LDPC codes as presented in [152]. The efficiency
of the encoder arises from the sparseness of the parity-check matrix H and the
algorithm can be applied to any ‘sparse’ H. Although our example is binary,
the algorithm applies generally to matrices H whose entries belong to a field F .
We assume throughout that the rows of H are linearly independent. If the rows
are linearly dependent, then the algorithm which constructs the encoder will de-
tect the dependency and either one can choose a different matrix H, or one can
eliminate the redundant rows from H in the encoding process.
Assume we are given an m× n parity-check matrix H over F . By definition,
the associated code consists of the set of n-tuples x over F such that
HxT = 0T . (2.11)
As briefly discussed in the previous section, the most straightforward way of
constructing an encoder for such a code is the following. By means of Gaussian
43
Figure 2.9: The parity-check matrix in approximate lower triangular form
elimination bring H into an equivalent lower triangular form as shown in Fig-
ure 2.8. Split the vector x into a systematic part s, s ∈ F n−m, and a parity part
p, p ∈ Fm, such that x = (s, p). Construct a systematic encoder as follows: i) Fill
s with the (n−m) desired information symbols. ii) Determine the m parity-check
symbols using back-substitution. More precisely, for l ∈ [m] calculate
pl =n−m∑j=1
Hl,jsj +l−1∑j=1
Hl,j+n−mpj. (2.12)
Bringing the matrix H into the desired form requires O(n3) operations of pre-
processing. The actual encoding then requires O(n2) operations since, in general,
after the preprocessing the matrix will no longer be sparse.
Given that the original parity-check matrix H is sparse, one might wonder if
encoding can be accomplished in O(n). As it will be shown, typically for codes
which allow transmission at rates close to capacity, linear time encoding is indeed
possible.
Assume that by performing row and column permutations only we can bring
the parity-check matrix into the form indicated in Figure 2.9. We say that H is
in approximate lower triangular form. Note that since this transformation was
accomplished solely by permutations, the matrix is still sparse. More precisely,
44
assume that we bring the matrix in the form
H =
A B T
C D E
(2.13)
where A is of size (m − g) × (n −m), B is (m − g) × g, C is g × (n −m), D is
g × g, and finally, E is g × (m − g). Further, all these matrices are sparse and
T is lower triangular with ones along the diagonal. Multiplying this matrix from
the left by I 0
−ET−1 I
(2.14)
we get A B T
−ET−1A + C −ET−1B + D 0
(2.15)
Let x = (s, p1, p2) where s denotes the systematic part, p1 and p2 combined
denote the parity part, p1 has length g, and p2 has length (m− g). The defining
equation HxT = 0T splits naturally into two equations, namely
AsT + BpT1 + TpT
2 = 0 (2.16)
and
(−ET−1A + C)sT + (−ET−1B + D)pT1 = 0. (2.17)
Define φ := −ET−1B + D and assume for the moment that φ is nonsingular.
The general case will be discussed shortly. Then from (2.17) we conclude that
pT1 = −φ−1(−ET−1A + C)sT . (2.18)
Hence, once the g× (n−m) matrix −φ−1(−ET−1A+C) has been precomputed,
the determination of p1 can be accomplished in complexity O(g×(n−m)) simply
45
by performing a multiplication with this matrix. This complexity can be further
reduced as shown in Table 2.2. Rather than precomputing −φ−1(−ET−1A + C)
and then multiplying with sT , we can determine p1 by breaking the computation
into several smaller steps, each of which is efficiently computable.
To this end, we first determine AsT , which has complexity O(n) since A is
sparse. Next, we multiply the result by T−1. Since T−1[AsT ] = yT is equivalent
to the system [AsT ] = TyT this can also be accomplished in O(n) by back-
substitution, since T is lower triangular and also sparse. The remaining steps are
fairly straightforward. It follows that the overall complexity of determining p1 is
O(n + g2). In a similar manner noting from (2.16) that pT2 = −T−1(AsT + BpT
1 ),
we can accomplish the determination of p2 in complexity O(n) as shown step by
step in Table 2.3.
Table 2.2: Efficient computation of pT1 = −φ−1(−ET−1A + C)sT .
Operation Comment Complexity
AsT multiplication by sparse matrix O(n)
T−1[AsT ] T−1[AsT ] = yT ⇔ [AsT ] = TyT O(n)
−E[T−1AsT ] multiplication by sparse matrix O(n)
CsT multiplication by sparse matrix O(n)
[−ET−1AsT ] + [CsT ] addition O(n)
−φ−1[−ET−1AsT + CsT ] multiplication by dense g × g matrix O(g2)
A summary of the RU encoding procedure is given in Table 2.4. It consists of
two steps. A preprocessing step and the actual encoding step. In the preprocess-
ing step, we first perform row and column permutations to bring the parity-check
46
Table 2.3: Efficient computation of pT2 = −T−1(AsT + BpT
1 ).
Operation Comment Complexity
AsT multiplication by sparse matrix O(n)
BpT1 multiplication by sparse matrix O(n)
[AsT ] + [BpT1 ] addition O(n)
−T−1[AsT + BpT1 ] −T−1[AsT + BpT
1 ] = yT ⇔ −[AsT + BpT1 ] = TyT O(n)
matrix into approximate lower triangular form with as small a gap g as possible.
We also need to check whether φ := −ET−1B + D is nonsingular. Rather than
premultiplying by the matrix
I 0
−ET−1 I
, this task can be accomplished ef-
ficiently by Gaussian elimination. If after clearing the matrix E the resulting
matrix φ is seen to be singular, we can simply perform further column permu-
tations to remove this singularity. This is always possible when H is not rank
deficient, as assumed. The actual encoding then entails the steps listed in Tables
2.2 and 2.3.
47
Table 2.4: Summary of the RU encoding procedure.
Preprocessing: Input: Non-singular parity-check matrix H. Output: An equivalent parity-check matrix of
the form
24 A B T
C D E
35 such that −ET−1B + D is non-singular.
1. Perform row and column permutations to bring the parity-check matrix H into approximate lower
triangular form
H =
24 A B T
C D E
35 (2.19)
with as small a gap g as possible. We will see in subsequent sections how this can be accomplished
efficiently.
2. Use Gaussian elimination to effectively perform the pre-multiplication.
24 I 0
ET−1 I
3524 A B T
C D E
35 =
24 A B T
−ET−1A + C −ET−1B + D 0
35 (2.20)
in order to check that −ET−1B + D is non-singular, performing further column permutations is
necessary to ensure this property. (Singularity of H can be detected at this point.)
Encoding: Input: Parity-check matrix of the form
24 A B T
C D E
35 such that −ET−1B + D is non-singular
and a vector s ∈ F n−m. Output: The vector x = (s, p1, p2), p1 ∈ F g , p2 ∈ F m−g , such that HxT = 0T .
1. Determine p1 as shown in Table 2.2.
2. Determine p2 as shown in Table 2.3.
48
2.6.4 Hardware Aspects of LDPC codes
Although there have been plenty of publications on the hardware implementation
of Viterbi [20], [77], [136], [169], [190] and Turbo codes [64], [116], [146], [145],
[177] little attention has been given to the hardware implementation issues of
LDPC codes.
Levine and Schmidt [99] present a simple hardware architecture for the LDPC
codec, but it has not been implemented and is perhaps not practical for a real
design due to the large size and inefficiency. Zhang et al. investigate the finite pre-
cision effects in regular LDPC decoders in [195]. They introduce their hardware
architecture in [193] and present an FPGA based (3,6)-regular LDPC decoder
in [194]. Their design is capable of 54Mbps, but its error correcting performance
is rather poor due to the use of regular LDPC codes and simplicity of their de-
sign. In [9], Bhatt et al. present a regular LDPC implementation on a fixed-point
DSP, which achieves a bit rate of just 133.33Kbps. In [67], Howland et al. present
their parallel regular LDPC decoding architecture, and later published their ASIC
based LDPC decoder chip in [10] and [66]. Their chip is claimed to be capable of
1Gbps, but is again based on regular LDPC codes, which have lower error cor-
recting performance compared to irregular LDPC codes. Our closet competitors
are probably Metha et al. and Flarion Technologies Inc. Metha et al. are working
on an FPGA based regular LDPC simulation platform and recently published a
technical report on their preliminary architecture [120]. Flarion offer intellectual
properties for LDPC encoder and decoders [45]. Their FPGA decoder is reported
to operate at up to 384Mbps and their ASIC decoder at 10Gbps. However, little
details are known due to them being commercial products.
To the best of our knowledge, existing hardware LDPC encoders in the lit-
erature [66], [120], [193] employ the straightforward encoding method where a
49
vector of information bits is multiplied by a dense generator matrix, which has
complexity quadratic to the block length.
Recently, Johnson and Weller [73] proposed a family of irregular LDPC codes
with low encoding complexity based on quasi-cyclic codes. The quasi-cyclic codes
can be encoded with a shift register circuit of size equal to the code dimension.
Low-Density Generator-Matrix (LDGM) codes [50] have also received consider-
able attention due to their linear encoding complexity. However, quasi-cyclic
and LDGM codes are subsets of LDPC codes and restrict the way the parity-
check matrix is constructed. In most cases, they are outperformed by properly
constructed irregular codes such as those in [174].
2.7 Summary
In this chapter, we have presented background material and related work of
this thesis. We have first introduced FPGAs: their architecture, applications
and design tools. Different methods for approximating functions have been
described, covering CORDIC, digit-recurrence and on-line algorithms, bipar-
tite/multipartite methods, polynomial approximation, polynomial approximation
with non-uniform segmentation, and rational approximation. Our implementa-
tions of some of these methods are presented in Chapters 3 ∼ 5.
Various issues involved with function evaluation such as different approx-
imation requirements for elementary and compound functions, approximation
method selection, range reduction, and the types of errors that can occur during
approximation process have been discussed. In Chapter 3, we address the automa-
tion of method selection when evaluating elementary functions, and in Chapter 4
we present hardware architecture for range reduction. Chapter 5, presents a hi-
50
erarchical segmentation method suitable for approximating non-linear compound
functions.
Several ways of generating Gaussian noise have been discussed, which are used
for various applications including channel code simulations. In Chapters 6 and 7
we present two hardware architectures suitable for applications that require high
speed/quality noise generators.
Finally, we have introduced the basics of LDPC codes and LDPC encoding,
described the RU algorithm for efficient encoding of LDPC codes, and looked
at previous work that deals with the hardware related issues of LDPC codes.
We present a flexible hardware LDPC encoder based on the RU algorithm in
Chapter 8.
51
CHAPTER 3
Automating Optimized Table-with-Polynomial
Function Evaluation
3.1 Introduction
Hardware implementation of elementary functions is a widely studied field with
many research papers (e.g. [19], [163], [171], [185]) and books (e.g. [46], [128])
devoted to the topic. Even though many methods are available for evaluating
functions, it is difficult for designers to know which method to select for a given
implementation.
Advanced FPGAs enable the development of low-cost and high-speed function
evaluation units, customizable to particular applications. Such customization can
take place at run time by reconfiguring the FPGA, so that different functions,
function evaluation methods, or precision can be introduced according to run-
time conditions. Consequently, the automation of function evaluation design is
one of the key bottlenecks in the further application of function evaluation in
reconfigurable computing. The main contributions of this chapter are:
• A methodology for the automation of function evaluation unit design, cov-
ering table look-up, table-with-polynomial and polynomial-only methods.
• An implementation of a partially automated system for design space explo-
52
ration of function evaluation in hardware, including:
– Algorithmic design space exploration with MATLAB.
– Hardware design space exploration with ASC.
• Method selection results for three commonly used elementary functions:
sin(x), log(1 + x) and 2x.
The rest of this chapter is organized as follows. Section 3.2 provides an
overview of our approach. Section 3.3 presents the algorithmic design space
exploration with MATLAB. Section 3.4 describes the automation of the ASC
design space exploration process. Section 3.5 shows how ASC designs can be
verified. Section 3.6 discusses results, and Section 3.7 offers summary and future
work.
3.2 Overview
We can use polynomials and/or look-up tables for approximating a function f(x)
over a fixed range [a, b]. At one extreme, the entire function approximation
can be implemented as a table look-up. At the other extreme, the function
approximation can be implemented as a polynomial approximation with function-
specific coefficients. The polynomials are of the form
f(x) = cdxd + cd−1x
d−1 + ... + c1x + c0. (3.1)
We use Horner’s rule to reduce the number of multiplications:
f(x) = ((cdx + cd−1)x + ...)x + c0 (3.2)
where x is the input, d is the polynomial degree and c are the coefficients.
53
Between these two extremes, we use a table followed by a polynomial. This
table-with-polynomial method partitions the total approximation into several
segments. In this work, we employ uniformly sized segments, which have been
widely studied in literature [19], [70], [100].
As discussed in Section 2.4.2 in Chapter 2, for a given accuracy requirement
it is possible to plot the area, latency, and throughput tradeoff and thus identify
the optimal function evaluation method. The optimality depends on further
requirements such as available area, required latency and throughput. We shall
illustrate this approach using Figures 3.13 to 3.15, where several methods are
combined to provide the optimal implementations in area, latency or throughput
for different bit-widths for the function sin(x).
The contribution of this chapter is the design and implementation of a method-
ology to automate this process. Here, MATLAB automates the mathematical
side of function approximation (e.g. bitwidth and coefficient selection), while
ASC [123] automates the hardware design space exploration of area, latency
and throughput. Figure 3.1 shows the proposed methodology and Figure 3.2
illustrates how ASC optimizes designs automatically for the user specified met-
ric [123]. Area optimization time-shares common blocks, latency optimization
uses no registers in the intermediate data paths, and throughput optimization
inserts pipeline registers.
3.3 Algorithmic Design Space Exploration with MATLAB
Given a target accuracy, or number of output bits so that the required accu-
racy is one unit in the last place (ulp), it is straightforward to automate the
design of a sufficiently accurate table, and with help from MATLAB, also to
54
Algorithm Selection
Algorithm Exploration
Function Range Accuracy
MATLAB
Arithmetic Exploration
FPGA Implementations
A Stream Compiler
ASC code
Results of Design Space Exploration
Figure 3.1: Block diagram of methodology for automation.
find the optimal coefficient for a polynomial-only implementation. The inter-
esting designs are between the table-only and polynomial-only designs – those
involving both a table and a polynomial. Three MATLAB programs have been
developed: TABLE (table look-up), TABLE+POLY (table-with-polynomial) and
POLY (polynomial-only). The programs take a set of parameters (e.g. function,
input range, operand bitwidth, required accuracy, bitwidths of the operations
and the coefficients and the polynomial degree) and generate function evaluation
units in ASC code.
TABLE produces a single table, holding results for all possible inputs; each
input is used to index the table. If the input is n bits and the precision of the
results is m bits, the size of the table would be 2n ×m. It can be seen that the
disadvantage of this approach is that the table size varies exponentially with the
input size.
55
A r e a O p t i m i z e d
L a t e n c y O p t i m i z e d
T h r o u g h p u t O p t i m i z e d
Figure 3.2: Principles behind automatic design optimization with ASC.
TABLE+POLY implements the table-with-polynomial method. The input
interval [a, b] is split into N = 2I equally sized segments. The I leftmost bits
of the argument x serve as the index into the table, which holds the coefficients
for that particular interval. We use degree two polynomials for approximating
the segments, which are known to give good results for moderate precisions [91].
The program starts with I = 0 (i.e. one segment over the whole input range)
and finds the minimax polynomial coefficients, those minimize the maximum
absolute error [128]. I is incremented until the maximum error over all segments
is lower than the requested error. The operations are performed in fixed-point
and in finite precision with the user supplied parameters, which are emulated by
MATLAB.
POLY generates an implementation which approximates the function over
the whole input range with a single polynomial. It starts with a degree one
polynomial and finds the minimax polynomial coefficients. The polynomial degree
is incremented until the desired accuracy is met.
56
3.4 Hardware Design Space Exploration with ASC
While Mencer et al. [123] show the details of the design space exploration pro-
cess with ASC, we now utilize ASC (version 0.5) to automate this process. The
idea is to retain user-control over all features available on the gate level, whilst
automating many of the tedious tasks involved in exploring the design space.
Therefore ASC allows the user to specify the dimensions of design space explo-
ration, e.g. bitwidths of certain variables, optimization metrics such as area,
latency, or throughput, and in fact anything else that is accessible in ASC code,
which includes algorithm level, arithmetic unit level and gate level constructs. For
example, suppose we wish to explore how the bitwidth of a particular ASC vari-
able affects area and throughput. To do this we first parameterize the bitwidth
definition of this variable in the ASC code. Then we specify the detail of the
exploration in the following manner:
RUN0 = −XBITWIDTH = 8, 16, 24, 32 (3.3)
which states that we wish to investigate bitwidths of 8, 16, 24 and 32. At this
point, typing ‘make run0’ begins an automatic exploration of the design space,
generating a vast array of data (e.g. number of 4-input LUTs, total equivalent
gate count, throughput and latency) for each different bitwidth. ASC also au-
tomatically generates graphs for key pieces of this data, in an effort to further
reduce the time required to evaluate it.
The design space explorer, or ‘user’, in our case is of course the MATLAB
program that mathematically designs the arithmetic units on the algorithmic level
and provides ASC with a set of ASC programs, each of which results in a large
number of implementations. Each ASC implementation in return results in a
57
1e-06
1e-05
1e-04
0.001
0.01
0.1
8 10 12 14 16 18 20 22 24
Max
Err
or
Bitwidth
sin(x)
POLYTABLE
TABLE+POLY
Figure 3.3: Accuracy graph: maximum error versus bitwidth for sin(x) with the
three methods.
number of design space exploration graphs and data files. The remaining manual
step, which is difficult to automate, involves inspecting the graphs and extracting
useful information about the variation of the metrics. It would be interesting to
see how such information from the hardware design space exploration can be used
to steer the algorithmic design space exploration.
One dimension of the design space is technology mapping on the FPGA side.
Should we use block RAMs, LUT memory or LUT logic implementations of
the mathematical look-up tables generated by MATLAB? Table 3.1 shows ASC
results which substantiate the view that logic minimization of tables containing
smooth functions is usually preferable over using block RAMs or LUT memory
to implement the table for the precisions used in this chapter. Therefore, in
this chapter we limit the exploration to combinational logic implementations of
tables.
58
Table 3.1: Various place and route results of 12-bit approximations to sin(x). The
logic minimized LUT implementation of the tables minimizes latency and area,
while keeping comparable throughput to the other methods, e.g. block RAM
(BRAM) based implementation.
ASC memory 4-input LUTs clock speed latency throughput
optimization type [MHz] [ns] [Mbps]
latency block RAM 919 + 1BRAM 17.89 111.81 250.41
LUT memory 1086 15.74 63.51 220.43
LUT logic 813 16.63 60.11 232.93
throughput block RAM 919 + 1BRAM 39.49 177.28 552.79
LUT memory 1086 36.29 192.88 508.09
LUT logic 967 39.26 178.29 549.67
3.5 Verification with ASC
One major problem of automated hardware design is the verification of the re-
sults, to make sure that the output circuit is actually correct. ASC offers two
mechanisms for this activity based on a software version of the implementation.
• Accuracy Graphs: graphs showing the accuracy of the gate-level simu-
lation result (SIM) compared to a software version using double precision
floating-point (SW ), automatically generated by MATLAB, plotting:
max.error = max(|SW − SIM |), or
max.error = max(|SW − FPGA|)
when comparing to an actual FPGA output (FPGA).
Figure 3.3 shows an example graph. Here the precisions of the coefficients
59
and the operations are increased according to the bitwidth (e.g. when
bitwidth=16, all coefficients and operations are set to 16 bits), and the
output bitwidth is fixed at 24 bits.
• Regression Testing: same as the accuracy graph, but instead of plotting a
graph, ASC compares the result to a maximally tolerated error and reports
only ‘pass’ or ‘fail’ at the end.
3.6 Results
We demonstrate our approach with three elementary functions: sin(x), log(x+1)
and 2x. Five bit sizes 8, 12, 16, 20 and 24 bits are considered for the bitwidth. In
this chapter, we implement designs with n-bit inputs and n-bit outputs. However,
the position of the decimal (or binary) point in the input and output formats can
be different in order to maximize the precision that can be described. All results
are post place-and-route, and are implemented on a Xilinx Virtex-II XC2V6000-6
device [187].
In algorithmic space explored by MATLAB, there are three methods, three
functions and five bitwidths, resulting in 45 designs. These designs are generated
by the user with hand-optimized coefficient and operation bitwidths. ASC takes
the 45 algorithmic designs and generates a large number of implementations in the
hardware space with different optimization metrics. With the aid of the automatic
design exploration features of ASC (Section 3.4), we are able to generate all the
implementation results in one go with a single ‘make’ file. It takes around twelve
hours on a dual Athlon MP 2.13GHz PC with 2GB DDR-SDRAM.
The following graphs are a subset of the full design space exploration which
we show for demonstration purposes. Figures 3.4 to 3.15 show a set of FPGA
60
implementations resulting from a 2D cut of the multidimensional design space.
In Figures 3.4 to 3.6, we fix the function and approximation method to sin(x)
and TABLE+POLY, and obtain area, latency and throughput results for various
bitwidths and optimization methods. Degree two polynomials are used for all
TABLE+POLY experiments in this chapter.
Figure 3.4 shows how the area (in terms of the number of 4-input LUTs) varies
with bitwidth. The lower part shows LUTs used for logic while the small top part
of the bars shows LUTs used for routing. We observe that designs optimized
for area are significantly smaller than other designs. In addition, as one would
expect, the area increases with the bit width. Designs optimized for throughput
have the largest area; this is due to the registers used for pipelining. Figure 3.5
shows that designs optimized for latency have significantly less delay, and the
increase in delay with the bitwidth is lower than others. Designs optimized for
area have the longest delay, which is due to hardware being shared in a time-
multiplexed manner. Figure 3.6 shows that designs optimized for throughput
perform significantly better than others. Designs optimized for area perform
worst, which is again due to the hardware sharing. We note that the throughput
is rather unpredictable with increasing bitwidth. This is because the throughput
is solely determined by the critical path, which does not necessarily increase with
bitwidth (circuit area).
Figures 3.7 to 3.9 show various metric-against-metric scatter plots of 12-bit
approximations to sin(x) with different methods and optimizations. For TABLE,
only results with area optimization are shown, because the results for other opti-
mizations applied are identical (since for TABLE optimizations are not possible).
With the aid of such plots, one can decide rapidly what methods to use for
meeting specific requirements in area, latency or throughput.
61
0
500
1000
1500
2000
2500
3000
3500
4000
4500
242016128
Are
a [4
-inpu
t LU
Ts]
Bitwidth
TABLE+POLY, sin(x)
OPT-AREAOPT-LATENCY
OPT-THROUGHPUT
Figure 3.4: Area versus bitwidth for sin(x) with TABLE+POLY. OPT indicates
for what metric the design is optimized for. Lower part: LUTs for logic; small
top part: LUTs for routing.
0
200
400
600
800
1000
1200
1400
1600
1800
8 10 12 14 16 18 20 22 24
Late
ncy
[ns]
Bitwidth
TABLE+POLY, sin(x)
OPT-AREAOPT-LATENCY
OPT-THROUGHPUT
Figure 3.5: Latency versus bitwidth for sin(x) with TABLE+POLY. Shows the
impact of latency optimization.
62
0
100
200
300
400
500
600
8 10 12 14 16 18 20 22 24
Thr
ough
put [
Mbp
s]
Bitwidth
TABLE+POLY, sin(x)
OPT-AREAOPT-LATENCY
OPT-THROUGHPUT
Figure 3.6: Throughput versus bitwidth for sin(x) with TABLE+POLY. Shows
the impact of throughput optimization.
0
200
400
600
800
1000
0 500 1000 1500 2000 2500 3000
Late
ncy
[ns]
Area [4-input LUTs]
sin(x), 12 bits
OPT-AREA-TABLE+POLYOPT-LATENCY-TABLE+POLY
OPT-THROUGHPUT-TABLE+POLYOPT-AREA-TABLEOPT-AREA-POLY
OPT-LATENCY-POLYOPT-THROUGHPUT-POLY
Figure 3.7: Latency versus area for 12-bit approximations to sin(x). The Pare-
to-optimal points [124] in the latency-area space are shown.
63
0
100
200
300
400
500
600
700
800
900
0 100 200 300 400 500 600 700
Late
ncy
[ns]
Throughput [Mbps]
sin(x), 12 bits
OPT-AREA-TABLE+POLYOPT-LATENCY-TABLE+POLY
OPT-THROUGHPUT-TABLE+POLYOPT-AREA-TABLEOPT-AREA-POLY
OPT-LATENCY-POLYOPT-THROUGHPUT-POLY
Figure 3.8: Latency versus throughput for 12-bit approximations to sin(x). The
Pareto-optimal points in the latency-throughput space are shown.
0
200
400
600
800
1000
1200
1400
0 500 1000 1500 2000 2500 3000
Thr
ough
put [
Mbp
s]
Area [4-input LUTs]
sin(x), 12 bits
OPT-AREA-TABLE+POLYOPT-LATENCY-TABLE+POLY
OPT-THROUGHPUT-TABLE+POLYOPT-AREA-TABLEOPT-AREA-POLY
OPT-LATENCY-POLYOPT-THROUGHPUT-POLY
Figure 3.9: Area versus throughput for 12-bit approximations to sin(x). The
Pareto-optimal points in the throughput-area space are shown.
64
In Figures 3.10 to 3.12, we fix the approximation method to TABLE+POLY,
and obtain area, latency and throughput results for all three functions at various
bitwidths. Optimization methods are used for all three experiments (e.g. area is
optimized to get the area results).
From Figure 3.10, we observe that sin(x) requires the most and 2x requires the
least area. The difference gets more apparent as the bitwidth increases. This is
because 2x is the most linear of the three functions, hence requires fewer number
of segments for the approximations. This leads to a reduction in the number of
entries in the coefficient table and hence less area on the device.
Figure 3.11 shows the variations of the latency with the bitwidth. We observe
that all three functions have similar behavior. In Figure 3.12, we observe that
again the three functions have similar behavior, with 2x performing slightly better
than others for bitwidths higher than 16 bits. We suspect that this is because of
the lower area requirement of 2x, which leads to less routing delay.
Figures 3.13 to 3.15 show the main emphasis and contribution of this chap-
ter, illustrating which approximation method to use for the best area, latency or
throughput performance. We fix the function to sin(x) and obtain results for all
three methods at various bitwidths. Again, area/latency/throughput optimiza-
tions are performed for a given experiment. For experiments involving TABLE,
we have managed to obtain results up to 12 bits only, due to memory limitations
of our PCs.
From Figure 3.13, we observe that TABLE has the least area at 8 bits, but the
area increases rapidly making it less desirable at higher bitwidths. The reason for
this is the exponential increase in table to the input size for full look-up tables.
The TABLE+POLY approach yields the least area for precisions higher than
eight bits. This is due to the efficiency of using multiple segments with minimax
65
coefficients. We have observed that for POLY, roughly one more polynomial term
(i.e. one more multiply-and-add module) is needed every four bits. Hence, we
see a linear behavior with the POLY curve. We are unable to generate TABLE
results beyond 12 bits, due to the device size restrictions.
Figure 3.14 shows that TABLE has significantly smaller latency than others.
We expect that this will be the case for bitwidths higher than 12 bits as well.
POLY has the worst delay, which is due to computations involving high-degree
polynomials, and the terms of the polynomials increase with the bitwidth. The
latency for TABLE+POLY is relatively low across all bitwidths, because the
number of memory accesses and polynomial degree are fixed.
In Figure 3.15, we observe how the throughput varies with bitwidth. For
low bitwidths, TABLE designs result in the best throughput, which is due to
the short delay for a single memory access. However, the performance quickly
degrades and we predict that at bitwidths higher than 12 bits, it will perform
worse than the other two methods due to rapid increase in routing congestion.
The performance of TABLE+POLY is better than POLY before 15 bits and gets
worse after. This is due to the increase in the size of the table with precision,
which leads to longer delays for memory accesses.
66
0
200
400
600
800
1000
1200
1400
1600
242016128
Are
a [4
-inpu
t LU
Ts]
Bitwidth
Optimize Area
sin(x)log(1+x)
2x
Figure 3.10: Area versus bitwidth for the three functions with TABLE+POLY.
Lower part: LUTs for logic; small top part: LUTs for routing.
0
10
20
30
40
50
60
70
80
90
100
8 10 12 14 16 18 20 22 24
Late
ncy
[ns]
Bitwidth
Optimize Latency
sin(x)log(1+x)
2x
Figure 3.11: Latency versus bitwidth for the three functions with TABLE+POLY.
67
0
100
200
300
400
500
600
700
800
8 10 12 14 16 18 20 22 24
Thr
ough
put [
Mbp
s]
Bitwidth
Optimize Throughput
sin(x)log(1+x)
2x
Figure 3.12: Throughput versus bitwidth for the three functions with TA-
BLE+POLY. Throughput is similar across functions, as expected.
0
500
1000
1500
2000
2500
3000
8 10 12 14 16 18 20 22 24
Are
a [4
-inpu
t LU
Ts]
Bitwidth
Optimize Area, sin(x)
TABLEPOLY
TABLE+POLY
Figure 3.13: Area versus bitwidth for sin(x) with the three methods. Note that
the TABLE method gets too large already for 14 bits.
68
0
50
100
150
200
250
300
350
8 10 12 14 16 18 20 22 24
Late
ncy
[ns]
Bitwidth
Optimize Latency, sin(x)
TABLEPOLY
TABLE+POLY
Figure 3.14: Latency versus bitwidth for sin(x) with the three methods.
0
100
200
300
400
500
600
700
800
900
1000
8 10 12 14 16 18 20 22 24
Thr
ough
put [
Mbp
s]
Bitwidth
Optimize Throughput, sin(x)
TABLEPOLY
TABLE+POLY
Figure 3.15: Throughput versus bitwidth for sin(x) with the three methods.
69
3.7 Summary
We have presented a methodology for the automation of function evaluation
unit design, covering table look-up, table-with-polynomial and polynomial-only
methods. An implementation of a partially automated system for design space
exploration of function evaluation in hardware has been demonstrated, including
algorithmic design space exploration with MATLAB and hardware design space
exploration with ASC. We have also compared block RAMs, LUT memory and
LUT logic implementations for storing mathematical look-up tables generated by
MATLAB. It is observed that the logic minimized LUT implementation of the
tables minimizes latency and area, while keeping comparable throughput to the
other methods.
Method selection results for sin(x), log(1 + x) and 2x have been shown. Area
and speed results for area, latency and throughput optimized designs have been
examined. Demonstrating that indeed an optimum method does exist for a given
function, precision and metric. We conclude that the automation of function
evaluation unit design is within reach, even though there are many remaining
issues for further study, which are discussed in Chapter 10.
70
CHAPTER 4
Adaptive Range Reduction
for Function Evaluation
4.1 Introduction
One of the main challenges in function evaluation is to provide a programming
tool or library that delivers the best function evaluation unit for a given function,
with the associated input and output range and precision. In Chapter 3, we have
shown the connection between precision and function evaluation methods. This
chapter focuses on adaptive range reduction, which transforms the input domain
into a smaller manageable range, such as the ranges used for the functions in
Chapter 3. The main contributions of this chapter are:
• Framework for adaptive range reduction based on a parametric function
evaluation library, and on function approximation by polynomials and ta-
bles and pre-computing all possible input and output ranges.
• Implementation of design space exploration for adaptive range reduction,
using MATLAB in producing function evaluation parameters for hardware
designs targeting the ASC system.
• Evaluation of the proposed approach by exploring various effects of range re-
duction of several arithmetic functions such as sin(x) and log(x) on through-
71
put, latency and area for FPGA designs accurate to one ulp.
The rest of this chapter is organized as follows. Section 4.2 covers overview
and background material. Section 4.3 shows the design of the adaptive func-
tion evaluation library for ASC. Section 4.4 presents the implementation of the
algorithmic design space exploration with MATLAB, ASC library code genera-
tion, and the automation of the ASC design space exploration process optimizing
area, latency or throughput. Section 4.5 discusses results, and Section 4.6 offers
summary and thoughts on future work.
4.2 Overview
We evaluate an elementary function f(x), where x and f(x) have a given range
[a, b] and precision requirement. The evaluation typically consists of three steps [128]:
(1) range reduction, reducing x over the interval [a, b] to a more convenient y
over a smaller interval [a′, b′],
(2) function evaluation on the reduced interval, and
(3) range reconstruction: expansion of the result back to the original result
range.
There are two main types of range reduction:
• additive reduction: y is equal to x−mC;
• multiplicative reduction: y is equal to x/Cm
where integer m and a constant C are defined by the evaluated function.
Range reduction is widely studied, especially for CORDIC [182] and floating-
point number systems on microprocessors [25]. In contrast, range reduction which
72
f u n c t i o n f ( x ) i n p u t f o r m a t m e t h o d
A p p r o x i m a t e f ( x ) ( M A T L A B )
H a r d w a r e C o m p i l e r ( A S C )
F P G A i m p l e m e n t a t i o n s
L i b r a r y G e n e r a t o r ( P e r l S c r i p t )
A S C c o d e F u n c t i o n
E v a l u a t i o n L i b r a r y
( A S C L i b )
U s e r
L i b r a r y C o n s t r u c t i o n L i b r a r y U s a g e
Figure 4.1: Design flow: MATLAB generates all the ASC code for the library. The
user simply indexes into the library to obtain the specific function approximation
unit.
adapts to different input ranges and precisions has received little attention. To
the best of our knowledge, this is the first work that deals with this issue. Our
design flow is illustrated in 4.1.
4.3 Design
This section describes our approach for adaptive range reduction. Section 4.3.1
provides an overview. Section 4.3.2 describes the degrees of freedom for choosing
different parameters in our method.
73
4.3.1 Design Overview
Figure 4.1 shows the design flow of this research. The function of interest, its
input range and precision, and evaluation method are supplied to our MATLAB
program, which automatically designs the function approximator and produces
its hardware description. In our case, MATLAB produces code for ASC. This
large collection of ASC functions is then transformed by a Perl script into an
ASC function evaluation library (ASC lib). ASC then takes care of design space
exploration on the architecture level, the arithmetic level, and the gate level of
abstraction. The result is an optimized function evaluation library for computing
with FPGAs.
Given a function f(x) and an interval [a, b] we approximate the function with
polynomials and tables. Tasks in designing a function evaluation library include
automating the selection of range reduction, the selection and design of the func-
tion evaluation method, and area, latency and throughput optimizations on the
lower levels of abstraction. This section shows how we design a function evalua-
tion library that contains optimized implementations for a large number of range
and precision combinations.
The conventional way of implementing function evaluation is shown below for
the three functions evaluated in this chapter. We use ASC code notation [123]
in Figure 4.2 to show various methods of function evaluation including range
reduction and range reconstruction. Figures 4.3, 4.4 and 4.5 show the circuit
diagrams.
The code in Figure 4.2 shows as an example a different function evaluation
method for each function. In reality we use many combinations of method and
function. sin(x) is an instance of additive reduction, whereas log(x) and√
x are
instances of multiplicative reduction.
74
Evaluating f(x) = sin(x)
// Range Reduction
x1 = abs(x) % (2*pi);
x2 = IF(x1>pi, x1-pi, x1);
y = IF(x2>(pi/2), pi-x2, x2);
// Evaluation Method
// f(y) where y = [0,pi/2)
// e.g. polynomial-only (po)
f1 = (a*y+b)*y+c;
// Range Reconstruction
f = IF(x1>pi, f1, -f1);
Evaluating f(x) = log(x)
// Range Reduction
exp = LeadingOneDetect(x)-FracWidth(x);
y = x >> exp;
// Evaluation Method
// f(y) where y = [0.5,1)
// e.g. table+degree-1-polynomial (tp1)
f1 = Table1[y]*y+Table2[y];
// Range Reconstruction
f = f1+exp*log(2);
Evaluating f(x) =√
x
// Range Reduction
exp = LeadingOneDetect(x)-FracWidth(x);
x1 = x >> exp;
y = IF(exp[0], x1 >> 1, x1);
// Evaluation Method
// f(y) where y = [0.25,1)
// e.g. table+degree-2-polynomial (tp2)
f1 = (Table1[y]*y+Table2[y])*y+Table3[y];
// Range Reconstruction
exp1 = IF(exp[0], exp+1 >> 1, exp >> 1);
f = f1 << exp1;
Figure 4.2: Description of range reduction, evaluation method and range recon-
struction for the three functions sin(x), log(x) and√
x.
75
m o d 2 -
x
A p p r o x i m a t i o n U n i t
y
g ( y )
n e g a t e
f ( x ) = s i n ( x )
R a n g e R e d u c t i o n
A p p r o x i m a t i o n
R a n g e R e c o n s t r u c t i o n
> - / 2
>
Figure 4.3: Circuit for evaluating sin(x).
76
x
L e a d i n g O n e D e t e c t o r
x . p r e c
e x p
x > > e x p
A p p r o x i m a t i o n U n i t
y
g ( y )
l o g ( 2 )
f ( x ) = l o g ( x )
R a n g e R e d u c t i o n
A p p r o x i m a t i o n
R a n g e R e c o n s t r u c t i o n
Figure 4.4: Circuit for evaluating log(x).
77
x
L e a d i n g O n e D e t e c t o r
x . p r e c
e x p
x > > e x p
A p p r o x i m a t i o n U n i t
y
g ( y )
f ( x ) = x
R a n g e R e d u c t i o n
A p p r o x i m a t i o n
R a n g e R e c o n s t r u c t i o n
1
> > 1
e x p 1
g ( y ) < < e x p 1
e x p [ 0 ]
> > 1
Figure 4.5: Circuit for evaluating√
x.
78
Figure 4.6 shows the functions over the range reduced intervals. We observe
that the functions have a linear behavior over these intervals.
0 0.5 1 1.50
0.2
0.4
0.6
0.8
1
y
sin(
y)
0.5 0.6 0.7 0.8 0.9 1
−0.6
−0.4
−0.2
0
y
log(
y)
0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.5
0.6
0.7
0.8
0.9
1
y
√y
Figure 4.6: Plot of the three functions over the range reduced intervals.
The central contribution of this chapter lies in reconsidering the above struc-
ture for user-defined fixed-point bitwidths. When programming FPGAs one can
select any bitwidth for the integer part and the fractional part of the fixed-point
number. As a consequence, a function evaluation library obtains the range and
precision of the input and can use this information to produce an optimized func-
tion evaluation unit. Previous work [122] shows the subproblem of how to select
function evaluation methods based on precision. In this work we add the issue
of input/output range and range reduction. Based on input range and precision
79
we now have the following degrees of freedom:
1. applicability of range reduction
2. evaluation method selection
3. evaluation method design
• find minimal bitwidths
• find minimal polynomial degree
(for polynomial-only method)
• find minimal segments
(for table-with-polynomial method)
4. optimize: area, latency or throughput
The ASC function evaluation library takes the range, precision and optimiza-
tion metric, and instantiates one of many instances of the corresponding function
evaluation unit.
4.3.2 Degrees of Freedom
Applicability of Range Reduction
Assume we require a hardware unit to compute sin(x) and x is a fixed-point
variable with four integer bits and eight fraction bits. Then the range of the
input is [0, 16) and the expected range of the output is [−1, 1]. The precision of
the input and output is 2−8 which also sets the ulp (unit in last place). Given a
particular function that we want to evaluate, we can decide whether it is necessary
to implement range reduction or not. In order to make the correct decision we
need to consider the optimization metric (area, latency or throughput), design a
80
function evaluation unit with and without range reduction, and select the best
one.
In practice, we actually pre-compute all possible input ranges and store for
each function a particular range r so that for all input ranges smaller than r we do
not use range reduction, and for all input ranges above r we use range reduction.
We obtain the graphs which determine r after place-and-route. Section 4.5 shows
the detailed graphs of this step.
Evaluation Method Selection
As discussed in Section 2.3 in Chapter 2, there are many possible methods for
evaluating functions. In this chapter we explore polynomial-only (po) and table-
with-polynomial methods with polynomials of degree two to four (tp2∼tp4 ). The
architecture for an approximation unit with a table-with-polynomial scheme is
shown in Figure 4.8. The polynomial coefficients are found in a minimax sense.
For the table-with-polynomial approach, the input interval is split into 2k equally
sized segments. The k leftmost bits of the argument y serve as the index into the
table, which holds the coefficients for that particular interval. Segmentation for
evaluating log(y) with eight uniform segments (k = 3) is illustrated in Figure 4.7.
Note that for the polynomial-only approach, there would be just one entry (coef-
ficient) in the table and no addressing bits. The table-with-polynomial methods
(tp) trade off table area versus polynomial area.
Evaluation Method Design
Once we know which method to use, we need to design the optimized unit. For
the polynomial-only method we can find the minimal degree of the polynomial
that will satisfy the required output precision. Then we need to find the opti-
81
0.5 0.5625 0.625 0.6875 0.75 0.8125 0.875 0.9375 1
−0.6
−0.5
−0.4
−0.3
−0.2
−0.1
0
y
log(
y)
Figure 4.7: Segmentation for evaluating log(y) with eight uniform segments. The
leftmost three bits of the inputs are used as the segment index.
mized bitwidths of the computation inside the function evaluation units for all
the methods.
Optimize: Area, Latency or Throughput
While the options or selections of the previous degrees of freedom are pre-computed
with MATLAB, the area, latency and throughput optimizations on the arithmetic
and gate-levels can be left for the compiler to worry about (as discussed in Sec-
tion 3.2 in Chapter 3). The next section about the implementation contains
details on how this is achieved.
4.4 Implementation
This section presents the implementation of the algorithmic design space explo-
ration with MATLAB, ASC library code generation, and the automation of the
ASC design space exploration process optimizing area, latency or throughput.
82
c d
0
index
1 ...
... 2 k
c d - 1 ...
...
k j - k
n
w d - 1 w d
c 0
w 0
c 1
w 1
j
g ( y )
y
y a y b
Figure 4.8: Architecture of table-with-polynomial unit for degree d polynomials.
Horner’s rule is used to evaluate the polynomials.
4.4.1 Algorithmic Design Space Exploration
We use MATLAB to generate a large number of implementations for function
evaluation. We consider several function evaluation methods: polynomial-only
(po), and table-with-polynomial of degree two to four (tp2∼tp4 ). For a given
function and any range/precision pair, the MATLAB code generates polyno-
mial coefficients which form entries of the look-up tables based on the Remez
method [128]. The range and precision are represented by the integer and fraction
83
bitwidths respectively. The Remez method computes the minimax coefficients
that minimize the maximum absolute error over an interval. In this fashion we
also obtain minimal bitwidths and the minimal number of polynomial terms for
the po method. For tp methods, we find the minimal table size and the coefficient
bitwidths for the given range and precision.
The following structure of 2000 lines of MATLAB code provides the following
functionality.
// for a given function f, input format i,
// method m and polynomial degree d
if (m==‘po’) // for polynomial-only
// find minimum polynomial degree
min_degree = find_min_degree(f,i);
// find minimum internal fraction bitwidth
int_bw = find_min_int_bw(f,i,min_degree);
// generate polynomial coefficients
coeffs = gen_coeffs(f,i,min_degree,int_bw);
// generate ASC code
gen_ASC(f,i,min_degree,int_bw,coeffs);
elseif (m==‘tp’) // for table-with-polynomial
// find minimum number of segments
min_segs = find_min_segs(f,i,d);
// find minimum internal fraction bitwidth
int_bw = find_min_int_bw(f,i,d,min_segs);
// generate coefficient look-up table
table = gen_table(f,i,d,int_bw,min_segs);
// generate ASC code
gen_ASC(f,i,d,int_bw,table,min_segs);
end
84
For this implementation we use uniform bitwidths for the internal datapath
fraction bitwidth. This minimum bitwidth is found using a binary search method.
In the future we hope to support non-uniform minimum bitwidth by using more
advanced bitwidth minimization techniques such as BitSize [47]. The ASC code
automatically generated for evaluating sin(x) for range 8 bits and precision 8 bits
with tp2 is shown in Figure 4.9.
85
// ASC for evaluating sin(x) in fixed-point with range reduction
// Range ReductionHWfix &HWreduce_HWsin_16_8_tp2_wrr(HWfix &x, HWint &sign_x, HWint&temp_sign)
const double pit2_const = 6.2832031250000000000000000e+000; // Pi * 2const double pi_const = 3.1416015625000000000000000e+000; // Piconst double pio2_const = 1.5703125000000000000000000e+000; // Pi / 2
HWfix &pit2 = *(new HWfix(TMP, 13, 10, UNSIGNED)); HWfix &pi = *(new HWfix(TMP, 12, 10, UNSIGNED));HWfix &pio2 = *(new HWfix(TMP, 11, 10, UNSIGNED)); HWfix &abs_x = *(new HWfix(TMP, 18, 8));HWfix &x1 = *(new HWfix(TMP, 14, 10)); HWfix &x2 = *(new HWfix(TMP, 13, 10));HWfix &x3 = *(new HWfix(TMP, 12, 10));
pit2 = pit2_const;pi = pi_const;pio2 = pio2_const;
sign_x = x[15];abs_x = HWabs(x);x1 = abs_x%pit2;temp_sign = x1>pi;x2 = IF(temp_sign, x1-pi, x1);x3 = IF(x2>pio2, pi-x2, x2);
return x3;
// ApproximationHWfix &HWapproximate_HWsin_16_8_tp2_wrr(HWfix &reduced_x)
double c2_init[4]=-3.125000000e-002, -8.496093750e-002, -1.181640625e-001, -1.220703125e-001;double c1_init[4]= 5.126953125e-001, 4.482421875e-001, 2.744140625e-001, 3.417968750e-002;double c0_init[4]=-9.765625000e-004, 4.785156250e-001, 8.408203125e-001, 9.980468750e-001;
HWfix &reduced_x_temp = *(new HWfix(TMP, 14, 10)); HWfix &temp1 = *(new HWfix(TMP, 11, 10));HWfix &temp2 = *(new HWfix(TMP, 11, 10)); HWfix &_x = *(new HWfix(TMP, 8, 7));HWfix &c2 = *(new HWfix(TMP, 12, 11)); HWfix &c1 = *(new HWfix(TMP, 12, 11));HWfix &c0 = *(new HWfix(TMP, 12, 11)); HWfix &dp1 = *(new HWfix(TMP, 11, 10));HWfix &dp2 = *(new HWfix(TMP, 11, 10)); HWfix &_dp2 = *(new HWfix(TMP, 12, 11));HWfix &dp3 = *(new HWfix(TMP, 11, 10)); Wfix &dp4 = *(new HWfix(TMP, 10, 8));HWint &coeff_addr = *(new HWint(TMP, 2, UNSIGNED));
HWvector<HWfix> &c2_mem = *new HWvector<HWfix>(4, new HWfix(TMP, 11, 10), c2_init);HWvector<HWfix> &c1_mem = *new HWvector<HWfix>(4, new HWfix(TMP, 11, 10), c1_init);HWvector<HWfix> &c0_mem = *new HWvector<HWfix>(4, new HWfix(TMP, 11, 10), c0_init);
reduced_x_temp = reduced_x;coeff_addr = reduced_x_temp<<1;temp1 = coeff_addr;temp2 = reduced_x_temp<<1;_x = temp2 - temp1;c2 = c2_mem[coeff_addr];c1 = c1_mem[coeff_addr];c0 = c0_mem[coeff_addr];dp1 = _x * c2;dp2 = dp1 + c1;_dp2 = dp2;dp3 = _x * _dp2;dp4 = dp3 + c0;
return dp4;
// Range ReconstructionHWfix &HWreconstruct_HWsin_16_8_tp2_wrr(HWfix &approximated_x, HWint&sign_x, HWint &temp_sign)
HWfix &fx = *(new HWfix(TMP, 10, 8));
fx = IF(temp_sign==sign_x, approximated_x, -approximated_x);
return fx;
// EvaluationHWfix &HWsin_16_8_tp2_wrr(HWfix &x)
HWfix &fx = *(new HWfix(TMP, 10, 8)); HWfix &reduced_x = *(new HWfix(TMP, 12, 10));HWfix &approximated_x = *(new HWfix(TMP, 10, 8)); HWint &sign_x = *(new HWint(TMP, 1, UNSIGNED));HWint &temp_sign = *(new HWint(TMP, 1, UNSIGNED));
// Range Reductionreduced_x = HWreduce_HWsin_16_8_tp2_wrr(x, sign_x, temp_sign);
// Approximationapproximated_x = HWapproximate_HWsin_16_8_tp2_wrr(reduced_x);
// Range Reconstructionfx = HWreconstruct_HWsin_16_8_tp2_wrr(approximated_x, sign_x, temp_sign);
return fx;
Figure 4.9: ASC code for evaluating sin(x) for range 8 bits and precision 8 bits
with tp2. This code is automatically generated from our MATLAB tool.
86
4.4.2 ASC Code Generation and Optimizations
ASC code makes use of C++ syntax and ASC semantics which allow the user
to program on the architecture-level, the arithmetic-level and the gate-level. As
a consequence ASC code provides the productivity of high-level hardware design
tools and the performance of low-level optimized hardware design. ASC pro-
vides types and operators to enable research on custom data representation and
arithmetic. Currently supported types are HWint, HWfix and HWfloat. For this
chapter we use the HWfix type which is defined as follows:
HWfix x(TMP,size,fract_size,sign_mode);
All results in this chapter are given for sign-magnitude representation which
makes most sense for range reduction. ASC provides operator-level optimizations
of area, latency, and throughput, which is referred to below as the optimization
mode.
As a result of this work, ASC provides a function evaluation library call of
the form
y = HWsin(x);
In order to create an optimizing function evaluation library, we utilize MATLAB
to generate a vast amount of ASC code. This ASC code forms a two-dimensional
matrix, which is indexed by range and precision of the argument to the function
evaluation call. Each matrix entry consists of a pointer to an ASC function which
is called for the particular input x.
Note that for each function we determine two design selection matrices: for
minimum area (Figure 4.10) and for minimum latency (Figure 4.11) as shown
in Section 4.5. The HWsin(x) call indexes into the matrix to find the optimized
87
ASC implementation. For instance, from Figure 4.10, a√
x design with 12-bit
range and 16-bit precision, the smallest implementation would be tp3.
The function evaluation code, for example for log(x), then indexes into the
matrix of function pointers (HWlog_matrix) and accesses the correct function
based on input range and precision:
HWfix &HWlog( HWfix &x )
return HWlog_matrix[x.range][x.precision](x);
All together, the 2000 lines of MATLAB code generates 300,000 lines of ASC
code, resulting in over 1000 designs with a total of over 40 million Xilinx equiv-
alent circuit gates. This process took few days’ time on two Intel Xeon 2.6GHz
dual processor PCs, each fitted with 4GB DDR-SDRAM.
4.5 Results
After applying the method in Section 4.4.2, 1000 distinct designs are place and
routed on a Xilinx Virtex-II XC2V6000-6 device. These result in over 150
graphs/figures. We summarize all the results in two matrices which show the
Pareto-optimal solutions in Figure 4.10 for area and Figure 4.11 for latency. In
essence, these matrices tell us for each combination of range and precision of the
input which method to use for the three functions. Note that we use the term
range reduction to also include range reconstruction.
The remaining result figures show a sample of the graphs that we used to
arrive at the decisions presented in the matrices above.
Figures 4.12, 4.13, 4.14 and 4.15 show the area cost of range reduction for
88
sin(x) and log(x) implemented using po and tp3 methods. The lower part of the
bars shows LUTs used for function evaluation, and the small upper part shows
the LUTs used for range reduction. These figures show that the percentage area
used by range reduction increases with precision and range. Comparing sin(x)
with log(x), the cost of range reduction with increasing range is large for sin(x),
due to the use of the modulus operation which incorporates a divider. In contrast,
log(x) uses a barrel shifter to do the range reduction.
To decide when to use range reduction, we consider Figures 4.16, 4.17, 4.18
and 4.19, which show the area and latency results for sin(x) and log(x) evaluated
using range reduction (WRR) and without range reduction (WOR). In the case of
evaluating with WOR, we approximate the function over the entire user defined
range with the given methods (tp2∼tp4 ).
Considering the area for sin(x), WOR has a lower LUT usage than WRR
when the range is less then six bits. In the case of log(x), we observe that even
for ranges as low as two bits, the LUT usage for WOR is significantly higher
than WRR and this gap increases with range. This is due to the non-linear
region of log(x) near zero which requires more segments to approximate with
WOR. Considering the latency results for sin(x) and log(x), WOR is always
faster than the corresponding WRR method. This is due to the absence of the
range reduction step.
Figures 4.20 and 4.21 highlight the area and latency tradeoffs where the area
increase with precision is smaller for area optimized designs, and the latency
increase is smaller for latency optimized designs. Figures 4.22 and 4.23 show a
similar tradeoff when we consider the range while keeping the precision fixed.
By looking at these figures along with other figures, we are able to create the
resulting matrices in Figures 4.10 and 4.11. From the two matrices, we observe
89
that mostly tp2 is the most attractive solution. This result is not too surprising,
since second order polynomials are known to give good trade offs between table
size and circuit complexity for the bitwidths we aim in this chapter. But when
the precision requirement is high (such as 16 bits in Figure 4.10), we see that
tp3 gives the smallest area. This is because at low precision requirements, table
sizes are manageable with low order polynomials. However, table sizes increase
rapidly with increasing precision, at which higher order polynomials result in
significantly smaller tables.
90
sin : tp 2 log : tp 2 sqrt : tp 2
sin : tp 2 log : tp 2 sqrt : tp 2
sin : tp 2 log : tp 2 sqrt : tp 2
sin : tp 2 log : tp 2 sqrt : po
sin : tp 2 log : tp 2 sqrt : tp 2
sin : tp 2 log : tp 2 sqrt : tp 2
sin : tp 2 log : tp 2 sqrt : tp 2
sin : po log : tp 2 sqrt : tp 2
sin : tp 2 log : tp 2 sqrt : tp 3
sin : tp 2 log : tp 2 sqrt : tp 2
sin : tp 2 log : tp 2 sqrt : tp 2
sin : tp 2 log : tp 2 sqrt : tp 2
sin : tp 2 log : tp 2 sqrt : tp 3
sin : tp 2 log : tp 2 sqrt : tp 2
sin : tp 2 log : tp 2 sqrt : tp 2
sin : po log : po
sqrt : tp 2
16 12 8 4
4
8
12
16
Range [ bits ]
P r e
c i s i
o n [ b
i t s ]
Figure 4.10: Area matrix which tells us for each input range/precision combina-
tion which design to use for minimum area.
sin : tp 2 log : tp 2 sqrt : tp 2
sin : tp 2 log : tp 2 sqrt : tp 2
sin : tp 2 log : tp 2 sqrt : tp 2
sin : po log : po sqrt : po
sin : tp 2 log : tp 2 sqrt : tp 2
sin : tp 2 log : tp 2 sqrt : tp 2
sin : tp 2 log : tp 2 sqrt : tp 2
sin : tp 2 log : po
sqrt : tp 2
sin : tp 2 log : tp 2 sqrt : tp 2
sin : tp 2 log : tp 2 sqrt : tp 2
sin : tp 2 log : tp 2 sqrt : tp 2
sin : po log : tp 2 sqrt : tp 2
sin : tp 2 log : tp 2 sqrt : tp 2
sin : tp 2 log : tp 2 sqrt : tp 2
sin : tp 2 log : tp 2 sqrt : tp 2
sin : tp 2 log : po
sqrt : tp 2
16 12 8 4
4
8
12
16
Range [ bits ]
P r e
c i s i
o n [ b
i t s ]
Figure 4.11: Latency matrix which tells us for each input range/precision com-
bination which design to use for minimum latency.
91
4 8 12 160
2000
4000
6000
8000
10000
12000sin(x) − po
Range [bits]
Are
a [4
−in
put L
UT
s]
Precision 4Precision 8Precision 12Precision 16
Figure 4.12: Area cost of range reduction (upper part) for sin(x) implemented
using po with the designs optimized for area.
4 8 12 160
1000
2000
3000
4000
5000
6000
7000sin(x) − tp3
Range [bits]
Are
a [4
−in
put L
UT
s]
Precision 4Precision 8Precision 12Precision 16
Figure 4.13: Area cost of range reduction (upper part) for sin(x) implemented
using tp3 with the designs optimized for area.
92
4 8 12 160
2000
4000
6000
8000
10000log(x) − po
Range [bits]
Are
a [4
−in
put L
UT
s]
Precision 4Precision 8Precision 12Precision 16
Figure 4.14: Area cost of range reduction (upper part) for log(x) implemented
using po with the designs optimized for area.
4 8 12 160
500
1000
1500
2000
2500
3000
3500
4000
4500log(x) − tp3
Range [bits]
Are
a [4
−in
put L
UT
s]
Precision 4Precision 8Precision 12Precision 16
Figure 4.15: Area cost of range reduction (upper part) for log(x) implemented
using tp3 with the designs optimized for area.
93
4 6 80
1000
2000
3000
4000
5000
6000
7000
8000
Range [bits]
Are
a [4
−in
put L
UT
s]
sin(x)
tp2 WORtp2 WRRtp3 WORtp3 WRRtp4 WORtp4 WRR
Figure 4.16: Area for sin(x) with precision of eight bits for different methods
with (WRR, solid line) and without (WOR, dashed line) range reduction, with
the designs optimized for area.
4 6 840
60
80
100
120
140
160
180
200
220
240
Range [bits]
Late
ncy
[ns]
sin(x)
tp2 WORtp2 WRRtp3 WORtp3 WRRtp4 WORtp4 WRR
Figure 4.17: Latency for sin(x) with precision of eight bits for different methods
with (WRR, solid line) and without (WOR, dashed line) range reduction, with
the designs optimized for latency.
94
2 3 40
2000
4000
6000
8000
10000
12000
Range [bits]
Are
a [4
−in
put L
UT
s]
log(x)
tp2 WORtp2 WRRtp3 WORtp3 WRRtp4 WORtp4 WRR
Figure 4.18: Area for log(x) with precision of eight bits for different methods
with (WRR, solid line) and without (WOR, dashed line) range reduction, with
the designs optimized for area.
2 3 450
60
70
80
90
100
110
120
130
140
Range [bits]
Late
ncy
[ns]
log(x)
tp2 WORtp2 WRRtp3 WORtp3 WRRtp4 WORtp4 WRR
Figure 4.19: Latency for sin(x) with precision of eight bits for different methods
with (WRR, solid line) and without (WOR, dashed line) range reduction, with
the designs optimized for latency.
95
4 8 12 160
1000
2000
3000
4000
5000
6000
7000sin(x) − tp3
Precision [bits]
Are
a [4
Inpu
t LU
Ts]
Range 4, area optRange 8, area optRange 12, area optRange 16, area optRange 4, latency optRange 8, latency optRange 12, latency optRange 16, latency opt
Figure 4.20: Area versus precision for sin(x) using tp3 for different ranges and
optimization.
4 8 12 160
1000
2000
3000
4000
5000
6000
7000
8000
9000sin(x) − tp3
Precision [bits]
Late
ncy
[ns]
Range 4, area optRange 8, area optRange 12, area optRange 16, area optRange 4, latency optRange 8, latency optRange 12, latency optRange 16, latency opt
Figure 4.21: Latency versus precision for sin(x) using tp3 for different ranges and
optimization.
96
4 8 12 16400
600
800
1000
1200
1400
1600
1800
2000Area Optimization − Precision 8 bits
Range [bits]
Are
a [4
Inpu
t LU
Ts]
sin, tp2sin, tp3sin, tp4sin, posqrt, tp2sqrt, tp3sqrt, tp4sqrt, polog, tp2log, tp3log, tp4log, po
Figure 4.22: Area versus range for all three functions using different methods
with the precision fixed at eight bits optimized for area.
4 8 12 160
100
200
300
400
500
600
700Latency Optimization − Precision 8 bits
Range [bits]
Late
ncy
[ns]
sin, tp2sin, tp3sin, tp4sin, posqrt, tp2sqrt, tp3sqrt, tp4sqrt, polog, tp2log, tp3log, tp4log, po
Figure 4.23: Latency versus range for all three functions using different methods
with the precision fixed at eight bits optimized for latency.
97
4 8 12 160
500
1000
1500
2000
2500
3000
3500
4000Area Optimization − po
Range [bits]
Are
a [4
Inpu
t LU
Ts]
sin, prec. 4sin, prec. 8sin, prec. 12sin, prec. 16sqrt, prec. 4sqrt, prec. 8sqrt, prec. 12sqrt, prec. 16log, prec. 4log, prec. 8log, prec. 12log, prec. 16
Figure 4.24: Area versus range for all three functions using po for different pre-
cisions optimized for area.
4 8 12 160
100
200
300
400
500
600
700
800
900Latency Optimization − po
Range [bits]
Late
ncy
[ns]
sin, prec. 4sin, prec. 8sin, prec. 12sin, prec. 16sqrt, prec. 4sqrt, prec. 8sqrt, prec. 12sqrt, prec. 16log, prec. 4log, prec. 8log, prec. 12log, prec. 16
Figure 4.25: Latency versus range for all three functions using po for different
precisions optimized for latency.
98
4 8 12 16200
400
600
800
1000
1200
1400
1600
1800
2000Area Optimization − tp3
Range [bits]
Are
a [4
Inpu
t LU
Ts]
sin, prec. 4sin, prec. 8sin, prec. 12sin, prec. 16sqrt, prec. 4sqrt, prec. 8sqrt, prec. 12sqrt, prec. 16log, prec. 4log, prec. 8log, prec. 12log, prec. 16
Figure 4.26: Area versus range for all three functions using po for different pre-
cisions optimized for area.
4 8 12 160
100
200
300
400
500
600
700
800Latency Optimization − tp3
Range [bits]
Late
ncy
[ns]
sin, prec. 4sin, prec. 8sin, prec. 12sin, prec. 16sqrt, prec. 4sqrt, prec. 8sqrt, prec. 12sqrt, prec. 16log, prec. 4log, prec. 8log, prec. 12log, prec. 16
Figure 4.27: Latency versus range for all three functions using po for different
precisions optimized for latency.
99
4.6 Summary
We have presented the design space exploration of function evaluation with cus-
tom range and precision. The result is an optimizing function evaluation library
for ASC. The novel aspect of this work is the method and range reduction se-
lection based on range and precision of the input/output variables. The detailed
research issues to which this chapter contributes are:
• exploration of the area and speed tradeoffs of function evaluation with and
without range reduction, using ASC;
• given a function, its input/output range/precision, and an optimization
metric, we automate the decision about whether range reduction helps to
optimize the metric by pre-computing a large library of function evaluation
generators;
• given the above and a decision regarding range reduction, we automate
the decision which is the best evaluation method to use by looking at the
range/precision/method space and selecting the best method in each case;
• given the method, we automate the decision about which bitwidths and
number of polynomial terms to use by constructing the function evaluation
generators via MATLAB simulation and computation.
In addition, we show the productivity which we obtain from combining MAT-
LAB with ASC, exploring over 40 million Xilinx equivalent circuit gates in few
days’ time on two Intel Xeon 2.6GHz dual processor PCs, each fitted with 4GB
DDR-SDRAM.
100
CHAPTER 5
The Hierarchical Segmentation Method
for Function Evaluation
5.1 Introduction
In Chapters 3 and 4, we presented the evaluation of elementary functions [128].
Range reduction techniques such as those presented in Chapter 4 are used to
bring the input within a linear range. In contrast, there has been little attention
on the efficient approximation of compound functions for special purpose appli-
cations. Examples of such applications include N-body simulation [63], channel
coding [74], Gaussian noise generation [86] and image registration [158]. In prin-
ciple, these compound functions can be evaluated by splitting them into several
elementary functions, but this approach results in long delay, propagation of
rounding errors and possibilities of catastrophic cancellations [55]. Range reduc-
tion is not feasible for compound functions (unless the sub-functions are com-
puted one by one), so highly non-linear regions of a function need to be handled
as well. Since we are looking at the entire function over a given input range,
the advantages of our method increase significantly as compound functions be-
come more complex. We present an efficient adaptive hierarchical segmentation
scheme based on piecewise polynomial approximations that caters well for these
non-linear regions. We illustrate our method with the following four functions:
101
f1 =√− log(x) (5.1)
f2 = x log(x) (5.2)
f3 =0.0004x + 0.0002
x4 − 1.96x3 + 1.348x2 − 0.378x + 0.0373(5.3)
f4 = cos(πx/2) (5.4)
where x is an n-bit number over [0, 1) of the form 0.xn−1..x0. The function f1
is used in the Box-Muller algorithm for the generation of Gaussian noise (Chap-
ter 6), and f2 is commonly used for entropy calculation such as mutual informa-
tion computation in image registration [158]. The trigonometric function f4 is
widely used in many applications including Gaussian noise generation, robot arm
control [109] and direct digital frequency synthesizers [112].
Note that the functions f1 and f2 cannot be computed for x = 0, therefore we
approximate these functions over (0, 1) and generate an exception when x = 0. In
this chapter, we implement an n-bit in, n-bit out system. However, the position
of the decimal (or binary) point in the input and output formats can be different
in order to maximize the precision that can be described.
The principal contribution of this chapter is a systematic method for pro-
ducing fast and efficient hardware function evaluators for both compound and
elementary functions using piecewise polynomial approximations with a hierar-
chical segmentation scheme. The novelties of our work include:
• an algorithm for locating optimum segment boundaries given a function,
input interval and maximum error;
• a scheme for piecewise polynomial approximations with a hierarchy of seg-
ments;
102
• evaluation of this method with four compound functions;
• hardware architecture and implementation of the proposed method.
The rest of this chapter is organized as follows: Section 5.2 covers related work.
Section 5.3 explains how our algorithm finds the optimum placement of the seg-
ments. Section 5.4 presents our hierarchical segmentation scheme. Section 5.5
describes our hardware architecture. Section 5.6 analyzes the various errors in-
volved in our approximations. Section 5.8 discusses evaluation and results, and
Section 5.9 offers summary.
5.2 Related Work
Approximations using uniform segments are suitable for functions with linear
regions, but are inefficient for non-linear functions, especially when the function
varies exponentially. It is desirable to choose the boundaries of the segments
to cater for the non-linearities of the function. Highly non-linear regions may
need smaller segments than linear regions. This approach minimizes the amount
of storage required to approximate the function, leading to more compact and
efficient designs. We use a hierarchy of uniform segments (US) and powers of two
segments (P2S), that is segments with the size varying by increasing or decreasing
powers of two.
Similar approaches to ours have been proposed for the approximation of the
non-linear functions in logarithmic number systems (LNS). Henkel [61] divides
the interval into four arbitrary placed segments based on the non-linearity of the
function. The address for a given input is approximated by another function
that approximates the segment number for an input. This method only works
if the number of segments is small and the desired accuracy is low. Also, the
103
function for approximating the segment addresses is non-linear, so in effect the
problem has been moved into a different domain. Coleman et al. [26] divide the
input interval into seven P2S that decrease by powers of two, and employ constant
numbers of US nested inside each P2S, which we call P2S(US). Lewis [100] divides
the interval into US that vary by multiples of three, and each US has variable
numbers of uniform segments nested inside, which we call US(US). However, in
both cases the choice of inner and outer segment numbers is done manually, and a
more efficient segmentation can be achieved using our segmentation scheme. We
generalize the idea of hierarchical segmentation and provide a systematic way of
partitioning a function.
5.3 Optimum Placement of Segments
The problem of piecewise approximation with variable segment boundaries has
received considerable attention in the mathematical literature, especially with the
theory of splines [15], [42]. To quote Rice [150]: “The key to the successful use
of splines is to have the location of knots as variables.” This section introduces a
method for computing the optimum placement of segments for function approxi-
mation. We shall use it as a reference in comparing the uniform segment method
and our proposed method, as shown in Table 5.2 (Section 5.4). Let f be a contin-
uous function on [a, b], and let an integer m ≥ 2 specify the number of contiguous
segments into which [a, b] has been partitioned: a = u0 ≤ u1 ≤ ... ≤ um = b.
Let d be a non-negative integer and let Pi denote the set of functions pi whose
polynomials are of degrees less or equal to d. For i = 1, ..., m, define
hi(ui−1, ui) = minpi∈Pi
maxui−1≤x≤ui
|f(x)− pi(x)|. (5.5)
104
Let emax = emax(u) = max1≤i≤m hi(ui−1, ui). The segmented minimax approxi-
mation problem is that of minimizing emax over all partitions u of [a, b]. If the
error norm is a non-decreasing function of the length of the interval of approx-
imation, the function to be approximated is continuous and that the goal is to
minimize the maximum error norm on each interval, then a balanced error solu-
tion is locally optimal. The term “balanced error” means that the error norms
on each interval are equal [81]. One class of algorithms to tackle this problem is
based on the remainder formula [42] and assumes that the (d + 1)th derivative
of f is either of fixed sign or bounded away from zero [140]. However, in many
practical cases this assumption does not hold [138]. Often, the (d + 1)th deriva-
tive may be zero or very small over most of [a, b] except a few points where it has
very large values. This is precisely the case with the non-linear functions we are
approximating.
In Lawson paper, an iterative technique for finding the balanced error so-
lution [81] is presented. However, his technique has a rather serious defect: if
at some intermediate step of the algorithm an interval with zero-error norm (or
even much smaller than others) is found, the method fails. This turns out to
be a common occurrence in various practical applications [138]. Pavlidis and
Maika present a better scheme in their paper [142] which results in a suboptimal
balanced error solution. It is based on an iteration of the form
uk+1i = uk
i + c(eki+1 − ek
i ), i = 1, ..., n− 1. (5.6)
Here uki is the value of the i-th point and the k-th iteration, ek
i is the error on
(uki−1, u
ki ] and c is an appropriate small positive number. It can be shown that for
sufficiently small c the scheme converges to a solution [142]. A reasonable choice
for c is the inverse of the change in error norm divided by the size of the motion of
a boundary which caucused it. We have implemented this scheme in MATLAB,
105
where the function to be approximated f , the interval of approximation [a, b],
the degree d of the polynomial approximations and the number of segments m
are given as inputs. The program outputs the segment boundaries u1..m−1 and
the maximum absolute error emax. Our tests show that this scheme requires
large numbers of iterations for a reasonable value of m and balance criterion
(deviation of errors of the segments), and often fails to converge. In addition, for
our purposes we would like to give f , [a, b], d and emax as inputs and obtain m
and u1..m−1.
We have developed a novel algorithm to find the optimum boundaries for a
given f , [a, b], d, emax and unit in the last place (ulp). The ulp is the least
significant bit of the fraction of a number in its standard representation. For
instance, if a number has F fractional bits, the ulp of that number would be
2−F . The ulp is required as an input, since the input is quantized to n bits.
The MATLAB code for the algorithm is shown in Figure 5.1. The algorithm is
based on binary search and finds the optimum boundaries over [a, b]. We first set
x1 = a and x2 = b and find the minimax approximation over the interval [x1, x2].
If the error e of this approximation is larger than emax, we set x2 = (x1 + x2)/2
and obtain the error for [a, x2]. We keep halving the interval of approximation
until e ≤ emax. At this point we increment x2 by a small amount and compute
the error again. This small amount is either (abs(x2 − prev x2))/2 or the ulp,
whichever is smaller (prev x2 is the value of x2 in the previous iteration). When
this small amount is the ulp, in the next iterations x2 will keep oscillating between
the ideal (un-quantized) boundary. We take the x2 whose error e is just below
emax as our boundary, set x1 = x2 and x2 = b, and move on to approximating
over [x1, x2]. This is performed until the error over [x1, x2] is less than or equal
to emax and x2 has the same value as b. We can see that the boundaries up to the
last one are optimum for the given ulp (the last segment is always smaller than its
106
optimum size, as it can be seen in Figure 5.2 for f2). Although our segments are
not optimum in the sense that the errors of the segments are not fully balanced,
we can conclude that given the error constraint emax and the ulp, the placement
of our segment boundaries is optimum. This is because the maximum error we
obtain is less than or equal to emax and this is not achievable with fewer segments.
The results of our segmentation can be used for various other applications [138]
including pattern recognition [37], [141], data compression, non-linear filtering
and picture processing [140].
In the ideal case, one would use these optimum boundaries to approximate
the functions. However, from a hardware implementation point of view, this is
can be impractical. The circuit to find the right segment for a given input could
be complex, hence large and slow. Nevertheless, the optimum segments give us
an indication of how well a given segmentation scheme matches the optimum
segmentation. Moreover, they provide information on the non-linearities of a
function. Figure 5.2 shows the optimum boundaries for the four functions in
Section 5.1 for 16-bit operands and second order approximations. We observe that
f1 needs more segments in the regions near 0 and 1, f2 requires more segments
near 0 and f3 requires more segments in the two regions in the lower and upper
half of the interval.
In Figure 5.3 and Figure 5.4, we observe how the optimum number of segments
change for first and second order approximations as the bit width increases. We
can see that they have an exponential behavior. The interesting observation is
that for all four functions, the optimum segment numbers vary by a factor of
around 4 per bit for first order and 1.6 per bit for second order approximations.
Therefore as the bitwidths get larger, the memory savings of using second order
approximations get larger (Figure 5.5).
107
Figure 5.5 compares the ratio of the number of optimum segments required by
first and second order approximations for 8, 12, 16, 20 and 24-bit approximations
to the four functions. We can see that savings of second order approximations
get larger as the bit width increases. However one should note that, whereas first
order approximations involve one multiply and one add, second order approxima-
tions involve two multiplies and two adds. Therefore, there is a tradeoff between
the look-up table size and the circuit complexity. For low latency and low accu-
racy applications, first order approximations may be appropriate. Second order
approximations may be suitable for applications that require small look-up tables
and high accuracies.
108
% Inputs: a, b, d, f, e_max, ulp
% Output: u()
x1 = a; x2 = b; m = 1; done = 0; check_x2 = 0; prev_x2 = a;
oscillating = 0; done = 0;
while (~done)
e = minimax(f,d,x1,x2,ulp);
if (e <= e_max)
if (x2 == b)
u(i) = x2;
done = 1;
else
if (oscillating)
u(m) = x2;
prev_x2 = x2;
x1 = x2;
x2 = b;
m = m+1;
oscillating = 0;
else
change_x2 = abs(x2-prev_x2)/2;
prev_x2 = x2;
if (change_x2 > ulp)
x2 = x2 + change_x2;
else
x2 = x2 + ulp;
end
end
end
else
change_x2 = abs(x2-prev_x2)/2;
prev_x2 = x2;
if (change_x2 > ulp)
x2 = x2 - change_x2;
else
x2 = x2 - ulp;
if (check_x2 == x2)
oscillating = 1;
else
check_x2 = x2;
end
end
end
end
Figure 5.1: MATLAB code for finding the optimum boundaries.
109
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0
0.5
1
1.5
2
2.5
3
x
f 1(x)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−0.35
−0.3
−0.25
−0.2
−0.15
−0.1
−0.05
x
f 2(x)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.2
0.4
0.6
0.8
1
1.2
x
f 3(x)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.2
0.4
0.6
0.8
1
x
f 4(x)
Figure 5.2: Optimum locations of the segments for the four functions in Sec-
tion 5.1 for 16-bit operands and second order approximation.
110
8 10 12 14 16 18 20 22 240
1000
2000
3000
4000
5000
6000
Operand Bit Width
Num
ber
of O
ptim
um S
egm
ents
f1f2f3f4
Figure 5.3: Numbers of optimum segments for first order approximations to the
functions for various operand bitwidths.
8 10 12 14 16 18 20 22 240
50
100
150
200
250
300
350
400
Operand Bit Width
Num
ber
of O
ptim
um S
egm
ents
f1f2f3f4
Figure 5.4: Numbers of optimum segments for second order approximations to
the functions for various operand bitwidths.
111
8 10 12 14 16 18 20 22 240
5
10
15
20
25
Operand Bit Width
Firs
t Ord
er /
Sec
ond
Ord
er
f1f2f3f4
Figure 5.5: Ratio of the number of optimum segments required for first and
second order approximations to the functions.
112
5.4 The Hierarchical Segmentation Method
Let a segmentation scheme Λ ∈ US, P2S where US = uniform segments and
P2S = powers of two segments. The proposed segment hierarchy H is of the form
Λ0(Λ1(...(Λλ−1))) where λ is the number of levels in the hierarchy. This structure
can be implemented in a cascaded look-up table structure, where the output of
one table is used as the address of the next. Let i = 0..λ. The input x is split
into λ + 1 partitions called δi. Let vi denote the bit width and si denote the
number of segments of the ith partition δi. Therefore, n =∑λ
i=0 vi, where n is
the number of bits of the input x. Then si can be defined with the following set
of equations:
si = 2vi , if Λi = US (5.7)
si ≤ 2vi, if Λi = P2S (5.8)
For US, it is clear that 2vi segments can be formed, since uniform segments are
addressed with vi bits. However for P2S, it is not so clear why up to 2vi segments
can be formed. Consider the case when Λ0 = P2S, n = 8, v0 = 5 and v1 = 3.
When vi = 5 it is possible to construct 10 P2S as illustrated in Table 5.1. Notice
that the segment sizes increase by powers of two till “01111111” (end of location
4) and start decreasing by powers of two from “10000000” (beginning of location
5) until the end. It can be seen that the maximum number of P2S that can be
constructed with δi is 2vi. Fewer segments can be obtained by omitting parts
of the table. For example with locations 0-4, one can have segments that only
increase by powers of two. To compute the segment address for a given input
δ0, we need to detect the leading zeros for locations 0-4, and leading ones for
locations 5-9. A simple cascade of AND and OR gates and a 1-bit multi-operand
adder can be used to find the segment address for a given input δi as shown in
113
Table 5.1: The ranges for P2S addresses for Λ1 = P2S, n = 8, v0 = 5 and v1 = 3.
The five P2S address bits δ0 are highlighted in bold.
P2S address range
0 0 0 0 0 0 | 0 0 0 ∼ 0 0 0 0 0 | 1 1 1
1 0 0 0 0 1 | 0 0 0 ∼ 0 0 0 0 1 | 1 1 1
2 0 0 0 1 | 0 0 0 0 ∼ 0 0 0 1 | 1 1 1 1
3 0 0 1 | 0 0 0 0 0 ∼ 0 0 1 | 1 1 1 1 1
4 0 1 | 0 0 0 0 0 0 ∼ 0 1 | 1 1 1 1 1 1
5 1 0 | 0 0 0 0 0 0 ∼ 1 0 | 1 1 1 1 1 1
6 1 1 0 | 0 0 0 0 0 ∼ 1 1 0 | 1 1 1 1 1
7 1 1 1 0 | 0 0 0 0 ∼ 1 1 1 0 | 1 1 1 1
8 1 1 1 1 0 | 0 0 0 ∼ 1 1 1 1 0 | 1 1 1
9 1 1 1 1 1 | 0 0 0 ∼ 1 1 1 1 1 | 1 1 1
Figure 5.6. The appropriate taps are taken from the cascades depending on the
choice of the segments and are added to work out the P2S address. For P2S that
increase and decrease by powers of two, the full circuit is used, and for P2S that
decrease only to the left side (P2SL), just the AND gates are used. Similarly for
P2S that decrease to the right side (P2SR), the cascade OR gates are used. These
circuits can be pipelined and a circuit with shorter critical path but requiring
more area can be used [80]. Note that in the last partition, δλ is not used as
an address. If Λi = US, then δi+1 uses the next set of bits vi+1. However if
Λi = P2S, then the location of δi+1 depends on the value of δi. Let j denote the
P2S address, where j = 0..si− 1. From the vertical lines in Table 5.1, we observe
that δi+1 should be placed after a0 for j = 0 and j = si − 1, after aj−1 for j = 1
114
a v-2 a v-3 a v-1 a v-4 a 0
P2S Address
Figure 5.6: Circuit to calculate the P2S address for a given input δi, where
δi = av−1av−2..a0. The adder counts the number of ones in the output of the two
prefix circuits.
to j = (si/2)− 1, and after asi−2−j for j = si/2 to j = si − 2.
In principle it is possible to have any number of levels of nested Λ, as long as∑λ
i=0 vi ≤ n. The more levels are used, the closer the total number of segments
m will be to the optimum. However as λ (the number of levels) increases the
partitioning problem becomes more complex, and the cascade of look-up tables
gets longer, increasing the delay to find the final segment. Therefore there is
a tradeoff between the partitioning complexity, delay and m. Our tests with
the functions we consider in this chapter show that the rate of reduction of m
decreases rapidly as λ increases. λ = 2 gives a very close m to the optimum with
acceptable partitioning complexity and delay. Moreover, λ > 2 gives diminishing
returns in terms of small improvement in m with high partitioning complexity
and long delays. Therefore in this work, we limit ourselves to λ = 2, which
consists of one outer segment Λ0 and one inner segment Λ1. P2S is used as the
outer segment if the function varies exponentially in the beginning and the end
of the interval. P2SL and P2SR are used as the outer segment when the function
115
varies exponentially at the beginning or at the end respectively. US is used if
the function is non-linear in arbitrary regions. Although we limit ourselves with
λ = 2, higher levels of hierarchies could be useful for certain functions.
In Section 6.4 (Chapter 6), we approximate the functions√− log(x) with
P2S and cos(πx/2) with US(P2S) which are needed by the Box-Muller algo-
rithm. These two schemes are found to be sufficient to generate high quality
noise samples. However, these schemes are perhaps inappropriate for applica-
tions that require high accuracies, since when P2S is used as the most inner
segmentation, the segments in the middle regions are large causing large er-
rors. Moreover, the address calculation circuit is needed for P2S, therefore P2S
should be avoided if the difference is small compared to using US. US(P2S(US))
could be useful for cases when there are highly non-linear regions in the mid-
dle parts of the function. The hierarchy schemes we have chosen are H =
P2S(US), P2SL(US), P2SR(US), US(US). These four schemes cover most of
the non-linear functions of interest.
We have implemented the hierarchical segmentation method (HSM) in MAT-
LAB, which deals with the four schemes. The program called HFS (hierarchical
function segmenter) takes the following inputs: the function f to be approx-
imated, input range, operand size n, hierarchy scheme H, number of bits for
the outer segment v0, the requested output error emax, and the precision of the
polynomial coefficients and the data paths. HFS divides the input interval into
outer segments whose boundaries are determined by H and v0. HFS finds the
minimum number of bits v1 for the inner segments for each outer segment, which
meets the requested output error constraint. For each outer segment, HFS starts
with v1 = 0 and computes the error e of the approximation. If e > emax then v1 is
incremented and the error e for each inner segment is computed, i.e. the number
116
of inner segments is doubled in every iteration. If it detects that e > emax it incre-
ments v1 again. This process is repeated until e ≤ emax for all inner segments of
the current outer segment. This is the point at which HFS obtains the minimum
number of bits for the current outer segment. HFS performs this process for all
outer segments. The main MATLAB code for finding the hierarchical boundaries
and their polynomial coefficients is shown in Figure 5.7. Note that minimax2
takes the precisions of the polynomial coefficients and data paths into account.
The outer boundaries are determined by H and v0.
Experiments are carried out to find the minimum number of bits for v0. Fig-
ure 5.8 shows how the total number of segments varies with v0 for 16-bit second
order approximation to f3. We can observe the figure of U shape, and there is
a point at which v0 is at a minimum, which is five bits in this particular case.
When v0 is too small, there are not enough outer segments to cater to local non-
linearities. When v0 is too large, there are too many unnecessary outer segments.
Note that when v0 = 0, it is equivalent to using standard uniform segmenta-
tion. Figure 5.9 shows the segmented functions obtained from HFS for 16-bit
second order approximations to the four functions. It can be seen that the seg-
ments produced by HFS closely resemble the optimum segments in Figure 5.2.
Table 5.2 shows a comparison in terms of numbers of segments for various second
order approximations for uniform, HSM, and the optimum number of segments.
Double precision is used for the data paths and the output for this comparison.
We can see that HSM is significantly more efficient for the first three functions
than using uniform segments, and the difference between the optimum ones are
around a factor of two. However, for f4, the improvements over uniform segments
are small due to the function being very linear. Looking at the results for 24-bit
approximation to f1, we can see that HSM performs worse than average. This is
due to the fact that insufficient bits are left for δ1 (19 bits are already used for
117
δ0).
Figure 5.10 shows our design flow for approximating functions. First the
user supplies the following to the HFS: f , input range, H, n, v0, emax, and the
precision of the polynomial coefficients and the data paths. HFS computes the
segment boundaries and the polynomial coefficients and stores the data into a
file. It also provides the user with a report, which contains the total number of
segments m, maximum error, percentage of exactly rounded results, and the sizes
of the multipliers, adders and look-up tables. There is a parameterizable reference
design template library for the four hierarchy schemes defined by H for first
and second order approximations. A design generator instantiates the relevant
reference design templates with information from the data file and generates the
hardware design in VHDL.
An interesting aspect of our approach is that it could be used to accelerate
applications that have involve pure floating-point calculations such as software
applications. This is because our method computes compound functions at once
using polynomial approximations, instead of decomposing the compound func-
tions into sub-functions and computing the sub-functions one by one. Versions of
FastMath [44] used P2S to approximate the non-linear functions in logarithmic
number systems (LNS) to speed up software applications without the use of a
coprocessor.
118
% Inputs: d, f, e_max, ulp, v0, H, n, precisions
% Output: hier_boundaries_table, poly_coeffs_table
for i=1:(length(outer_boundaries)-1)
x1 = outer_boundaries(i)
x2 = outer_boundaries(i+1);
hier_boundaries = x1;
[e, poly_coeffs] = minimax2(f,d,x1,x2,ulp);
if (e > e_max)
outer_seg_size = x2-x1;
v1 = 1;
while (e > e_max)
inner_seg_size = outer_seg_size/(2^v1);
hier_boundaries = [];
poly_coeffs = [];
for j=1:2^v1
x1 = outer_boundaries(i)
+ (inner_seg_size*j)
- inner_seg_size;
x2 = x1 + inner_seg_size;
[e, _poly_coeffs]
= minimax2(f,d,x1,x2,ulp);
hier_boundaries(j,:) = x1;
poly_coeffs(j,:) = _poly_coeffs;
end
v1 = v1 + 1;
end
end
hier_boundaries_table
= [hier_boundaries_table; hier_boundaries];
poly_coeffs_table
= [poly_coeffs_table; poly_coeffs];
end
Figure 5.7: Main MATLAB code for finding the hierarchical boundaries and their
polynomial coefficients.
119
0 2 4 6 8 10100
150
200
250
300
350
400
450
500
550
v0
Num
ber
of S
egm
ents
Figure 5.8: Variation of total number of segments against v0 for a 16-bit second
order approximation to f3.
120
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.5
1
1.5
2
2.5
3
x
f 1(x)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−0.35
−0.3
−0.25
−0.2
−0.15
−0.1
−0.05
x
f 2(x)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.2
0.4
0.6
0.8
1
1.2
x
f 3(x)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0.2
0.4
0.6
0.8
1
x
f 4(x)
Figure 5.9: The segmented functions generated by HFS for 16-bit second order
approximations. f1, f2, f3 and f4 employ P2S(US), P2SL(US), US(US) and
US(US) respectively. The black and grey vertical lines are the boundaries for the
outer and inner segments respectively.
121
Table 5.2: Number of segments for second order approximations to the four
functions. Results for uniform, HSM and optimum are shown.
function order operand uniform HSM optimum HSM
width segments segments segments /optimum
f1 1 8 64 13 7 1.86
12 4,096 78 35 2.23
16 65,536 395 161 2.45
20 1,048,576 1,876 723 2.59
24 33,554,432 8,608 2,302 3.74
2 8 8 5 4 1.25
12 1,024 23 15 1.53
16 32,768 72 44 1.64
20 524,288 218 126 1.73
24 16,777,216 742 287 2.59
f2 1 8 32 19 11 1.73
12 512 93 45 2.07
16 8,192 381 181 2.10
20 131,072 1533 724 2.12
24 2,097,152 6,141 2,896 2.12
2 8 8 5 4 1.25
12 128 15 10 1.50
16 2,048 44 26 1.69
20 32,768 124 66 1.88
24 524,228 315 167 1.89
f3 1 8 256 36 20 1.80
12 1,024 172 81 2.12
16 4096 683 303 2.25
20 16,384 2,723 1,296 2.10
24 65,536 10,609 5,182 2.05
2 8 64 20 10 2.00
12 256 41 24 1.71
16 512 107 59 1.81
20 1,024 234 151 1.55
24 2,048 573 379 1.51
f4 1 8 8 7 5 1.40
12 32 27 20 1.35
16 128 110 77 1.43
20 512 435 307 1.42
24 2048 1,739 1,228 1.42
2 8 4 3 2 2.00
12 8 7 4 1.71
16 16 15 10 1.81
20 64 45 23 1.55
24 128 111 58 1.51
122
Hierarchical Function
Segmenter
Data File
Synthesis
Place and Route Report
Hardware
User Input
Design Generator
Reference Design Library
Figure 5.10: Design flow of our approach.
123
5.5 Architecture
The architecture of our function evaluator for HSM is shown in Figure 5.11. The
P2S unit performs the P2S address calculation (Figure 5.6) on δ0 if δ0 is of type
P2S (Λ0 = P2S). If Λ0 = US, δ0 is bypassed. The bit selection unit selects the
appropriate bits from the input based on the values of v0 and v1. This variable
bit selection is implemented using a barrel shifter. There are two look-up tables:
one used for storing the v1 values and the offset (ROM0), and the other storing
the polynomial coefficients (ROM1). The offset in ROM0 stores the starting
address in ROM1 for the different δ0 values. The depth s0 of ROM0 is defined in
Equations (5.7) and (5.8), and the depth of m of ROM1 is the total number of
segments. The size of the two look-up tables are defined as follows:
ROM0 = dlog2(max(v1))e+ (5.9)
dlog2(max(offset))e × s0
ROM1 =d∑
i=0
wi ×m. (5.10)
In practice, ROM0 is significantly smaller than ROM1, since the depth is
bounded by v0 and the entries v1 and offset are small. There is an interesting
tradeoff factor for ROM1: the wider the widths of the coefficients w, the fewer
segments m are needed, since the approximations will be more accurate. However,
if w is over a certain threshold, it has negligible effect on m. It is desirable to
find the right widths that minimize the total ROM size.
Let bj denote the boundaries of the outer segments where j = 0..m − 1 and
θ = max(bj− bj+1) which is the maximum width of the outer segment. Instead of
approximating each interval over [bi, bi+1), we perform the translation x = x− bi,
124
which translates the interval [bi, bi+1) to [0, θ). This form reduces the widths of
the data paths, since x ∈ [0, θ) requires fewer bits to represent than x.
Highly non-linear functions such as f1 which have exponentially varying re-
gions to infinity, have a large dynamic range on the coefficients. For instance, the
largest coefficient c2 of a 24-bit second order approximation to f1 is in the order
of 1012. In such cases, floating-point arithmetic is needed. For f2 and f3, where
the ranges of the coefficients are relatively small, standard fixed-point arithmetic
is used.
For high throughput applications, the P2S unit, the multipliers and the adders
can be pipelined. For typical applications targeting FPGAs, the size of the two
ROMs are small and can be implemented on-chip using distributed RAM or block
RAM. Often the multiplier would be the part taking up a significant portion of
the area. The size of the multipliers depend on the width of v1 + v2 and the
coefficients. Recent FPGAs, such as Xilinx Virtex-II or Altera Stratix devices,
provide dedicated hardware resources for multiplication which can benefit the
proposed architecture.
5.6 Error Analysis
The error of the approximation etotal = f(x)− p(x) is the difference between the
ideal mathematical value and the approximation. In our work, we regard IEEE
double precision floating-point as the exact value. There are two ways the result
can be rounded: exact rounding [161] where the result rounded to the nearest, and
faithful rounding [159] where it is rounded to the nearest or next nearest. Exact
rounding requires etotal ≤ 0.5 ulp and faithful rounding requires etotal ≤ 1 ulp.
There is no known method for determining the accuracy required to guarantee
125
exactly rounded results for elementary functions, a problem known as the table
maker’s dilemma. The authors in [161] indicate that by computing an elementary
function to sufficiently high precision, it is possible to guarantee that all results
are exactly rounded. However, the degree of additional precision required is
almost doubled, greatly increasing the hardware complexity. Therefore, we have
opted for faithful rounding, which is good enough for most practical applications.
The total error etotal consists of the following four types of errors:
• ein due to interpolating f(x) with a polynomial;
• eco for rounding polynomial coefficients to finite precision;
• edp for rounding results from multipliers and adders in various data paths;
• erd for rounding the final result to n-bits.
Thus, our error requirement for faithful rounding is:
etotal = ein + eco + edp + erd ≤ 1 ulp. (5.11)
The final rounding step erd rounds the result to the nearest (so do the other
rounding steps), thus introduces a maximum error of 0.5 ulp. So our requirement
is
etotal = ein + eco + edp ≤ 0.5 ulp. (5.12)
In HSM, when we find the minimax approximation over an interval, all four
types of errors are already taken into account. This is because the user supplies
the parameters for the finite precisions of the coefficients, data paths and the
final result. Rather than computing the segments with great precision first and
then applying finite approximations, it is better to take the finite precisions into
account when the actual approximations are being made.
126
It is desirable to minimize the bitwidths for both the coefficients and the
data paths, which leads to size reductions in look-up tables, multipliers and
adders. Currently, the precision of the coefficients and the internal data paths are
supplied by the user. Pineiro et al. [147], and Schulte and Schwartzlander Jr. [161]
present an exhaustive iterative technique to find suboptimum bitwidths for the
polynomial coefficients. Hauser and Purdy [59] use genetic algorithms to find the
optimal coefficients. Unfortunately, the two approaches become inefficient when
the output accuracy requirement is high. We plan to automate the optimization
of the bitwidths for both coefficients and data paths using bit analysis techniques
such as those presented in [29] and [47].
5.7 The Effects of Polynomial Degrees
The degrees of the polynomials play an important role when approximating func-
tions with HSM: the higher the degree, the less segments are needed to meet the
same error requirement. However, higher degree polynomials require more mul-
tipliers and adders, leading to higher circuit complexity and more delay. Hence,
there is a tradeoff between table size, circuit complexity and delay.
Table size. As the polynomial degree d increases, the width of the coefficient
look-up table increases linearly. One needs to store d+1 coefficients per segment.
Circuit Complexity. As d increases, one needs more adders and multipliers
to perform the actual polynomial evaluation. These increase in a linear manner:
since we are using Horner’s rule, d adders and d multipliers are required.
Delay. Note that the polynomial coefficients in the look-up table can be accessed
in parallel. Hence, the delay differences between the different polynomial degrees
occur when performing the actual polynomial evaluation. Due to the serial nature
127
of Horner’s rule, the increase in delay of the polynomial evaluation is again linear
(d adders and d multipliers).
We vary the polynomial degree for a given approximation, the circuit com-
plexity and delay are predictable. However, for the table size, although the
“number of coefficients per segment” (the width of the table) is predictable, the
look-up table size depends on the total of number of segments (the depth of the
table) as well. To explore the behavior of the table size, we have calculated ta-
ble sizes at various operand sizes between 8 to 16 bits, and polynomial degrees
from one to five. Double precision is used for the data paths to get these results.
The bitwidths of the polynomial coefficients are assumed to be the same as the
operand size. Mesh plots of these parameters for the four functions are shown in
Figure 5.12.
Interestingly, all four functions share the same behavior: we observe that
the plots have an exponential behavior in table size in both the operand width
and the polynomial degree. Although first order approximations have the least
circuit complexity and delay (one adder and one multiplier), we can see that
they perform bad (in terms of table size) for operand widths of more than 16
bits. Second order approximations have reasonable circuit complexity and delay
(two adders and two multipliers) and for the bitwidths used in these experiments
(up to 24 bits), they yield reasonable table sizes for all four functions.
Another observation is that the improvements in table size of using third
or higher order polynomials are very small. For instance, looking at the 24-bit
results to f2, the table sizes of first, second and third order approximations are
294768, 22680 and 8256 bits respectively. The difference between first and second
order is a factor of 11.3, whereas the difference between second and third order is
a factor of 2.7. Therefore, the overhead of having an extra adder and multiplier
128
stage for third order approximations maybe not be worth while for a table size
reduction of a factor of just 2.7.
Hence, we conclude that for operand sizes of 16 bits or fewer, first order
approximations yield good results. For operand sizes between 16 and 24 bits,
second order approximations are perhaps more appropriate. We predict that for
operand sizes larger than 24 bits, the table size improvements of using third or
higher order polynomials will appear more dramatic.
Earlier, we have compared the number of segments of HSM to the optimum
segmentation. For first and second order approximations, the ratio of segments
obtained by HSM and the optimum is around a factor of two. To explore how
this ratio behaves at varying operand width and polynomial degree, we obtain the
results shown in Figure 5.13. We can see that the ratios are at around a factor of
two at various parameters, with HSM showing no obvious signs of degradation.
129
c d
0
index
1 ...
... m-1
c d-1 ...
...
p ( x )
P2S unit
v 1 0
index
1 ...
...
v 0
d 2
S 0 -1
v 1 v 2
n
L 0
bit selection
unit
w d-1 w d
d 0 d 1
offset
v 0
:
v 1 +v 2
c 0
w 0
c 1
w 1
Figure 5.11: HSM function evaluator architecture for λ = 2 and degree d approx-
imations. Note that ‘:’ is a concatenation operator.
130
12
34
5
1015
20
0
1
2
3
4
x 105
Degree
f1
Operand Width
Tab
le S
ize
12
34
5
1015
20
0
1
2
x 105
Degree
f2
Operand WidthT
able
Siz
e
12
34
5
1015
20
0
2
4
x 105
Degree
f3
Operand Width
Tab
le S
ize
12
34
5
1015
20
0
2
4
6
8
x 104
Degree
f4
Operand Width
Tab
le S
ize
Figure 5.12: Variations of the table sizes to the four functions with varying
polynomial degrees and operand bitwidths.
131
12
34
5
1015
20
1
2
3
Degree
f1
Operand Width
HS
M /
Opt
imum
12
34
5
1015
20
1
1.5
2
Degree
f2
Operand WidthH
SM
/ O
ptim
um
12
34
5
1015
20
1
1.5
2
Degree
f3
Operand Width
HS
M /
Opt
imum
12
34
5
1015
20
1
1.2
1.4
1.6
1.8
Degree
f4
Operand Width
HS
M /
Opt
imum
Figure 5.13: Variations of the HSM/Optimum segment ratio with polynomial
degrees and operand bitwidths.
132
5.8 Evaluation and Results
Table 5.3 compares HSM with direct table look-up, the symmetric bipartite table
method (SBTM) [162] and the symmetric table addition method (STAM) [63],
[167] for 16 and 24-bit approximations to f2. SBTM and STAM use bipartite
and multipartite tables to exploit the symmetry of the Taylor approximations
and leading zeros in the table coefficients to reduce the look-up table size, as
discussed in Section 2.3.3 in Chapter 2.
The uniform segmentation compared here is similar to the ones described
in [147] and [70]. The polynomial-only method [165] is not considered here, since
they require unpractically high order polynomials when the function is non-linear.
For instance, in order to achieve just 8-bit accuracy to f2 with the polynomial-
only method, one requires a polynomial degree of 12.
We observe that table sizes for direct look-up approach are not feasible when
the accuracy requirement is high. SBTM/STAM significantly reduce the table
sizes compared to the direct table look-up approach, at the expense of some
adders and control circuitry. In the 16-bit results, HSM4 has the smallest table
size, being 546 and 8.5 times smaller than direct look-up and STAM. The table
size improvement of HSM is of course at the expense of more multipliers and
adders, hence higher latency. Generally, the higher the polynomial degree, the
smaller the table size. However, in the 16-bit case, HSM5 actually has lager table
size than HSM4. This is because of the extra overhead of having to store one more
polynomial coefficient per segment exceeds the reduction in number of segments
compared to HSM4. We observe for the 24-bit results, the differences in table
sizes between HSM and other methods are even larger. Moreover, the reductions
in table size by using higher order polynomials get greater as the operand width
increases (i.e. as the accuracy requirement increases). For applications that
133
require relatively low accuracies and latencies, SBTM/STAM may be preferred.
For high accuracy applications that can tolerate more latencies, HSM would be
more appropriate.
The reference design templates have been implemented using Xilinx System
Generator. As mentioned in Section 5.4, these design templates are fully param-
eterizable, and changes to the desired function, input interval, operand width or
finite precision parameters can result in producing a new design automatically.
The Xilinx System Generator design templates used for first order US(US) and
second order P2SL(US) are depicted in Figure 5.14 and Figure 5.15.
A variant [87] of our approximation scheme to f1 and f4, with one level of P2S
and US(P2S), has been implemented and successfully used for the generation of
Gaussian noise samples [86]. Table 5.4 contains implementation results for f2
and f3 with 16 and 24-bit operands and second order approximation, which are
mapped and tested on a Xilinx Virtex-II XC2V4000-6 FPGA. The precision of
bit width and the data paths have been optimized to minimize the size of the
multipliers and look-up tables. The design is fully pipelined generating a result
every clock cycle. Designs with lower latency and clock speed can be obtained
by reducing the number of pipeline stages. The designs have been tested exhaus-
tively over all possible input values to verify that all outputs are indeed faithfully
rounded. There are many compelling arguments in the literature that if the result
is faithfully rounded, one should maximize the percentage of results that are ex-
actly rounded. For the two designs, over 86% of the results are exactly rounded.
Higher percentage can be achieved by controlling the precisions of the coefficients
and data paths. We have observed the exact rounding is possible, but with a sig-
nificant increase in the precisions. The data path widths, number of segments,
table size and percentage of exactly rounded results to the implementations are
134
Fig
ure
5.14
:X
ilin
xSyst
emG
ener
ator
des
ign
tem
pla
teuse
dfo
rfirs
tor
der
US(U
S).
135
Fig
ure
5.15
:X
ilin
xSyst
emG
ener
ator
des
ign
tem
pla
teuse
dfo
rse
cond
order
P2S
L(U
S).
136
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1−1
−0.5
0
0.5
1
x
Err
or in
ulp
Figure 5.16: Error in ulp for 16-bit second order approximation to f3.
presented in Table 5.5.
Although we have not synthesized the designs for SBTM and STAM, we esti-
mate that they will take significantly less area in terms of slices than HSM (since
only adders and some control circuity are required, and adders are efficiently
implemented on Xilinx FPGAs using fast-carry chains), but at the expense of
more block RAM usage. The difference of block RAM usage between HSM and
SBTM/STAM will get more significant as the accuracy requirement increases as
shown in Table 5.3.
Figure 5.16 shows how the error (in ulp) varies with the input for 16-bit second
order approximation to f3. We observe that most of the errors (in absolute terms)
are less than 0.5 ulp, i.e. most of the results are exactly rounded.
Our hardware implementations have been compared with software implemen-
tations (Table 5.6). The FPGA implementations compute the functions using
HSM with 24-bit operands and second order polynomials. Software implementa-
tions are written in C generating single precision floating-point numbers, and are
compiled with the GNU gcc 3.2.2 compiler [54]. This is a fair comparison in terms
137
of precision, since single precision floating-point has 24-bit mantissa accuracy.
For the f2 function, the FPGA implementation is 20 times faster than the
Athlon based PC in terms of throughput, and 1.3 times faster in terms of com-
pletion time. We suspect that the inferior results of the Pentium 4 PC is due
to inefficient implementation of the log function in the gcc math libraries for
the Pentium 4 CPU. Looking at the f3 function, the FPGA implementation is 90
times faster than the Athlon based PC in terms of throughput, and 7 times faster
in terms of completion time. This increase in performance gap is due to the f3
function being more ‘compound’ than the f2 function. Whereas a CPU computes
each elementary operation of the function one by one, HSM looks at the entire
function at once. Hence, the more compound a function is, the advantages of
HSM get bigger.
Note that the FPGA implementations use only a fraction of the device used
(less than 2%), hence by instantiating multiple function evaluators on the same
chip for parallel execution, we can expect even larger performance improvements.
138
Table 5.3: Comparison of direct look-up, SBTM, STAM and HSM for 16 and
24-bit approximations to f2. The subscript for HSM denotes the polynomial
degree, and the subscript for STAM denotes the number of multipartite tables
used. Note that SBTM is equivalent to STAM2.
operand method table size compression multipliers adders
width [bits]
16 direct 1,048,576 546.1 - -
SBTM 29,696 15.5 - 1
STAM4 16,384 8.5 - 3
HSM1 24,384 12.7 1 2
HSM2 4,620 2.4 2 3
HSM3 2,304 1.2 3 4
HSM4 1,920 1.0 4 5
HSM5 2,112 1.1 5 6
24 direct 402,653,184 77,672.3 - -
SBTM 2,293,760 442.5 - 1
STAM6 491,520 94.8 - 5
HSM1 393,024 75.8 1 2
HSM2 40,446 7.8 2 3
HSM3 11,008 2.1 3 4
HSM4 6,720 1.3 4 5
HSM5 5,184 1.0 5 6
139
Table 5.4: Hardware synthesis results on a Xilinx Virtex-II XC2V4000-6 FPGA
for 16 and 24-bit, first and second order approximations to f2 and f3.
function order operand speed latency slices block block
width [MHz] [cycles] RAMs multipliers
f2 1 16 202 11 332 2 2
24 160 12 897 44 4
2 16 153 13 483 1 4
24 135 14 871 2 10
f3 1 16 232 8 198 2 1
24 161 10 418 37 2
2 16 198 12 234 1 3
24 157 13 409 3 4
140
Tab
le5.
5:W
idth
sof
the
dat
apat
hs,
num
ber
ofse
gmen
ts,ta
ble
size
and
per
centa
geof
exac
tly
rounded
resu
lts
for
16
and
24-b
itse
cond
order
appro
xim
atio
ns
tof 2
and
f 3.
func
tion
oper
and
w(x
)w
(C2)
w(C
1)
w(C
0)
w(x
C2)
w(x
C2C
1)
w(x
C2C
1x)
w(p
x)
segm
ents
tabl
esi
zeex
actl
y
wid
th[b
its]
roun
ded
[%]
f 216
1432
2520
2124
2116
604,
620
86
2422
4831
2831
3431
2437
840
,446
87
f 316
1017
1819
1819
1816
125
6,75
086
2416
2627
2727
2727
2476
461
,120
87
141
Table 5.6: Performance comparison: computation of f2 and f3 functions. The
Athlon and the Pentium 4 PCs are equipped with 512MB and 1GB DDR-S-
DRAMs respectively.
function platform speed throughput completion
[MHz] [operations / second] time [ns]
f2 XC2V4000-6 FPGA 135 135 million 104
AMD Athlon PC 1400 7.14 million 140
Intel Pentium 4 PC 2600 0.48 million 2088
f3 XC2V4000-6 FPGA 157 157 million 83
AMD Athlon PC 1400 1.76 million 569
Intel Pentium 4 PC 2600 1.43 million 692
142
5.9 Summary
We have presented a novel method for evaluating functions using piecewise poly-
nomial approximations with an efficient hierarchical segmentation scheme. Our
method is illustrated using four non-linear compound functions,√− log(x), x log(x),
a high order rational function and cos(πx/2). An algorithm that finds the op-
timum segments for a given function, input range, maximum error and ulp has
been presented. The four hierarchical schemes P2S(US), P2SL(US), P2SR(US)
and US(US) deal with the non-linearities of functions which occur frequently. A
simple cascade of AND and OR gates can be used to rapidly calculate the P2S
address for a given input. Results show the advantages of using our hierarchi-
cal approach over the traditional uniform approach. Compared to other popular
methods such as STAM, our approach has longer latencies due to the increased
number of arithmetic operations. However, the size of the look-up tables are
considerably smaller (up to a factor of 94.8 depending on method and precision).
143
CHAPTER 6
Gaussian Noise Generator
using the Box-Muller Method
6.1 Introduction
The availability of high quality Gaussian random numbers is critical to many sim-
ulation, graphics and Monte Carlo applications. Numerical methods for Gaussian
random number generation have a long history in mathematics and communica-
tions. As described in [78] and the references cited therein, most methods involve
initially generating samples of a uniform random variable and then applying a
transformation to obtain samples drawn from a unit-variance, zero-mean Gaus-
sian PDF f(x) = (1/√
2π) e−x2/2. In the overwhelming majority of cases, this
occurs in environments such as computer based simulation where functions such
as sine, cosine, and square roots are easily performed, and where there is sufficient
precision so that finite-word length effects are negligible.
There are many applications in which large simulations using Gaussian noise.
These include financial modeling [14], simulation of economic systems [6] and
molecular dynamics simulations [76]. For all of these applications, hardware
based simulation offers the potential to speed up simulation by several orders of
magnitude, but is feasible only if suitably fast and high-quality noise generators
can be implemented in environments with the limited word length, and the com-
144
putational, memory and data flow properties typical of hardware systems. In
addition, while any deviation from an ideal Gaussian PDF creates the potential
for degrading the simulation results, very large simulations create particularly
stringent requirements on the quality of the PDF in the tails. Samples that lie at
large multiples of σ (standard deviations) away from the mean are by definition
extremely rare but are also exactly the noise realizations that are most likely to
induce events of high interest in understanding the behavior of the overall system.
To accurately obtain good characteristics in the tails requires the combination of
1) an underlying method that creates high σ values with the proper frequency,
and 2) a hardware implementation of the method that preserves the requisite
precision at all of the stages to ensure that high σ behavior is not compromised.
There has been little attention focused on efficient hardware implementation
of Gaussian noise generators, as the noise in real hardware systems is of course
supplied by the environment and does not typically need to be generated inter-
nally. Recent advances in coding, however, have made the case for hardware
based simulation of channel codes much more compelling, and provide strong
motivation to examine the Gaussian noise generation problem in the framework
of limited word length, and limited computational and memory resources. For
example, computer simulations to examine LDPC code behavior can be time
consuming, particularly when the behavior at BERs in the error floor region is
being studied. Hardware based simulation offers the potential of speeding up
code evaluation by several orders of magnitude, but is feasible only if suitably
fast and high-quality noise generators can be implemented in hardware alongside
the channel decoder.
Probably the most well known method for generating Gaussian noise is known
as the Box-Muller transformation [13]. It allows us to transform uniformly dis-
145
tributed random variables, to a new set of random variables with a Gaussian
distribution. We start with two independent random numbers, which come from
a uniform distribution (in the range from 0 to 1). Then apply mathematical trans-
formations to get two new independent random numbers which have a Gaussian
distribution with zero mean and a standard deviation of one.
The principal contribution of this chapter is a hardware Gaussian noise gener-
ator based on the Box-Muller method, that offers quality suitable for simulations
involving large numbers of noise samples. The noise generator occupies approx-
imately 10% of the resources on a Xilinx Virtex-II XC2V4000-6 device, while
producing over 133 million samples per second. In contrast to previous work, we
focus specific attention on the accuracy of the noise samples in the high σ regions
of the Gaussian PDF, which are particularly important in achieving accurate
results during large simulations. The key novelties of our work include:
• a hardware architecture which involves the use of non-uniform piecewise lin-
ear approximations in computing trigonometric and logarithmic functions;
• exploration of hardware implementations of the proposed architecture tar-
geting both advanced high-speed FPGAs and low-cost FPGAs;
• evaluation of the proposed approach using several different statistical tests,
including the chi-square test and the Anderson-Darling test, as well as
through application to decoding of LDPC codes.
The rest of this chapter is organized as follows. Section 6.2 covers related work.
Section 6.3 briefly reviews the Box-Muller algorithm, and discusses how each of its
steps can be handled in a hardware architecture. Section 6.4 presents a method
for function evaluation based on non-uniform segments. Section 6.5 explains
how the function evaluation method is used to compute the functions in the
146
Box-Muller algorithm. Section 6.6 describes technology-specific implementation
of the hardware architecture. Section 6.7 discusses evaluation and results, and
Section 6.8 offers summary.
6.2 Related Work
There is little previous work on high quality digital hardware Gaussian noise
generators. The most relevant publications are probably [12] and [186], which
discuss designs targeting FPGAs. We present a design with significantly improved
efficiency, which also passes statistical tests widely used for testing normality. In
addition, previous work produces noise samples that are targeted primarily for
the output region below about 4σ, and therefore does not specifically address the
high σ values of 4σ to 6σ and beyond; these are critical in the large simulations
motivating our work.
The Box-Muller algorithm requires the approximation of non-linear functions.
When using piecewise polynomials to approximate such functions, it is desirable
to choose the boundaries of the segments to cater for the non-linearities of the
function. Highly non-linear regions may need smaller segments than linear re-
gions. This approach minimizes the amount of storage required to approximate
the function, leading to more compact and efficient designs. We employ a novel
hierarchy of uniform segments and segments that vary by powers of two to cover
the non-linearities of different functions appropriately. Moreover, we present a
hardware architecture which is suitable for hardware implementation.
147
6.3 Architecture
This section provides an overview of the Box-Muller method and the associated
four-stage hardware architecture. The implementation of this architecture in
FPGA technology is presented in Section 6.6.
The Box-Muller method is conceptually straightforward. Given two indepen-
dent realizations u1 and u2 of a uniform random variable over the interval [0,1),
and a set of intermediate functions f , g1 and g2 such that
f(u1) =√− ln(u1) (6.1)
g1(u2) =√
2 sin(2π u2) (6.2)
g2(u2) =√
2 cos(2π u2) (6.3)
the products
x1 = f(u1) g1(u2) (6.4)
x2 = f(u1) g2(u2) (6.5)
provide two samples of a Gaussian distribution N(0, 1).
The above equations lead to an architecture that has four stages (Figure 6.1).
1. A shift register based uniform random number generator,
2. implementation of the functions f , g1, g2 and the subsequent multiplica-
tions,
3. a sample accumulation step that exploits the central limit theorem to over-
come quantization and approximation errors, and
4. a simple multiplexor based circuit to support generation of one result per
clock cycle.
148
A similar basic approach has been taken in other hardware Gaussian noise im-
plementations [12]; what distinguishes our work is the detail of the functional
implementation developed to deal with: (a) Gaussian noise with high σ values,
and (b) evaluations using commonly-used statistical tests.
In the following, each of the four stages in our architecture is described in
detail.
The first stage. This stage involves generation of the uniformly distributed
realizations u1 and u2. The implementation of this stage is straightforward, and
can be accomplished using well-known techniques based on Linear Feedback Shift
Registers (LFSRs) [24]. To ensure maximum randomness, we use an independent
shift register for each bit of u1 and u2. The resources needed are related to the
periodicity desired in the shift registers. Since m-bit LFSRs with irreducible
polynomials can produce random numbers with periodicity of 2m − 1, hardware
required will be proportional to the number of bits of precision needed in u1 and
u2.
The necessary precisions of u1 and u2 are related to the maximum σ value
that the full system will produce. Since g1 and g2 are bounded by [−√2,√
2], the
maximum output is determined by f , which in turn takes on its largest values
when u1 is smallest. For example, when 16 bits are used for u1, the maximum
possible Gaussian sample has an absolute value of 4.7σ. With 32 bits we use in
this chapter, we can get up to 6.7σ. Using more bits of u1 means that we need
to approximate non-linear parts of f closer to zero. In addition, the precisions of
u1, u2, g1 and g2 should be large enough so that there are enough diversities in
the outputs. Low precisions will cause the statistical tests to fail.
The second stage. This stage involves the most interesting challenges: efficient
implementation of the functions f , g1 and g2. Direct computation of the functions
149
u 1
LFSRs
g 1 ( u
2 ) f ( u
1 ) g
2 ( u
2 )
ACC (2)
u 2
u 2
50
32 18 18
x 1 x 2
32 32
x
ACC (2)
y
MUX
32
Stage 1
Stage 2
Stage 3
Stage 4
x
Figure 6.1: Gaussian noise generator architecture. The black boxes are buffers.
150
using methods such as CORDIC leads to prohibitively long computation times. A
direct look-up table would allow outputs to be obtained in only a few clock cycles,
but this leads to prohibitively large memory requirements. For example, a look-up
table for f(u1) with sufficient resolution for u1 would require 232 entries. Instead,
we use a two-step process based on non-uniform piecewise linear approximation.
Our approach is described in sections 6.4 and 6.5.
The third stage. This stage involves a sample accumulation step that exploits
the central limit theorem to overcome quantization and approximation errors.
As is well known, given a sequence of realizations of independent and identically
distributed random variables x1, x2, ..., xl with unit variance and zero mean, the
distribution of
x1 + x2 + ... + xl√l
tends to be normally distributed as l → ∞. We find that l = 2 is sufficient
to overcome the effects of the approximation errors, so we use an accumulator
(the ACC(2) component shown in Figure 6.1) that sums two successive inputs
to produce an output every other cycle. The central limit theorem calls for a
division by√
2, which is potentially problematic in hardware. Fortunately, since
computation of g1 and g2 involves a multiplication by√
2 (equations (6.2) and
(6.3)), this multiplication is in effect cancelled by the subsequent division, so it
can be dispensed with in both places in the implementation. This optimization
also alters the range of g as implemented to [-1,1].
The fourth stage. This stage involves a multiplexor based circuit to select one of
the two ACC(2) component outputs in alternate clock cycles. The multiplexor
is controlled by a circuit that toggles its output. This enables producing an
output every clock cycle, rather than two outputs every other cycle. The buffer
151
after the second ACC(2) is needed to ensure one valid noise sample is fed to
the multiplexor every clock cycle, rather than two valid samples every two clock
cycles.
Two further remarks about this architecture can be made. First, it is pos-
sible to speed up the output rate further by having multiple noise generators
running in parallel, provided that the LFSRs are initialized with different ran-
dom seeds. Second, the periodicity can be increased by using larger LFSRs and
higher σ values can be obtained using more bits for u1, both with little increase
in complexity.
6.4 Function Evaluation for Non-uniform Segmentation
This section presents a method for function evaluation based on an innovative
technique involving non-uniform segmentation. This method is a variant of the
segmentation ideas presented in Chapter 5. The interval of approximation is
divided into a set of sub-intervals, called segments. The best-fit straight line,
in a minimax sense (minimize worst-case error), to each segment is found. A
look-up table is used to store the coefficients for each line segment, and the
functions can then be evaluated using a multiplier and an adder to calculate the
linear approximation. Uniform segmentation methods have been proposed, which
involve similar hardware [122].
Using well-known methods that compute elementary functions such as CORDIC,
the evaluation of compound functions is a multi-stage process. Consider the
evaluation of the f function as defined in Equation (6.1) over the interval (0, 1)
(Figure 6.2). Using CORDIC, the computation of this function is a two-stage
process: the logarithm of x followed by the square root. With our approach, we
152
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
1
2
3
4
5
u
f(u)
Figure 6.2: The f function. The asterisks indicate the boundaries of the linear
approximations.
look at the entire function over the given domain, and therefore we do not need
to have two stages. As shown in Figure 6.2, the greatest non-linearities of the f
function occur in the regions close to zero and one. If uniform segments are used,
a large number of small segments would be required to get accurate approxima-
tions in the non-linear regions. However, in the middle part of the curve where it
is relatively linear, accurate approximation can be obtained using relatively few
segments. It would be efficient to use small segments for the non-linear regions,
and large segments for linear regions. Arbitrary-sized segments would enable us
to have the least error for a given number of segments; however, the hardware to
calculate the segment address for a given input can be complex. Our objective is
to provide near arbitrary-sized segments with a simple circuit to find the segment
address for a given input.
We have developed a novel method which can construct piecewise linear ap-
proximation. The main features of our proposed method include: (a) the segment
153
lengths used in a given region depends on the local linearity, with more segments
deployed for regions of higher non-linearity; and (b) the boundaries between seg-
ments are chosen such that the task of identifying which segment to use for a
given input can be rapidly performed. The method is based on early ideas to
the hierarchical segmentation method (HSM) described in Chapter 5. It is not
as sophisticated as HSM, but is sufficient to generate high quality Gaussian noise
samples.
As an example to illustrate our approach, consider approximating f with an
8-bit input. Using the traditional approach, the most-significant bits of u are
used to index the uniform segments. For instance if the most-significant four bits
are used, 16 uniform segments are used to approximate the function. Using our
approach, it is possible to adopt small segments for non-linear regions (regions
near 0 and 1), and large segments for linear regions (regions around 0.5). The
idea is to use segments that grow by a factor of two from 0 to 0.5, and segments
that shrink by a factor of two from 0.5 to 1 in the horizontal axis of Figure 6.2.
We use segment boundaries at locations 2n−8 and 1−2−n where 0 ≤ n < 8. Up to
14 segments can be formed this way. A circuit based on prefix computation can
be used for calculating segment addresses (Figure 6.3, same as the circuit used for
HSM in Chapter 5) for a given input x. It checks the number of leading zeros and
ones to work out the segment address. A cascade of OR gates is used for segments
that grow by factors of two, and a cascade of AND gates is used for segments that
shrink by factors of two; these circuits can be pipelined and a circuit with shorter
critical path but requiring more area can be used [80]. Note that the choice of
segments does not have to be factors of two, it could be more. The appropriate
taps are taken from the cascades depending on the choice of the segments and are
added to work out the segment address. In Figure 6.3, the maximum available
taps are taken, giving 14 segment addresses. Some taps would not be taken if the
154
address
+
segment
x7 x6 x5 x4 x3 2x x1
Figure 6.3: Circuit to calculate the segment address for a given input x. The
adder counts the number of ones in the output of the two prefix circuits. Note
that the least-significant bit xo is not required.
segments grow or shrink by more than a factor of two. It can be seen that the
critical path of this circuit is the path from x6 or x7 to the output of the adder.
By introducing pipeline registers between the gates, higher throughput can easily
be achieved.
When approximating f with 32-bit inputs based on polynomials of the form
p(u) = c1 × u + c0 (6.6)
the gradient of the steepest part of the curve is in the order of 108, thus large
multipliers would be required. To overcome this problem, we use scaling factors
of multiples of two to reduce the magnitude of the gradient, essentially trading
precision for range. This is appropriate since the larger the gradient, the less
important precision becomes. The use of scaling factors provides the user the
ability to control the precision for both c1 and c0, resulting in variation of the
size of the multiplier and adder. Hence for each segment, four coefficients are
stored: c1 and its scaling factor, c0 and its scaling factor. Note that the precision
155
of the approximation p(x) depends on the maximum error desired between p(x)
and the actual function.
It is also possible to divide the input interval into uniform or non-uniform
intervals, and have uniform or non-uniform segments inside each interval. In this
case, the most-significant bits are used to address the intervals, and the least-
significant bits are used to address the segments inside each interval. It can be
seen that one can have any number of nested combinations of uniform and non-
uniform segments. This hybrid combination of nested uniform and non-uniform
segments provides a flexible way to choose the segment boundaries.
The architecture of our function evaluator, shown in Figure 6.4, is based on
first order polynomials. The most-significant bits are used to select the interval,
and the least-significant bits are passed through the segment address calculator
which calculates the segment address within the interval. The ROM outputs the
four coefficients for the chosen interval and segment. c1 is multiplied by the input
x and c s1 is used to scale the output. The scaling circuit involves shifters, which
increase or decrease the value by powers of two. This scaled multiplication value
is added to the scaled c0 coefficient to produce the final result.
6.5 Function Evaluation for Noise Generator
This section explains in detail how the function evaluation method based on non-
uniform segmentation is used to compute the f and g functions for Gaussian
noise generation (Equations (6.1)∼(6.3)). We first consider the f function. As
stated earlier, the greatest non-linearities of this function occur in the regions
close to zero and one. To be consistent with the change in linearity, we use line
segment locations to boundaries at locations 2n−32 for 0 < u ≤ 0.5, and 1−2−n for
156
LSBs MSBs
u
ROM
segment address
calcuation
scaling scaling
c 1
2 (c 1 u) +
c_s 1 c 0 c_s 0
c_s 1 2 c 0 c_s 0
Figure 6.4: Function evaluator architecture based on non-unform segmentation.
157
0 2 4 6 8 10 12 14 160.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
0.12
0.13
Number of Bits
Max
imum
Abs
olut
e E
rror
Figure 6.5: Variation of function approximation error with number of bits for the
gradient of the f function.
0.5 < u ≤ 1, where 0 ≤ n < 32. A total of 59 segments are used to approximate
this function as shown in Figure 6.2. Since f approaches infinity for u values
close to zero, the smallest u value is 2−32, resulting in a maximum output value
of around 4.7.
The maximum absolute error of this approximation is 0.020 (compared against
IEEE double precision). However this is the case only if we have infinite preci-
sion for the coefficients and data paths, which is not realistic. Multipliers take
significant amount of resources on FPGAs, therefore the coefficients for the gra-
dient should be as small as possible. Tests are carried out to find the optimum
number of bits for the gradient coefficients that provides the least absolute er-
ror. Figure 6.5 shows how the maximum absolute error varies with the number
of bits used for the gradient of the f function. Similar tests are performed for
the y-intercept coefficients and various data paths. The figure indicates that six
bits are sufficient to give a maximum absolute error of 0.031. Our requirement
158
0 0.25 0.5 0.75 1−1
0
1
u
g(u) Region 0
Region 2
Region 1
Region 3
cos(u)
sin(u)
Figure 6.6: The g functions. Only the thick line is approximated; see Figure 4.
The most significant 2 bits of u2 are used to choose which of the four regions to
use; the remaining bits select a location within Region 0.
is faithfully rounded results [159] (results are rounded to the nearest or next
nearest), where the approximation should differ from the true value by less than
one ulp. With this error, it is sufficient to give an output accuracy of eight bits
(three bits for integer and five for fraction). If uniform segments are used, small
segment size would be needed in order to cope with the highly non-linear parts
of the curve. In fact, one would require around 617 million segments to get the
same maximum absolute error with uniform segments. This is a good example to
demonstrate the effectiveness of our non-uniform approach. It is clear that our
approach works well especially for functions with exponential behavior.
The computation of g1 and g2 is carried out in a similar way. Given the
symmetry of the sine and cosine functions, the axis can be considered in four
regions related by symmetry, labeled 0 to 3 in Figure 6.6. To evaluate the func-
tions g1 and g2, due to the symmetry of the sine and cosine functions, only the
input range [0, 1/4) for g1 needs to be approximated [128]. The specific axis-
159
0 0.0625 0.125 0.1875 0.25 0
1
u
g 1(u)
Figure 6.7: Approximation for g1 over [0, 1/4). The asterisks indicate the segment
boundaries of the linear approximations.
partitioning technique for f is unsuitable for g1, since the non-linearities of the
two functions are different. If the same technique is used, there would be many
unnecessary segments near the beginning and end of the curve, and not enough
segments in the middle regions. As before we consider both the local linearity
of the curve, and the computational concerns with respect to choosing specific
segment boundary locations, leading to the approximations shown in Figure 6.7.
The curve is divided into four uniform intervals and within each interval, non-
uniform segmentation is applied. Note that for each interval, not all taps are
taken from the segment address calculator. The boundaries are chosen in a way
to minimize the approximation error. For the first three intervals, non-uniform
segments increasing and decreasing by powers of two with six segments each are
used. For the last interval, only three segments are used by omitting taps. Since
this interval is the most non-linear, sufficiently good accuracy can be achieved
with only a few segments. We use a total of 21 segments to approximate this
function.
160
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−0.03
−0.025
−0.02
−0.015
−0.01
−0.005
0
0.005
u
App
roxi
mat
ion
Err
or f(
u)
Figure 6.8: Approximation error to f . The worst case and average errors are
0.031 and 0.000048 respectively.
With finite precision on the coefficients and data paths, the maximum absolute
error of this approximation is 0.031, which is sufficient to give an output accuracy
of eight bits (all eight bits for fraction). Using uniform segments, the same error
can be obtained with a slightly larger number of segments; this is because the
curve does not have high non-linearities.
The maximum absolute errors to the two functions, 0.031 and 0.00079, may
seem to be rather high. However, the average errors to the two functions are
in fact 0.000048 and 0.0000012 respectively. Lower average approximation er-
rors to the functions ensure overall higher noise quality. The error plots for the
approximations to f and g1 are shown in Figures 6.8 and 6.9.
Table 6.1 shows a comparison of the number of segments for the two functions
for non-uniform and uniform segmentation in order to achieve the same worst-case
error. Note that for uniform segmentation, the number of segments needs to be a
power of two. This is because the most-significant n bits are used for addressing.
For instance, the actual number of uniform segments needed for the f function
161
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
−4
−2
0
2
4
6
8
u
App
roxi
mat
ion
Err
or g
1(u)
Figure 6.9: Approximation error to g1. The worst case and average errors are
0.00079 and 0.0000012 respectively.
is 617 million, but one billion segments are used which is the next power of two
(230). We do not have this kind of restriction with our non-uniform addressing
scheme. The table also shows the number of bits used for each coefficient in the
look-up tables. The look-up table sizes are 59× (6 + 5 + 32 + 5) = 2832 bits for
the f function and 21× (8 + 4 + 16 + 4) = 672 bits for the g1 function, giving a
total look-up table size of just 3504 bits for all three functions. With such small
look-up table size, all the coefficients can be stored on-chip for fast access. Note
that the g2 function shares the same look-up table with g1.
6.6 Implementation
This section presents implementations of the four-stage architecture using FPGA
technology.
We use 32 bits for u1, allowing a maximum output of 6.7σ. Higher values
of σ can be supported by increasing the number of bits for u1; for instance 46
162
Table 6.1: Comparing two segmentation methods. Second column shows the
comparison of the number of segments for non-uniform and uniform segmentation.
Third column shows the number of bits used for the coefficients to approximate
f and g1.
function non-uniform uniform c1 c s1 c0 c s0
f 59 1 billion 6 5 32 5
g1 21 32 8 4 16 4
bits would yield a maximum output of 8σ. For u2, 18 bits are found to be
sufficient without loss of performance (lower bitwidths cause the statistical tests
to fail). This is because the trigonometric functions in g1 and g2 can be computed
over [0,1/4) instead of [0,1), with symmetry used to derive the remainder of the
[0,1) interval. In terms of hardware resources, the size of these uniform random
number inputs (u1, u2, g1 and g2) affects the size of the multipliers and adders
(see Figure 6.4). The more bits there are, the larger the multipliers and adders
must be.
The combination of 32 bits for u1 and 18 bits for u2 means that 50 shift
registers are needed. We choose to target a period of about 1018 for the noise
generator, which exceeds by several orders of magnitude even the most ambi-
tious simulation size that can be contemplated with current hardware. Since
1018 is approximately 260, we use 60-bit LFSRs. In order for the LFSRs to it-
erate through this large period, they are configured with polynomials which will
produce maximum sequence lengths for a given LFSR size [125].
The 50 60-bit LFSRs can be implemented in configurable hardware using
surprisingly few resources. Recent-generation reconfigurable hardware has a
163
large amount of user-configurable elements. For instance the Xilinx Virtex-II
XC2V4000-6 has 23040 user-configurable elements known as slices. The SRL16
primitive in Xilinx Virtex FPGAs enables a look-up table to be configured as
a 16-bit shift register. A 60-bit LFSR using SRL16s instead of flipflops can be
packed into three slices instead of 32 [125]. So we just need 150 slices for the 50
LFSRs. Note that all 50 LFSRs are initialized with random seeds.
It could also be argued that application of the central limit theorem should be
unnecessary if f , g1 and g2 are implemented with sufficient accuracy. However,
there is hardware tradeoff involved in increasing the accuracy of these functions.
We have found that application of the central limit theorem once (by summing
two values as described above) results in a net reduction in complexity when
the corresponding looser tolerances in the piecewise linear approximations are
exploited.
Having a larger number of terms in the central limit theorem step would fur-
ther simplify the linear approximations, but would slow the execution speed due
to the need for accumulating more terms. For instance, when 17 approximations
are used for f and 6 for g, eight values need to be summed in order to pass the
statistical tests. When 59 approximations are used for f and 21 for g, without
summing, the statistical tests fail after around 700 million samples. Therefore,
we sum two samples to pass the tests.
Several FPGA implementations have been developed, using the Handel-C
hardware compiler from Celoxica [21]. We have mapped and tested the design
onto a hardware platform with a Xilinx Virtex-II XC2V4000-6 device. This design
occupies 2514 slices, eight block multipliers and two block RAMs, which takes up
around 10% of the device. Stage two, the function evaluator, takes up 2137 slices
or 85% of the slices used. A pipelined version of our design operates at 133MHz,
164
and hence our design produces 133 million Gaussian noise samples per second.
We have also implemented our design on a Xilinx Spartan-IIE XC2S300E-7
FPGA. This design runs at 62MHz and has 2829 slices and eight block RAMs,
which requires over 90% of this device. This implementation can produce 133
million samples in around two seconds.
It is possible to increase the performance by exploiting parallelism. We have
experimented with placing multiple instances of our noise generator in an FPGA,
and find that there is a small reduction in clock speed probably due to the high
fan-out of the clock tree. For instance, a design with three instances of our noise
generator takes up around 32% of the resources in an XC2V4000-6 device; it runs
at 126MHz, producing 378 million noise samples per second.
In Section 6.7, the performance of the hardware designs presented above is
compared with those of software implementations.
6.7 Evaluation and Results
This section describes the statistical tests that we use to analyze the properties
of the generated Gaussian noise.
In order to ensure the randomness of the uniform random samples u1 and u2,
we have tested the LFSR with the Diehard test suite [113], which is a popular
tool among statisticians for testing uniformity. The LFSR passed all the Diehard
tests indicating that the uniform random samples generated are indeed uniformly
randomly distributed.
We use two well-known goodness-of-fit tests to check the normality of the ran-
dom variables: the chi-square (χ2) test and the Anderson-Darling (A-D) test [32].
The χ2 test involves quantizing the x axis into k bins, determining the actual
165
and expected number of samples appearing in each bin, and using the results
to derive a single number that serves as an overall quality metric. Let t be the
number of observations, pi be the probability that each observation fall into the
category i and Yi be the number of observations that actually do fall into category
i. The χ2 statistic is given by
χ2 =k∑
i=1
(Yi − tpi)2
tpi
(6.7)
This test, which is essentially a comparison between an experimentally deter-
mined histogram and the ideal PDF, is sensitive not only to the quality of the
noise generator itself, but also to the number and size of the k bins used on the
x axis. For example, a noise generator that models the true PDF accurately for
low absolute values of x but fails for large x could yield a good χ2 result if the
examined regions are too closely centered around the origin. It is precisely for
these high |x| regions where a noise generator is critically important, and most
likely to be flawed.
Consider a simulation involving generation of 1012 noise samples, conducted
with the goal of exploring performance for a channel decoder in the range of BERs
from 10−9 to 10−10. In samples drawn from a true unit-variance Gaussian PDF, we
would expect that approximately half a million samples from the set of 1012 would
have absolute value greater than x = 5. These high σ noise values are precisely
the ones likely to cause problems in decoding, so a hardware implementation
that fails to faithfully produce them appropriately risks creating incorrect and
deceptively optimistic results in simulation. To counter this, we extend the tests
to specifically examine the expected versus actual production of high σ values.
While the χ2 test deals with quantized aspects of a design, the A-D test deals
with continuous properties. It is a modification of the Kolmogorov-Smirnov (K-
S) test [78] and gives more weight to the tails than the K-S test does. The K-S
166
test is distribution free in the sense that the critical values do not depend on
the specific distribution being tested. The A-D test makes use of the specific
distribution (normal in our case) in calculating critical values. For comparing a
data set to a known CDF F (x), the A-D statistic A2 is defined by
A2 =N∑
i=1
1− 2i
N[lnF (xi) + ln(1− F (xN+1−i))]−N (6.8)
where xi is the ith sorted and standardized sample value, and N is the sample
size.
A p-value [32] can be obtained from the tests, which is the probability that the
deviation of the observed from that expected is due to chance alone. A sample set
with a small p-value means that it is less likely to follow the target distribution.
The general convention is to reject the null hypothesis – that the samples are
normally distributed – if the p-value is less than 0.05.
Figures 6.10, 6.11 and 6.12 illustrate the effect on the PDF of different im-
plementation choices. Figure 6.10 shows the PDF obtained when 17 and 6 linear
approximations are used for f and g1 respectively. The figure (as well as the
others in this section) is based on a simulation of four million Gaussian random
variables. There are distinct error regions visible in the PDF, which occur when
there are large errors in the approximation of f and g1. These distinct errors
cause the χ2 and A-D tests to fail. Increasing the number of linear approxima-
tions to 59 and 21 respectively leads to the PDF shown in Figure 6.11. It is clear
that the error regions have decreased significantly. However, although this passes
the A-D test, it fails the χ2 test when the sample size is sufficiently large. When
the further enhancement of summing two successive samples as discussed earlier
is added, the PDF of Figure 6.12 results.
This implementation passes the statistical tests even with extremely large
167
numbers of samples. We have run a simulation of 1010 samples to calculate the
p-values for the χ2 and A-D test. For the χ2 test, we use 100 bins for the x axis
over the range [-7,7]. The p-values for the χ2 and A-D tests are found to be 0.3842
and 0.9058 respectively, which are well above 0.05, indicating that the generated
noise samples are indeed normally distributed. To test the noise quality in the
high σ regions, we run a simulation of 107 samples over the range [-7,-4] and [4,7]
with 100 bins. This is equivalent to a simulation size of over 1011 samples. The
p-values for the χ2 and A-D tests are found to be 0.6432 and 0.9143, showing
that the noise quality even in the high σ regions is high.
In order to explore the possibility of temporal statistical dependencies [154]
between the Gaussian variables, we generate scatter plots showing pairs yi and
yi+1. This is to test serial correlations between successive samples, which can
occur if the noise generator is improperly designed. If correlations exist, certain
patterns can be seen in the scatter plot [154]. An example based on 10000 Gaus-
sian variables is shown in Figure 6.13, which displays no obvious correlations.
Our hardware implementations, described in Section 6.6, have been compared
to several software implementations based on the polar method [78] and the
Ziggurat method [115], which are the fastest methods for generating Gaussian
noise for instruction processors. The software implementations are written in
C generating single precision floating-point numbers, and are compiled with the
GNU gcc 3.2.2 compiler. The uniform random number generator used is the
mrand48 C function in UNIX, which uses a linear congruential algorithm [78]
and 48-bit integer arithmetic (period of 248). This algorithm can generate one
billion 48-bit uniform random numbers on a Pentium 4 2.6GHz PC in just 23
seconds.
The results are shown in Table 6.2. The XC2V4000-6 FPGA belongs to
168
−6 −4 −2 0 2 4 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
x
PD
F(x
)
Figure 6.10: PDF of the generated noise with 17 approximations for f and 6 for g
for a population of four million. The p-values of the χ2 and A-D tests are 0.00002
and 0.0084 respectively.
−6 −4 −2 0 2 4 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
x
PD
F(x
)
Figure 6.11: PDF of the generated noise with 59 approximations for f and 21
for g for a population of four million. The p-values of the χ2 and A-D tests are
0.0012 and 0.3487 respectively.
169
−6 −4 −2 0 2 4 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
x
PD
F(x
)
Figure 6.12: PDF of the generated noise with 59 approximations for f and 21 for
g with two accumulated samples for a population of four million. The p-values
of the χ2 and A-D tests are 0.3842 and 0.9058 respectively.
−4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
yi
y i+1
Figure 6.13: Scatter plot of two successive accumulative noise samples for a
population of 10000. No obvious correlations can be seen.
170
Table 6.2: Performance comparison: time for producing one billion Gaussian
noise samples. All PCs are equipped with 1GB DDR-SDRAM.
platform speed [MHz] method time [s]
XC2V4000-6 FPGA 105 96% usage 1
XC2V4000-6 FPGA 126 32% usage 2.6
XC2V4000-6 FPGA 133 10% usage 7.5
XC2S300E-7 FPGA 62 90% usage 16
Intel Pentium 4 PC 2600 Ziggurat 50
AMD Athlon PC 1400 Ziggurat 72
Intel Pentium 4 PC 2600 Polar 147
AMD Athlon PC 1400 Polar 214
the Xilinx Virtex-II family, while the XC2S300E-7 FPGA belongs to the Xilinx
Spartan-IIE family. It can be seen that our hardware designs are faster than
software implementations by 3–200 times, depending on the device used and the
resource utilization. Such speedups are mainly due to the ability to perform bit-
level and parallel operations in FPGAs, which result in a more efficient usage of
silicon area for a given design over general purpose microprocessors.
Figure 6.14 shows how the number of noise generator instances affects the
output rate. While ideally the output rate would scale linearly with the number
of noise generator instances (dotted line), in practice the output rate grows slower
than expected, because the clock speed of the design deteriorates as the number
of noise generators increases. This deterioration is probably due to the increased
171
routing congestion and delay. We are able to fit up to nine instances on the
Virtex-II XC2V4000-6, which can generate almost one billion noise samples per
second.
We have used our noise generator in LDPC decoding experiments [74]. Al-
though the output precision of our noise generator is 32 bits, 16 bits are found to
be sufficient for our LDPC decoding experiments (other applications such as fi-
nancial modeling [14] may require higher precisions). To obtain a benchmark, we
performed LDPC decoding using a full precision (64-bit floating-point represen-
tation) software implementation of belief propagation in which the noise samples
are also of full precision. We then performed decoding using the LDPC algorithm
but with noise samples created using the design presented in this chapter. Over
many simulations, we have found no distinguishable difference in code perfor-
mance, even in the high Eb/N0 (low SNR) regions where the error floor in BER
is as low as 10−9 (1012 codewords are simulated). To generate 1012 noise samples
on a 2.6GHz Pentium 4 PC, it takes over 11 hours, whereas a single instance of
our hardware noise generator takes just over two hours. On a PC, where LDPC
encoding, noise generation and LDPC decoding are performed, the simulation
time for 1012 codeword samples will be a lot longer than ten hours, since all three
modules need to be performed. However, in our hardware simulation we have the
advantage of running all three modules in parallel. Although the hardware im-
plementation of our hardware LDPC decoder is currently at a preliminary stage
(implemented serially), it has a throughput of around 500Kbps, which is over 20
times faster than our PC based simulations. We are currently in the process of
implementing a fully parallel scalable decoder, which we predict will be several
orders of magnitude faster than traditional software simulations.
Comparing our implementation with other hardware Gaussian noise genera-
172
1 2 3 4 5 6 7 8 90
200
400
600
800
1000
1200
Number of Instances
Mill
ion
Sam
ples
/ S
econ
d
Figure 6.14: Variation of output rate against the number of noise generator
instances.
tors, the only implementation known on a Xilinx FPGA is the AWGN core [186]
from Xilinx. This implementation follows the ideas presented in [12]. Although
this core is around twice as fast as and four times smaller than our design, it is
only capable of a maximum σ value of 4.7 (whereas we can achieve 6.7 σ and
more). In addition, we have tested the design with our statistical tests, and found
out that the noise samples fails the χ2 test after around 200,000 samples. Hence,
we find the design to be inadequate for our low BER and high quality LDPC
decoding experiments.
6.8 Summary
We have presented a hardware Gaussian noise generator based on the Box-Muller
method, designed to facilitate simulations implemented in hardware which involve
large numbers of samples. A key aspect of the design is the use of non-uniform
173
piecewise linear approximations in computing trigonometric and logarithmic func-
tions, with the boundaries between each approximation chosen carefully to enable
rapid computation of coefficients from the inputs.
Our noise generator design occupies approximately 10% of a Xilinx Virtex-
II XC2V4000-6 FPGA and 90% of a Xilinx Spartan-IIE XC2S300E-7, and can
produce 133 million samples per second. The performance can be improved by
exploiting parallelism: an XC2V4000-6 FPGA with nine parallel instances of the
noise generator at 105MHz can run 50 times faster than a 2.6GHz Pentium 4 PC.
Statistical tests, including the χ2 test and the A-D test, as well as application
in LDPC decoding have been used to confirm the quality of the noise samples.
The output of the noise generator accurately models a true Gaussian PDF even
at very high σ values.
This noise generator has been integrated with the LDPC decoder presented
in [74], and is being used for exploring LDPC code behavior at UCLA and JPL
(Jet Propulsion Laboratory, NASA). It is also being used at the Chinese Univer-
sity of Hong Kong for Monte Carlo simulations of financial models [192]. In the
next chapter, we describe another hardware Gaussian noise generator based on a
recent method proposed by Wallace [180].
174
CHAPTER 7
Gaussian Noise Generator
using the Wallace Method
7.1 Introduction
Most of the methods including the Box-Muller method described in previous
chapter, produce normal variables by performing operations on uniform vari-
ables. In contrast, in [180] Wallace proposes an algorithm that completely avoids
the use of uniform variables, operating instead using an evolving pool of normal
variables to generate additional normal variables. The approach draws its inspi-
ration from uniform random number generators that generate one or more new
uniform variables from a set of previously generated uniform variables. Given a
set of normally distributed random variables, a new set of normally distributed
random variables can be generated by applying a linear transformation.
Although the Wallace method is simple and fast, it can suffer from correlations
at the output due to its feedback nature. This issue will be discussed in detail in
Chapter 8.
The principal contribution of this chapter is a hardware Gaussian noise gen-
erator based on the Wallace method that offers quality suitable for simulations
involving very large numbers of noise samples. The noise generator occupies ap-
proximately 3% of the resources on a Xilinx Virtex-II XC2V4000-6 device, while
175
producing over 155 million samples per second. The key contributions of our
work include:
• a hardware architecture for the Wallace method;
• exploration of hardware implementations of the proposed architecture tar-
geting both advanced high-speed FPGAs and low-cost FPGAs;
• evaluation of the proposed approach using several different statistical tests,
including the chi-square test and the Anderson-Darling test, as well as
through application to a large communications simulation involving LDPC
codes.
The rest of this chapter is organized as follows. Section 7.2 provides an
overview of the Wallace method. Section 7.3 describes our Wallace implementa-
tion, and discusses how each of its steps can be handled in a hardware architec-
ture. Section 7.4 describes technology-specific implementation of the hardware
architecture. Section 7.5 discusses evaluation and results, and Section 7.6 offers
summary.
7.2 The Wallace Method
The Wallace approach [180] draws its inspiration from uniform random number
generators that generate one or more new uniform variables from a set of previ-
ously generated uniform variables. Given a set of normally distributed random
variables, a new set of normally distributed random variables can be generated
by applying a linear transformation. Brent [16] implemented a fast vectorized
Gaussian random number generator using the Wallace method on the Fujitsu
VP2200 and VPP300 vector processors. In [17] and [157], Brent and Rub outline
176
P o o l o f N
G a u s s i a n S a m p l e s
K - b y - K T r a n s f o r m a t i o n
x K - 1
x 0
S u m - o f - s q u a r e s C o r r e c t i o n
G a u s s i a n S a m p l e s
x ' K - 1
x ' 0
Figure 7.1: Overview of the Wallace method.
the possible problems associated with the Wallace method and discuss ways of
avoiding them.
The Wallace method is a fast algorithm for generating normally distributed
pseudo-random numbers which generates the target distributions directly using
their maximal-entropy properties. This algorithm is particularly suitable for high
throughput hardware implementation since no transcendental functions such as√
x, log(x) or sin(x) are required. It takes a pool of KL normally distributed
random numbers from the normal distribution. These values are normalized so
that their average squared value is one. In L transformation steps, K numbers are
treated as a vector X, and transformed into K new numbers from the components
of the K vector X1 = AX where A is an orthogonal matrix. If the original K
values are normally distributed, then so are the K new values. Furthermore, this
transformation preserves the sum of squares. An overview of the Wallace method
is depicted in Figure 7.1.
The process of generating a new pool of normally distributed random numbers
is called a ‘pass’. After a pass, a pool of new Gaussian random numbers is formed.
As there are KL variables in the data pool, L transformation steps are performed
during each pass. A K-vector X is multiplied with the orthogonal matrix A in
177
performing a transformation step.
As stated by Wallace, it is desirable that any value in the pool should even-
tually contribute to every value in the pools formed after several passes. In
Wallace’s original method, the old pool is treated as an L-by-K array stored
in row-major order, and the new pass is treated as an L-by-K array stored in
column major order. Hence, each pass effectively transposes the values in the
pool. If L is odd, the transposition is sufficient to ensure eventual mixing of the
values. However, if L is even, which is desirable for hardware implementation),
transposition alone is not sufficient. We describe in Section 7.3 how we overcome
this problem to reduce correlation even further.
The initial values in the pool are normalized so that their average squared
value is one. Because A is orthogonal, the subsequent passes do not alter the
sum of the squares. This would be a defect, since if x1, ..., xN are are independent
samples from the normal distribution, we would expect∑i=N
i=1 x2i to have a chi-
squared distribution χ2N . In order to overcome this defect, a variate from the
previous pool is used to approximate a random sample S from the χ2N distribution.
A scaling factor is introduced to ensure that the sum of the squares of the values
in the pool is S, the random sample.
7.3 Architecture
This section provides an overview of the hardware design for the Wallace method,
which involves a four-stage hardware architecture shown in Figure 7.2. The
implementation of this architecture in FPGA technology will be presented in
Section 7.4. In Figure 7.2, the select signals for the multiplexors and the clock
enable signals for the registers are omitted for simplicity.
178
L F S R 1 L F S R 3 0
5 2
:
1
: :
1 0
1
s t a r t 1 0 s t r i d e 1 0 m a s k
< < 1
1 0 1 0 1 0 1 0
p a d d r
s a d d r
q a d d r
r a d d r
a d d r _ A
P o o l R A M
a d d r _ B A
B
T r a n s f o r m a t i o n
d a t a _ A
d a t a _ B
i n i t P o o l R O M
2 4
2 4
2 4 2 4
2 4
2 4
S x C 2
C 1 2 4
n o i s e
r e s e t
S t a g e 1
S t a g e 2
S t a g e 3
S t a g e 4
L F S R S e e d R O M
C o u n t e r
1 0
A '
B '
C '
G
1
i n i t G
Figure 7.2: Overview of our Gaussian noise generator architecture based on the
Wallace method. The triangle in Stage 4 is a constant coefficient multiplier.
179
In our design illustrated in Figure 7.2, we choose K = 4 and L = 256, giving
a pool size N of 1024. On-chip, true dual read/write port synchronous RAM is
used to implement the pool. The dual-port RAM allows two values to be read
and written simultaneously, improving the memory bandwidth.
As all the variables from the pool are used to generate the new pseudo random
numbers, the indices should cover all the numbers in the pool and at the same
time reduce the correlations between them. The addresses which index the pool
are started from a random origin ‘start’, stepped by a random odd ‘stride’ and
XOR is performed with a random ‘mask’.
In order to achieve better mixing of the Gaussian random number generator,
more pass types can be used during a pass by introducing different orthogonal
matrices. As in Wallace’s original implementation, two orthogonal matrices A0
and A1 are chosen for our design:
A0 =1
2
1 −1 −1 −1
1 −1 1 1
1 1 −1 1
1 1 1 −1
A1 =1
2
−1 1 1 1
−1 1 −1 −1
−1 −1 1 −1
−1 −1 −1 1
During a pass, A0 is used for L ≤ 127 and A1 is used for L ≥ 128. As the
elements of the matrices A0 and A1 are only 1 or −1, only simple integer addition
and shift operations are required. The Gaussian random variables in the pool are
held as 24-bit two’s complement integers. For the given set of four values p, q, r, s
180
to be transformed, and with our choice of A0 and A1, the new values p′, q′, r′, s′
can be calculated from the old ones as follows:
p′ = p− t; q′ = t− q; r′ = t− r; s′ = t− s; (7.1)
and
p′ = t− p; q′ = q − t; r′ = r − t; s′ = s− t; (7.2)
where t = 12(p + q + r + s).
7.3.1 The First Stage
This stage involves generation of the uniformly distributed realizations start,
stride and mask. The implementation of this stage is straightforward, and can
be accomplished using well-known techniques based on Linear Feedback Shift
Registers (LFSRs) [24]. To ensure maximum randomness, we use an independent
shift register for each bit of start, stride and mask. The resources needed are
related to the periodicity desired in the shift registers. Since m-bit LFSRs with
irreducible polynomials can produce random numbers with periodicity of 2m− 1,
the hardware required will be proportional to the number of bits of precision
needed. Since we use a pool size of 1024, 10 bits are needed for the three variables,
meaning that 30 LFSRs are needed. If the reset signal is set, we would like to
generate the same sequences again. The “LFSR Seed ROM” contains the initial
seeds for the 30 LFSRs, which are loaded when the reset signal is set. 52-bit
LFSRs are used in our architecture, which give a period of 252− 1 (≈ 4.5× 1015).
Hence, the size of the “LFSR Seed ROM” is 30× 52 = 1560 bits.
181
7.3.2 The Second Stage
This stage follows the techniques used by Wallace in his FastNorm2 implemen-
tation [181]. It generates the addresses for the four values p, q, r, s from start,
stride and mask. To ensure the value of stride is odd, OR with one is performed.
The addresses are calculated as follows:
p addr = start⊕mask (7.3)
q addr = (start + stride)⊕mask (7.4)
r addr = (start + stride× 2)⊕mask (7.5)
s addr = (start + stride× 3)⊕mask (7.6)
The multiplication by two is implemented simply by a left shift, and the mul-
tiplication by three is implemented by a left shift followed by an adder. This
addressing scheme ensures that the correlations between variables are kept min-
imum.
7.3.3 The Third Stage
This stage involves the most interesting challenge: efficiently performing the
actual transformation. This stage contains the “Pool RAM” which holds the
pool of 1024 Gaussian random variables. Dual-port RAM is used to implement
the pool. Since each variable in the pool is 24 bits, the total size of the pool
is 1024 × 24 = 24576 bits. The “init Pool ROM” and the counter are used to
initialize the pool with the original pool contents when the reset signal is set; this
ROM is single ported and has the same size as the pool. The contents of this
ROM is generated in software using the Box-Muller method, and the variables
are normalized so that their sum of squares is equal to one.
Figure 7.3 shows how we perform the transformation steps described in equa-
182
R 1
R 2
R 3
R 4
> > 1 R t
R p
R q
R r
R s
R p '
R q '
R r '
R s '
A
B
C '
A '
B '
Figure 7.3: The transformation circuit of Stage 3. The square boxes are registers.
The select signals for the multiplexors and the clock enable signals for the registers
are omitted for simplicity.
tions (7.1) and (7.2). The timing diagram of this circuit and the “Pool RAM” is
illustrated in Figure 7.4. All ports and registers of the transformation circuit and
ports of the dual-port RAM are shown. We observe that the dual-port RAM is
fully utilized. t is calculated in three steps:
x = p + q (7.7)
y = r + s (7.8)
t = x + y. (7.9)
In principle, we could share a single adder in conjunction with multiplex-
ors to perform all the operations of the transformation circuit. However, high-
183
c l k
r e s e t
a d d r _ A A _ p 0 A _ r 0 A _ p 0 A _ r 0 A _ p 1 A _ r 1 A _ p 1 A _ r 1
a d d r _ B A _ q 0 A _ s 0 A _ q 0 A _ s 0 A _ q 1 A _ s 1 A _ q 1 A _ s 1
p 0 r 0 p 1 r 1
R 2 q 0 s 0 q 1 s 1
W E
A _ p 2 A _ r 2
A _ q 2 A _ s 2
A ' p 1 ' r 1 '
B ' q 1 ' s 1 '
x 0 x 1
R 4 y 0 y 1
t 0 t 1
p 0 p 1
R q q 0 q 1
r 0 r 1
s 0 s 1
p 0 '
R q ' q 0 '
r 0 '
s 0 '
R 1
R 3
R t
R p
R r
R s
R p '
R r '
R s '
B
A p 0 r 0 p 1 r 1
q 0 s 0 q 1 s 1
p 2
q 2
C ' p 0 ' q 0 ' r 0 '
Figure 7.4: Detailed timing diagram of the transformation circuit and the dual–
port “Pool RAM”. A z indicates the address of the data z and WE is the write
enable signal of the “Pool RAM”.
184
speed adders are efficiently implemented on FPGAs by fast-carry chains. In fact,
both a two-input 24-bit multiplexor and a 24-bit adder occupy 14 slices (user-
configurable elements on the FPGA) in a Xilinx Virtex-II FPGA. In addition the
use of multiplexors would increase the delay significantly. For these reasons, we
decide to use separate adders/subractors for each operation. For other devices
such as Application-Specific Integrated Circuits (ASICs), it can be more efficient
to adopt the former approach involving hardware sharing. The critical path of
the entire Wallace design is from Rp to Rp′ which is just a multiplexor followed
by a subtractor.
7.3.4 The Fourth Stage
This stage performs the sum of squares correction described in Section 7.2. It
follows the approach used by Wallace in his FastNorm2 implementation [181].
A random sample S with an approximate χ2N distribution can be obtained as
S =1
2(C + A× x)2 (7.10)
where x has unit normal distribution, A = 1 + 18N
and C =√
2N − A2 for large
N . Hence, S can be computed as
S =
√1
2N× A× (B + x) (7.11)
where B = CA. We set C1 = A×√2N and C2 = B.
The noise sample C ′, generated from the transformation circuit of Stage 3,
is multiplied by G to correct the sum of the squares and hence the final noise
sample. G is obtained by
G = S × C2 + C1. (7.12)
Since C1 and C2 are constants, they are precalculated in software and stored as
constants in the hardware design.
185
Before a pass, S is assigned with a variable from a previous pass, and G is
updated. For the very first pass when the reset signal is set, G is initialized to
1/√
N/ts where ts is the sum of squares of the initial pool. Note that we are
using a pool size of N = 1024.
7.4 Implementation
This section presents implementations of the four-stage architecture using FPGA
technology.
The 30 52-bit LFSRs in Stage 1 can be implemented in configurable hardware
using a small amount of resources. Recent FPGAs have many of user-configurable
element: for instance, the Xilinx Virtex-II XC2V4000-6 device has 23040 user-
configurable elements known as slices. A look-up table can be configured as a
16-bit shift register using the SRL16 primitive in Xilinx Virtex and Virtex-II
series FPGAs. A 52-bit LFSR using SRL16s instead of flipflops can be packed
into three slices instead of 32 [125]. Hence, our design contains 90 slices to
implement the 30 52-bit LFSRs. Note that all 30 LFSRs are initialized with
uniformly distributed random seeds.
Xilinx Virtex-II devices have embedded memory elements and multipliers,
which are known as block RAMs and MULT18X18s. Each block RAM can hold
18Kb of data and each embedded multiplier can implement a 18-bit by 18-bit
multiplication. If the data or the multiplication are larger than 18kb or 18-bit by
18-bit, the Xilinx tools will use multiple block RAMs and embedded multipliers
to implement them. The Xilinx Virtex-II XC2V4000-6 device has 120 block
RAMs and 120 embedded multipliers in total. The “LFSR Seed ROM” and
the “init Pool ROM” are implemented using single-port block RAMs, while the
186
“Pool RAM” is implemented using dual-port block RAMs. The sizes of “LFSR
Seed ROM”, “init Pool ROM” and “Pool RAM” are 1560, 24576 and 24576 bits.
Hence they occupy one, two and two block RAMs respectively. The constant
coefficient multiplier in Stage 4 uses two block RAMs to implement part of the
multiplication. The 24-bit by 24-bit multiplier in Stage 4 occupies four embedded
multipliers.
Several FPGA implementations have been developed, using Xilinx System
Generator 6.2 [188]. All designs are heavily pipelined to maximize throughput.
Synplicity Synplify Pro 7.5.1 is used for synthesis with the retiming and pipelin-
ing options turned on. For place-and-route, Xilinx ISE 6.2.01i is used with the
maximum effort level and the clock constraints are carefully tuned to give the
fastest clock frequency. We have mapped and tested the Wallace design onto a
hardware platform with a Xilinx Virtex-II XC2V4000-6 FPGA. The design occu-
pies 895 slices, seven block RAMs and four embedded multipliers, which takes up
around 3% of the device. The pipelined design operates at 155MHz, and hence
our design produces 155 million Gaussian noise samples per second. The resource
usage of each of the four stages is shown in Table 7.1. It may be surprising to see
that Stage 1 occupies 281 slices, since the 30 LFSRs require just 90 slices. This is
due to extra components such as logic gates, registers and multiplexors required
to initialize the LFSRs with seeds. Xilinx System Generator design diagrams of
Stage 1 and Stage 2 are depicted in Figure 7.5 and Figure 7.6. Stage 3 and Stage
4 are shown in Figure 7.7.
The latency of our design is 1680 clock cycles (≈ 11µs at 155MHz). 1560 cycles
are used to initialize the 30 52-bit LFSRs. The LFSRs need to be initialized one
by one, since the “LFSR Seed ROM” is single ported. The other 120 cycles are
needed to fill up the pipelines of the design. Although the latency is very large,
187
Figure 7.5: Wallace architecture Stage 1 in Xilinx System Generator. The 30
LFSRs generate uniform random bits for Stage 2.
188
Fig
ure
7.6:
Wal
lace
arch
itec
ture
Sta
ge2
inX
ilin
xSyst
emG
ener
ator
.P
seudo
random
addre
sses
for
p,q,
r,s
are
gener
ated
.
189
Fig
ure
7.7:
Wal
lace
arch
itec
ture
Sta
ge3
and
Sta
ge4
inX
ilin
xSyst
emG
ener
ator
.O
rthog
onal
tran
sfor
mat
ion
is
per
form
edan
dsu
mof
squar
esco
rrec
ted.
190
Table 7.1: Resource utilization for the four stages of the noise generator on a
Xilinx Virtex-II XC2V4000-6 FPGA.
stage slices block RAMs multipliers
1 281 1 -
2 180 - -
3 214 4 -
4 220 2 4
total 895 7 4
it is not important since we only care about the throughput in a hardware based
simulation. Figures 7.8 and 7.8 show the placed and routed Wallace designs on
a Xilinx Virtex-II XC2V4000-6 FPGA.
From a hardware designer’s point of view, it is interesting to explore the
tradeoffs between using different types of hardware resources. For instance, a
look-up table can be implemented using block RAM or distributed RAM with
slices. Table 7.2 shows our noise generator implemented using different FPGA
resources. We observe that the design using slices only is more than four times
the number of slices and has significantly lower clock speed than our original
design. Also, the area and speed penalty of using slices to implement tables
instead of block RAMs is especially high. Hence in our opinion, dedicated FPGA
resources such as block RAMs and embedded multipliers should be used whenever
applicable.
We have also implemented our design on a low-cost Xilinx Spartan-III XC3S200E-
5 FPGA. The design runs at 106MHz and takes up the same amount of resources
191
Figure 7.8: Our Wallace design placed on a Xilinx Virtex-II XC2V4000-6 FPGA.
Figure 7.9: Our Wallace design routed on a Xilinx Virtex-II XC2V4000-6 FPGA.
192
Table 7.2: Hardware implementation results of the noise generator using different
types of FPGA resources on a Xilinx Virtex-II XC2V4000-6 FPGA.
FPGA resources used slices block embedded speed
RAMs multipliers [MHz]
slices + block RAMs + multipliers 895 7 4 155
slices + block RAMs 1215 7 - 152
slices + multipliers 3702 - 4 118
slices 4020 - - 112
as the Virtex-II design above, which requires around half of the resources in the
device.
The performance can be improved by concurrent execution. We have experi-
mented with placing multiple instances of our noise generator in an FPGA, and
discovered that there is a small reduction in clock speed due to increased routing
congestion. For example, eight instances of our noise generator on an XC2V4000-
6 FPGA run at 144MHz. They take up around 31% of the resources producing
over one billion noise samples per second.
7.5 Evaluation and Results
This section describes the statistical tests that we use to analyze the properties
of the generated Gaussian noise.
To ensure the randomness of the uniform random numbers start, stride and
mask, we have tested the LFSRs with the Diehard tests [113]. The LFSRs pass all
193
the tests indicating that the uniform random samples generated are indeed uni-
formly randomly distributed. As in Chapter 6, we use two well-known goodness-
of-fit tests to check the normality of the random variables: the chi-square (χ2)
test and the Anderson-Darling (A-D) test [32].
Our hardware Wallace implementation passes the statistical tests even with
extremely large numbers of samples. We have run a simulation of 1010 samples
to calculate the p-values for the χ2 and A-D test. For the χ2 test, we use 100 bins
for the x axis over the range [-7,7]. The p-values for the χ2 and A-D tests are
found to be 0.5385 and 0.7372 respectively, which are well above 0.05, indicating
that the generated noise samples are indeed normally distributed. To test the
noise quality in the high σ regions, we run a simulation of 107 samples over the
range [-7,-4] and [4,7] with 100 bins. This is equivalent to a simulation size of
over 1011 samples. The p-values for the χ2 and A-D tests are found to be 0.6839
and 0.7662, showing that the noise quality even in the high σ regions is high.
If (x, y) is a pair random numbers with Gaussian distributions, then u =
e(x2+y2)/2 should be uniform over [0, 1]. Six million Gaussian variables, randomly
picked from a population of 1010 samples generated from our design are trans-
formed using this identity, resulting in three million uniform random variables.
These uniform variables are tested with the Diehard tests [113] for uniformity.
They pass all tests indicating that the transformed numbers are indeed uniformly
distributed.
To explore the possibility of temporal statistical dependencies [154] between
the Gaussian variables, we generate scatter plots showing pairs yi and yi+1. This
is to test serial correlations between successive samples, which can occur if the
noise generator is improperly designed. If undesirable correlations exist, certain
patterns can be seen in the scatter plot [154]. An example based on 10000 Gaus-
194
−4 −3 −2 −1 0 1 2 3 4−4
−3
−2
−1
0
1
2
3
4
yi
y i+1
Figure 7.10: Scatter plot of two successive noise samples for a population of
10000. No obvious correlations can be seen.
sian variables is shown in Figure 7.10; there is no evidence of obvious correlations.
Figures 7.11 and 7.12 show the PDF obtained from our Gaussian noise gen-
erator for a population of one and four million samples. Both set of samples pass
the χ2 and the A-D test, and we observe that the PDF is very smooth when the
sample size is large.
We compare our design with two other designs: “White Gaussian Noise Gen-
erator” block available in Xilinx System Generator 6.2 [188] and the design de-
scribed in Chapter 6. The “White Gaussian Noise Generator” block is based on
the “Additive White Gaussian Noise (AWGN) Core 1.0” from Xilinx [186]. The
Xilinx core follows the architecture presented by Boutillon et al. in [12], which
uses the Box-Muller method in conjunction with the central limit theorem. The
195
−6 −4 −2 0 2 4 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
x
PD
F(x
)
Figure 7.11: PDF of the generated noise from our design for a population of one
million. The p-values of the χ2 and A-D tests are 0.9994 and 0.2332 respectively.
design in Chapter 6 is also based on the Box-Muller method and central limit
theorem, but we employ more sophisticated approximation techniques for the
mathematical functions in the Box-Muller method, resulting in significantly more
statistically accurate noise samples. We test the noise samples generated from
the Xilinx block with the χ2 and the A-D test. We find that the samples fail
the tests after just around 160,000 samples. Figure 7.13 shows the PDF for a
populations of one million noise samples from the Xilinx block. The samples fail
both the χ2 and the A-D test, and we observe some undesirable spikes in the
PDF.
Table 7.3 compares the Xilinx block, our Box-Muller design in Chapter 6 and
our current Wallace design. We can see that the Xilinx block uses less resources
and slightly faster than our Wallace design, but as mentioned above the block fails
the statistical tests after a very small amount of samples. Both of our Box-Muller
196
−6 −4 −2 0 2 4 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
x
PD
F(x
)
Figure 7.12: PDF of the generated noise from our design for a population of four
million. The p-values of the χ2 and A-D tests are 0.7303 and 0.8763 respectively.
and Wallace designs pass the statistical tests, even with very large numbers of
samples. However, our Wallace design is around almost three times smaller and
slightly faster.
Figure 7.14 shows the variation of the χ2 test p-value with sample size for
the Xilinx block and various Wallace implementations using different data path
bitwidths. The 0.05 p-value pass mark is shown as a dotted line. We observe
that the Xilinx block fails after a small number of samples. For the Wallace
implementations, bitwidths lower than 24 bits fail the test gradually as the sample
size increases. Using 24 bits, which is the bitwidth used in our design, does not
fail the test even at large numbers of samples, and does not show signs of the
quality degrading.
Table 7.4 shows the hardware implementation results when multiple instances
of the noise generator are implemented on the device. We are able to fit up to
197
−6 −4 −2 0 2 4 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
x
PD
F(x
)
Figure 7.13: PDF of the generated noise from the Xilinx block for a population
of one million. The p-values of the χ2 and A-D tests are 0.0000 and 0.0002
respectively.
16 instances on the XC2V4000-6 FPGA, the number of block RAMs available on
the device being the limit. Of course using a bigger device such as the Virtex-
4 XC4VFX140-11 device [189] (which has 62848 slices, 560 block RAMS and
192 embedded multipliers), we are able to fit over 50 instances. Note that it is
perfectly valid to use multiple instances of the noise generator, as long as the
LFSRs and pool RAMs are initialized with different random seeds and noise
samples.
Figure 7.15 shows how the number of noise generator instances affects the
output rate. While ideally the output rate would scale linearly with the number
of noise generator instances (dotted line), in practice the output rate grows slower
than expected, because the clock speed of the design deteriorates as the number
of noise generators increases. This deterioration is probably due to the increased
198
Table 7.3: Comparisons of different hardware Gaussian noise generators imple-
mented on Xilinx Virtex-II XC2V4000-6 FPGAs. All designs generate a noise
sample every clock.
Xilinx [188] Chapter 6 this design
slices 653 2514 895
block RAMs 4 2 7
multipliers 8 8 4
speed [MHz] 168 133 155
pass χ2 test no yes yes
pass A-D test no yes yes
routing congestion and delay.
We have used our noise generator in LDPC decoding experiments [74]. Al-
though the output precision of our noise generator is 24 bits, 16 bits are found
to be sufficient for our LDPC decoding experiments. If precisions higher than 24
bits are required, we can simply increase the size of the data paths and the noise
samples in the memories.
To obtain a benchmark, we performed LDPC decoding using a full precision
(64-bit floating-point representation) software implementation of belief propaga-
tion in which the noise samples are also of full precision. We then performed
decoding using the LDPC algorithm but with noise samples created using the
design presented in this chapter. Over many simulations, we have found no dis-
tinguishable difference in code performance, even in the high Eb/N0 regions where
199
105
106
107
108
109
1010
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Samples
p−va
lue
Xilinx12−bit Wallace16−bit Wallace20−bit Wallace24−bit Wallace
Figure 7.14: Variation of the χ2 test p-value with sample size for the Xilinx block,
12-bit, 16-bit, 20-bit and 24-bit Wallace implementation.
the error floor in BER is as low as 10−9 (1012 codewords are simulated).
Our hardware implementations have been compared to several software imple-
mentations based on the Wallace, Ziggurat [115], polar and Box-Muller method [78],
which are known to be the fastest methods for generating Gaussian noise for
instruction processors. For the Wallace and Ziggurat methods, FastNorm2 avail-
able in [181] and rnorrexp available in [115] are used. In order to make a fair
comparison, we use the same uniform number generator for all implementations.
The mixed multiplicative congruential (Lehmer) generator [179] used in the Fast-
Norm2 implementation is chosen. Software implementations are run on an Intel
Pentium 4 2.6GHz PC is equipped with 1GB DDR-SDRAM. They are written in
ANSI C and compiled with the GNU gcc 3.2.2 compiler with -O3 optimization,
generating double precision floating-point numbers. The results are shown in Ta-
200
Table 7.4: Hardware implementation results on a Xilinx Virtex-II XC2V4000-6
FPGA for for different numbers of noise generator instances. The device has
23040 slices, 120 block RAMs and 120 embedded multipliers in total.
inst slices block embedded speed million
RAMs multipliers [MHz] samples / sec
1 895 7 4 155 155
4 3590 28 16 151 606
8 7178 56 32 144 1149
12 10776 84 48 140 1668
16 14359 112 64 115 1843
ble 7.5. The XC2V4000-6 FPGA belongs to the Xilinx Virtex-II family, while the
XC3S200E-5 FPGA belongs to the Xilinx Spartan-III family. It can be seen that
our hardware designs are faster than software implementations by 2–491 times,
depending on the device used and the resource utilization. Looking at the PC
results, we can see that the Wallace method performs significantly better than
other methods.
201
2 4 6 8 10 12 14 160
500
1000
1500
2000
2500
Number of Instances
Mill
ion
Sam
ples
/ S
econ
d
Figure 7.15: Variation of output rate against the number of noise generator
instances.
Table 7.5: Performance comparison: time for producing one billion Gaussian
noise samples.
platform speed method time ratio
[MHz] [s]
XC2V4000-6 FPGA 115 16 inst 0.54 1
XC2V4000-6 FPGA 155 1 inst 6.5 12
XC3S200E-5 FPGA 106 1 inst 9.4 17
Intel Pentium 4 PC 2600 Wallace 22 41
Intel Pentium 4 PC 2600 Ziggurat 63 117
Intel Pentium 4 PC 2600 Polar 16f4 304
Intel Pentium 4 PC 2600 Box-Muller 265 491
202
7.6 Summary
We have presented a hardware Gaussian noise generator using the Wallace method
to support simulations which involve very large numbers of samples.
Our noise generator architecture contains four stages. It takes up approxi-
mately 3% of a Xilinx Virtex-II XC2V4000-6 FPGA and half of a Xilinx Spartan-
III XC3S200E-5, and can produce 155 million samples per second. Further im-
provement in performance can be obtained by concurrent execution: 16 parallel
instances of the noise generator on an XC2V4000-6 FPGA at 115MHz can run 41
times faster than software on a 2.6GHz Pentium 4 PC. The quality of the noise
samples is confirmed by two statistical tests; the χ2 test and the A-D test, and
also by applications involving LDPC decoding. The output of the noise generator
accurately models a true Gaussian PDF even at very high σ values. Although the
Wallace design occupies smaller area and is faster than the Box-Muller design in
Chapter 6, it has slight correlations between successive transformations, which
may be undesirable for certain types of applications. Strategies to reduce such
correlations are discussed in the next chapter.
203
CHAPTER 8
Design Parameter Optimization
for the Wallace Method
8.1 Introduction
The Wallace method [180], described in the previous chapter, creates new outputs
based on linear combinations of a continually refreshed pool of previous outputs.
Outputs are produced in blocks, each containing the same number of values as
the pool, and each of which then becomes the pool for generation of the next
block. This process of generating a new pool from the old pool is called a ‘pass’.
This method is simple and fast, but can suffer from correlations at the output
due to its feedback nature. The main contributions of this chapter are:
• Tests designed specifically to detect correlations in the Wallace method.
• Parameter optimizations to reduce correlations.
• Identification of parameters minimizing execution time and cache require-
ments, while keeping correlations at minimum.
• Detailed performance tradeoff analysis on Athlon XP and Pentium 4 plat-
forms and comparisons with other methods.
This chapter is organized as follows. Section 8.2 provides a brief overview of
the Wallace method. Section 8.3 analyzes correlations that can occur with the
204
Wallace method and proposes stringent tests designed specifically to be sensitive
to such correlations. Section 8.4 describes parameter optimizations which help
to keep correlations sufficiently low to pass statistical tests. Section 8.5 provides
performance tradeoffs with different parameter settings and compare the opti-
mized Wallace method with other method. Section 8.6 examines modifications
needed to the hardware design in Chapter 7 if the optimized parameters are used,
and Section 8.7 offers summary.
8.2 Overview of the Wallace Method
At startup, the Wallace method involves seeding a pool with samples drawn
from a zero-mean Gaussian probability density function (PDF), and all subse-
quent outputs are then produced by applying K-dimensional orthogonal trans-
formations in L transformation steps to the contents of the pool. Two key design
parameters are therefore the size of the pool N = KL and the dimension K of
the orthogonal transformation.
Wallace’s original description utilized a pool size N of 1024 and a Hadamard
transform [52] with dimension K = 4 requiring only additions, subtractions and
shifts. Since orthogonal transformations are energy-conserving, if no other rescal-
ing is performed on the pool then the sum of the squares (or variance) of all blocks
would be identical. In order to address this defect, a variate from the previous
pool is used to approximate a random sample from the χ2N distribution. A scaling
factor is introduced to ensure that the sum of the squares of the values in the
pool is the random sample. We can control the output rate by a factor of R
(i.e. the number of passes performed before noise samples are output) to reduce
205
correlations. This parameter is discussed in detail in the subsequent sections.
A simplified pseudo code of the Wallace method is shown in Figure 8.1. The
generate_addr() function generates pseudo random addresses for the array hold-
ing pool and is discussed in Section 8.4. As seen in Figure 8.1, there are no
conditional operations involved in the Wallace method, meaning that the output
data rate is a deterministic function of the underlying clock rate of the system.
It is this attribute as well as the simplicity of the arithmetic that makes the
Wallace method particularly attractive for hardware implementations. Wallace
provides several generations of source code referred to as FastNorm1, FastNorm2
and FastNorm3 [181].
As observed by Brent [16], [17] and Rub [157], one concern of the Wallace
method is the issue of correlations given the use of previous outputs to generate
new outputs. This is particularly problematic in the case of realizations with very
large absolute values lying in the tails of the Gaussian. When such a large value
is created and output, it also enters the pool from where it contributes directly
to K values in the subsequent block, K2 values in the next block, and so on with
diminishing influence. Similar correlations can be found in the reverse direction
as well. In other words, the presence of a very large output in a given block
conveys a higher likelihood of abnormally large values in the previous block, as
it is those values which, when linearly combined, led to the large output.
Given the computational advantages that the Wallace method offers, it is
reasonable to ask what design choices can be made in order to maintain extremely
high output noise quality. More specifically, given a requirement that the output
accurately model a Gaussian out to magnitudes of Mσ, what design options exist
to achieve this requirement? In what follows, we discuss the measurement of the
correlations, illustrate their impact in the form of the PDF, explore the extent of
206
01: for i = 1..R /* R = retention factor */
02: for j = 1..L /* L = N/K */
03: /* read K values from pool */
04: for z = 1..K /* K = matrix size */
05: addr = generate_addr();
06: x[z-1] = pool[addr];
07: end
08: /* apply transformation to the K values */
09: x’[0..(K-1)] = transform(x[0..(K-1)]);
10: /* write K values to pool */
11: for z = 1..K
12: addr = generate_addr();
13: pool[addr] = x’[z-1];
14: end
15: end
16: end
17: pool[0..(N-1)] = sum_of_sq_corr(pool[0..(N-1)]);
18: return pool[0..(N-1)];
Figure 8.1: Pseudo code of the Wallace method.
this impact as a function of design parameters, and provide information relating
to when the Wallace method can still produce noise of sufficient quality for a
given application despite the use of recycled outputs.
207
8.3 Measuring the Wallace Correlations
The chi-square (χ2) goodness-of-fit test [32] is used to check the normality of
the random variables. We focus on correlations due to outputs of high absolute
value relative to σ, as it is those occurrences that correspond to perturbations
of the PDF in subsequent (and previous) output blocks. The following simple
experiment illustrates the correlation problem. We first run FastNorm2, which
uses a pool size of 1024 and transform dimension of four, for a sufficiently long
time to generate 4000 realizations with absolute value exceeding 5σ. We then
extract the 1024 data values comprising the block immediately following the
block containing each large output, and combine all the data from many such
blocks (a total of 4000× 1024 ' 4 million noise outputs) into a single sequence.
Selecting data preferentially in the neighborhood of the high-value outputs is
fair from a testing standpoint, as an ideal Gaussian noise sequence is of course
independent and identically distributed, and this approach to testing is aimed
specifically at an area of potential weakness of this method.
The four million values are evaluated directly using the χ2 test based on 200
bins spaced uniformly over [−7, 7]. The chi-squared output χ2199 is 3081.6, which is
a strong failure as it is well above the typical upper limit of 232.9 that corresponds
to a 0.05 confidence level. As illustrated in Figure 8.2, the bins causing the failure
are centered in the region of 4σ ∼ 5σ, illustrating the expected effect that in a
method based on reusing outputs, large outputs will lead to more large outputs.
Figure 8.3 shows the result when the quality of the noise is evaluated as a
function of distance relative to a block containing a high-value (> 5σ) output.
The dotted horizontal line is the 0.05 confidence level, i.e. values below this
line pass the χ2 test. The block containing the high value output is indexed
by 0 in the horizontal axis of the figure. An index of 1 refers to the block
208
−6 −4 −2 0 2 4 60
50
100
150
200
250
FastNorm2: N = 1024, R = 1, K = 4 , χ2199
= 3081.6
Bin
χ2 199
Figure 8.2: Four million samples of blocks immediately following the block con-
taining a 5σ output, evaluated with the χ2 test with 200 bins over [−7, 7] for
FastNorm2. The χ2199 contributions of each of the bins are shown.
immediately following block 0; an index of −1 refers to the block immediately
preceding block 0. Block 0 is not shown in the figure, since its χ2199 output is
in the order of millions. Figure 8.3 illustrates two main points. First, it shows
that the correlations are approximately symmetric. In other words, the presence
of a very high output not only leads to poorer noise quality in the following
block, but also indicates statistically exceptional behavior in the previous block.
Second, the improved performance as a function of displacement means that one
way to improve the noise quality is simply to retain only some fraction 1/R of
the output blocks. For the set of parameter choices used in generating the date
in Figure 8.3, choosing R = 3 so that only every third block is delivered to the
output of the noise generator would eliminate the correlation issue, albeit at the
cost of dropping the throughput by a factor of three.
Interestingly, the approach of applying a Gaussian-to-uniform transformation
209
−6 −5 −4 −3 −2 −1 1 2 3 4 5 60
500
1000
1500
2000
2500
3000
FastNorm2: N = 1024, R = 1, K = 4
Displacement
χ2 199
Figure 8.3: The χ2199 values of blocks relative to a block containing a realization
with absolute value of 5σ or higher. Four million samples are compiled for each
block. The dotted horizontal line indicates the 0.05 confidence level.
followed by a test such as the Diehard suite [113] can fail to capture problems in
the tail regions in general. If (x, y) is a pair of random numbers with Gaussian
distributions, then u = e(x2+y2)/2 should be uniform over [0, 1]. Indeed using this
identity, in the specific case of the data used to generate Figures 8.2 and 8.3 all
18 diehard tests are passed. Given the general scarcity of high absolute value
outputs, even significant deviations in their numbers from the expected amount
can be masked by the mixing that occurs in the Gaussian-to-uniform transfor-
mation. Additionally, failure to isolate and test those blocks in the immediate
neighborhood of high-value outputs can also mask the problem illustrated in the
Figures 8.2 and 8.3. The most direct way to identify block-by-block perturbations
in the Wallace output is to use knowledge of the underlying Wallace algorithm
and the locations of block boundaries within the data stream.
210
8.4 Reducing the Wallace Correlations
There are three basic ways to reduce correlations in the Wallace outputs. First,
the dimension K of the transform can be increased. Higher values of K mean that
each new output is a linear combination of K previous outputs, and this greater
amount of mixing dilutes the impact of any individual member of the pool. While
there are many ways to generate orthogonal transforms of a given size, Hadamard
transforms are particularly attractive because they are trivially generated, and
other than a scaling factor can be implemented using only additions. For these
reasons, we used Hadamard transforms in the experiments described below. Sec-
ond, the overall size of the pool N can be increased. Increasing N while holding
K constant does not directly reduce the correlation between each set of K in-
puts and the K outputs they produce, but distributing the K outputs within a
larger N has a randomizing effect on the output. Finally, as noted above, not
all of the blocks that are generated need to be output as noise samples. At the
cost of reducing the output rate by a factor of R, the correlation impact can be
made arbitrarily small. The most advanced software version provided by Wallace,
FastNorm3 implements this method with R selectable from 2, 4, 8 or 16.
For our experiments, when K = 4, we use the following two Hadamard ma-
trices A0 and A1 and interchange them for each transformation:
A0 =1
2
−1 1 1 1
1 −1 1 1
−1 −1 1 −1
−1 −1 −1 1
(8.1)
211
A1 =1
2
1 −1 −1 −1
−1 1 −1 −1
1 1 −1 1
1 1 1 −1
(8.2)
Note that A1 is simply the negated version of A0, which is a valid approach
to obtain a new Hadamard matrix. For the given set of four values x0, x1, x2, x3
to be transformed, and with our choice of A0 and A1, the new values x′0, x′1, x
′2, x
′3
can be calculated from the old ones as follows:
x′0 = t− x0; x′1 = t− x1; x′2 = x2 − t; x′3 = x3 − t; (8.3)
and
x′0 = x0 − t; x′1 = x1 − t; x′2 = t− x2; x′3 = t− x3; (8.4)
where t = 12(x0 +x1 +x2 +x3). Rather than straightforward matrix-vector multi-
plication, this approach (as used in the FastNorm implementations) reduces the
number of addtions/subtractions required. We perform similar optimizations for
larger transformation matrices. Orthogonal matrices of size 8 and 16 are obtained
by using the property: if H is a Hadamard matrix, then H ′ =
H H
H −H
is
also a Hadamard matrix [52].
It is desirable that any value in the pool should eventually contribute to every
value in the pools formed after several passes. To achieve this mixing effect, we
need a pseudo random address generator for the indices addr of x0, . . . , xK−1. As
in FastNorm2, we use the permutations of the form addr=((addr+stride)&w)^mask,
where w = N−1. The initial values of addr, stride and mask are generated from
a uniform random number generator at the beginning of each pass, and stride
is ensured to be odd. The sizes of each of the three uniform random numbers are
212
log2 N bits. Such addresses are produced by the generate_addr() function in
line 5 and 12 of Figure 8.1. From Figure 8.1, we observe that number of integer
additions, AND and XOR operations required for address generation is inversely
proportional to K.
Figure 8.4 illustrates the impact of various design choices on noise quality.
To obtain each data point in the three graphs, four million samples are compiled
from the block immediately after each block containing a high value (absolute
value ≥ 5σ = 5) output. The number of instances necessary to accumulate the
four million samples is a function of pool size. For example, when a pool size of
2048 is used, the generator is run for a long enough time to generate 2000 high-
value outputs. For each such output, the data in the single block immediately
following the one containing the high value output are retained. χ2 tests are
performed using 200 bins spaced uniformly over [−7, 7]. The vertical axis gives
the χ2199 result plotted on a logarithmic scale, with the customary upper limit
of 232.9 indicated using a dotted horizontal line in each figure. The pool size
is plotted on the horizontal axis, also in logarithmic scale. Each figure contains
three curves, corresponding to transform sizes K of 4, 8 and 16 respectively. The
three graphs in Figure 8.4 present the cases of discard ratios R ranging from one
(all data is output) to three (only every third block is output).
We observe that when R = 1, the quality of the noise samples improve with
the pool size N and the transformation size K. At this retention setting, for
the three transformation sizes 4, 8 and 16, the samples pass the χ2 tests at pool
sizes of 32768, 8192 and 4096 respectively. Comparing the three graphs, as R is
increased the quality improves dramatically for small N . However, there is little
quality difference for large N , suggesting that choosing R > 1 and paying the
associated penalty in speed does not provide significant benefit.
213
Table 8.1: Number of arithmetic operations per transform/sample for the trans-
formation at various sizes of K.
Additions/Subtractions Multiplications
K Transform Sample Transform Sample
4 7 1.8 1 0.3
8 31 3.9 9 1.1
16 115 7.3 17 1.1
8.5 Performance Comparisons
There are three factors that effect the execution speed of the Wallace method:
arithmetic operations for the address generation and transformation, and num-
ber of table accesses. Table 8.1 shows the number of arithmetic operations per
transform/sample for the transformations at various sizes of K. Such arithmetic
operations include additions/subractions and multiplications as shown in (8.3)
and (8.4) in Section 8.4. From the table, we observe that the number of addi-
tions/subtractions and multiplications required per sample are around K/2 and
1. The number of multiplications for K = 4 is lower than others, because the
scaling factor of its orthogonal matrix is 1/2. For hardware designs or integer
arithmetic, shifts could be used for K = 4 and K = 16 instead of multiplica-
tions, because the scaling factor 1/√
K is a power of two. We use 64-bit double
precision floating-point representation for the arithmetic operations, but integer
arithmetic is also feasible.
Besides the arithmetic operations, another important factor is the number
of table accesses. For each transformation, we read K values from the table
214
holding the pool, apply the transformation and write the K values back to the
table. For each pass, we perform a total of N reads and N writes in order to
read/write values from the table holding the pool. The number of table accesses
is not affected by K, but is directly proportional to R. As mentioned earlier, R
determines the number of passes before noise samples are output, e.g. if R = 2,
two passes are performed with 2N reads and 2N writes before samples are output.
We use two PCs, one fitted with an AMD Athlon XP 2400+ (2GHz) proces-
sor and the other fitted an Intel Pentium 4 2GHz processor, for our performance
measurements. These two platforms are arguably most commonly used for com-
puter simulations. Both platforms are equipped with 1GB DDR-SDRAM and run
Mandrake Linux 9.1. Designs are written in ANSI C and compiled with the GNU
gcc 3.2.2 compiler with -O3 optimization, generating double precision floating-
point numbers. Processor specific instruction sets such as ‘3DNow!’ or ‘SSE2’ are
not used. The -O3 setting performs optimizations such as prefetching, scalar re-
placement, and loop and memory access transformations. It is recommended for
applications that have loops that heavily use floating-point calculations and pro-
cess large data sets, which is very much the case for the Wallace method. For the
experiments in this section, we measure the execution time: time taken to pro-
duce one noise sample. The specifications of interest of the two processors [1, 68]
are listed in Table 8.2. Details of the data caches of the two processors is ob-
tained using the RightMark Memory Analyzer [153] and are shown in Table 8.3.
Since our noise samples are in 64-bit (= 8 bytes) double precision, in principle,
pool sizes of 32768 and 65536 could fit into the level 2 caches of the Athlon XP
(32768× 8 = 256KB) and Pentium 4 (65536× 8 = 512KB).
Figure 8.5 explores how the execution time of arithmetic operations and table
accesses behave with varying K at N = 4096 and R = 1. Results are obtained
215
Table 8.2: Specifications of the AMD Athlon XP and Intel Pentium 4 platforms
used in our experiments.
Specification Athlon XP Pentium 4
Process 0.13 micron 0.13 micron
Processor Core Thoroughbred Northwood
Clock Speed 2GHz 2GHz
Pipeline Stages 10 20
Floating-Point Units 3 1
Branch Predictor Entries 2048 4096
Frontside Bus Speed 266MHz 400MHz
from the gprof profiler: for each experiment, one billion iterations are run in
a loop and the overhead of the loop construct is subtracted. The lower part
of the bars shows the time consumed by arithmetic operations, and the upper
part shows the time consumed by table accesses. Besides the transformation, the
arithmetic operation times include other overheads such as address calculation
and branches, but they are small compared to the transformations.
Looking at the Athlon XP results, we observe that the arithmetic operation
times increase with K. However, the table access times are decreasing. Although
we always read and write 4096 locations, for small K such as K = 4, we read/write
four locations consecutively in L = 1024 steps. For large K such as K = 16, we
perform 16 consecutive reads/writes in L = 256 steps. Such reads and writes,
which correspond to line 4 and 11 in Figure 8.1, can cause a branch misprediction
when z = K + 1. This results in a pipeline stall where the whole pipeline needs
to be flushed and refilled, causing severe delay. This effect explains the reduction
216
Table 8.3: Details of the AMD Athlon XP and Intel Pentium 4 data caches.
Athlon XP Pentium 4
Specification Level 1 Level 2 Level 1 Level 2
Size 64KB 256KB 8KB 512KB
Speed 2GHz 2GHz 2GHz 2GHz
Latency 3 cycles 11 cycles 2 cycles 9 cycles
Sets 512 256 32 1024
Block Size 64 bytes 64 bytes 64 bytes 64 bytes
Associativity 2 way 16 way 4 way 8 way
in table access times with increasing K. We have used the SimpleScalar x86
processor simulator [166] to confirm that the number of branch mispredictions
reduces with K. Moreover, when compared to the Athlon XP, we see a significant
performance loss in the Pentium 4 results. This loss can be due to the fact that
the Pentium 4 arithmetic operations are slightly slower than the Athlon XP,
since it has only one floating-point unit than the three available on the Athlon
XP. Table access occupies a large portion of the execution time, possibly because
of the Pentium 4’s smaller level 1 cache and its deeper pipeline: its 20-stage
pipeline has high branch misprediction penalties [62].
Figure 8.6 explores the execution time tradeoffs as a function of parameter
choice on the two platforms for N = 512 (size = 4KB) to N = 8192 (size =
64KB). Table 8.4 shows the numerical results at N = 4096. On both platforms,
as expected, we see a direct relationship between choosing a lower retention factor
R and the associated increase in execution time. Looking at the Athlon XP results
in Table 8.4, K = 4 is significantly faster than K = 8 and K = 16, especially
for large R. This observation is likely due to the small number of multiplications
217
Table 8.4: Execution time in nanoseconds for the AMD Athlon XP and Intel
Pentium 4 platforms at N = 4096.
R 1 2 3
K 4 8 16 4 8 16 4 8 16
Athlon XP 6 8 9 11 18 18 17 26 27
Pentium 4 22 19 19 39 31 34 56 46 51
involved when K = 4 (Table 8.1). The Pentium 4 results are less linear, probably
due to the small level 1 cache and branch misprediction penalties. Figure 8.6 also
shows that increasing the pool size N causes no significant change in execution
time, though it does of course involve more resources for the pool table. We
conclude that N has no significant execution time impact on both platforms, K
has little effect on the Pentium 4 but has a notable effect on the Athlon XP, and
as one would expect, the consequences of increasing R are most significant in all
cases.
Thus, for these implementations at least, the much improved noise quality
enabled by larger transforms represents an extremely good tradeoff to make.
Based on these observations and the results in Figure 8.4, it is better in terms
of noise quality and speed, to use large N and K but keep R = 1. Hence, for
example, choosing N = 4096, R = 1 and K = 16 on both platforms leads to
an optimized Wallace implementation, that has low execution time and cache
requirements, while keeping the correlation effects at minimum.
Figure 8.7 shows the execution time variation for pool sizes of 4KB (N = 512)
to 512KB (N = 65536) at R = 1 and K = 16. Looking at the Athlon XP curve,
the execution time stays pretty much constant up to 64KB and suddenly starts
218
to increase from 128KB. In principle, we could store the whole pool in the level
2 cache up to 256KB. However, it is not just our pool that is stored in the cache;
data such as other program variables and operating system data are also stored.
Most likely, the entire pool is kept in the level 2 cache up to 64KB, but beyond
this point (e.g. 128KB), the cache is saturated and cache misses occur, such that
parts of the pool need to be fetched from the main memory. Hence, we observe
this sudden increase in execution time. The same applies to the Pentium 4 curve,
except the saturation effect occurs at 256KB due to the Pentium 4’s level 2 cache,
which is twice as large as that for the Athlon XP.
In order to investigate how the level 2 cache saturation effect varies with
different values of N , we again use the SimpleScalar x86 simulator. Figure 8.8
shows the level 2 cache miss rates for different level 2 cache sizes at various pool
sizes at R = 1 at K = 16. Level 1 cache is fixed at 16KB throughout and 65536
noise samples are generated for each data point. The 256KB level 2 cache result
uses 1024 sets, 128 byte blocks, two way set associativity, and LRU (least recently
used) replacement policy. Smaller level 2 cache sizes are obtained by reducing
the number of sets by powers of two. As expected, we observe a rapid increase in
miss rate once the level 2 cache is saturated; this observation is consistent with
the trend of increasing execution time shown in Figure 8.7.
In Table 8.5, we compare the performance of our optimized Wallace imple-
mentation against the Ziggurat, Polar and Box-Muller method on the Athlon XP
and Pentium 4 platforms. For the Ziggurat method, rnorrexp from [115] is used.
For the Polar and Box-Muller method, we follow the algorithms described in [78].
In order to make a fair comparison, we use the same uniform number generator
for all implementations. The mixed multiplicative congruential (Lehmer) gen-
erator [179] used in the FastNorm implementations is chosen. We observe that
219
the optimized Wallace implementation is more than three times faster than the
Ziggurat method, which is often known as the fastest Gaussian random num-
ber generator for instruction processors. These results suggest that the Wallace
method with the proposed optimizations in this work, should be considered as a
serious candidate when high-speed Gaussian noise generation is required.
Table 8.5: Performance comparison of different software Gaussian random num-
ber generators. The Wallace implementations use N = 4096, R = 1 and K = 16.
Method Platform Execution Time [ns] Ratio
Wallace Athlon XP 9 1
Pentium 4 19 2.1
Ziggurat Athlon XP 30 3.3
Pentium 4 62 6.9
Polar Athlon XP 117 13.0
Pentium 4 170 18.9
Box-Muller Athlon XP 158 17.6
Pentium 4 275 30.6
8.6 Hardware Design with Optimized Parameters
The Wallace hardware architecture presented in Chapter 7 uses N = 1024, R = 1
and K = 4, meaning that significant correlation could be detected with our tests
described in Section 8.3. Modifying the hardware architecture to reflect the
new optimized parameters N = 4096, R = 1 and K = 16 would mainly involve
more addition/subtraction and memory requirements. In summary, the following
220
architectural changes are needed:
• Since each noise variable in the pool is 24 bits, the size of the pool required
would be 4096 × 24 = 98304 bits. Hence, we would need six block RAMs
(each block RAM can hold 18Kb) each for “init Pool ROM” and “Pool
RAM”.
• A pool size of 4096 means that we need 12 bits each for the three random
numbers start, stride and mask. This means that six more LFSRs are
needed, and also six more entries are needed in the “LFSR Seed ROM”,
which will still fit into a single block RAM.
• Since adders are cheap on FPGAs, there will be just a slight increase in the
number of slices for the increased number of transformation operations.
• The scheduling of the transformation circuit (Figures 7.3 and 7.4 in Chap-
ter 7) will have to be modified to reflect the new set of arithmetic operations.
Given that one can find a good scheduling strategy for the transformation,
the optimized parameters will have little affect on the speed, since one can always
pipeline hardware designs. Like the implementation in Chapter 7, we predict it
will be likely in the 155MHz range on a Virtex-II FPGA, resulting in around
6.5ns per sample.
221
512 1024 2048 4096 8192 16384 3276810
2
103
104
N
χ2 199
R = 1
K = 4K = 8K = 16
512 1024 2048 4096 8192 16384 3276810
2
103
104
N
χ2 199
R = 2
K = 4K = 8K = 16
512 1024 2048 4096 8192 16384 3276810
2
103
104
N
χ2 199
R = 3
K = 4K = 8K = 16
Figure 8.4: Impact of various design choices on the χ2199 value. Four million
samples are compiled from the block immediately after each block containing an
absolute value of 5σ or higher for each data point. The dotted horizontal line
indicates the 0.05 confidence level.
222
4 8 160
5
10
15
20
25
K
Exe
cutio
n T
ime
[ns]
53%
67%
32%
45%
23%
30%
47%
33% 68%
55%77%
70%
N = 4096, R = 1
Athlon XPPentium 4
Figure 8.5: Speed comparisons at various K at N = 4096 and R = 1. Lower
part: arithmetic operations. Upper part: table accesses.
512 1024 2048 4096 81920
10
20
30
40
50
60
N
Exe
cutio
n T
ime
[ns]
AMD Athlon XP 2GHz
R = 1, K = 4R = 1, K = 8R = 1, K = 16R = 2, K = 4R = 2, K = 8R = 2, K = 16R = 3, K = 4R = 3, K = 8R = 3, K = 16
512 1024 2048 4096 81920
10
20
30
40
50
60
N
Exe
cutio
n T
ime
[ns]
Intel Pentium 4 2GHz
Figure 8.6: Speed comparisons for different parameter choices. The solid, dashed
and dotted lines are for R = 1, R = 2 and R = 3 respectively.
223
4 8 16 32 64 128 256 5120
10
20
30
40
50
60
70
80
Pool Size [KB]
Exe
cutio
n T
ime
[ns]
R = 1, K = 16
Athlon XP, 256KB Level 2 CachePentium 4, 512KB Level 2 Cache
Figure 8.7: Execution times for different pool sizes at R = 1 and K = 16.
The solid and dotted lines are for the Athlon XP and the Pentium 4 processors
respectively.
4 8 16 32 64 128 256 5120
20
40
60
80
100
Pool Size [KB]
Leve
l 2 C
ache
Mis
s R
ate
[%]
SimpleScalar x86, Level 1 Cache = 16KB
16KB Level 2 Cache32KB Level 2 Cache64KB Level 2 Cache128KB Level 2 Cache256KB Level 2 Cache
Figure 8.8: Level 2 cache miss rates on the SimpleScalar x86 simulator for differ-
ent pool sizes at R = 1, K = 16 and various level 2 cache sizes. Level 1 cache is
fixed at 16KB and 65536 noise samples are generated for each data point.
224
8.7 Summary
We have explored the impact of parameter choice on noise quality for the Wallace
Gaussian random number generator. Using tests designed specifically to identify
the presence of correlations due to the use of previous outputs in generating new
outputs, we have identified specific combinations of pool size, transform size, and
retention factor (one example is N = 4096, R = 1 and K = 16) that deliver high
quality noise output at high speeds. Thorough performance tradeoff studies have
been conducted for AMD Athlon XP and Intel Pentium 4 based platforms. With
the aid of these studies, we have shown that much improved noise quality enabled
by larger transforms and pool sizes represents an extremely good tradeoff to make.
Performance comparisons with other Gaussian random number generators have
been carried out, demonstrating that given a careful choice of parameters, the
Wallace method is a serious competitor due to its speed advantages. We have
also examined the architectural changes needed, if the optimized parameters are
used with the Wallace design presented in Chapter 7.
As noted earlier, the Wallace method is particularly attractive from an im-
plementation standpoint, because of its lack of conditional statements and its
reliance on simple mathematical computations. While the presence of some cor-
relation between data in nearby blocks is an unavoidable byproduct of any ap-
proach using feedback, the results presented here provide specific guidance on
how to create extremely high quality noise with no detectable correlation even
when highly targeted tests are used.
225
CHAPTER 9
Flexible Hardware Encoder for LDPC Codes
9.1 Introduction
In the past few years, LDPC codes [48], [49] have received much attention be-
cause of their excellent performance and the large degree of parallelism that can
be exploited in the decoder. LDPC codes are being widely considered as the
most promising candidate ECC scheme for many applications in telecommunica-
tions and storage devices. Recently, LDPC codes have been selected over Turbo
codes [7] by Europe’s DVB standards group for next-generation digital satellite
broadcasting due to their superior performance. Provided that the information
block lengths are long enough, performance close to the Shannon limit can be
achieved with LDPC codes.
Although LDPC codes achieve better performance and have low decoding
complexity compared to Turbo codes, one of the major drawbacks of LDPC
codes lies in their apparently high encoding complexity. Whereas Turbo codes
can be encoded in linear time, a straightforward implementation for an LDPC
code has complexity quadratic in the block length. Note that the complexity
referred to here is measured in the number of mathematical operations required
per bit. In [152], Richardson and Urbanke (RU) show that linear time encoding
is achievable through careful linear manipulation of ‘good’ LDPC codes. In their
paper, they present methods to preprocess the parity-check matrix H and a set
226
of matrix operations to perform the actual encoding. We have implemented the
preprocessing in software since it needs to be performed only once for a given
H matrix. For the actual hardware encoder, we have identified the operations
that can be run in parallel and scheduled the tasks to maximize throughput.
In addition we have designed an efficient memory architecture for storing sparse
matrices.
The principal contribution of this chapter is a fast and efficient hardware
encoder for both irregular and regular LDPC codes based on the RU method.
The novelties of our work include:
• software preprocessor to bring the parity-check matrix H into a approxi-
mate lower triangular form;
• hardware architecture with an efficient memory organization for storing and
performing computations on sparse matrices;
• implementation and evaluation of the encoder that is capable of 80 times
speedup over a 2.4GHz PC; we also explore run-time reconfiguration op-
portunities.
The rest of this chapter is organized as follows. Section 9.2 presents an
overview of our approach. Section 9.3 describes how we preprocess the H matrix.
Section 9.4 presents our hardware encoder architecture. Section 9.5 describes the
main components used in our encoder. Section 9.6 discusses our implementation
results and Section 9.7 offers summary.
227
A B
C D E
T
n - m m - g g
m
n
m - g
g
0
Figure 9.1: The parity-check matrix H in ALT form. A, B, C, and E are sparse
matrices, D is a dense matrix, and T is a sparse lower triangular matrix.
9.2 Overview
The RU algorithm as described in Section 2.6.3, consists of two steps: a prepro-
cessing step and the actual encoding step. In the preprocessing step, row and
column permutations are performed to bring the parity-check matrix H into an
approximate lower triangular (ALT) form (Figure 9.1). Since the transformation
is accomplished by permutations only, the sparseness of the matrix is preserved.
The actual encoding is carried out by matrix-multiplication, forward-substitution
and vector addition operations. Since the preprocessing needs to be performed
only once on a given H matrix, we execute this operation in software. The ac-
tual encoding step is done in hardware. The RU encoding algorithm is presented
in [152] as a set of matrix operations. We have examined the algorithm and
identified the operations that can be executed in parallel. The operations im-
plemented in our hardware encoder are scheduled to maximize concurrency and
throughput. Moreover, we employ an efficient memory architecture for storing
sparse matrices, which minimizes memory usage.
228
H M a t r i x
P r e p r o c e s s o r ( S W )
A L T H M a t r i x
E n c o d e r ( H W ) M e s s a g e B l o c k s C o d e w o r d s
Figure 9.2: LDPC encoding framework.
The basic framework of our encoder is shown in Figure 9.2. Our approach
for LDPC encoding consists of two steps: preprocessing and hardware encoding.
First, the original parity-check matrix H is preprocessed with the RU algorithm
to generate the appropriate look-up tables consisting of the six matrices needed
by the hardware encoder. These matrices are generated from the RU algorithm
and contain information on how the input message blocks are encoded to generate
codewords. This preprocessing step is implemented in software and needs to be
performed once for a given H matrix. The hardware encoder itself is implemented
on an FPGA and uses the look-up tables (ALT H matrix) generated from the
preprocessing step to encode the message blocks. Note that the preprocessing
step does not involve any data. Hence, during a normal encoding operation only
the hardware encoder is needed. Although our implementation is based on H
matrices that are binary, GF(2), it can be extended to matrices that belong to
higher order fields.
The RU algorithm and the hardware architecture proposed in this chapter
make no restrictions on the actual H matrix. This flexibility allows our hardware
229
n - m m
m
n
0
Figure 9.3: An equivalent parity-check matrix in lower triangular form. Note
that n = block length and m = block length× (1− code rate).
architecture to be used in any application involving LDPC codes. Different appli-
cations require different H matrices. Applications requiring low latency typically
use shorter block lengths (less than 1000 bits), while applications requiring op-
eration near the channel capacity require longer block lengths (more than 10000
bits). Code rate r also influences the dimensions of the H matrix. Low code
rates offer more error protection at the expense of information throughput and
are often used when the SNR is very low (e.g. deep space communications). The
dimensions of the H matrix are (block length×(1−code rate)) by (block length)
as illustrated in Figure 9.3. Our hardware architecture is completely flexible in
regards to block length and code rate.
Another issue related to encoder flexibility is the specific location of ones in
the H matrix. Properly designed regular LDPC codes have performance that
continues to improve as the SNR is increased. Irregular LDPC codes do not have
this property, they have a so-called ‘error floor’, meaning that after a certain level
of performance is reached, the performance stops improving. For example, say a
code operates at a BER of 10−5 at 2dB SNR. If an error floor exists, the BER
230
will be the same when the SNR is increased to 5dB. If an error floor was not
present, then the BER would improve to say 10−6. While regular LDPC codes
have no error floor, they do not perform as close capacity as irregular codes. This
means that as the SNR is increased, the BER will decrease faster with irregular
codes than with regular codes. An ideal code performs close to capacity and
contains no error floor. We have designed high-performance LDPC codes in [174]
using special code construction techniques, which perform close to capacity and
have reduced error floors. Our hardware is completely flexible in regards to the
location of ones in the H matrix, in other words, this hardware can encode any
LDPC code that is created.
9.3 Preprocessing
In preprocessing, row and column permutations are performed to bring the H
matrix into an ALT form. Richardson and Urbanke [152] introduced three al-
gorithms to perform this task. There are three types of greedy algorithms: a,
b and c. We choose greedy algorithm a for our software preprocessor due to its
simplicity. The three algorithms are discussed in detail at the end of this section.
Preprocessing consists of two steps: triangulation and rank checking. Tri-
angulation is the process of row and column permutations that produces an H
matrix similar to the one shown in Figure 9.1, with the smallest gap g possible.
Multiplying
I 0
−ET−1 I
(9.1)
231
Table 9.1: Computation of pT1 = −F−1(−ET−1A + C)sT . Note that
T−1[AsT ] = yT ⇒ TyT = [AsT ].
index operation comment complexity
1 AsT multiplication by sparse matrix O(n)
2 T−1[AsT ] forward-substitution by sparse matrix O(n)
3 −E[T−1AsT ] multiplication by sparse matrix O(n)
4 CsT multiplication by sparse matrix O(n)
5 [−ET−1AsT ] + [CsT ] vector addition O(n)
6 −F−1[−ET−1AsT + CsT ] multiplication by dense g × g matrix O(g2)
Table 9.2: Computation of pT2 = −T−1(AsT + BpT
1 ).
index operation comment complexity
7 AsT multiplication by sparse matrix O(n)
8 BpT1 multiplication by sparse matrix O(n)
9 [AsT ] + [BpT1 ] vector addition O(n)
10 −T−1[AsT + BpT1 ] forward-substitution by sparse matrix O(n)
from the left of HxT = 0, where H =
A B T
C D E
we get
A B T
−ET−1A + C −ET−1B + D 0
s
p1
p2
=
0
0
(9.2)
This gives two equations and two unknowns p1, p2. Define F = −ET−1B +D
and assume for the moment that F is nonsingular, solving for p1, p2 yields
pT1 = −F−1(−ET−1A + C)sT . (9.3)
232
and
pT2 = −T−1(AsT + BpT
1 ). (9.4)
From Table 9.1 and Table 9.2 we can see that the complexity of the actual oper-
ations required to obtain p1 and p2 are mostly linear, except for the dense matrix
multiplication −F−1(−ET−1A + C)sT , which has complexity O(n2). Thus, we
have achieved near linear encoding complexity.
Since the equation for p1 depends on the inverse of F , the method will only
work when F is nonsingular (invertible). This requires additional rank checking
before the actual encoding. To obtain F , we perform Gaussian elimination on
the original H to bring it into the form A B T
−ET−1A + C −ET−1B + D 0
(9.5)
If F is singular, we swap columns of F with columns to the left of F and keep
doing this until F becomes nonsingular.
So far, we have shown the encoding complexity to be linear, except for the
dense g×g matrix multiplication, where g is the gap of the preprocessed H matrix
(Figure 9.1). Thus for efficiency, we should make g as small as possible.
The greedy algorithm a is used to find the best possible lower triangularization
of the parity matrix H. The algorithm begins by assigning Q = HT . Then the
following steps are applied:
1. Find a vector of indices to degree one rows in Q and call this vector α. If
α is empty, remove the left most column of Q and repeat step 1 with the
modified Q matrix. Let l equal to the length of the vector α.
2. Modify Q so that the degree one rows indicated by the elements of α are
moved to the top of Q (row numbers 1 to l).
233
3. Reorder the columns of the modified Q so that the rows that were moved
to the top of the matrix form a diagonal. This step is known as diagonal
extension.
4. Modify Q again by removing the first l rows and columns of modified Q.
5. Find a vector of indices to degree one rows in modified Q and call this
vector α. If α is empty, the algorithm terminates. Otherwise, go to step 2.
To summarize the above algorithm, through row and column operations a
small identity matrix is created at the top left corner of the main matrix, and
then the rows and columns participating in the identity matrix are deleted. This
is repeated until there are no more degree one rows in the remaining matrix.
Thus, through row and column swapping, we have produced the H matrix in
Figure 9.1.
The gap size is the equal to the number of columns removed in step 1. Note
that the gap size is further reduced by applying the greedy algorithm a to HT
rather than H itself. This is due to the different starting columns when applying
step 1. Since H is not a square matrix, we can only triangularize at most the
number of columns as there are rows. As a result, if we were to try to carry
out step 1 with a ‘fat’ H matrix, we can only look at part of the columns. For
example, for a rate 1/2 H matrix, in order to achieve the result in Figure 9.1,
we can only start with the middle column and work with the right half of the
matrix to search for degree one rows. On the other hand, if we apply step 1 to a
‘skinny’ HT matrix, we can start with the very first column since the number of
columns now is much less than the number of rows. This enables us to look at the
entire matrix when searching for degree one rows. This extra degree of freedom
results in better triangulation. The two different approaches are illustrated and
234
0
starting column
H 0
starting column
H T
Figure 9.4: Different starting columns for H and HT .
compared in Figure 9.4.
In addition to the greedy algorithm a, Richardson and Urbanke also intro-
duced greedy algorithm b and greedy algorithm c. In algorithm b, rather than
choosing the starting columns (starting point of triangulation) independent of
one another, they are chosen based on the weights of the rows with which they
are connected. This may reduce gaps but also requires more complicated process-
ing. Greedy algorithm c is built upon algorithm b with a looser constraint on the
weight distributions of rows and hence the definition of the starting columns with
which they are connected. Both b and c offer slightly smaller gaps in some cases;
however, we have chosen a since it offers satisfactory results in all cases we have
examined. Triangulation can be time-consuming with large block sizes. Since
this step only needs to be performed once for a given H matrix, such overhead
is tolerable.
9.4 Encoder Architecture
The hardware encoder computes the two parity parts p1 and p2 according to the
operations described in Table 9.1 and Table 9.2. Operations that can be executed
235
Message Blocks
Front Buffer
s
Back Buffer
s
Front Buffer
Cs
Back Buffer
Cs
MVM
C
Front Buffer
As
Back Buffer
As
A
T
TAs
E
ETAs ETAsCs
F
p1
B
Bp1 Front Buffer AsBp1
Back Buffer AsBp1
Buffer s
Buffer s
Permutation Table
Codewords
Stage 1 Stage 2 Stage 3 Stage 4
p2
MVM
FS MVM VA MVM MVM
VA FS CWG
Buffer p1
Buffer s
finish start
Stage Controller
finish start finish start finish start
Figure 9.5: Overview of our hardware encoder architecture. Double buffering is
used between the stages for concurrent execution. Grey and white box indicate
RAMs and operations respectively.
in parallel are identified and are scheduled to maximize parallelism. An overview
of our hardware encoder architecture is shown in Figure 9.5. The operations are
grouped into four stages that run in parallel and double buffering is used between
the stages so that they can be executed in parallel. Each stage generates a ‘finish’
signal once its computation is completed. Once all stages are completed (which is
when a codeword is generated), a ‘start’ signal is sent from the stage controller to
each stage for the next execution. The stages have been carefully partitioned to
balance the workloads between the stages, while minimizing the overall latency,
idle times and buffering requirements. This flexible architecture supports any
rate and block length, but has been specifically optimized for rate 1/2 codes.
The aim of dividing the encoding process into different stages is to balance
the execution times among the stages, so that the idle time of any of the stages is
minimized. Given that the rate is 1/2, the gap is small and the edges (ones) of the
236
H matrix are distributed in a random manner, the matrix A will contain nearly
half of the edges of the entire preprocessed H matrix. Also, since the matrix T is
lower-triangular, the number of its edges will be around half of A. Therefore, the
computation AsT (operation 1) will take the longest. This is because the number
of operations are proportional to the number of edges, which will be clarified
later. Since the gap is small, operations involving B, C, E and F will be very
fast.
In Stage 1, we simply write the message block to buffers. Since the message
block length is n − m, this stage will take n − m clock cycles. In Stage 2, we
perform operations 1 and 4 in parallel (the operations are listed in Table 9.1 and
Table 9.2). We do not do any other operations in this stage, since subsequent
operations are dependent on the result of operation 1 and operation 1 takes
the most time. In Stage 3, we perform all the remaining operations needed to
compute p1 as well as operations 8 and 9. In Stage 4, we perform operation 10 and
codeword generation. This segmentation into four stages balances the workload
across stages well for rate 1/2. In principle, we could parallelize some of the
matrix-vector multiplications and forward-substitutions to get higher throughput.
However, parallelizing those operations would involve duplicating the look-up
tables (since dual-port RAM is the best we can get from current FPGAs), which
would require significantly more area. Moreover, we can simply replicate many
instances of the encoder on the same chip to process several message blocks in
parallel without increasing RAM area needed for the look-up tables of the six
matrices. These look-up tables can be shared among the encoder instances.
Depending on the channel conditions, codes with different rates perform bet-
ter than others. For instance, when the SNR is low, higher rate codes are more
appropriate. Therefore one could implement an adaptive LDPC encoder, which
237
changes rate or block length depending on the channel conditions. Of course the
LDPC encoder would have to be synchronized with an adaptive LDPC decoder.
Although the architecture shown in Figure 9.5 could be used for different rates, it
is optimized for rate 1/2 codes. Codes with different rates differ in the dimensions
of the H matrix, leading to different edges ratios for the six matrices. Therefore,
different scheduling of the operations are needed for different rates to maximize
concurrency. Bit files of different designs optimized for different rates can be
stored in memory, and run-time reconfiguration of FPGAs can be exploited to
reconfigure the adaptive LDPC encoder at run-time for different channel con-
ditions. Since reconfiguration can be performed in a matter of milliseconds on
modern FPGAs, such adaptive LDPC encoders/decoders are viable options.
The main operations performed in the encoder are matrix-vector multiplica-
tion (MVM), forward-substitution (FS), vector addition (VA) and codeword gen-
eration (CWG). Codeword generation involves first constructing an intermediate
codeword by writing (s, p1, p2) into a memory. Then according to the permuta-
tion table, which contains the information on the row permutations performed
during the preprocessing step, the intermediate codeword is rearranged to gener-
ate the final codeword which is then valid with regards to the original H matrix.
The hardware architectures for vector addition, matrix-vector multiplication and
forward-substitution are described in the next section. Since we are dealing with
a binary system, multiplications can be performed with an AND gate and additions
with an XOR gate.
238
Y Z
index calculator
index
1
2
...
...
n
index
1
2
...
...
...
n
index X
1
2
...
...
...
n
index
...
Figure 9.6: Circuit for vector addition (VA).
9.5 Components for the Encoder
9.5.1 Vector Addition
This involves the computation of X + Y = Z, where X, Y and Z are vectors,
and Z is what we are trying to compute. Since we are dealing with a binary
system, vector addition can be simply achieved by performing XOR operations on
the corresponding elements of the two vectors. The circuit for vector addition is
shown in Figure 9.6. The index calculator increments the index every clock cycle.
9.5.2 Matrix-Vector Multiplication
This involves the computation of XY = Z, where X is a matrix, Y and Z
are vectors, and Z is what were are trying to compute. We shall illustrate our
approach with an example. Consider the multiplication of a 5 × 6 matrix X
by a vector Y to obtain a resulting vector Z. In this case, X is known from
the preprocessing step and is sparse. It would be inefficient to store this matrix
directly in a memory, since most of the locations will be zeroes. Instead, the
239
Table 9.3: Matrix X stored in memory. The location of the edges of each row
and an extra bit indicating the end of a row are stored.
address 0 1 2 3 4 5 6 7 8
data 3 5 1 2 4 6 0 3 4
end row 0 1 1 0 0 1 1 0 1
location of the edges (ones) of each row is stored, with an extra bit indicating
the end of a row. For example, if
X =
0 0 1 0 1 0
1 0 0 0 0 0
0 1 0 1 0 1
0 0 0 0 0 0
0 0 1 1 0 0
(9.6)
it would be stored in memory as shown in Table 9.3. Memory #6 is a special
case, indicating that the fourth row of matrix X has no edges.
The location of the edges of a row in X are used as bit selectors for the vector
Y . This bit selecting process has the same effect as performing AND operations
with the bits of a row in X and the bits in vector Y . XOR is performed on the
selected bits to calculate the resulting bits for Z. This operation is performed
for each row of X starting from the first one. Figure 9.7 shows our matrix-vector
multiplication circuit. The Z index calculator calculates the location of the Z
matrix to be written. The index is simply incremented every time there is an end
of a row. It can be seen that the number of clock cycles required to compute Z
is directly proportional to the number of edges in X.
240
end row data Y
X
Z
Z index calculator
index
index
1
2
...
...
...
n
index
1
2
...
...
...
n
index
data
Figure 9.7: Circuit for matrix-vector multiplication (MVM).
9.5.3 Forward-Substitution
Consider the equation XZ = Y , where X is a lower-triangular matrix, Y and Z
are vectors and Z is the vector we want to compute. X is given by
266666666666664
1 0 · · · · · · · · · · · · · · · 0
x(2,1) 1 0 · · · · · · · · · · · · 0
x(3,1) x(3,2) 1 0 · · · · · · · · · 0
x(4,1) x(4,2) x(4,3) 1 0 · · · · · · 0
..
....
..
....
..
....
..
....
x(n,1) x(n,2) · · · · · · · · · · · · x(n,n−1) 1
377777777777775
One way to approach this problem is to take the inverse of X and compute
Z = X−1Y . However, matrix inversion is a complex procedure and requires a
significant amount of processing time. Moreover, after inversion, X will be no
longer sparse. A better way is to use forward-substitution exploiting the fact that
X is lower triangular. The elements of the vector Z can be computed with the
241
following set of equations.
z1 = y1
z2 = y2 ⊕ x(2,1)z1
z3 = y3 ⊕ x(3,1)z1 ⊕ x(3,2)z2
...
zn = yn ⊕ x(n,1)z1 ⊕ x(n,2)z2 ⊕ · · · ⊕ x(n,n−1)zn−1
This can be generalized as:
zi = yi ⊕i−1⊕j=1
x(i,j)zj, 1 ≤ i ≤ n (9.7)
Just like the matrix-vector multiplication, to compute an element in Z, we need
elements from X and Y . However, we also require the previous elements of Z that
have been computed previously. Therefore, the circuit for forward-substitution is
similar to the one in Figure 9.7 with slight modifications as shown in Figure 9.8.
The index calculator computes the memory location of Y to be read and Z to be
written. From (9.7), these two addresses are identical. As in the matrix-vector
multiplication case, the index calculator is incremented every time there is an end
of a row and the clock cycles for the computation is proportional to the number
of edges in X.
9.6 Implementation and Results
The preprocessor has been implemented using MATLAB. Preprocessing times
for H matrices with rate 1/2 for various block lengths on a Pentium 4 2.4GHz
PC are shown in Table 9.4. A MATLAB tool we have developed, that constructs
high-performance irregular LDPC codes with low error floors is used to generate
the H matrices [174].
242
end row data Y
X
Z
index calculator
index
index
1
2
...
...
n
index
1
2
...
...
...
n
index
data
read index
write index
Figure 9.8: Circuit for forward-substitution (FS).
We can see that the preprocessing times for large block lengths can be long.
However, we are not too concerned about this since preprocessing needs to be
performed only once for any given H matrix. We also observe that the gap
remains small even for large block lengths. The primary reason for these small
gaps is the large number of rows in HT whose degrees are less than three in their
degree distributions. Low degree rows in HT lead to high probabilities of finding
degree one rows in the diagonal extension step of the greedy algorithm.
A scatter plot of a preprocessed irregular 500 × 1000 H matrix (i.e. block
length of 1000 bits and rate 1/2) is shown in Figure 9.9. The diagonal ones
of the matrix T can be clearly seen. Also, as expected since the gap is small
(g=2 in this case), the preprocessed H matrix consists of mainly A and T . The
blocky artifacts next to the diagonal of T are created by the diagonal extension
step of the greedy algorithm, during which an identity matrix is formed in every
243
Table 9.4: Preprocessing times and gaps for H matrices with rate 1/2 for vari-
ous block lengths performed on a Pentium 4 2.4GHz PC equipped with 512MB
DDR-SDRAM.
block preprocessing gap
length time [s]
500 3 2
1000 14 2
2000 83 2
4000 587 2
8000 3124 2
iteration. In Table 9.5, we show the number of edges for the six matrices for a
preprocessed 1000× 2000 irregular H matrix. We observe that, the matrices A,
B and T contain most of the edges, indicating that operations involving them
will dominate the encoding times.
The actual hardware encoder has been implemented using Xilinx System Gen-
erator and is heavily pipelined for maximum throughput. The codewords gener-
ated from our hardware encoder have been verified against our MATLAB model
for correctness. The four stage architecture design in Xilinx System Generator
are depicted in Figure 9.10. Stage 2 and the stage controller are shown in detail in
Figure 9.11. The MVM and FS circuits are shown in Figure 9.12 and Figure 9.13.
Let e(A) denote the number of edges for the matrix A, and c(S1) denote the
number of clock cycles taken by Stage 1 (see Figure 9.5). The number of clock
244
0 100 200 300 400 500 600 700 800 900 1000
0
100
200
300
400
500
Figure 9.9: Scatter plot of a preprocessed irregular 500× 1000 H matrix in ALT
form with a gap of two. Ones appear as dots.
cycles taken by each stage is given by
c(S1) = n−m
c(S2) = max(e(A), e(C))
c(S3) = e(T ) + e(E) + (n−m) + e(P ) + e(B) + (m− g)
c(S4) = e(T ) + 2((n−m) + g + (m− g)).
The number of clock cycles per codeword (CPC) is determined by the stage that
takes the longest, i.e.
CPC = max[c(S1), c(S2), c(S3), c(S4)].
For a given clock speed, the number of codewords per second (CPS) is given by
CPS = clock speed /CPC.
Therefore the codeword throughput (bits per second) of the encoder is
codeword bits throughput = CPS× block size
245
Fig
ure
9.10
:T
he
four
stag
eLD
PC
enco
der
arch
itec
ture
inX
ilin
xSyst
emG
ener
ator
.E
ach
stag
eco
nta
ins
mult
iple
subsy
stem
sper
form
ing
MV
M,FS,VA
orC
WG
.
246
Figure 9.11: LDPC encoder architecture Stage 2 and stage controller in Xilinx
System Generator.
247
Fig
ure
9.12
:T
he
mat
rix-v
ecto
rm
ult
iplica
tion
(MV
M)
circ
uit
inX
ilin
xSyst
emG
ener
ator
.
248
Fig
ure
9.13
:T
he
forw
ard-s
ubst
ituti
on(F
S)
circ
uit
inX
ilin
xSyst
emG
ener
ator
.
249
Table 9.5: Dimensions and number of edges for the matrices A, B, T , C, F and
E generated from a 1000× 2000 irregular H matrix.
matrix dimension edges
A 998× 1000 6273
B 998× 2 998
T 998× 998 2398
C 2× 1000 10
F 2× 2 2
E 2× 998 6
and the information throughput is given by
information bits throughput = codeword throughput× rate.
The latency of the encoder is the time taken for the four stages to fill up. This
is given by
latency = (4× CPC) / clock speed.
An encoder for block length of 2000 bits and rate 1/2 has been synthesized on a
Xilinx Virtex-II XC2V4000-6 device. The design takes up 870 slices and 19 block
RAMs, which uses approximately 4% of the device. The clock cycles taken by
each of the four stages of this design are
c(S1) = 1000
c(S2) = 6273
c(S3) = 5402
c(S4) = 6398.
250
We observe that the workloads across the stages are well balanced, which is the
case with our architecture for all rate 1/2 codes. The design is capable of running
at 143MHz and with a CPC of 6398 cycles, the resulting in a codeword throughput
and latency is 45Mbps (million bits per second) and 0.179ms respectively. This
kind of a throughput is sufficient to cover most applications, including wireless
networking and optical-link deep-space communications. Implementation results
for various encoders with block lengths ranging from 500 to 8000 bits for rate
1/2 codes are shown in Table 9.6. We see an increase in resources and latency
with block length due to the increase in size of the H matrix. This increase in
resources leads to reductions in clock speed and throughput due to routing delays.
The usage of distributed RAMs and block RAMs have been carefully performed
to minimize the waste of the 18Kb Virtex-II block RAMs. In Table 9.7, we show
how the performance varies with different rates with a fixed block length of 2000
bits. The encoder is optimized for rate 1/2 codes (by dividing the operations
shown in Figure 9.5), therefore we see the some performance loss for other rates.
Multiple instances of the encoder can be implemented on the same device to
encode multiple message blocks in parallel. Note that RAMs for the six matrices
describing the preprocessed H matrix can be shared among the encoders. This is
because the six matrices are indexing the operands, and are used one by one in
a linear manner. Synthesis results are shown in Table 9.8 for multiple instances
of an encoder with block length of 2000 bits and rate 1/2. The design with 16
instances consumes 73% of the device and is capable of a codeword throughput
of 410Mbps. Figure 9.14 shows how the number of encoder instances affects the
codeword throughput. The dotted line shows the linear relationship between the
output rate and the number of instances, if the clock speed does not deteriorate
with the increasing number of instances. While ideally the throughput would
scale linearly with the number of encoder instances, in practice the output rate
251
Table 9.6: Hardware synthesis results on a Xilinx Virtex-II XC2V4000-6 FPGA
for rate 1/2 for various block length.
block edges slices block speed throughput latency
length RAMs [MHz] [Mbps] [ms]
500 2418 562 12 161 50 0.040
1000 4859 682 13 152 48 0.084
2000 9687 870 19 143 45 0.179
4000 19452 1340 27 127 40 0.405
8000 38905 2148 49 110 34 0.937
grows slower than expected, because the clock speed of the design deteriorates
as the number of noise generators increases. This deterioration is probably due
to the increase in routing delays. Note that multiple FPGAs could be used to
speed up the encoding even further. For instance, an implementation of three
Xilinx Virtex-II XC2V4000-6 devices would be capable of a codeword throughput
of 1.2Gbps for block length of 2000 bits and rate 1/2 codes.
Our hardware implementations of the encoder for block length of 2000 bits
and rate 1/2 has been compared to software implementations. The software
implementations are written in C and compiled with Microsoft Visual C++ 6.0.
The results are shown in Table 9.9. It can be seen that our hardware designs are
faster than software implementations by 10–300 times, depending on the device
used and the resource utilization.
Regarding the feasibility of an adaptive LDPC encoder, the XC2V4000-6
FPGA has 15 million configuration bits [187]. The configuration bits can be
252
Table 9.7: Hardware synthesis results on a Xilinx Virtex-II XC2V4000-6 FPGA
for block length of 2000 bits for various rates.
rate edges slices block speed throughput latency
RAMs [MHz] [Mbps] [ms]
1/3 8896 1109 19 127 34 0.232
1/2 9687 870 19 143 44 0.179
2/3 9513 1065 18 125 33 0.235
fed to the device with eight bits in parallel at 50MHz, which is 400Mbps. So the
entire device can be configured in around 35ms (smaller devices would take less
time). If an adaptive LDPC encoder reconfigures itself every few seconds or tens
of seconds, the overhead of the reconfiguration time would still be acceptable if
the adapted encoder improves throughput and minimizes retransmission time.
9.7 Summary
We have described a hardware design of an efficient LDPC encoder based on
the RU method. Whereas a straightforward implementation of an encoder has
complexity quadratic in the block length, the RU method admits linear time
encoding through careful linear manipulation of the parity matrix for both regular
and irregular LDPC codes.
A preprocessor is written to optimize the parity-check matrix through the row
and column permutations, generating the look-up tables and parameters needed
by the hardware encoder. An efficient architecture for storing and performing
253
Table 9.8: Hardware synthesis results on a Xilinx Virtex-II XC2V4000-6 FPGA
for block length of 2000 bits and rate 1/2 for different numbers of encoder in-
stances.
instances slices block speed throughput latency
RAMs [MHz] [Mbps] [ms]
1 870 19 143 44 0.179
4 3547 36 90 112 0.284
8 6978 60 89 222 0.288
12 12702 83 86 322 0.298
16 16906 107 82 410 0.312
computations on sparse matrices has been discussed. The encoding steps have
been scheduled into different stages optimizing concurrency while reducing idle
times. Run-time reconfiguration of FPGAs can be used to load different designs
optimized for various rates at run-time for an adaptive LDPC encoder.
Implementation results for encoders of various block lengths and rates have
been presented. An encoder for block length of 2000 bits and rate 1/2 takes
up 4% of resources on a Xilinx Virtex-II XC2V4000-6 device. It is capable of
running at 143MHz resulting in a codeword throughput of 45Mbps and latency
of 0.179ms. The performance can be improved by mapping several instances of
the encoder onto the same chip to encode multiple message blocks concurrently.
An implementation of 16 instances of the encoder on the same device at 82MHz
is capable of 410 million codeword bits per second, 80 times faster than an Intel
Pentium 4 2.4GHz PC. The LDPC encoder architecture we have proposed in this
254
2 4 6 8 10 12 14 160
100
200
300
400
500
600
700
800
Number of Instances
Cod
ewor
d T
hrou
ghpu
t (M
bps)
Figure 9.14: Variation of throughput with the number of encoder instances.
chapter, has been chosen by JPL as a candidate for their future space missions.
255
Table 9.9: Performance comparison of block length of 2000 bits and rate 1/2
encoders: time for producing 410 million codeword bits.
platform speed time
[MHz] [s]
XC2V4000-6 FPGA, 16 encoder instances 82 1
XC2V4000-6 FPGA, 1 encoder instance 143 9
Intel Pentium 4 PC, 512MB DDR-SDRAM 2400 80
Intel Pentium-III PC, 256MB SDR-SDRAM 700 312
256
CHAPTER 10
Conclusions
10.1 Summary
Three main topics have been presented in this thesis: function evaluation, Gaus-
sian noise generation and LDPC encoding.
In Chapter 3 [95], we have presented a methodology for the automation of
function evaluation unit design, covering table look-up, table-with-polynomial
and polynomial-only methods. An implementation of a partially automated sys-
tem for design space exploration of function evaluation in hardware has been
demonstrated, including algorithmic design space exploration with MATLAB and
hardware design space exploration with ASC, A Stream Compiler, for FPGAs.
Method selection results for sin(x), log(1 + x) and 2x have been shown. We have
concluded that the automation of function evaluation unit design is within reach,
even though there are many remaining issues for further study.
In Chapter 4 [83], [84], a framework for adaptive range reduction based on a
parametric function evaluation library, and on function approximation by poly-
nomials and tables and pre-computing all possible input/output ranges has been
presented. We have demonstrated an implementation of design space exploration
for adaptive range reduction, using MATLAB for producing function evalua-
tion parameters for hardware designs targeting the ASC system. The proposed
approach has been evaluated by exploring various effects of range reduction
257
of several arithmetic functions such as sin(x), log(x) and√
x on throughput,
latency and area for FPGA designs. For a given function, its input/output
range/precision, and an optimization metric, we automate the decision about
whether range reduction helps to optimize the metric by pre-computing a large
library of function evaluation generators. Given the evaluation method, we auto-
mate the decision about which bitwidths and number of polynomial terms to use
by constructing the function evaluation generators via MATLAB simulation and
computation. In addition, we show the productivity which we obtain from com-
bining MATLAB with ASC, exploring over 40 million Xilinx equivalent circuit
gates in a relatively short amount of time.
In Chapter 5 [88], [90], [91], we have presented a novel method for evaluating
functions using piecewise polynomial approximations with an efficient hierarchical
segmentation scheme. Our method is illustrated using four non-linear compound
functions,√− log(x), x log(x), a high order rational function and cos(πx/2). An
algorithm that finds the optimum segments for a given function, input range,
maximum error and ulp (unit in the last place) has been presented. The four
hierarchical schemes P2S(US), P2SL(US), P2SR(US) and US(US) deal with the
non-linearities of functions which occur frequently. A simple cascade of AND and
OR gates can be used to rapidly calculate the P2S address for a given input.
Results show the advantages of using our hierarchical approach over the tradi-
tional uniform approach. We have also explored the effects of different polynomial
degrees on our hierarchical segmentation method. Compared to other popular
methods, our approach has longer latency and more operators, but the size of
the look-up tables and thus the total area are considerably smaller.
In Chapters 6 and 7, we have presented two hardware Gaussian noise gener-
ators designed to facilitate Monte Carlo simulations implemented in hardware,
258
which involve very large numbers of samples. The first design [86], [89] is based
on the Box-Muller method and the central limit theorem. This approach involves
the computation of two functions:√− ln(x) and cos(2πx). A key aspect of the
design is the use of non-uniform piecewise linear approximations [87] for comput-
ing trigonometric and logarithmic functions, with the boundaries between each
approximation chosen carefully to enable rapid computation of coefficients from
the inputs. The noise generator design occupies approximately 10% of a Xilinx
Virtex-II XC2V4000-6 FPGA and 90% of a Xilinx Spartan-IIE XC2S300E-7, and
can produce 133 million samples per second. The performance can be improved
by exploiting parallelism: an XC2V4000-6 FPGA with nine parallel instances of
the noise generator at 105MHz can run 50 times faster than a 2.6GHz Pentium
4 PC. This noise generator is currently being used for exploring LDPC code be-
havior at UCLA and JPL (Jet Propulsion Laboratory, NASA), and Monte Carlo
simulations of financial models at the Chinese University of Hong Kong.
The second noise generator [82], [94] is based on the Wallace method. It is a
fast algorithm for generating normally distributed pseudo-random numbers which
generates the target distributions directly using their maximal-entropy properties.
The Wallace method takes a pool of normally distributed random numbers from
the normal distribution. Through transformation steps, a new pool of normal
distributed random numbers are generated. The noise generator design occupies
approximately 3% of a Xilinx Virtex-II XC2V4000-6 FPGA and half of a Xilinx
Spartan-III XC3S200E-5, and can produce 155 million samples per second. An
XC2V4000-6 FPGA with 16 parallel instances of the noise generator at 115MHz
can run 98 times faster than a 2.6GHz Pentium 4 PC. The two noise generators are
used as a key component in a hardware simulation system including: exploration
LDPC code behavior at very low BERs in the range of 10−9 to 10−10, and financial
modeling [14], [192]. For both noise generators, statistical tests, including the χ2
259
test and the A-D test, as well as application in LDPC decoding have been used
to confirm the quality of the noise samples. The output of the noise generators
accurately model a true Gaussian PDF even at very high σ values.
In Chapter 8 [92], we have explored the impact of parameter choice on noise
quality of the Wallace method. Using tests designed specifically to identify the
presence of correlations due to the use of previous outputs in generating new
outputs, we have identified specific combinations of pool size, transform size,
and retention factor that deliver high quality noise output at high speeds (one
examples is pool size = 4096, transformation size = 16, and retention factor =
1). Detailed performance tradeoff studies have been conducted for AMD Athlon
XP and Intel Pentium 4 based platforms. Performance comparisons with other
software Gaussian random number generators have been carried out, demonstrat-
ing that given a careful choice of parameters, the Wallace method is a serious
competitor due to its speed advantages.
In Chapter 9 [93], we have described a hardware design of an efficient LDPC
encoder based on the RU method. A preprocessor is written to optimize the
parity-check matrix through the row and column permutations, generating the
look-up tables and parameters needed by the hardware encoder. An efficient
architecture for storing and performing computations on sparse matrices has been
discussed. Implementation results for encoders of various block lengths and rates
have been presented. An encoder for block length of 2000 bits and rate 1/2 takes
up 4% of resources on a Xilinx Virtex-II XC2V4000-6 device. It is capable of
running at 143MHz resulting in a codeword throughput of 45Mbps and latency
of 0.179ms. The performance can be improved by mapping several instances of
the encoder onto the same chip to encode multiple message blocks concurrently.
An implementation of 16 instances of the encoder on the same device at 82MHz
260
is capable of 410 million codeword bits per second, 80 times faster than an Intel
Pentium 4 2.4GHz PC. Due to the increasing demand for high-speed deep space
communications, our LDPC encoder architecture has been chosen by JPL as a
candidate for NASA’s future space missions.
10.2 Future Work
10.2.1 Function Evaluation
For the evaluation of elementary functions, we want implement other elementary
functions and explore other evaluation methods such as rational approximation
and symmetric table addition methods. We also hope to utilize embedded RAMs
and multipliers available in modern FPGAs. Our designs will be optimized fur-
ther by employing non-uniform bitwidth minimization techniques such as Bit-
Size [47]. The final objective is to progress towards a fully automated library
that provides optimal function evaluation hardware units given input/output
range and precision.
One of the major problems we are facing with this objective, is the fact that
we cannot verify a given approximation for all the possible inputs. For instance,
if the input is 24 bits, the output errors for all 224 possible inputs needs to
be computed to ensure correctness for every output. This can take days even
on the fastest PCs available today. Because of this performance bottleneck, in
the present implementation, a set of random samples are taken from the input
domain. Hence ideally, we need to move towards a framework were the library
construction itself (e.g. calculating coefficients and minimizing bitwidths, see
Figure 4.1 in Chapter 4) is done hardware. It is would enable us to create a fully
automated/accurate library of all the elementary functions and approximation
261
methods of interest, and test for all possible input values. This would involve the
generation of a comprehensive matrix of precision/range for various combinations
of metrics based on the structure shown in Figures 4.10 and 4.11.
In the work presented in Chapters 3 and 4, when we optimize a design for a
given metric (area, latency or throughput), the design is optimized by ASC to
give the best possible result for the specified metric. But in many situations, the
user may want to specify a combination of metrics. For instance, when designing
a modulator for a mobile phone, the designer may want to set a constraint on
the maximum latency that can tolerated, while meeting a certain throughput
and area requirement. Moreover, power consumption is a major factor in many
modern mobile devices, hence adding power optimization to ASC would also be
useful. We are also planning to explore the impact on power consumption [184]
across different bitwidths, methods and functions.
There are various extensions we want to make for the hierarchical segmen-
tation method (HSM) presented in Chapter 5. Many functions such as belief
propagation in LDPC decoding [74] involve two input variables [139], hence we
want to extend HSM to cover multivariate functions. The current implemen-
tation of HSM employs fixed-point arithmetic. However, it would be desirable
to support floating-point as well to address operations that have large dynamic
ranges. The bitwidths of various operations in the data paths have been min-
imized by hand, however this process is very time consuming and perhaps far
from optimal. Bitwidth minimization techniques such as those presented in [29]
and [47] are highly desirable. Also, we hope to explore how HSM can be used
to speed up addition and subtraction functions in logarithmic number systems
(LNS) [26] which are highly non-linear.
For all function evaluations units described in Chapters 3, 4 and 5, Horner’s
262
rule is used to reduce the number of operations in the polynomial. However,
more sophisticated methods exist which can reduce the number of operations
even further, such as those described by Knuth in [78]. We are planning to
investigate how these methods can be mapped efficiently into hardware. The
function evaluation units perform faithful rounding (accurate to 1 ulp, rounded
to the nearest or next nearest), however certain applications may require exact
rounding (accurate to 0.5 ulp, rounded to the nearest) [161]. We are investigating
how exact rounding can be achieved for our evaluation units, which would involve
using the right bitwidths for the operators in the data paths.
10.2.2 Gaussian Noise Generation
In Chapter 8, we have identified a set of design parameters for the Wallace method
to reduce correlations. We are planning to modify the Wallace hardware archi-
tecture presented in Chapter 7 with these new set of parameters. This would
mainly involve more addition/subtraction and memory requirements (discussed
in Section 8.6 in Chapter 8).
The statistical tests for the noise generators including the χ2 test and the
A-D test, have been carried out in software using a hardware emulation model in
C. Hence, we are only able to test up to around 1010 noise samples due to lack
of computational power. Ideally, these tests ought to be performed in hardware,
which would enable us to verify the noise samples for even larger numbers of
samples.
Recently, we have come across the inversion method [65], which involves the
inverse Gaussian CDF and uniform random samples to pick a point in the CDF.
This approach requires the approximation of the inverse Gaussian CDF, which
is highly non-linear, but could be dealt with a floating-point implementation of
263
HSM. This method has the advantage of having to approximate just one function
to generate a Gaussian random variate. We are looking at implementing this
method with the aid of a floating-point implementation of HSM.
Finally, we want to further refine of our noise generator architectures for
various applications. For instance those which involve different channels such
magnetic disc channels [22] and other communication channels [148] including
Rayleigh [36], Ricean and Nakagami-m [191], which are all based on Gaussian
noise.
10.2.3 LDPC Coding
The four-stage architecture of the LDPC encoder presented in Chapter 9 is cur-
rently optimized for rate 1/2 codes. It would be desirable to develop a set of
architectures optimized for different code rates, which would resulting in maxi-
mum throughput and minimum latency for the given rate.
The current LDPC decoder implementation [74] developed by our colleagues
in UCLA, is still in a preliminary stage and has a throughput of several hundreds
of kilobits per second due to its serial nature. We have now identified interesting
decoder architectures that would lead to a more parallel and scalable design. We
hope to implement this new improved design in the near future, which should
reach throughput of several tens of megabits per second.
Finally, using our current LDPC encoder/decoder architecture, we want to
implement an adaptive LDPC codec. This would involve supporting different
H matrices at run-time and adaptively choosing the appropriate H matrix de-
pending on the channel conditions, such as the SNR. Adaptive architectures for
Viterbi [169] and Turbo codes [103] have been proposed in literature, but not for
LDPC codes.
264
References
[1] Advanced Micro Devices Inc. AMD Ahtlon processor technical brief, 1999.Document number 22054.
[2] J.H. Ahrens and U. Dieter. An alias method for sampling from the normaldistribution. Computing, 42(2-3):159–170, 1989.
[3] R. Andraka. A survey of CORDIC algorithms for FPGA based comput-ers. In Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pages 191–200, 1998.
[4] D.R. Barr and N.L. Sezak. A comparison of multivariate normal generators.Communications of the ACM, 15(12):1048–1049, 1972.
[5] N.C. Beaulieu and C.C. Tan. An FFT method for generating bandlimitedGaussian noise variates. In Proceedings of IEEE Global CommunicationsConference, pages 684–688, 1997.
[6] A.R. Bergstrom. Gaussian estimation of mixed-order continuous-time dy-namic models with unobservable stochastic trends from mixed stock andflow data. Econometric Theory, 13(4):467–505, 1997.
[7] C. Berrou, A. Glavieuxand, and P. Thitimajshima. Near Shannon limiterror-correcting coding and decoding: Turbo-codes. In Proceedings of IEEEConference on Communications, pages 1064–1070, 1993.
[8] V. Bhagavatula, H. Song, and J. Liu. Low-density parity-check (LDPC)codes for optical data storage. In Proceedings of IEEE International Sympo-sium on Optical Memory and Optical Data Storage Topical Meeting, pages371–373, 2002.
[9] T. Bhatt, K. Narayanan, and N. Kehtarnavaz. Fixed-point DSP imple-mentation of low-density parity check codes. In Proceedings of IEEE DSPWorkshop, 2000.
[10] A.J. Blanksby and C.J. Howland. A 690-mW 1-Gb/s 1024-b, rate-1/2 low-density parity-check code decoder. IEEE Journal of Solid-State Circuits,37(3):404–412, 2002.
[11] M. Bossert. Channel Coding for Telecommunications. John Wiley & Sons,1999.
265
[12] E. Boutillon, J.L. Danger, and A. Gazel. Design of high speed AWGNcommunication channel emulator. Analog Integrated Circuits and SignalProcessing, 34(2):133–142, 2003.
[13] G.E.P. Box and M.E. Muller. A note on the generation of random normaldeviates. Annals of Mathematical Statistics, 29:610–611, 1958.
[14] A. Brace, D. Gatarek, and M. Musiela. The market model of interest ratedynamics. Mathematical Finance, 7(2):127–155, 1997.
[15] D.D. Braess. Chebyshev approximation by spline functions with free knots.Numerische Mathematik, 17:357–366, 1971.
[16] R.P. Brent. A fast vectorised implementation of Wallace’s normal randomnumber generator. ANU Computer Science Technical Reports TR-CS-97-07, The Australian National University, 1997.
[17] R.P. Brent. Some comments on C.S. Wallace’s random number generators.The Computer Journal, 2003. To appear.
[18] A. Cantoni. Optimal curve fitting with piecewise linear functions. IEEETransactions on Computers, C-20(1):59–67, 1971.
[19] J. Cao, B.W.Y. We, and J. Cheng. High-performance architectures forelementary function generation. In Proceedings of IEEE Symposium onComputer Arithmetic, pages 136–144, 2001.
[20] J. Cavallaro and M. Vaya. VITURBO: A reconfigurable architecture forViterbi and Turbo decoding. In Proceedings of IEEE International Confer-ence on Acoustics, Speech, and Signal Processing, volume 2, pages 497–500,2003.
[21] Celoxica Limited. Handel-C language reference manual v3.1, 2002. http:
//www.celoxica.com.
[22] J. Chen, J. Moon, and K. Bazargan. Reconfigurable readback-signal gen-erator based on a field-programmable gate array. IEEE Transactions onMagnetics, 40(3):1744–1750, 2004.
[23] P.L. Chu. Fast Gaussian noise generator. IEEE Transactions on Acoustics,Speech, and Signal Processing, 37(10):1593–1597, 1989.
[24] P.P. Chu and R.E. Jones. Design techniques of FPGA based random num-ber generator. In Proceedings of Military and Aerospace Applications ofProgrammable Devices and Technology Conference, 1999.
266
[25] W.J. Cody and W. Waite. Software Manual for the Elementary Functions.Prentice Hall, 1980.
[26] J.N. Coleman, E. Chester, C.I. Softley, and J. Kadlec. Arithmetic on theEuropean logarithmic microprocessor. IEEE Transactions on Computers,49(7):702–715, 2000.
[27] M. Combet, H. Van Zonneveld, and L. Verbeek. Computation of the basetwo logarithm of binary numbers. IEEE Transactions on Electrical Com-puters, EC-14(6):863–867, 1965.
[28] K. Compton and S. Hauck. Reconfigurable computing: a survey of systemsand software. ACM Computing Surveys, 34(2):171–210, 2002.
[29] G.A. Constantinides, P.Y.K. Cheung, and W. Luk. Wordlength optimiza-tion for linear digital signal processing. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 22(10):1432–1442, 2003.
[30] D.J. Costello, J. Hagenauer, H. Imai, and S.B. Wicker. Applicationsof error control and coding. IEEE Transactions on Information Theory,44(6):2531–2560, 1998.
[31] C. Cousineau, F. Laperle, and Y. Savaria. Design of a JTAG based runtime reconfigurable system. In Proceedings of IEEE Symposium on Field-Programmable Custom Computing Machines, pages 21–23, 1999.
[32] R.B. D’Agostino and M.A. Stephens. Goodness-of-Fit Techniques. MarcelDekker Inc., 1986.
[33] J.L. Danger, A. Ghazel, E. Boutillon, and H. Laamari. Efficient FPGA im-plementation of Gaussian noise generator for communication channel em-ulation. In Proceedings of IEEE International Conference on Electronics,Circuits, and Systems, volume 1, pages 366–369, 2000.
[34] F. de Dinechin and A. Tisserand. Some improvements on multipartite tablemethods. In Proceedings of IEEE Symposium on Computer Arithmetic,pages 128–135, 2001.
[35] D. Defour, P. Kornerup, J. Muller, and N. Revol. A new range reductionalgorithm. In Proceedings of Asilomar Conference on Circuits, Systems,and Computers, volume 2, pages 1656–1660, 2001.
[36] D. Derrien and E. Boutillon. Quality measurement of a colored Gaussiannoise generator hardware implementation based on statistical properties.In Proceedings of IEEE International Symposium on Signal Processing andInformation Technology, 2002.
267
[37] R.O. Duda, D.G. Stork, and P.E. Hart. Pattern Classification and SceneAnalysis: Pattern Classification. John Wiley & Sons, 2000.
[38] J. Duprat and J.M. Muller. The CORDIC algorithm: new results forfast VLSI implementation. IEEE Transactions on Computers, 42:168–178,1993.
[39] J.J. Eggers, J.K. Su, and B. Girod. Robustness of a blind image watermark-ing scheme. In Proceedings of IEEE International Conference on ImageProcessing, volume 3, pages 17–20, 2000.
[40] M.D. Ercegovac. A general hardware-oriented method for evaluation offunctions and computations in a digital computer. IEEE Transactions onComputers, 26(7):667–680, 1977.
[41] M.D. Ercegovac and T. Lang. Division and square root, digit-recurrencealgorithms and implementations. Kluwer Academic Publishers, 1994.
[42] R.E. Esch and W.L. Eastman. Computational methods for best splineapproximation. Journal of Approximation Theory, 2:85–96, 1969.
[43] Y. Fan, Z. Zilic, and M.W. Chiang. A versatile high speed bit error ratetesting scheme. In Proceedings of IEEE International Symposium on Qual-ity Electronic Design, pages 395–400, 2004.
[44] FastMath: software faster than a coprocessor. C User’s Journal, 9(7):12,1991.
[45] Flarion Technologies Inc. Vector-low-density parity-check coding solutiondata sheet, 2002. http://www.flarion.com.
[46] M.J. Flynn and S.F. Oberman. Advanced Computer Arithmetic Design.John Wiley & Sons, 2001.
[47] A. Abdul Gaffar, O. Mencer, W. Luk, and P.Y.K. Cheung. Unifying bit-width optimisation for fixed-point and floating-point designs. In Proceed-ings of IEEE Symposium on Field-Programmable Custom Computing Ma-chines, pages 79–88, 2004.
[48] R.G. Gallager. Low-density parity-check codes. IEEE Transactions onInformation Theory, 8:21–28, 1962.
[49] R.G. Gallager. Low-Density Parity-Check Codes. MIT Press, 1963.
268
[50] J. Garcia-Frias and W. Zhong. Approaching Shannon performance by it-erative decoding of linear codes with low-density generator matrix. IEEECommunication Letters, 7:266–268, 2003.
[51] C.W. Gardiner. Handbook of Stochastic Methods. Springer-Verlag, 1990.
[52] A.V. Geramita and J. Seberry. Orthogonal Designs: Quadratic Forms andHadamard Matrices. Marcel Dekker Inc., 1979.
[53] A. Ghazel, E. Boutillon, J.L. Danger, G. Gulak, and H. Laamari. Designand performance analysis of a high speed AWGN communication channelemulator. In Proceedings of IEEE Pacific Rim Conference on Communica-tions, Computers, and Signal Processing, volume 2, pages 374–377, 2001.
[54] GNU Project. gcc 3.2 Manual, 2003. http://gcc.gnu.org.
[55] D. Goldberg. What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys, 23(1):5–48, 1991.
[56] N. Golshan. A novel digital implementation of a Gaussian noise genera-tor. In Proceedings of IEEE Instrumentation and Measurement TechnologyConference, pages 256–257, 1989.
[57] B.D. Hart and D.P. Taylor. On the irreducible error floor in fast fadingchannels. IEEE Transactions on Vehicular Technology, 49(3):1044–1047,2000.
[58] J.F. Hart. Computer Approximations. John Wiley & Sons, 1968.
[59] J.W. Hauser and C.N. Purdy. Approximating functions for embedded andASIC applications. In Proceedings of IEEE Midwest Symposium on Circuitsand Systems, pages 478–481, 2001.
[60] H. Hemmati. Overview of laser communication research at JPL. In Pro-ceedings of SPIE The Search for Extraterrestrial Intelligence in the OpticalSpectrum III, volume 4273, 2001.
[61] H. Henkel. Improved addition for the logarithmic number system. IEEETransactions on Acoustics, Speech, and Signal Processing, 37(2):301–303,1989.
[62] J.L. Hennessy, D.A. Patterson, and D. Goldberg. Computer Architecture:A Quantitative Approach. Morgan Kaufmann, third edition, 2002.
269
[63] C.H. Ho, K.H. Tsoi, H.C. Yeung, Y.M. Lam, K.H. Lee, P.H.W. Leong,R. Ludewig, P. Zipf, A.G. Ortiz, and M. Glesner. Arbitrary function ap-proximation in HDLs. In Proceedings of IEEE International Conference onField-Programmable Technology, pages 110–117, 2003.
[64] S. Hong and W.E. Stark. Design and implementation of a low complexityVLSI Turbo-code decoder architecture for low energy mobile wireless com-munications. Journal of VLSI Signal Processing, pages 2350–2354, 2000.
[65] W. Hormann and J. Leydold. Continuous random variate generation byfast numerical inversion. ACM Transactions on Modeling and ComputerSimulation, 13(4):347–362, 2003.
[66] C.J. Howland and A.J. Blanksby. A 220mW 1 Gb/s 1024-Bit rate-1/2low density parity check code decoder. In Proceedings of IEEE CustomIntegrated Circuits Conference, pages 293–296, 2001.
[67] C.J. Howland and A.J. Blanksby. Parallel decoding architectures for lowdensity parity check codes. In Proceedings of IEEE International Sympo-sium on Circuits and Systems, volume 4, pages 742–745, 2001.
[68] Intel Corp. Intel Pentium 4 processor with 512-KB L2 cache on 0.13 micronprocess and Intel Pentium 4 processor extreme edition supporting Hyper-threading datasheet, 2004. Document number 298643-012.
[69] F.C. Ionescu. Theory and practice of a fully controllable white noise gen-erator. In Proceedings of IEEE International Semiconductor Conference,volume 2, pages 319–322, 1996.
[70] V.K. Jain, S.A. Wadecar, and L. Lin. A universal nonlinear componentand its application to WSI. IEEE Transactions on Components, Hybridsand Manufacturing Tech., 16(7):656–664, 1993.
[71] Jet Propulsion Laboratory. Basics of Space Flight, 2004. http://www2.
jpl.nasa.gov/basics.
[72] J. Jiang, W. Luk, and D. Rueckert. FPGA-based computation of free-form deformations in medical image registration. In Proceedings of IEEEInternational Conference on Field-Programmable Technology, pages 234–241, 2003.
[73] S.J. Johnson and S.R. Weller. A family of irregular LDPC codes with lowencoding complexity. IEEE Communications Letters, 7(2):79–81, 2003.
270
[74] C. Jones, E. Valles, M. Smith, and J.D. Villasenor. Approximate-min*constraint node updating for LDPC code decoding. In Proceedings of IEEEMilitary Communications Conference, volume 1, pages 157–162, 2003.
[75] J.N. Mitchell Jr. Computer multiplication and division using binary loga-rithms. IRE Transactions Electrical Computers, EC-11:512–517, 1962.
[76] B. Jung, H. Lenhof, P. Muller, and C. Rub. Langevin dynamics simu-lations of macromolecules on parallel computers. Macromolecular Theoryand Simulations, pages 507–521, 1997.
[77] J. Cavallaro K. Chadha and. A reconfigurable Viterbi decoder architec-ture. In Proceedings of Asilomar Conference on Circuits, Systems, andComputers, pages 66–71, 2001.
[78] D.E. Knuth. Seminumerical algorithms, volume 2 of The Art of ComputerProgramming. Addison-Wesley, third edition, 1997.
[79] I. Koren and O. Zinaty. Evaluating elementary functions in a numeri-cal coprocessor based on rational approximations. IEEE Transactions onComputers, 39(8):1030–1037, 1990.
[80] R.E. Ladner and M.J. Fischer. Parallel prefix computation. Journal of theACM, 27(4):831–838, 1980.
[81] C.L. Lawson. Characteristic properties of the segmented rational minimaxapproximation problem. Numerische Mathematik, 6:293–301, 1964.
[82] D. Lee. Gaussian noise generation for Monte Carlo simulations in hardware.In Proceedings of The Korean Scientists and Engineers Association in theUK 30th Anniversary Conference, pages 182–185, 2004.
[83] D. Lee, A. Abdul Gaffar, O. Mencer, and W. Luk. Adaptive range reductionfor hardware function evaluation. In Proceedings of IEEE InternationalConference on Field-Programmable Technology, pages 169–176, 2004.
[84] D. Lee, A. Abdul Gaffar, O. Mencer, and W. Luk. Automating optimizedhardware function evaluation. IEEE Transactions on Computers, 2004.Submitted.
[85] D. Lee, W. Luk, and P.Y.K. Cheung. Incremental programming for re-configurable engines. In Proceedings of IEEE International Conference onField-Programmable Technology, pages 411–415, 2002.
271
[86] D. Lee, W. Luk, J.D. Villasenor, and P.Y.K. Cheung. A hardware Gaussiannoise generator for channel code evaluation. In Proceedings of IEEE Sym-posium on Field-Programmable Custom Computing Machines, pages 69–78,2003.
[87] D. Lee, W. Luk, J.D. Villasenor, and P.Y.K. Cheung. Hardware func-tion evaluation using non-linear segments. In Proceedings of InternationalConference on Field-Programmable Logic and its Applications, LNCS 2778,pages 796–807. Springer-Verlag, 2003.
[88] D. Lee, W. Luk, J.D. Villasenor, and P.Y.K. Cheung. Hierarchical segmen-tation schemes for function evaluation. In Proceedings of IEEE Interna-tional Conference on Field-Programmable Technology, pages 92–99, 2003.
[89] D. Lee, W. Luk, J.D. Villasenor, and P.Y.K. Cheung. A Gaussian noisegenerator for hardware-based simulations. IEEE Transactions on Comput-ers, 53(12):1523–1534, 2004.
[90] D. Lee, W. Luk, J.D. Villasenor, and P.Y.K. Cheung. The effects of poly-nomial degrees on the hierarchical segmentation method. In W. Rosenstieland P. Lysaght, editors, New Algorithms, Architectures, and Applicationsfor Reconfigurable Computing. Kluwer Academic Publishers, 2004.
[91] D. Lee, W. Luk, J.D. Villasenor, and P.Y.K. Cheung. The hierarchical seg-mentation method for function evaluation. IEEE Transactions on Circuitsand Systems I, 2004. Submitted.
[92] D. Lee, W. Luk, J.D. Villasenor, and P.H.W. Leong. Design parameteroptimization for the Wallace Gaussian random number generator. ACMTransactions on Modeling and Computer Simulation, 2004. Submitted.
[93] D. Lee, W. Luk, C. Wang, C. Jones, M. Smith, and J.D. Villasenor. Aflexible hardware encoder for low-density parity-check codes. In Proceedingsof IEEE Symposium on Field-Programmable Custom Computing Machines,2004.
[94] D. Lee, W. Luk, G. Zhang, P.H.W. Leong, and J.D. Villasenor. A hardwareGaussian noise generator using the Wallace method. IEEE Transactionson VLSI, 2004. Submitted.
[95] D. Lee, O. Mencer, D.J. Pearce, and W. Luk. Automating optimized table-with-polynomial function evaluation for FPGAs. In Proceedings of Interna-tional Conference on Field-Programmable Logic and its Applications, LNCS3203, pages 364–373. Springer-Verlag, 2004.
272
[96] T.K. Lee, S. Yusuf, W. Luk, M. Sloman, E. Lupu, and N. Dulay. Compilingpolicy descriptions into reconfigurable firewall processors. In Proceedingsof IEEE Symposium on Field-Programmable Custom Computing Machines,pages 39–48, 2003.
[97] V. Lefevre and J.M. Muller. On-the-fly range reduction. Journal of VLSISignal Processing, 33:31–35, 2003.
[98] J.L. Leva. A fast normal random number generator. ACM TransactionsMathematical Software, 18(4):449–453, 1992.
[99] B. Levine, R.R. Taylor, and H. Schmit. Implementation of near Shannonlimit error-correcting codes using reconfigurable hardware. In Proceedingsof IEEE Symposium on Field-Programmable Custom Computing Machines,pages 217–226, 2000.
[100] D.M. Lewis. Interleaved memory function interpolators with applicationto an accurate LNS arithmetic unit. IEEE Transactions on Computers,43(8):974–982, 1994.
[101] J. Leydold. Automatic sampling with the ratio-of-uniforms method. ACMTransactions Mathematical Software, 26(1):78–98, 2000.
[102] R.C. Li, S. Boldo, and M. Daumas. Theorems on efficient argument reduc-tions. In Proceedings of IEEE Symposium on Computer Arithmetic, pages129–136, 2003.
[103] J. Liang, R. Tessier, and D. Goeckel. A dynamically-reconfigurable, power-efficient Turbo decoder. In Proceedings of IEEE Symposium on Field-Programmable Custom Computing Machines, 2004.
[104] J. Liang, R. Tessier, and O. Mencer. Floating point unit generationand evaluation for FPGAs. In Proceedings of IEEE Symposium on Field-Programmable Custom Computing Machines, pages 185–194, 2003.
[105] M. Luby, M. Mitzenmacher, A. Shokrollahi, and D. Spielman. Analysis oflow density codes and improved designs using irregular graphs. In Proceed-ings of the ACM Symposium on the Theory of Computing, pages 249–258,1998.
[106] M. Luby, M. Mitzenmacher, A. Shokrollahi, and D. Spielman. Improvedlow-density parity-check codes using irregular graphs and belief propaga-tion. In Proceedings of IEEE Symposium on Information Theory, page 117,1998.
273
[107] M. Luby, M. Mitzenmacher, A. Shokrollahi, and D. Spielman. Improvedlow-density parity-check codes using irregular graphs. IEEE Transactionson Information Theory, 47:585–598, 2001.
[108] M. Luby, M. Mitzenmacher, A. Shokrollahi, D. Spielman, and V. Stemann.Practical loss-resilient codes. In Proceedings of the ACM Symposium on theTheory of Computing, pages 150–159, 1997.
[109] J.N. Lygouras, B.G. Mertzios, and N.C. Voulgaris. Design and constructionof a microcomputer controlled light-weight robot arm. In Proceedings of theIEEE International Workshop on Intelligent Motion Control, pages 551–555, 1990.
[110] D.J.C MacKay. Good error-correcting codes based on very sparse matrices.IEEE Transactions on Information Theory, 45:399–431, 1999.
[111] D.J.C MacKay, S. Wilson, and M. Davey. Comparison of constructions ofirregular Gallager codes. IEEE Transactions on Communications, 47:1449–1454, 1999.
[112] A. Madisetti, A.Y. Kwentus, and A.N. Willson. A 100-MHz, 16-b, directdigital frequency synthesizer with a 100-dBc spurious-free dynamic range.IEEE Journal of Solid-State Circuits, 34(8):1034–1042, 1999.
[113] G. Marsaglia. Diehard: a battery of tests of randomness, 1997. http:
//stat.fsu.edu/∼geo/diehard.html.
[114] G. Marsaglia, M.D. MacLaren, and T.A. Bray. A fast procedure for gen-erating normal random variables. Communications of the ACM, 7(1):4–10,1964.
[115] G. Marsaglia and W.W. Tsang. The Ziggurat method for generating ran-dom variables. Journal of Statistical Software, 5(8):1–7, 2000.
[116] G. Masera, G. Piccinini, M. Ruo Roch, and M. Zamboni. VLSI architec-tures for Turbo codes. IEEE Transactions on VLSI, 7(3):369–379, 1999.
[117] The MathWorks Inc. MATLAB Manual v6.5, 2002. http://www.
mathworks.com.
[118] C. Maxfield. The Design Warrior’s Guide to FPGAs. Newnes, 2004.
[119] M. McKee. Mars laser will beam super-fast data. New Scientist, Sep 2004.http://www.newscientist.com/news/news.jsp?id=ns99996409.
274
[120] G. Mehta and H. Lee. An FPGA implementation of the graph encoder-decoder for regular LDPC codes. CRL Technical Report 8-4-2002-1, Com-munications Research Laboratory, University of Pittsburgh, 2002.
[121] O. Mencer. PAM-Blox II: design and evaluation of C++ module generationfor computing with FPGAs. In Proceedings of IEEE Symposium on Field-Programmable Custom Computing Machines, pages 67–76, 2002.
[122] O. Mencer and W. Luk. Parameterized high throughput function evaluationfor FPGAs. Journal of VLSI Signal Proceedings of Systems, 36(1):17–25,2004.
[123] O. Mencer, D.J. Pearce, L.W. Howes, and W. Luk. Design space explo-ration with A Stream Compiler. In Proceedings of IEEE InternationalConference on Field-Programmable Technology, pages 270–277, 2003.
[124] G.D. Micheli. Synthesis and Optimization of Digital Circuits. McGraw-Hill, 1994.
[125] A. Miller and M. Gulotta. PN generators using the SRL macro. In XilinxApplication Note, number XAPP211, 2001.
[126] R.H. Morelos-Zaragoza. The Art of Error Correcting Coding. John Wiley& Sons, 2002.
[127] G.S Muller and C.K. Pauw. On the generation of a smooth Gaussianrandom variable to 5 standard deviations. In Proceedings of IEEE SouthernAfrican Conference on Communications and Signal Processing, pages 62–66, 1988.
[128] J.M. Muller. Elementary Functions: Algorithms and Implementation.Birkhauser Verlag AG, 1997.
[129] J.M. Muller. A few results on table-based methods. Reliable Computing,5(3):279–288, 1999.
[130] M.E. Muller. A comparison of methods for generating normal deviates ondigital computers. Journal of the ACM, 6(3):376–383, 1959.
[131] Nallatech. BenONE User Guide, 2002. http://www.nallatech.com.
[132] T. Oenning and J. Moon. Low density parity check coding for magneticrecording channels with media noise. In Proceedings of IEEE Conferenceon Communications, volume 7, pages 2189–2193, 2001.
275
[133] E.P. O’Grady and C.H. Wang. Performance limitations in parallel processorsimulations. Transactions of the Society for Computer Simulation, 4:311–330, 1987.
[134] I. Page and W. Luk. Compiling Occam into FPGAs. In FPGAs. AbingdonEE&CS Books, 1991.
[135] K. Page and E.M. Chau. A FPGA ASIC communication channel systemsemulator. In Proceedings of IEEE ASIC Conference, pages 345–348, 1993.
[136] B. Pandita and S.K. Roy. Design and implementation of a Viterbi decoderusing FPGAs. In Proceedings of IEEE International Conference on VLSIDesign, pages 611–614, 1999.
[137] B. Patrice, R. Didier, and V. Jean. Programmable active memories: a per-formance assessment. In Proceedings of ACM/SIGDA International Sym-posium on Field-Programmable Gate Arrays, 1992.
[138] T. Pavlidis. Waveform segmentation through functional approximation.IEEE Transactions on Computers, C-22(7):689–697, 1973.
[139] T. Pavlidis. Optimal piecewise polynomial L2 approximation of functionsof one and two variables. IEEE Transactions on Computers, C-24:98–102,1975.
[140] T. Pavlidis. The use of algorithms of piecewise approximations for pic-ture processing applications. ACM Transactions Mathematical Software,2(4):305–321, 1976.
[141] T. Pavlidis and S.L. Horowitz. Segmentation of plane curves. IEEE Trans-actions on Computers, C-23:860–870, 1974.
[142] T. Pavlidis and A.P. Maika. Uniform piecewise polynomial approximationwith variable joints. Journal of Approximation Theory, 12:61–69, 1974.
[143] W.H. Payne. Normal random numbers: using machine analysis to choosethe best algorithm. ACM Transactions Mathematical Software, 3(4):346–358, 1977.
[144] C.S. Petrie and J.A. Connelly. The sampling of noise for random numbergeneration. In Proceedings of IEEE International Symposium on Circuitsand Systems, volume 6, pages 26–29, 1999.
[145] S.S. Pietrobon. Implementation and performance of a Turbo/MAP de-coder. International Journal of Satellite Communications, 16:23–46, 1998.
276
[146] J.A. Pineiro, J.D. Bruguera, and J.M. Muller. A Turbo/MAP decoder foruse in satellite circuits. In IEEE International Conference on Informationand Communications Security, volume 1, pages 427–431, 1997.
[147] J.A. Pineiro, J.D. Bruguera, and J.M. Muller. Faithful powering computa-tion using table look-up and a fused accumulation tree. In Proceedings ofIEEE Symposium on Computer Arithmetic, pages 40–47, 2001.
[148] J. Proakis. Digital communications. McGraw-Hill, fourth edition, 2000.
[149] E. Remez. Sur un procede convergent d’approximations successives pourdeterminer les polynomes d’approximation. IC.R. Academie des Sciences,Paris, (198), 1934.
[150] J.R. Rice. The Approximation of Functions, volume 2. Addison-Wesley,1969.
[151] T. Richardson, A. Shokrollahi, and R. Urbanke. Design of provably goodlow-density parity check codes. In IEEE International Symposium on In-formation Theory, pages 25–30, 2000.
[152] T. Richardson and R. Urbanke. Efficient encoding of low-density parity-check codes. IEEE Transactions on Information Theory, 47:638–656, 2001.
[153] RightMark Gathering. RightMark Memory Analyzer 3.4, 2004. http://
www.rightmark.org.
[154] B.D. Ripley. Stochastic Simulation. John Wiley & Sons, 1987.
[155] S. Rocchi and V. Vignoli. A chaotic CMOS true-random analog/digitalwithe noise generator. In Proceedings of IEEE International Symposiumon Circuits and Systems, volume 5, pages 463–466, 1999.
[156] C. Rose. A statistical identity linking folded and censored distributions.Journal of Economic Dynamics and Control, 19(8):1391–1403, 1995.
[157] C. Rub. On Wallace’s method for the generation of normal variates. MPIInformatik Research Reports MPI-I-98-1-020, Max-Planck-Institut fur In-formatik, Germany, 1998.
[158] D. Rueckert, L.I. Sonoda, C. Hayes, D.L. Hill, M.O. Leach, and D.J.Hawkes. Nonrigid registration using free-form deformations: applicationto breast MR images. IEEE Transactions on Medical Imaging, 18(8):712–720, 1999.
277
[159] D. Das Sarma and D.W. Matula. Faithful bipartite ROM reciprocal tables.In Proceedings of IEEE Symposium on Computer Arithmetic, pages 17–28,1995.
[160] M.F. Schollmeyer and W.H. Tranter. Noise generators for the simulation ofdigital communication systems. In Proceedings of IEEE Annual SimulationSymposium, pages 264–275, 1991.
[161] M.J. Schulte and E.E. Schwartzlander Jr. Hardware designs for ex-actly rounded elementary functions. IEEE Transactions on Computers,43(8):964–973, 1994.
[162] M.J. Schulte and J.E. Stine. Symmetric bipartite tables for accurate func-tion approximation. In Proceedings of IEEE Symposium on ComputerArithmetic, pages 175–183, 1997.
[163] M.J. Schulte and J.E. Stine. Approximating elementary functions withsymmetric bipartite tables. IEEE Transactions on Computers, 48(9):842–847, 1999.
[164] C.E. Shannon. A mathematical theory of communication. In Bell SystemTechnical Journal, number 27, pages 379–423, 1948.
[165] N. Sidahao, G.A. Constantinides, and P.Y.K. Cheung. Architectures forfunction evaluation on FPGAs. In Proceedings of IEEE International Sym-posium on Circuits and Systems, volume 2, pages 804–807, 2003.
[166] SimpleScalar LLC. SimpleScalar 4.0, 2004. http://www.simplescalar.
com.
[167] J.E. Stine and M.J. Schulte. The symmetric table addition method for accu-rate function approximation. Journal of VLSI Signal Processing, 21(2):167–177, 1999.
[168] H. Styles and W. Luk. Customizing graphics applications: techniquesand programming interface. In Proceedings of IEEE Symposium on Field-Programmable Custom Computing Machines, pages 77–90, 2000.
[169] S. Swaminathan, R. Tessier, D. Goeckel, and W. Burleson. A dynamicallyreconfigurable adaptive Viterbi decoder. In Proceedings of ACM/SIGDAInternational Symposium on Field-Programmable Gate Arrays, pages 227–236, 2002.
278
[170] K. Tae, J. Chung, and D. Kim. Noise generation system using DCT. InProceedings of IEEE International Symposium on Circuits and Systems,volume 4, pages 29–32, 2002.
[171] P.T.P. Tang. Table lookup algorithms for elementary functions and their er-ror analysis. In Proceedings of IEEE Symposium on Computer Arithmetic,pages 232–236, 1991.
[172] M. Tanner. A recursive approach to low complexity codes. IEEE Transac-tions on Information Theory, IT-27:533–547, 1981.
[173] N. Telle, R.C.C. Cheung, and W.Luk. Customising hardware designsfor elliptic curve cryptography. In International Workshop on ComputerSystems: Architectures, Modeling, and Simulation, LNCS 3133. Springer-Verlag, 2004.
[174] T. Tian, C. Jones, J. Villasenor, and R. Wesel. Construction of irregularLDPC codes with low error floors. In Proceedings of IEEE InternationalConference on Communications, volume 5, pages 3125–3129, 2003.
[175] T. Todman and W. Luk. Methods and tools for high-resolution imaging. InProceedings of International Conference on Field-Programmable Logic andits Applications, LNCS 3203, pages 627–636. Springer-Verlag, 2004.
[176] J. Vedral and J. Holub. Oscilloscope testing by means of stochastic signal.Measurement Science Review, 1(1), 2001.
[177] F. Viglione, G. Masera, G. Piccinini, M. Ruo Roch, and M. Zamboni. A50 Mbit/s iterative Turbo-decoder. In Proceedings of Design, Automationand Test in Europe Conference, pages 176–180, 2000.
[178] J.E. Volder. The CORDIC trigonometric computing technique. IEEETransactions on Electrical Computers, EC-8(3):330–334, 1959.
[179] C.S. Wallace. A long-period pseudo-random generator. Technical ReportTR89/123, Monash University, Australia, 1989.
[180] C.S. Wallace. Fast pseudorandom generators for normal and exponen-tial variates. ACM Transactions on Mathematical Software, 22(1):119–127,1996.
[181] C.S. Wallace. MDMC Software - Random Number Generators. 2003.http://www.datamining.monash.edu.au/software/random.
279
[182] J.S. Walther. A unified algorithm for elementary functions. In Proceedingsof AFIPS Spring Joint Computer Conference, pages 379–385, 1971.
[183] N. Wax. Noise and Stochastic Processes. Donver Publications Inc., 1954.
[184] S. Wilton, S. Ang, and W. Luk. The impact of pipelining on energy per op-eration in field-programmable gate arrays. In Proceedings of InternationalConference on Field-Programmable Logic and its Applications, LNCS 3203,pages 719–728. Springer-Verlag, 2004.
[185] W.F. Wong and E. Goto. Fast hardware-based algorithms for elementaryfunction computations using rectangular multipliers. IEEE Transactionson Computers, 43:278–294, 1994.
[186] Xilinx Inc. Additive White Gaussian Noise (AWGN) Core v1.0, 2002.http://www.xilinx.com.
[187] Xilinx Inc. Virtex-II Platform FPGAs: Detailed Sescription, 2003. http:
//www.xilinx.com.
[188] Xilinx Inc. Xilinx System Generator User Guide v6.2, 2003. http://www.xilinx.com.
[189] Xilinx Inc. Virtex-4 Family Overview, 2004. http://www.xilinx.com.
[190] D. Yeh, G. Feygin, and P. Chow. RACER: A reconfigurable constraint-length 14 Viterbi decoder. In Proceedings of IEEE Symposium on Field-Programmable Custom Computing Machines, pages 60–69, 1996.
[191] K.W. Yip and T.S. Ng. A simulation model for Nakagami-m fading chan-nels, m<1. IEEE Transactions on Communications, 48(2):214–221, 2000.
[192] G. Zhang, P.H.W. Leong, C.H. Ho, K.H. Tsoi, R.C.C. Cheung, D. Lee, andW. Luk. Monte Carlo Simulation using FPGAs. IEEE Transactions onVLSI, 2004. Submitted.
[193] T. Zhang and K.K. Parhi. VLSI implementation-oriented (3,k)-regular low-density parity-check codes. In Proceedings of IEEE Workshop on SignalProcessing Systems, pages 25–36, 2001.
[194] T. Zhang and K.K. Parhi. A 54 Mbps (3,6)-regular FPGA LDPC decoder.In Proceedings of IEEE Workshop on Signal Processing Systems, pages 127–132, 2002.
280
[195] T. Zhang, Z. Wang, and K.K. Parhi. On finite precision implementationof low-density parity-check codes decoder. In Proceedings of IEEE Inter-national Symposium on Circuits and Systems, volume 6, pages 202–205,2001.
[196] H. Zhun and H. Chen. A truly random number generator based on thermalnoise. In Proceedings of IEEE International Conference on ASIC, pages862–864, 2001.
281