Date post: | 07-Dec-2015 |
Category: |
Documents |
Upload: | mymadi2009 |
View: | 17 times |
Download: | 0 times |
BILINEAR ALGORITHMS
AND
ASIC ARCHITECTURES
FOR
FAST SIGNAL PROCESSING
by
Xingdong Dai
Presented to the Graduate and Research Committee
of Lehigh University
in Candidacy for the Degree of
Doctor of Philosophy
in
Computer Engineering
Lehigh University
2008
UMI Number: 3314490
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copy
submitted. Broken or indistinct print, colored or poor quality illustrations and
photographs, print bleed-through, substandard margins, and improper
alignment can adversely affect reproduction.
In the unlikely event that the author did not send a complete manuscript
and there are missing pages, these will be noted. Also, if unauthorized
copyright material had to be removed, a note will indicate the deletion.
®
UMI UMI Microform 3314490
Copyright 2008 by ProQuest LLC.
All rights reserved. This microform edition is protected against
unauthorized copying under Title 17, United States Code.
ProQuest LLC 789 E. Eisenhower Parkway
PO Box 1346 Ann Arbor, Ml 48106-1346
Approved and recommended for acceptance as a dissertation in partial fulfillment
of the requirements for the degree of Doctor of Philosophy.
AbAil £ 5 , 2.00$ Date
A/3A» 1 2Sj 2QO? Accepted Date
Dissertation Director Meghanad D. Wagh
Committee Members:
Meghanad D. Wagh
Bruce D. Fritchman
Zhiyuan Yan
\>A^>~ <-- Jf 0-3U&—
Bruce A. Dodson
QCu^-cte^J^ • "Sandeep Kumar
11
Acknowledgements
I wish to express appreciation for the members of my doctoral committee, the fac
ulty and staff at Lehigh University, friends and colleagues at LSI Corporation, and
family members here and abroad. This dissertation simply would not have been
possible, without your constant encouragement and unwavering support over many
years. Thank you very much for a wonderful experience. I will cherish it forever as
this research draws to a close.
I would like to thank my advisor, Dr. Meghanad D. Wagh, for his guidance during
my graduate study. Not only did he introduce me to the wonderful world of bilinear
algorithms on which this dissertation is focused, he has also shown me the virtue of
unselfishness. I remember many times how Dr. Wagh changed his busy schedule in
order to provide direction and support for this part-time student.
I would like to thank the members of my doctoral committee: Dr. Zhuyuan
Yan, Dr. Bruce D. Fritchman, Dr. Bruce A. Dodson and Dr. Sandeep Kumar. Each
contributed many stimulating ideas throughout this research. Their valuable feedback
greatly enhanced my dissertation.
During this work, friends and colleagues at LSI Corporation also generously lent
their support. From time to time, they substituted for me in conference calls or
meetings.
I want to thank my wife, Mingwei, for her cooperation and extreme patience during
long work days and late school nights. Last but not least, thanks to my parents for
their unconditional love and for providing me with a good education. You both were
my inspiration to accomplish this doctoral research work.
I am forever in your debt, xingdong.
in
Contents
Acknowledgements iii
Abstract 1
1 Introduction 3
1.1 Motivation 3
1.2 Objective and contributions 7
1.3 Organization 11
2 ASIC methodology and mathematical background 15
2.1 Design methodology 16
2.1.1 General ASIC flow 16
2.1.2 ASIC for signal processing 18
2.2 Performance metric 20
2.3 Group theroy 25
2.3.1 Group definition 25
2.3.2 Cyclic and Hankel matrix products 26
2.4 Bilinear algorithm 28
2.4.1 Recursive decomposition 29
2.4.2 Order of computation 30
3 Discrete Hartley transform 35
3.1 Background and prior work 35
3.2 Partitioning the DHT of prime power lengths 38
3.3 Bilinear algorithm for 2n-point DHT 40
v
3.3.1 16-point DHT 47
3.3.2 More than 16-point DHT 50
3.4 Performance analysis for 2n-point DHT 54
3.5 Bilinear algorithm for 3n-point DHT 57
3.6 Performance analysis for 3n-point DHT 60
3.7 Discussion and conclusion 63
4 Modified discrete cosine transform 65
4.1 Background and prior work 65
4.2 Transformation from iV-point MDCT/IMDCT to iV/2-point DCT-IV 68
4.2.1 The forward MDCT transformation 68
4.2.2 The inverse MDCT transformation 69
4.2.3 The advantage of DCT-IV transformation 70
4.3 Bilinear algorithms for 2"-point MDCT/IMDCT 72
4.4 Bilinear algorithms for 4 • 3"-point MDCT/IMDCT 78
4.4.1 The bilinear MDCT/IMDCT for MP3 audio short block length 84
4.4.2 The bilinear MDCT/IMDCT for MP3 audio long block length 88
4.4.3 The unified MDCT/IMDCT architecture for MP3 audio . . . 95
4.5 Discussion and conclusion 108
5 Modulated complex lapped transform 109
5.1 Background and prior work 109
5.2 Proposed algorithm 113
5.2.1 The real part of the MCLT 115
5.2.2 The imaginary part of the MCLT 116
5.2.3 The new MCLT algorithm 116
5.3 Discussion and Conclusion 118
6 Conclusions 121 6.1 Thesis summary 121
6.2 Future work 122
Bibliography 124
VI
List of Tables
1.1 Research steps and tools for fast bilinear algorithm development. . . . 8
1.2 Complexities of various MDCT and IMDCT algorithms for MP3 audio.
Note that M and A refer to multiplication and addition respectively. 10
1.3 Complexities of unified 12 and 36 point MDCT and IMDCT architec
tures for MP3 audio application. Note that M and A refer to multi
plication and addition respectively. 10
1.4 Complexities of various fast MCLT algorithms for block size N — 2n
with a sine window 11
2.1 Complexity and delay comparisons for the three implementations of
(2.1). Note that M and A refer to multiplication and addition respec
tively 23
2.2 Arithmetic complexity of different decomposition orders 32
2.3 Determining the order of combining some bilinear algorithms. Note
that matrices identified with * have the special property that the sum
of all elements in the first row is 0 33
3.1 Hardware complexity of various 2"-point DHT algorithms 54
3.2 Time complexity of various 2ra-point DHT algorithms. Note that M
and A stand for multiplier and adder delays respectively. 55
3.3 Hardware complexity of various 3n-point DHT algorithms 62
3.4 Critical path delay of various 3n-point DHT algorithms. Note that M
and A stand for multipliers and adders respectively. 62
4.1 Multiplication coefficients used in Fig. 4.7 76
ix
4.2 Complexities of various 8 and 16 point MDCT algorithms. Note that
M and A refer to multiplication and addition respectively 78
4.3 Multiplication coefficients used in Fig. 4.13 86
4.4 Complexities of various 12-point MDCT and IMDCT algorithms. Note
that M and A refer to multiplication and addition respectively. . . . 88
4.5 Multiplication coefficients used in Fig. 4.16 91
4.6 Complexities of various 36-point MDCT and IMDCT algorithms. Note
that M and A refer to multiplication and addition respectively. . . . 93
4.7 12 and 36 point MDCT and IMDCT input mapping for unified archi
tecture A 97
4.8 12 and 36 point MDCT and IMDCT output mapping for unified ar
chitecture A 98
4.9 12 and 36 point MDCT and IMDCT input mapping for unified archi
tecture B. Note that XA and XB refer to the two 6-point blocks whose
MDCT is computed concurrently. Similarly XA and XB represent two
independent 6-point transform blocks whose IMDCT is computed con
currently 100
4.10 12 and 36 point MDCT and IMDCT output mapping for unified archi
tecture B. Note that XA and XB refer to MDCTs of 6-point sequences
XA and xB respectively and are computed concurrently. Similarly x'A
and x'B refer to IMDCTs of 6-point transforms XA and XB respectively
and are computed concurrently 101
4.11 12 and 36 point MDCT and IMDCT input mapping for unified archi
tecture C (pipeline) 104
4.12 12 and 36 point MDCT and IMDCT output mapping for unified ar
chitecture C (pipeline) 106
4.13 Complexities of unified 12 and 36 point MDCT and IMDCT architec
tures for MP3 application. Note that M and A refer to multiplication
and addition respectively. 107
5.1 Complexities of various fast MCLT algorithms for block size N — 2". 119
x
List of Figures
1.1 2006 DSP markets by Forward Concepts 5
2.1 Digital ASIC abstraction levels 17
2.2 Critical path delay for an addition 21
2.3 Critical path delay for a multiplication 22
2.4 First implementation of (2.1) 22
2.5 Second implementation of (2.1) 23
2.6 Third implementation of (2.1) 23
2.7 A bilinear algorithm for the 4 point cyclic convolution 31
3.1 Flow graph for Ref. [2] implementation of 2"-point DHT. Note that
each rotation can also be implemented with 3 multiplciations and 3
additions. The index and the coefficients are: i', k' — 0,1, • • • , N/2 — 1;
C{j) = cos{2nj/N) and S{j) = sin(2n j/N), where j = 0 ,1 , • • • , N/4 -1. 36
3.2 Flow graph for Ref. [65] implementation of 3™-point DHT. Note that
each rotation can also be implemented with 3 multiplciations and 3
additions. The index and the coefficients are: i', k' = 0 ,1 , • • • , N/3 — 1,
C(i') = cos(2m'/N) and S(i') = sin{2-Ki'/N) 37
3.3 Flow chart for pn-point bilinear DHT 38
3.4 Hartley group transform for p"-point bilinear DHT 40
3.5 Kernel matrix group division of pn-point DHT 41
3.6 Proposed bilinear algorithm for even-indexed components of a 16-point
DHT. Note that multiplication constant c0 = \[2 48
XI
3.7 Proposed bilinear algorithm for odd-indexed 16-point DHT. Note that
the multiplication coefficients are: c0 = y/2, ci = 0.7654, c2 = 0.5412
and c3 = -1.8478 49
3.8 Delay in nsec (on horizontal axis) and normalized area (on vertical
axis) for various implementations of 8, 16, 32 and 64 point DHTs. . . 56
3.9 Proposed bilinear algorithm for 9-point DHT. Note that the multi
plication coefficients are: c0 = —0.5, C\ = 0.8660, c2 = —0.5924,
c3 = -1.7057, c4 = -0.7660, c5 = -1.6276, c6 = -0.3008, and
c7 = -0.6428 61
3.10 Delay in nsec (on horizontal axis) and normalized area (on vertical
axis) for various implementations of 9 and 27 point DHTs 63
4.1 Flow graph for Ref. [9] implementation of iV point MDCT 67
4.2 Flow graph for Ref. [27] implementation of N point MDCT/IMDCT.
Note that SDCT is unnormalized discrete cosine transform 67
4.3 Flow graph for the DCT-IV implementation of iV-point MDCT. . . . 69
4.4 Flow graph for the DCT-IV implementation of N point IMDCT. . . . 70
4.5 Flow graph for the DCT-IV implementation of 2iV-point unified MDCT
and IMDCT. Note that IMODE = 0 for MDCT and IMODE = 1
for IMDCT 71
4.6 Flow graph for the DCT-IV implementation of 2iV-point unified MDCT
and IMDCT with reduced IO requirement. Note that for MDCT,
IMODE = 0 and in(i) = x(i),i = 0,1, • • • , 2N - 1. For IMDCT,
IMODE = 1 and in(k) = X(k), k = 0,1,--- , iV - 1 71
4.7 Proposed bilinear implementation of the 8-point DCT-IV. The multi
plication coefficients are in Table 4.1 75
4.8 The implementations of 16-point MDCT and IMDCT based on the
8-point DCT-IV 76
4.9 Unified implementation of the 16-point MDCT and IMDCT employing
one 8-point DCT-IV. Note that for MDCT, IMODE = 0,in(i) =
x{i), 0 < i < 16. For IMDCT, IMODE = 1, in(k) - X(k), 0<k<8. 77
xn
4.10 Delay in nsec (on horizontal axis) and normalized area (on vertical
axis) for various implementations of 8 and 16 point MDCTs. Note
that Fig. 4.9 is a unified MDCT and IMDCT architecture, while all
others compute MDCT only. 78
4.11 Flow graph for 2 • 3n-point bilinear DCT-IV 79
4.12 Flow graph for cosine group transform of 2 • 3 n point DCT-IV 83
4.13 Proposed bilinear implementation of the 6-point DCT-IV 87
4.14 The implementations of 12-point MDCT and IMDCT based on the
6-point DCT-IV 87
4.15 Delay in nsec (on horizontal axis) and normalized area (on vertical
axis) for various implementations of 12-point MDCT and IMDCT. . . 89
4.16 Proposed bilinear implementation of multidimensional convolution in
volved in the 18-point DCT-IV 91
4.17 Proposed bilinear implementation of the 18-point DCT-IV 92
4.18 Proposed bilinear implementation of the 36-point MDCT 93
4.19 Proposed bilinear implementation of the 36-point IMDCT 94
4.20 Delay in nsec (on horizontal axis) and normalized area (on vertical
axis) for various implementations of 36-point MDCT and IMDCT. . . 95
4.21 Proposed bilinear implementation of the unified 12 and 36 point MDCT
and IMDCT (architecture A) 99
4.22 Proposed bilinear implementation of the unified 12 and 36 point MDCT
and IMDCT (architecture B) 102
4.23 Proposed bilinear implementation for the pipelined unified 12 and 36
point MDCT and IMDCT (architecture C) 105
4.24 Delay in nsec (on horizontal axis) and normalized area (on vertical
axis) for unified 12 and 36 point MDCT and IMDCT architectures (A,
B and pipeline), with comparison to the 36-point MDCT architectures
in literature 107
5.1 Flow graph for Ref. [34] implementation of the MCLT. Note that c(k) =
W8{2k + l)W4N{k) I l l
xni
5.2 Flow graph for Ref. [17] implementation of the MCLT. Note that c(i) =
h(N — 1 — i)/h(i) and d(i) is a constant for a sine window. Each circle
respresents an addition 112
5.3 Flow graph for the proposed MCLT algorithm. Note that DCT mod
ules are scaled by a constant of l/(2\/iV) 118
xiv
Abstract
This dissertation presents a formal hardware design approach using bilinear algo
rithms for fast digital signal processing (DSP) applications. In particular, we focus
on the design of application specific integrated circuit (ASIC), where dedicated algo
rithmic accelerators are implemented in fixed-point arithmetic.
Most signal processing algorithms involve a transform kernel with a known struc
ture. Using concepts of group theory, the kernel matrix can be recurisvely partitioned
into computations of small length of cyclic convolutions and Hankel matrix products.
Bilinear algorithms for these smaller blocks are then combined together to obtain the
required bilinear algorithm of the transform. Bilinear algorithms have a high degree
of concurrency as all multiplication operations are independent of each other and can
be computed at the same time. As a result, the hardware realizations of bilinear algo
rithms are much faster than any other implementations. The structural modularity
also allows simple pipelining and greatly reduces the number of input and output
(10) pins.
In this dissertation, we develop new bilinear algorithms and implementations
for the discrete Hartley transform (DHT), the modified discrete cosine transform
(MDCT) and the modulated complex lapped transform (MCLT). In case of bilinear
DHT algorithms, we show that the kernel divisions are identical for all prime power
lengths. Our implementations are 20%-60% faster than existing implementations. For
MPEG-1/2 audio layer III (MP3) application, our proposed MDCT algorithms have
about 30% lower computational complexity as compared with other fast algorithms
in the literature. The modularity of our algorithms also permits one to design, for
the first time, a unified architecture for forward and inverse transforms using different
MP3 block sizes. In case of the MCLT, we achieve a bilinear algorithm by by merging
1
the external sine window function with the main computation through trigonometric
manipulations. As compared with most algorithms, our MCLT algorithm requires
about iV-less multiplications, where the typical block size, N, for applications such
as audio watermarking is 2048.
2
Chapter 1
Introduction
1.1 Motivation
The research presented in this dissertation is motivated by the need of developing fast
signal processing algorithms and efficient hardware implementations. The demand for
fast signal processing algorithms has been rising for many years, driven primarily by
the spread of broadband communications and wide usages of cellular technologies [16].
Fast and efficient audio and video processing and compression algorithms have been
seen as enabling technologies for many emerging real-time multimedia-heavy appli
cations [47]. This dissertation focuses on fast algorithms and architectures for the
discrete Hartley transform (DHT), the modified discrete cosine transform (MDCT)
and the modulated complex lapped transform (MCLT). Among many signal process
ing algorithms, these three transforms have attracted particular research interests
from both the academia and the industries.
The discrete Hartley transform is a real-valued transform, and its forward and
inverse transforms share the same kernel. The DHT is favored for fast convolution
on real data sets, and is frequently used as an alternative to the real-valued discrete
Fourier transform (DFT). A recent study finds fast Hartley transform (FHT) has the
fastest realization of the DFT, when implemented across a variety of general purpose
processor platforms [20]. DHT is such a fundamental signal processing tool that its
fast implementations will continue to benefit many signal processing applications.
3
CHAPTER 1. INTRODUCTION
The modulated lapped transform (MLT) is a cosine-modulated filter bank based
on concepts of time domain aliasing cancellation (TDAC) which permits perfect re
construction. A key signal processing component within the framework of MLT is the
modified discrete cosine transform (MDCT). The MLT and the MDCT are essential
parts of many state of the art audio compression technologies, such as MPEG audio,
Dolby AC-3, and windows media audio (WMA). In complex extended forms, the mod
ulated complex lapped transform (MCLT) has been proposed for many exciting new
applications such as audio watermarking and audio identification. As many of these
functions being integrated onto mobile devices, fast and efficient implementations are
needed to free the host processor from these CPU-taxing computations.
This research starts out with a modest goal of designing fast MDCT algorithms
and circuits for MPEG-1/2 layer III (MP3) audio applications. Since 2001 when Apple
Computer first made the iPod a household name, the demand for portable MP3 player
has been booming. In a report by leading market research firm IDC, shipments of
MP3 flash memory music players are predicted to surge to nearly 124 million units in
2009. That is a 370% increase from the 26.4 million units shipped worldwide in 2004.
Combined with the MP3 player category, all compressed audio players are expected
to reach 945.5 million units shipped and $145.4 billion in revenue worldwide by 2009.
The forward and inverse MDCT algorithms involve intensive computations and are
frequently seen as implementation bottlenecks for MP3 encoder and decoder. To
further complicate the computational requirement, the sample sizes for MP3 audio
data are not a power of two. As a result, today only a few fast MDCT algorithms
are published on MP3 audio processing.
Recently we have also witnessed the coming of multimedia handsets. A multime
dia handset is a handset capable of processing various kinds of data including text,
graphics, animation, audio and video, in addition to voice services. In 2006, nearly
all of the major manufacturers have developed music handsets. A recent report from
Market and Research has predicted that handsets will gradually transfer from com
munication tools into multimedia terminals with the appearance of completely new
consumer electronics such as DV handsets, TV handsets, game handsets, etc. A lot
4
1.1. MOTIVATION
of computing power is required to make multimedia handset a reality, since informa
tion will need to be digitally sampled, encoded, compressed and transmitted at the
source, then received, decoded, decompressed and played back at the destination. In
addition, all of these data transformations must take place either in real time for in
teractive type communications and entertainments, or within a well constrained time
frame because of limited on-board memory. Today a significant amount of computing
load is shared by a programmable digital signal processor (DSP). According to the
DSP analyst Forward Concepts, 70% of programmable DSP are shipped for wireless
applications including handsets (Fig. 1.1).
$8.3 Billion worldwide
Computer W i r e l i n e
73% Wireless
Figure 1.1: 2006 DSP markets by Forward Concepts.
Algorithm research on signal processing has a long history of concentrating on
reducing computational complexity in terms of arithmetic operations. However this
approach often ignores the underlying hardware platform, making the algorithms most
5
CHAPTER 1. INTRODUCTION
suitable for general purpose processors where computations are executed sequentially
in a shared arithmetic logic unit (ALU). Attention is paid to minimize data trans
fers between the memory and the ALU. A fast algorithm is the one that moves data
through the ALU quickly and does so in the fewest times. Most signal processing al
gorithms involve a transform kernel of matrix-vector multiplications, programmable
DSP takes advantage of it by offering a single cycle multiply-accumulation (MAC)
and low overhead loop structures. As a result, for many signal processing applica
tions, a better performance can be achieved using a programmable DSP over a general
purpose processor. In the past decade, in order to meet the rising demand of compu
tational requirements, a lot of important advances in processor technology have been
made. Among them, very long instruction word (VLIW), single instruction multiple
data (SIMD), multi-threading (MT) and multi-core processors are just a few notable
improvements. However these advanced technologies require close interactions be
tween the compiler and the silicon to achieve the much touted performance. It can
mean a prolonged development cycle, in addition to commanding a high cost and
power premium.
An attractive approach to realize a fast alogrithm is to apply dedicated hardware
acceleration, in particular through the use of application specific integrated circuits
(ASIC) [40,59]. For an ASIC implementation, algorithm speed is intrinsically tied
to critical path delay. Improvements to critical path can be explored at several ab
straction levels, including architecture, logic, and circuit [60]. At architecture level,
concurrency available in an algorithm can be exploited by pipelining and parallel
processing techniques. Pipelining reduces effective critical path by performing inde
pendent computations in an interleaved manner. Parallel processing on the other
hand, performs these computations concurrently using duplicated hardware. At logic
level, fast logic styles are favored on critical path, while area-efficient styles can be
used to meet cost and power goals. For example, one can choose carry look-ahead
adder (CLA) for critical path and use carry propagation adder (CPA) on non-critical
paths. At circuit level, sub-micron CMOS offers best combinations of speed, capac
ity, and cost among commercially available options. From a vendor supplied ASIC
library, selection of cell type, size and drive strength can be made through synthesis
6
1.2. OBJECTIVE AND CONTRIBUTIONS
to meet a set of pre-defined design objectives such as area and power.
A frequently cited advantage of general purpose processors (GPP) and digital sig
nal processors (DSP) is faster time-to-market than an ASIC implementation. This
makes GPP and DSP good choices for initial prototyping when function requirements
are not completely specified. However once the performance bottleneck is well un
derstood, ASIC can provide a targeted high performance solution that is also low in
cost and power. This is because for a given application, the computational require
ments for the data flow can often be isolated from those for the control flow. Data
flow is frequently point-to-point in nature and with fewer conditional branches. In
addition, similar to many software project developments, the hardware design cycle
can be significantly reduced if a pre-verified signal processing library is used. Once
the computation intensive tasks are off-loaded, the rest of processing can be carried
out in a more flexible DSP or GPP with relaxed MIPS requirements. Consequently
this hybrid approach can generate better system performance and savings in cost and
power.
This ASIC-driven approach however, is not physically possible or economically
feasible in micron-width technologies. Large transistor sizes limited the amount of
hardware that can be put on a die. In sub-micron technologies, ASIC design has
increasingly crossed over from die-limited to pad-limited [48]. This means for ASIC
design, transistors are cheap and inputs and outputs (10) are expensive. Therefore
more diversified functions and larger scale of signal processing can be implemented
with little cost penalty, if the numebr of ASIC inputs and outputs can be capped.
This change in ASIC technology permits a fresh look into the hardware acceleration
approach to fast signal processing.
1.2 Objective and contributions
The objective of this research is to develop fast bilinear algorithms and efficient hard
ware architectures, particularly for three types of signal transforms: the discrete
Hartley transform (DHT), the modified discrete cosine transform (MDCT) and and
the modulated complex lapped transform (MCLT). Early research [35] has shown
7
CHAPTER 1. INTRODUCTION
that implementations based on bilinear algorithms can provide very fast and effi
cient architectures for the discrete cosine transform (DCT) of certain lengths. To
obtain bilinear algorithms for the DHT, MDCT and MCLT, we plan to employ a
similar group theoretic approach, which allows us to partition a transform kernel into
smaller cyclic convolutions and Hankel products. Appling bilinear algorithms to these
smaller matrix products, we will then be able to obtain the desired architectures. For
the transforms under consideration, we aim for a well defined, scalable signal flow
graph of the computation which will be ideal for ASIC implementation. Table 1.1
summarizes the steps and tools utilized in this study.
Table 1.1: Research steps and tools for fast bilinear algorithm development. Research steps Algorithm design
Hardware architecture Design synthesis
Design technology
Research tools Matlab Verilog
Synopsys Design Compiler TSMC 90nm CMOS
The major contributions of this dissertation include:
1. Discrete Hartley Transforms (DHT) of prime power lengths
• Theory: Using group theory, it is first shown that the transform kernel for
lengths 2" and pn for odd prime p, can be partitioned into two individual
sub-matrices, a DHT of smaller size and a Hartley Group Transform. The
Hartley group transform further decomposes into a cyclic matrix and a
Hartley group transform of smaller size.
• Implementation: Hardware complexity is competitive with the referenced
algorithms up to 256-point with a pipelined and folded design. For lengths
less than 64, our proposed implementations are 20% to 60% faster on
average. The pipelined design can further reduce the output pins by 50%.
2. Modified Discrete Cosine Transforms (MDCT) of lengths 2"
1.2. OBJECTIVE AND CONTRIBUTIONS
• Theory: Using group theory, it is first shown that the transform kernel
matrix converts to a single Hankel matrix.
• Implementation: Algorithm complexity is competitive for 8 and 16 point
MDCT. However our proposed structured bilinear implementations are
30% to 45% faster than the referenced design.
3. Modified Discrete Cosine Transforms (MDCT) of composite length
4 - 3 "
• Application: MPEG-1/2 audio layer III (MP3) employs two MDCTs. The
long block (36-point) is normally used to provide better frequency resolu
tion. The short block (12-point) is used whenever better time resolution
is needed. The switch from the long block to the short block occurs if a
distortion is expected in the frequency domain coding of an audio signal.
• Theory: Using group theory, it is first shown that the transform kernel can
be partitioned into two individual sub-matrices, a type-IV discrete cosine
transform of smaller size and a Cosine Group Transform. The cosine group
transform further decomposes into a block cyclic matrix and a cosine group
transform of smaller size.
• Algorithm: We have designed the most efficient MDCT algorithms for MP3
audio processing (Table 1.2).
• Implementation: Based on a group theoretic division of transform kernels,
we are the first in developing three versions of unified hardware architec
tures to compute transforms of different block sizes, in addition to the for
ward and inverse transforms. Not only our single function designs perform
better than existing solutions, the unified architectures are also showing
superior performance against non-unified existing designs (Table 1.3).
4. Modulated Complex Lapped Transforms (MCLT) of lengths divisible
by four
• Applications: Digital audio watermarking, audio and video denoising. The
transform sizes are frequently larger than 2048.
9
CHAPTER 1. INTRODUCTION
Table 1.2: Complexities of various MDCT and IMDCT algorithms for MP3 audio. Note that M and A refer to multiplication and addition respectively.
Transform 12-point MDCT 12-point MDCT 12-point MDCT 12-point MDCT 12-point IMDCT 12-point IMDCT 12-point IMDCT 12-point IMDCT 36-point MDCT 36-point MDCT 36-point MDCT 36-point MDCT 36-point IMDCT 36-point IMDCT 36-point IMDCT 36-point IMDCT
Algorithm Proposed Ref. [9] Ref. [37] Ref. [27] Proposed Ref. [9] Ref. [37] Ref. [27] Proposed Ref. [9] Ref. [37] Ref. [27] Proposed Ref. [9] Ref. [37] Ref. [27]
Arithmetic complexity 9M + 29,4 11M + 39A 13M + 27.4 11M + 29A 9M + 23A UM + 33A 13M + 21A 11M + 23A
36M + 150A AIM + 1654 47M + 129 A 43M + 133A 36M + 1324 51M + 1514 51M + 1154 43M + 1154
Critical delay M + 6A 2M+6A 2M+5A 3M+7A M + 5A 2M+5A 2M+4A 3M+6A
M + 10A 2M+9A 2M+8A
3M+22A M + 9A 2M+8A 2M+7A 3M+21A
Table 1.3: Complexities of unified 12 and 36 point MDCT and IMDCT architectures for MP3 audio application. Note that M and A refer to multiplication and addition respectively.
Architecture
A B
C (pipeline)
Arithmetic complexity
36M + 15CL4 36M + 154,4 27M + 1314
Critical delay 36-point M + 1CL4 M + 10A 2M + 12A
12-point M + 6A M + QA M + 6A
10
1.3. ORGANIZATION
Theory: Using trigonometric identities, it was shown that the window func
tion can be completely merged with the transform kernel so as to preserve
the bilinearity of the algorithm. The kernel can then be decomposed into a
discrete cosine transform and a discrete sine transform done concurrently.
Algorithm: We propose an efficient algorithm for the MCLT with a sine
window fuction (Table 1.4). This solution also preserves the bilinearity of
the design. Our algorithm first breaks down the MCLT into a linear combi
nation of an evenly-stacked modified discrete cosine transform (E-MDCT)
and an evenly-stacked modified discrete sine transform (E-MDST). We
show E-MDCT can be converted to discrete cosine transform of type II
(DCT-II), and similarly E-MDST to discrete sine transform of type II
(DST-II). With a known transformation technique, DST-II can be com
puted through a DCT-II of same size. Thus the entire MCLT can share a
single DCT-II core module.
Table 1.4: Complexities of various fast MCLT algorithms for block size N = 2n with a sine window.
Algorithm
Ref. [33] Ref. [17] Ref. [34] Proposed
Window choice
any sine sine sine
Real multiplications
N\ogN + W iVlogJV + JV NlogN + N NlogN + 2
Real additions
3N\ogN + 2N 3JVlogJV + 4/V
3NlogN + 3N -2 3N\ogN + 4N
1.3 Organization
This chapter has provided an introduction on our research into the design of bilinear
hardware accelerator for signal processing applications. We stressed the importance of
developing fast signal processing algorithms and the challenges posted by a converged
multimedia wireless network. The requirements on hardware implementation are also
discussed. We advocate an ASIC-driven approach to deliver a high performance and
11
CHAPTER 1. INTRODUCTION
yet cost effective solution. Also presented in this chapter are our research objectives
and a summary of contributions.
In Chapter 2, we first discuss the ASIC design flow. Then concepts of group theory
and basic properties of the group are reviewed. This is followed by a detailed treat
ment of the bilinear algorithm. We show examples of bilinear algorithms for small
size cyclic convolutions and Hankel products. A bilinear algorithm for larger size
cyclic and Hankel matrix products can be derived by using small length algorithms
as building blocks. Since structured bilinear architecture has only one multiplication
on the critical path, the resulting ASIC circuit can be very fast in fixed point im
plementations. For a complex group structure, the order of computation can have
significant impact on the additive complexity. We discuss a procedure for determining
this computation order.
Bilinear algorithms for the DHT of prime power lengths are developed in Chapter
3. We investigate the structure of DHT kernel matrix using group A(pn) where p is
a prime number. This study leads to bilinear algorithms with a single multiplication
stage. Algorithm derivation and verification using MATLAB was carried out. The
new algorithms and two prior algorithms were implemented for a number of 2™ and
3" points and areas and critical path delays are compared.
In Chapter 4, we extend structured bilinear algorithms to 2n and 4 • 3" point
MDCT. Composite length MDCT algorithms of 4-3" points are the workhorse behind
the popularity of MPEG-1/2 layer III (MP3) audio format. The kernel matrix of
MDCT is rectangular (2" x 2"+1) and not square as in case of the DHT. With front-
end data massaging, we first transform the MDCT into a type-IV DCT. Both forward
and inverse transforms can be unified with a DCT-IV based algorithm. This has a
significant hardware implication since both encoder and decoder can now time-share
a single hardware. By taking advantages of the recursive structure inside a bilinear
algorithm, we propose three variations of unified hardware architecture for not only
the forward and inverse transforms, but also the short and long blocks. We show that
these proposed architectures offer superior performance over existing design solutions
and can be obtained with little area penalty.
Fast algorithm for Modulated complex lapped transform is developed in Chapter
12
1.3. ORGANIZATION
5. We obtain a bilinear algorithm for MCLT of lengths divisible-by-4 and with a sine
window. The real part of the MCLT is a modulated lapped transform (MLT). MLT
is related to the MDCT by first applying a window function to the input data. A
commonly used window is a sine window which preserves the perfect reconstruction.
The imaginary part of MCLT can be obtained by a modified discrete sine transform
(MDST) with windowed data. In most publications, this windowing function was
performed separately from the MDCT flow. A handful publications attempted to
merge the discrete computation steps. All met with limited success. Using trigono
metric identities, we show that the window function can be completely merged with
the transform kernel so as to preserve the bilinearity of the algorithm. The hardware
complexity of the new algorithm is computed and compared with four prior algo
rithms. It is shown that our proposed algorithm has the lowest overall complexity
and lowest multiplication requirements.
In the final chapter, the method and results of this research are summarized. To
obtain bilinear algorithms for the transforms under consideration, a group theoretic
approach can be proven successful in providing a fast VLSI architecture. The use of
group theory will allow us to partition a transform kernel of interest into smaller cyclic
and Hankel matrices. By applying bilinear algorithms on these matrix products, the
desired architectures can be obtained as a result. Future work is discussed on topics
of structured bilinear algorithm and algorithmic hardware acceleration.
13
Chapter 2
ASIC methodology and
mathematical background
This chapter provides the information about implementation techniques used as well
as the mathematical background required for this research. We start out with a re
view of ASIC design flow and metrics used to evaluate an implementation. Then the
concepts of group theory and basic properties of group used here are reviewed. This
is followed by a detailed treatment of the bilinear algorithm. We illustrate examples
of bilinear algorithms for small length cyclic convolutions and Hankel products. A
bilinear alogrithm for larger length cyclic and Hankel matrix products can be derived
using small length algorithms as building blocks. Since structured bilinear architec
ture has only one multiplication operation on the critical path, one can devise very
fast ASIC bilinear architectures in fixed-point implementations. For a complex group
structure, the ordering of computation can have a significant impact on the additive
complexity of the bilinear algorithm. We discuss a procedure for determining this
computation order.
15
CHAPTER 2. ASIC METHODOLOGY AND MATHEMATICAL BACKGROUND
2.1 Design methodology
2.1.1 General ASIC flow
Application specific integrated circuit (ASIC) has been the technology of choice for
many large volume applications. Many digital signal processing systems fit in this
category. For example, mobile phones are produced in very large numbers and re
quire high-performance circuits with respect to throughput and power consumption.
ASIC often has the performance advantages over digital signal processors (DSP) in
size, power and speed. These advantages are gained when area is optimized by elim
inating unnecessary and all unwanted functions that are part of a standard DSP. For
example, the instruction fetch-and-decode logic and interrupt handling circuits. As
such, ASIC circuits drive less capacitive load, resulting in less power dissipation and
better performance. In addition, the word precision and other design parameters can
be tailored to a specific application to provide optimum high-level performance.
ASIC however, is not without challenges. The most notables are the lack of the
programmability and ease of portability, and with that a longer design cycle. Since
ASIC implementations are literally set in stone, they can not be easily field-upgraded
and thus must be designed reliably on the first try. To achieve this objective, a
design ripples through a sequence of refinement stages called abstraction levels before
making its way into the silicon. The different design abstraction levels of a top-down
design methodology are illustrated in Fig. 2.1.
In a digital ASIC flow, the highest abstraction level is the specification level. It
defines the functionality, the performance and the constraints of target application.
Next is the behavioral level, where a design is often embodied by a system software
program that executes the function to be implemented in hardware. At the register
transfer level (RTL), the behavioral description is mapped to a well defined hardware
micro-architecture. The behavior and structure of a design is specified by describing
the operations that are performed on data as it flows between circuit inputs, outputs,
and clocked registers. At circuit level, different blocks of the hardware architecture
are committed to Boolean logic gates, latches, and flip-flops. For example, an adder
is committed to a specific implementation such as a carry propagation adder (CPA),
16
1. DESIGN METHODOLOGY
Abstraction Example Tools
Specification
Behavior
Register transfer (RTL)
Circuit
Physical layout
Function: MP3 decoder Performance: 320 kbps Constraint: max. 5 mm
if(s) x =a+b;
else x =a+c;
b—*~
c — * •
s
K. CLA
s a
if IX/1 m««««
Word Spreadsheet
Matlab C/C++
VHDL
Verilog
Synthesis:
Synopsys Cadence Mentor
Cadence Synopsys Magma
Figure 2.1: Digital ASIC abstraction levels.
17
CHAPTER 2. ASIC METHODOLOGY AND MATHEMATICAL BACKGROUND
a carry-lookahead adder (CLA), or a carry select adder (CSA) depending on the
availability of library cells. The circuit level is the stage where design synthesis is
performed to a specific technology node, such as TSMC 90nm CMOS process used in
this research. The final abstraction level before design sign-off is the physical layout
level. Here the design is linked to a specific technology and mask geometries are
derived. Simulation parameters such as wire resistance and parasitic capacitance are
extracted and fed back to a circuit simulator to ensure design can function at spec
ified performance level. Cross-talk analysis on electromigration is also performed to
determine if the device life time is sufficient for an application. Most often design will
iterate a few times between last two levels before a satisfactory solution is obtained.
We have seen that these abstraction levels offer opportunities to divide a compli
cated design process into smaller and more manageable tasks. A well-practiced ASIC
flow can significantly reduce the risk of introducing errors into the design. However it
is also this elaborated and fine-tuned procedure that demands significant effort and
time, which in turn often prolongs the design process. In order to shorten the schedule
without negatively impacting the quality of design, a frequently used improvement
technique is designing for reuse. This requires that during the design process, atten
tion must be paid to any iterative process within the scope. These iterative processes
can be extracted out as a macro or a core. They can be verified once and confi
dently applied many times. A systematic reuse of macros and cores provides the
quickest and most efficient approach to design ASIC [44]. In this dissertation, we will
show that our choice of design provides a natural way for design-for-reuse, As such,
a larger design can be quickly generated based on existing pre-verified blocks thereby
significantly shortening the design schedule.
2.1.2 ASIC for signal processing
Signal processing is fundamental to information processing. Generally the goal is
to reduce the information content in a signal to facilitate a decision about what
information the signal carries. In other instances, the aim is to retain the information
and to transform the signal into a form that is more suitable for transmission or
storage.
18
2.1. DESIGN METHODOLOGY
The major attributes of our ASIC design for signal processing are
• Standard cell based ASIC: Standard cell based ASIC, or cell-based IC (CBIC)
uses a collection of pre-designed logic cells to build a large complex circuit.
These logic cells are known as standard cells and include Boolean gates, flip-
flops and other complex modules such as multiplexers etc. The entire collection
is called a library. If building an ASIC is like building a house, the library is like
a Home Depot catalog, and the cells are equivalent to the lumber, bricks, nails
and tiles listed in the catalog. The advantage of this design approach is that
designers save time and reduce risk by using a pre-designed, pre-tested, and
pre-characterized standard cell library. ASIC designers can save their efforts to
focus on system functionality and high-level design tradeoffs.
• Fixed-point arithmetic: Early signal processors used fixed-point arithmetic and
often had far too short internal data length and far too small on-chip memory
to be efficient. Recent processors use floating-point arithmetic which is much
more expensive than fixed-point arithmetic in terms of power consumption,
execution time and chip area. In fact, these processors are not exclusively
aimed at DSP applications. Applications that typically require floating-point
arithmetic are three-dimension (3D) graphics and mechanical computer aided
design (CAD) applications. Fixed-point arithmetic is better suited for DSP
applications than floating-point arithmetic since good DSP algorithms require
high accuracy (long mantissa), but not the large dynamic signal range provided
by floating-point arithmetic. Further, the performance degradation due to non-
linearity (rounding of products) are less severe in fixed-point arithmetic.
• Direct mapping: DSP algothms rarely have many data dependent branching
operations. This makes them ideal candidates for the direct mapping technique
which involves creating a hardware structure that exactly matches with the sig
nal flow graph of the algorithm. This implementation technique is particularly
suitable for system with a fixed function, for example, digital filters and signal
transforms. Since direct mapping approach provides perfect match between the
19
CHAPTER 2. ASIC METHODOLOGY AND MATHEMATICAL BACKGROUND
DSP algorithm and the circuit architecture, it allows algorithm level design pa
rameter tradeoffs and tuning. It thus reduces the overall design time and leads
to a more reliable design.
2.2 Performance metric
The design space can be represented as a triplet:
(Function, Performance, Constraints).
That is to say a design has to deliver a function at a performance level under certain
constraints. For example, in signal processing, the function can be a digital filter
or a signal transform. The performance objectives are often time related, such as
signal sampling rate, or processor requirements such as clock frequency or MIPS.
The constraint targets, on the other hand, are frequently associated with cost, such
as chip area, power consumption, design schedule, or a combination. In general, a
successful design is about the trade off between the performance and the constraints
in order to realize some particular functions. In this dissertation, we are interested in
developing architectures for real-time DSP applications, where computations must be
completed within a given time frame (for example, the signal sample period). In such
applications, unacceptable errors occur if the time limit is exceeded. Consequently,
general purpose or DSP processors that rely on concepts such as memory management
units, cache etc., which may have a high throughput but not necessarily a guaranteed
time for each computation, are unacceptable. On the other hand, without these
features, general purpose and DSP processors cannot provide the necessary speed.
We therefore focus on ASIC solutions in this work.
Architecture speed is intrinsically tied to its critical path delay. Reducing overall
arithmetic complexity helps on the circuit area. However it does not address the
critical path delay directly. For algorithmic circuit, the critical path delay is attributed
to arithmetic operation speed and the number of operations on the signal path. This
implies that on the critical path, one needs to favor simple and fast operations such as
additions rather than more complex and slow operations such as multiplications. This
is especially important for fixed-point arithmetic, since intrinsic delay of a multiplier
20
2.2. PERFORMANCE METRIC
is much greater than that of an adder. Fig. 2.2 shows an addition of 3 operands
using carry propagation adders (CPA). The delay involved in an n-bit CPA is nA,
where A is the delay of a full adder (FA). However when these adders are cascaded to
perform successive additions, the delay introduced by each additional adder is only
A. This additional delay is not dependent on the fixed-point word size. This is
in contrast with a multiplication operation, shown in Fig. 2.3 using a Pezaris array
multiplier. Note that the critical path delay of this multiplier is 2nA where n-th bit
is the operand sign. In addition, cascading another operation increases delay a time
up to nA since even the (useful) least significant bit of the product may take upto
nA time to compute. Further, in any hardware implementation, the size and value
of variables involved in multiplication play a significant role in determining both the
area and the delay.
03 Oj Oj
Figure 2.2: Critical path delay for an addition.
Clearly there is a distinction between the design for area and the design for speed.
In a design for area, the quantity of arithmetic operation is of utmost importance.
Whereas in design for speed, the placement of arithmetic operations along the com
putational path is important. For fixed-point arithmetic ASIC, the total number of
multiplications tends to dominate the chip area and the number of multiplication on
the critical path tends to dominate the delay, or the circuit speed. To reduce the
number of operations on the critical path, concurrency available in an algorithm can
21
CHAPTER 2. ASIC METHODOLOGY AND MATHEMATICAL BACKGROUND
a {b0 ,i=4,3,2,l,0. a4 bQ a^bQ ^ bQ a, b Q aQ
0 1 .0 1.0 j ^ O
a4 bi (+>*- r+n- (+
7 J 6 J 5 " 4 J 3 J 2 " 1
Figure 2.3: Critical path delay for a multiplication.
t P.
be exploited by parallel processing techniques. On the other hand, to reduce the cir
cuit area, existing hardware units can be shared, potentially at the expense of circuit
speed.
Consider the following matrix-vector product.
a b
b a x0
Xl
(2.1)
Equation (2.1) can be implemented by three different flow graphs shown in Figs. (2.4-
2.6). The hardware and time complexities of these implementations are listed in Table
2.1.
Figure 2.4: First implementation of (2.1).
22
2.2. PERFORMANCE METRIC
Figure 2.5: Second implementation of (2.1).
(a+b)/2
(a-b)/2 e—-s
Figure 2.6: Third implementation of (2.1).
Table 2.1: Complexity and delay comparisons for the three implementations of (2.1). Note that M and A refer to multiplication and addition respectively.
Implementation Fig. 2.4 Fig. 2.5 Fig. 2.6
Arithmetic Complexity 4M+2A 4M+2A 2M+4A
Critical Delay M+A 2M+A M+2A
23
CHAPTER 2. ASIC METHODOLOGY AND MATHEMATICAL BACKGROUND
Even though the implementations in Figs. 2.4 and 2.5 have the same complexity,
the critical path delays cannot be more different. Therefore the complexity cannot be
the sole indicator of an algorithm design. This is especially important for real-time
applications. In fixed-point arithmetic implementation, Fig. 2.4 is better because
for the same area (same total operations), its delay is shorter due to less number
operations on the critical path. However when a/b is a power of 2, the multiplication
with a/b can be realized as a left or right shift with sign extension. It is considered
to be such a trivial operation that its impact to area and delay is negligible. In this
case, the implementation of Fig. 2.5 is better than that of Fig. 2.4 since it has the
same delay but smaller area.
Implementation in Fig. 2.6 can be considered as a compromise between the per
formance and the area. It has smaller area but slightly longer delay than the one
in Fig. 2.4. As we will see later in this chapter, both these implementations belong
to the family of bilinear algorithm. Bilinear algorithms exploit the parallelism in
multiplication operations to the fullest extent, i.e. no multiplication is dependent on
another multiplication operation. Since they minimize the number of multiplications
along the critical path, bilinear architectures are very fast.
Doing so however, does require one to possess sufficient hardware resources. For
tunately modern ASIC technology provides an opportunity to accomplish this goal.
As the manufacturing process continues marching down to finer design geometries,
more designs are limited by the number of inputs and outputs (10) than the number
of transistors a die can support. This shift from capacity limitation to 10 limitation,
permits the use of more processing elements for parallel computing. At the same
time, it also brings up a new design constraint on the number of inputs and outputs.
In later chapters, we will show the effects of this new constraint on the performance
of the design.
24
2.3. GROUP THEROY
2.3 Group theroy
2.3.1 Group definition
A group is a nonempty set G together with an operation o satisfying the following
conditions.
• Closure: If a, b E G, then a o b e G.
• Associativity: For all a, b, c € G, (a o b) o c = a o (6 o c).
• Identity: There exists an element e G G such that for all a € G, aoe = a = eoa.
• Inverse: For each a £ G: there exists an a~1 € G such that a o o - 1 = e = a~l oo.
In this dissertation, our groups will be sets of integers with operation of multipli
cation modulo N. This is because many signal processing algorithms use trigonomic
functions which are inherantly periodic, thus providing the reasoning for the modulo
base of our operation.
A group G is cyclic if it can be generated by one element g £ G. A cyclic group
of n elements is represented as Cn. Thus
Cn = {l,g,g2,~-,9n-1}-
In most cases, the choice of a generator is not unique.
If G and H both are groups, then we can construct a new group F — G x H,
called the direct product of G and H. The elements of G x H are pairs (a, 6) where
a G G and b € H. The operation between two pairs is done component-wise in the
two individual groups. But if G and H have the same group operation, then elements
o f F = G x t f = {aob\aeG,be H}. For example if G = {1, g} with g2 mod iV = 1,
and H = {l,h,h2} with h3 mod N = 1, then F = G x H = {l,h,h2,g,gh,gh2},
which is done by component-wise multiplication modulo N.
Two groups F and F' are isomorphic if there exists a one-to-one correspondence
ijj between the elements in the two groups and iff/' preserves the group operation, i.e.,
i>{gi°g2) = V,(5i)°V'(flr2)- For groups G and H defined in the previous example, choose
gh2 as the generator. We can define a new group F' = {l,gh2,h,g,h2,gh}. One
25
CHAPTER 2. ASIC METHODOLOGY AND MATHEMATICAL BACKGROUND
can verify that groups F and F' are isomorphic. Note that computations expressed
over one group can always be translated over to an equivalent group without extra
computational effort.
In an Abelian group, operation o is commutative, that is for all a,b G G, aob — boa.
Positive integers which are less than N and relatively prime to N, form an Abelian
group under the operation of multiplication modulo N. We will denote this group by
A(N). From the fundamental theorem of group theory, every finite Abelian group is
a product of cyclic groups. In particular, A(N) can be decomposed according to the
following rules.
• ^(rir2) = A{rr) x A(r2) when gcd(ri,r2)= 1.
• A(pn) = C(p_i)pn-i when p is an odd prime.
• A{2n) = C2 x C2n-2 when n > 3, A{A) = C2 and A(2) = {1}.
In addition, for a cyclic group,
• CTlT2 = CTl x CT2 when gcd (r1;r2)= 1.
2.3.2 Cyclic and Hankel matr ix products
Many signal processing algorithms rely heavily on matrix multiplication. This is
especially true in the time-frequency analysis of the signal. A two-dimensional kernel
matrix generally involves the time and frequency indices. We can reorder the indices
and transform the kernel (maybe in parts) to certain desirable forms. Using efficient
algorithms for these parts, one can then obtain efficient implementations for the DSP
applications. This section explains the desriable matrix forms we will use heavily in
this dissertation.
The most common matrix we use is a Hankel matrix. A general form of iV-point
26
2.3. GROUP THEROY
cyclic matrix-vector product can be expressed as
/ X0 \ x1
x2
I c0 cx c2
C\ C2 C3
c2 c3 c4
V XN_i J \ cN-i c0 ci
CAT. -A Co
Cl
/ XQ
XI
X2
CN-2 J \ XN-1 /
(2.2)
Note that each row of a Hankel matrix is a left rotated version of its previous row.
Hankel matrix is symmetric. Equation (2.2) can also be rewritten by permuting its
columns and the elements of the x vector as:
/ Xa \ Lo
x2
\ xN_, J
I Co CN-i CN-2
Ci C0 CN-i
C2 Ci C0
\ C T V - 1 CN-2 CN-3
C 1 \
c2
c3
/ XQ
XN-1
XN-2
Co / \ Xi )
(2.3)
The matrix obtained in (2.3) is a cyclic matrix. Cyclic matrix is antisymmetric.
There are efficient bilinear algorithms available for computing (2.3). Clearly, all these
algorithms (with trivial transformation) can be applied to computing (2.2) as well.
A kernel matrix (or part of it) can be turned into a Hankel matrix if the indices
involved form a cyclic group and if the matrix element M(i,j) = <f>(ioj) for a chosen
fucntion <j>. To illustrate this, consider a 7-point Fourier transform kernel with indices
restricted to range 1 < i,j < 7. This set of indices forms a cyclic group A(7) — Ce —
{1, 3, 2,6,4, 5} under the operation of multiplication modulo 7, where the elements of
the group are ordered as powers of its generator 3. Further, M(i,j) = W]3 — <fi(i°j)
where <f>(t) — W? and o is the group operation of multiplication modulo 7. Using the
index order dictated by the group element order, we can see that the partial Fourier
27
CHAPTER 2. ASIC METHODOLOGY AND MATHEMATICAL BACKGROUND
matrix can be written as
X3
x2 X,
x± x5 \A5J
( w1 w3 w2 w® w4 w5 \ w3 w2 w6 w4 w5 w1
w2 w6 w4 w5 w1 w3
w6 w4 w5 w1 w3 w2
W4 Wb Wl W3 W2 WG
w5 w1 w3 w2 w6 w4
fxx\ x3
X2
x6
\x, J
(2.4)
Clearly the matrix in (2.4) is a Hankel matrix. Further, since CQ = C2 x C3 =
{1,3} x {1,2,4} = {1,2,4,3,6,5}, one can reorder the indices to this new order to
see that the original Hankel matrix of (2.4) can be partitioned into 3 x 3 blocks such
that each is a Hankel matrix and the structure is formed by the blocks is a 2 x 2
Hankel matrix. This is shown in (2.5).
\
x2
x3
x6 \x5/
( wl w2
w2 w4
w4 w1
w3 ws
w6 w5
w5 w3
w4
wl
w2
w5
w3
w6
w3 w6 w5 \ w6 w5 w3
w5 w3 w6
w1 w2 w4
w2 w4 w1
W4 W1 W2 j
f Xl\ x2
Xi
x3
x6
\X5J
(2.5)
Both cyclic and Hankel matrices are important in this dissertation. We will review
efficient bilinear algorithms for cyclic and Hankel matrix products in the next section.
Once bilinear algorithms are established for arbitrary sized cyclic convolutions and
Hankel products, we will demonstrate in later chapters that for the signal processing
applications of interest, indices forming groups can be extracted and thereby the
computations reduced to Hankel products.
2.4 Bilinear algorithm
Typical bilinear algorithm has three sequential computing stages: pre-addition, mul
tiplication, and post-addition. A key characteristic of the bilinear architecture is that
all the multiplications therein are independent and can be executed concurrently.
28
2.4. BILINEAR ALGORITHM
Since there is only one multiplication along the critical data path, the structured
bilinear hardware has potential to achieve ultra-fast VLSI implementation.
Bilinear algorithms have been studied earlier by Rader [42] for prime length DFT
and by Winograd [61] for certain short composite length DFT. Recently bilinear al
gorithms have been proposed for more general length 2n DFT [55]. The process of
converting a transform to a bilinear algorithm consists of two steps. First, group
theory is used to identify convolution structures within the transform kernel matrix.
Typically one looks for cyclic and Hankel sub-matrices within the kernel. Using bi
linear algorithms for these cyclic and Hankel matrix-vector multiplications, a bilinear
algorithm is obtained for the complete transform.
2.4.1 Recursive decomposition
The basic building blocks for our targeted applications are the bilinear algorithms
for cyclic convolutions and Hankel products of prime power lengths. For small prime
lengths, these algorithms are well documented [3]. Bilinear algorithms for large prime
lengths can be derived as well [54]. A general cyclic matrix of any size can always be
decomposed into smaller cyclic matrices of relatively prime lengths. Similarly a large
size Hankel matrix can be decomposed into smaller Hankel matrices. In addition, a
cylic or Hankel matrix of prime power length can be decomposed into smaller cyclic
and Hankel matrices for whom bilinear algorithms are readily available.
Consider the following 4-point cyclic convolution as an example.
2/2
( C0 Ci
C\ C 2
c2 c3
\ c3 c0
C2 C3 \
C3 C0
Co C\
C\ C2 /
/ x0 \
X2 (2.6)
Let Y0 ' , ^ 1 = X0 x0
Xi ,X,
X2
X3
, and define Hankel
matrices A = [ \ and B — [ | . Equation (2.6) can then be rewritten c\ c2 c2 c0
29
CHAPTER 2. ASIC METHODOLOGY AND MATHEMATICAL BACKGROUND
as a 2-point block cyclic convolution as
A B \ ( X, \ (2.7)
Applying 2-point bilinear algorithm (Fig. 2.6) for cyclic convolution to (2.7), one
gets
Y0 = {(A + B)(X0 + X1) + (A-B){X0-X1))/2
*i = ( ( ^ + B)(X0 + X 1 ) - ( ^ - B ) ( X 0 - ^ i ) ) / 2 . (2.8)
The block coefficient matrices are:
A + B = I
A-B = (
c0 + c2 cx+cz \
c\ + c3 c0 + c2 J '
c0 - c2 ci - c3 Cl - C3 - ( C 0 - C 2 )
(2.9)
Therefore the block multiplication with (A + B) is again a 2-point cyclic matrix
multiplication requiring 2 multiplications and 4 additions in a bilinear algorithm.
The multiplication with (A — B) however, is a 2-point Hankel product and its bilinear
algorithm needs 3 multiplications and 3 additions. Putting together the blocks, the
complete bilinear algorithm shown in Fig. (2.7) is obtained. The net arithmetic
complexity of a bilinear algorithm for (2.6) is therefore 5 multiplications and 15
additions and the critical path delay is one multiplication and 4 additions.
In this dissertation, we employ this kind of recursive decomposition approach
repeatedly. A key aspect of the approach is that only a small number of bilinear algo
rithms are needed in order to obtain the solution to a much larger problem size. The
small number of required algorithm means more efficient design reuse for both soft
ware and hardware. When sizes are parameterized, it can reduce the overall code size
and shorten the design schedule since the verification effort is reduced substantially.
2 . 4 . 2 O r d e r o f c o m p u t a t i o n
When a larger bilinear algorithm is decomposed into two or more smaller algorithms,
it is important to determine the order of decomposition which provides the minimum
complexity. A bilinear algorithm can be characterized as a triplet (n, a, m), where n
30
2.4. BILINEAR ALGORITHM
(" C 0" C 1 + C 2 + C 3) / 2
Figure 2.7: A bilinear algorithm for the 4 point cyclic convolution.
is the length of the input vector, a its additive complexity and m its multiplicative
complexity.
Consider designing a bilinear algorithm of nin2 points using two bilinear algo
rithms described as (ni,ai,mi) and (rz2, a2,m2). We can partition the input and
output vectors into n\ sub-vectors, with each sub-vector having n2 points. The com
putation resembles the algorithm of (ni,ai,mi) at the block level. The a\ additions
are duplicated n2 times for each component of the sub-vectors, resulting in n2«i
additions. There are mx block multiplications, each is resolved with the algorithm
(n2,a2,m2). Total computation for the block multiplication requires mia2 additions
and mim2 multiplications. Thus the arithmetic complexity of the algorithm for this
n\ x n2 decomposition is mim2 multiplications and n2ax + mia2 additions.
Alternately, using the decomposition n2 x rti, the computation resembles the al
gorithm of (ri2, 0,2,1712) at the block level. The a2 additions are duplicated n\ times,
resulting in nia2 additions. There are m2 block multiplications, each is resolved with
the algorithm (ni,ai,mi). Total computation for the block multiplication requires
m2ai additions and mim.2 multiplications. Thus the arithmetic complexity of this
n2 x n\ decomposition is mim2 multiplications and nia2 + m2ai additions. A com
parison table of the two decomposition strategies is shown in Table 2.2.
It is important to note that the multiplicative complexity is identical for both
31
CHAPTER 2. ASIC METHODOLOGY AND MATHEMATICAL BACKGROUND
Table 2.2: Arithmetic complexity of different decomposition orders. Decomposition
Multiplication Addition
ni x n2
vn,\vn,2 n2a\ + m\a2
n2 x ni
m\m2
nxa2 + m2ai
decomposition orders. The difference is in the additive complexity. If algorithm
decomposition order n\ x n2 has a lower additive complexity than n2 x n\ does, then
{n2a\ + m,ia2) < (nia2 + m2a\) which can be further simplified as
(mi - n i ) /a i < (m2 - n2)/a2. (2-10)
Since an algorithm with a lower computational complexity is desirable, we can thus
use the value of (m — n)/a to determine the decomposition order of bilinear algorithm
(n, a,m).
As an example, consider the computation of a 6-point cyclic convolution. This
may be expressed either as a 2-point block cyclic convolution where each block is a 3-
point cyclic convolution or alternately as a 3-point block cyclic convolution with each
block being a 2-point cyclic convolution. The characteristic of bilinear algorithms
for a 2-point cyclic matrix is (2,4,2) and that for the 3-point arbitrary cyclic matrix
is (3,11,4). Apply (2.10), the value (m - n)/a for C2 is 0 and that for C$ is 1/11.
Therefore the decomposition order C2 x C3 will result in a lower complexity than that
of C3 x C2.
In Table 2.3, commonly used 2, 3 and 5 point bilinear algorithms are listed ac
cording to their (m — n)/a values. When two algorithms have the same complexity
values, their order of combining can be arbitrarily chosen.
32
2.4. BILINEAR ALGORITHM
Table 2.3: Determining the order of combining some bilinear algorithms. Note that matrices identified with * have the special property that the sum of all elements in the first row is 0.
Matrix type
Circular Circular* Circular Hankel
Circular* Hankel Hankel
Size n
2 3 3 3 5 2 5
Multiplications m
2 3 4 •5
9 3 14
Additions a
4 6 11 16 22 3
27
Value (m — n)/a
0 0
1/11 1/8
2/11 1/3 1/3
33
Chapter 3
Discrete Hartley transform
This chapter develops bilinear algorithms for the discrete Hartley transform (DHT)
of pn points for a prime p. Using a group theoretic approach, we show that the DHT
kernel matrix can be recursively transformed into cyclic and Hankel sub-matrices.
By using bilinear algorithms for cyclic convolution and Hankel product, one can
then obtain bilinear algorithms for the DHT. Bilinear algorithms ensure highest com
putational speeds in dedicated hardware. We have implemented in VLSI our new
algorithms of 2™ and 3" point DHTs as well as the ones available in literature. We
find that our algorithms have a speed advantage of 20% - 30% over others.
3.1 Background and prior work
The iV-point discrete Hartley transform (DHT) of sequence {x(i)}, is defined as [6]
N-l
X(k) = ^2x{i)cas{2irik/N), k = 0 , 1 , . . . , N - 1, (3.1) i=0
where cas(a) = cos (a) +sin(a) . The DHT is a real-valued transform with its forward
and inverse transforms sharing the same kernel (except for a scaling factor) and is
useful for obtaining convolutions of real sequences. It has also been used in many
applications in the fields of spectral analysis [57], error control coding [62], data
compression [64], and optics and microwave [6,7]. Further, it has been shown that fast
35
CHAPTER 3. DISCRETE HARTLEY TRANSFORM
Hartley transform (FHT) has the fastest realization of the DFT when implemented
across a variety of general purpose processor platforms [20].
Most of the algorithms for the DHT target von Neumann architecture. Since all
the operations in such general purpose computers are sequential, the performance of
these algorithms is evaluated by means of their arithmetic complexity and the number
of memory accesses [2,4,21,28,58]. However, the recent progress of VLSI technology
has now made it possible to develop cost effective dedicated Application Specific In
tegrated Circuits (ASIC) for signal processing applications. The ready availability of
low cost field programmable gate arrays (FPGA) has made such hardware solutions
practical even for low volume applications. Unfortunately, efficient algorithms devel
oped for general purpose computers such as the split-radix algorithm [2,4], separate a
2"-point DHT into n serial computation stages each involving multiplications. Thus,
if converted to hardware, the critical path of these algorithms involves several multi
plications one after the other. Fig. 3.1 shows a typical computational flow graph of a
2"-point DHT using a split-radix algorithm.
rotation re-order
Figure 3.1: Flow graph for Ref. [2] implementation of 2"-point DHT. Note that each rotation can also be implemented with 3 multiplciations and 3 additions. The index and the coefficients are: i',k' — 0,l,--- ,N/2 — 1; C(j) = cos(2ivj/N) and S(j) = sin(27rj/iV), where j = 0,1, • • • , N/4 - 1.
Split-radix algorithms have also been proposed for 3n-point DHT [1,29,65]. The
signal flow graph of a 3"-point DHT [65] is shown in Fig. 3.2. As can be seen from
the figure, these algorithms require additional computations to separate indices into
evenly divided three or nine bands. Hence these algorithms are quite different from
those of 2™-point DHT. However, similar to the algorithms for 2"-point DHT, these
36
3.1. BACKGROUND AND PRIOR WORK
3n-point DHT algorithms also have several multiplication stages on the critical path,
thus limiting the performance in hardware implementation.
x(i'+2N/3)
Length-N/3 DHT
Length-N/3 DHT
Length-N/3 DHT arrange
X(3k')
X(3k'+1)
X(3k'+2)
Figure 3.2: Flow graph for Ref. [65] implementation of 3n-point DHT. Note that each rotation can also be implemented with 3 multiplciations and 3 additions. The index and the coefficients are: i',k' = 0,l,--- , iV/3 — 1, C(i') = cos(2-jri'/N) and S(i') = sm(2m'/N).
Previous hardware implementations of the DHT include algorithms using FFT-like
butterfly structures [19,26], CORDIC schemes [13], multiplier-arrays [11], bit-serial
architectures [22], transversal filters [5] and distributed arithmetic (DA) [23]. All of
these implementations, except the last one have more-than-one multiplication stages
along the critical path. The distributed arithmetic design does not use multipliers.
Instead, multiplications are converted into a memory-based look-up table. The DA
approach can theoretically compute any length DHT, however both transform length
and algorithm speed are limited by memory size and speed.
The inherent delay of the p"-point DHT algorithm developed here consists of only
one multiplication and a few additions. The primary focus of the algorithm devel
opment is on partitioning the transform kernel into cyclic convolutions and Hankel
products and then realizing these with bilinear algorithms. Even though the method
is more general, we illustrate it here for lengths of the form 2" and 3". The ASIC
designs were synthesized and compared with realizations of the most current DHT
algorithms.
The rest of this chapter is organized as follows. Section 3.2 describes a group
theoretic DHT kernel partitioning. Section 3.3 presents the details of the proposed
algorithms for 2"-point DHT. Performance of the ASIC implementations is presented
in Section 3.4. Proposed algorithms for 3"-point DHT are discussed in Section 3.5,
with performance analysis followed in Section 3.6. Finally, Section 3.7 provides our
37
CHAPTER 3. DISCRETE HARTLEY TRANSFORM
concluding remarks.
3.2 Partit ioning the D H T of prime power lengths
Let N — pn where p is a prime number. We will use the symbol Xn to indicate
a transform of length pn. Define a set A(N) = {0 < i < N\gcd(i,N) = 1}. We
compute the transform components depending on whether the indices belong to A(N)
or not. The proposed computation of DHT based on A(N) is shown in Fig. 3.3.
Hartley Group Transform HGTN
Pre-Addition
z(i)=Xx(i+jN/p) j=o
i=0,l, ..,N/p-l.
DHT of length N/p
k 8 A(N)
k ^ A(N)
Figure 3.3: Flow chart for pn-point bilinear DHT.
Consider first the computation of Xn(k), k ^ A(N), i.e., when A; is a multiple of
p. In this case, as definition (3.1) shows, each x(i + (N/p)j), 0 < j < p, multiplies
the same cas(2irik/N). Define a sequence
p- i
z(i) = ^x(i+ (N/p)j), i = 0 ,1 , . . . (N/p) - 1. j=o
It is then obvious that
Xn(pk) = Zn_i(fc).
Thus the DHT components with indices which are multiples of p can be computed
directly from the DHT of a sequence {z(i)} of smaller length (N/p).
To compute Xn(k), k e A(N), we use the fact that A(N) forms a group under
the operation of multiplication modulo N. We refer to this computation of Hartley
38
3.2. PARTITIONING THE DHT OF PRIME POWER LENGTHS
transform with the transform indices restricted to a group as the pn-point Hartley
Group Transform, HGTN. Thus we have
HGTN{k) = J2 x ( * ) c a s ( ^ ) , * G A(N). (3.2)
We separate (3.2) in two summations depending on whether signal index i belongs to
A(N) or not. The results of these are combined using |̂ 4(iV)| additions later. When
i E A(N), we can permute the signal and transform components to convert the partial
kernel to a cyclic matrix (when p is odd) or a direct product of cyclic groups (when
p = 2). This permutation and computation thus depends on the group structure and
is illustrated in Section 3.3 for N = 2n and in Section 3.5 for N = 3".
When k £ A(N) but i £ A(N), i is a multiple of p. As a result, each HGTN(k +
(N/p)j), 0 < j < p in (3.2) is identical. It is therefore sufficient to compute HGTN(k)
only for k € A(N), 0 < k < {N/p), i.e., k € A(N/p). However,
HGTN{k) = J2 * « c a s ( ~ ) i$A(N)
27T7 n
= J2 ^ / P ) c a s ( ^ ) , k e A{N/p),
= HGTN/P(k), keA(N/p). (3.3)
Note that the HGTN/P in (3.3) represents the Hartley group transform of the length
N/p sequence {x(pi)}, 0 < i < N/p.
This decomposition of HGTN is shown in Fig. 3.4. It shows that the computation
of HGTN breaks down into two independent computations, one involving a cyclic
convolution (or a multi-dimensional cyclic convolution) and the other, the transform
HGTN/P. HGTN/P can also be similarly decomposed into a smaller sized convolution
and HGTN/V2. Since all the resultant convolutions can be done concurrently, one can
get a bilinear algorithm for the DHT from bilinear algorithms for cyclic convolutions.
The proposed bilinear DHT flow can also be visualized from kernel matrix per
spective. In Fig. 3.5, each solid boxes represents a circular matrix associated with
an Abelian group that needs to be computed. The circular matrices are identical if
they are of the same size. These matrices are basic computing blocks for our bilin
ear algorithms. Clearly there are plenty of redundant coefficients in the DHT kernel
39
CHAPTER 3. DISCRETE HARTLEY TRANSFORM
x(i) i£A(N)
f W
x(i) v i§A(N)
Multi-dimensional cyclic convolution
Hartley Group Transform
HGTN/p
Y(k) k £A(N)
X'(k') k'E A(N/p)
Post-Addition X(k) = Y(k) + X'(k modN/p)
k £A(N)
Figure 3.4: Hartley group transform for pn-point bilinear DHT.
matrix. The presented matrix division method recursively identifies and removes
this inherited redundancy, thus it achieves a higher computational efficiency. Only
p2(p — l ) 2 / (p 2 — 1) percent of original DHT kernel matrix needs to be computed.
This computational load can be futher reduced using bilinear algorithm.
In a fully parallel ASIC architecture, it can be seen that alogrithms for small
circular matrices are deployed at many places. Therefore the DHT kernel matrix is
an ideal candidate for folding, done in such a way that computing module for same
size circular matrices is shared to reduce overall hardware cost. To one extreme,
there will be only one module instantiation of arbitrary-sized circular matrices in an
implementation using bilinear algorithm.
3.3 Bilinear algorithm for 2n-point DHT
As seen in Section 3.2, by partitioning the signal and transform indices using the set
A(N) allows one to compute some parts of the Appoint DHT recursively. However,
when both the indices are restricted to A(N), the structure of the computation is
governed by the structure of the group A(N). Recall that N — pn and that group
A(N) uses the operation of multiplication modulo N. It is known that when p is an
odd prime, A(N) is a cyclic group C(P_I)JV/J> of (p — l)N/p elements and if p = 2,
A(N) is isomorphic to C2 x CN/A- We will focus in this section on lengths N — 2n
which have a greater applicability.
40
3.3. BILINEAR ALGORITHM FOR 2N-POINT DHT
i where i EA(N). i where i £A(N).
k where k 8 A(N).
k where k S A(N).
=0,...,p-l.
Lrl
i=(p-l)N/p,...,(N-l). i
+ ... +
Figure 3.5: Kernel matrix group division of p"-point DHT.
41
CHAPTER 3. DISCRETE HARTLEY TRANSFORM
When N = 2", the group A(N) has a structure C2 x Cjv/4. Let the group CAT/4 =
{l,g,g2,--- ,gN/i'1}, where gN^ mod JV = 1. Similarly, let C2 = {1, ft} .where
h £ CN/i and ft2 mod N = 1. Each element of -<4(iV) can then be expressed as a
product of an element of CN/A and an element of C2. It is known that it is always
possible to find the generators g and h of the cyclic group described here. These
generators are not unique. In the following discussions on 2n-point DHT, we choose
g — 3 and h = N/2 — 1. We first state and prove three lemmas which would help
us in the clarification of the algorithm. The first of these lemmas shows that 3 is in
fact the generator of CN/A with operation of multiplication modulo N.
L e m m a 1 Let N = 2n, n>3. The order of 3 in A(N) is N/4.
Proof. When N = 2n, \A(N)\ = 2n~1. Thus order of 3 in A(N) is 2j for the smallest
j satisfying
32i = 1 mod 2n
= 1 + q2n, for some integer q. (3.4)
We now prove by mathematical induction that j — n — 2 and the quotient q is odd.
These statements can be directly verified for n = 3. Now assume that they are true
for some n, i.e., the smallest j satisfying (3.4) is j = n — 2 and q obtained in (3.4) is
odd. To prove the statements for n + 1, let the order of 3 in A{2n+l) be denoted by
2k, i.e., k is the smallest value of j such that
32J = 1 + q'2n+\ for some integer q'. (3.5)
Clearly any j satisfying (3.5) cannot be smaller than n — 2 because, otherwise, one
can reduce both sides of (3.4) by 2" to contradict the assumption that the smallest
value of j satisfying (3.5) is n — 2. Further, j in (3.5) cannot also equal n — 2 because
if it did, then (3.4) and (3.5) give
32 n"2 = l + g2n = l + g ' 2 " + 1 ,
contradicting the assumption that q is odd. Thus any j < n — 2 does not satisfy (3.5).
42
3.3. BILINEAR ALGORITHM FOR 2N-POINT DHT
Finally, setting j = n — 2 in (3.4) and squaring both sides gives
32"~ l = l + (g + g22n-1)2n + 1 . (3.6)
Equation (3.6) shows that j = n — 1 satisfies (3.5). From preceding discussion, this
is the smallest such value of j . By comparing (3.5) and (3.6) one gets
q' = q(l + q2n~1)
showing that q' is an odd integer. I
L e m m a 2 Let N = 2", n > 4. Then
3iv/8 = jv/2 + 1 mod N. (3.7)
Proof. We will prove this lemma by mathematical induction over n. For n = 4, one
can verify the result directly. Assume it is true for n. Equation (3.7) can be written
as
3N/8 = N/2 + 1+qN, (3.8)
where q is some integer. To prove the result for n + 1, we square both sides of (3.8)
to get
32W/8 = N2/4 + l + q2N2 + N + qN2 + 2qN
= (2N)/2 + l + (N/8 + q2N/2 + qN/2 + q)(2N). (3.9)
Since (N/8 + q2N/2 + qN/2 + q) is an integer, (3.9) gives
32iv/8 = (2N)/2 + 1 mod (2iV)
showing that the result is true for n + 1 if it is true for n. I
Lemma 1 shows that when N — 2n and therefore ^4(iV) = C2 x CAT/4, 3 is the
generator of C7V/4. Clearly, there is only one element in this cyclic group which has
an order 2. Lemma 2 shows that, the element identified in Lemma 1 is (iV/2 + 1).
Thus the generator h of spliting subgroup C2 of A(N) can be chosen to be N/2 — 1.
It is obvious that N/2 — 1 has an order 2 under multiplication modulo N. For N — 8,
A(N) = C2 x C2 with generators 3 and 5.
43
CHAPTER 3. DISCRETE HARTLEY TRANSFORM
Lemma 3 Let N = 2n, n > 5. Then
3*/i6 = N/4 + 1 mod N. (3.10)
Proof. We will prove this lemma by mathematical induction over n. For n = 5, one
can verify the result directly. Assume it is true for n. Equation (3.10) can be written
as
3N/W = N/4 + 1+ qN, (3.11)
where q is some integer. To prove the result for n + 1 , we square both sides of (3.11)
to get
32AT/i6 = N2/16 + l + q2N2 + N/2 + qN2/2 + 2qN
= (2N)/4 + l + (N/32 + q2N/2 + qN/4 + q)(2N). (3.12)
Since (JV/32 + q2N/2 + qN/4 + q) is an integer, (3.12) gives
32iv/i6 = (2N)/4 + 1 mod (2N)
showing that the result is true for n + 1 if it is true for n. I
To obtain the iV-point DHT algorithm, and in particular transform components
X(k), k € A(N), we first reorder the signal and transform components as follows.
Using generators g — 3 and h — N/2 — 1 as explained earlier, permute sequences x(i),
X(k), i,k E A(N) to form sequences y(i), Y(i), 0 < i, k < N/2 as
/ xtf mod TV), H0<i<N/4 y(i) — < . (3.13)
\x{hg% mod AT), if N/4 < i < N/2.
and similarly,
f X(gk mod A) , if 0 < k < N/4 Y(k) = { vy , ; ' - 7 3.14
{ X(hgk mod A), if N/4 < k < N/2.
Note that these permutations do not cost any computation. The relationship between
Y and y is given by a matrix obtained by choosing and permuting the rows and
44
3.3. BILINEAR ALGORITHM FOR 2N-POINT DHT
columns of the DHT kernel consistent with the permutations of x and X. Denote
this N/2 x N/2 matrix by M, and let
Y = My. (3.15)
Using (3.1), (3.13) and (3.14), one can see that when 0 < i, k < N/4 or N/4 <i,k <
N/2,
M(i, k) = cas{gi+k2ir/N). (3.16)
On the other hand, when one of i and k is in the range 0 to N/4 — 1 and the other
in N/4 to N/2 - 1, one has
M(i, k) = cas(hgi+k27r/N). (3.17)
From (3.16) and (3.17) it is obvious that if the N/2 x N/2 matrix M is partitioned
into four equal submatrices with diagonally opposite submatrices being equal, i.e.,
( A B \ M=\ . 3.18)
\B A)
Now recall that generator g has order N/4 under multiplication modulo N (Lemma
1). Therefore, both N/4 x N/4 submatrices A and B are cyclic as their (i, k)-th
element depends only on (i + k) mod {N/4). This decomposition of M given by
(3.18) is a reflection of the fact that A(N) = C2 x CAT/4-
When N > 16, matrices A and B are at least 4 x 4 . Lemma 2 can then be used
to show that they then have additional redundancies. Consider the matrix A whose
entries are given by (3.16). These entries can be simplified using lemma 2 as follows:
A(i + N/8,k) = cas(gi+N^+k2n/N)
= cas{gi+k(N/2 + l)2Tr/N)
= -cas{gi+k2<ir/N)
= -A(i, k), 0<i,k< N/8 (3.19)
The last step in (3.19) uses the fact that the generator g is odd. In fact all elements
of A(2n) are odd. In a similar fashion, one can show that A(i, k + iV/8) = — A(i, k)
and A(i + iV/8, k + N/8) = A(i, k). Similar manipulation of (3.17) exposes identical
45
CHAPTER 3. DISCRETE HARTLEY TRANSFORM
redundancies in the matrix B. As a result, we see that the AT/4 x N/4 matrices A
and B in (3.18) have the structure
A = A'
A' - A ) A' ) and B=[
\
B'
-B'
-B'
B' (3.20)
M (3.21)
where each of submatrices A' and B' is of size N/8 x AT/8. Thus matrix M becomes
/ A' -A' B' -B' \
-A' A' -B' B'
B' -B' A' -A'
\ -B' B' -A' A' )
Equation (3.21) shows that half the rows of M are negatives of the other half and
therefore it is sufficient to compute Y(k) only for 0 < k < N/8 and for N/2 < k <
3AT/8. Thus define a sequence of {Y'(k)} for 0 < k < N/4, we have
Y'(k) = Y(k) = -Y{k + N/8), H0<k<N/8,
Y'(k + N/8) = Y(k + N/4) = -Y(k + 3N/8), if N/8 < k < N/4.
But even for these rows of M, the columns have a high degree of redundancy. This
redundancy may be exploited by combining y components that multiply with the
same value in M. By computing sequences
' y(i) ~ y(* + N/8), ifO<i<N/8
y{i + N/8)-y{i + N/4), ifN/8<i<N/4.
one only has to compute
(3.22)
y'ii) = (3.23)
Y' = (3.24)
Since submatrices A' and B' are carved out of circular matrices A and B respec
tively, they are structurd as Hankel matrices. By partitioning each in four equal
submatrices, one would describe the structure as:
A' and B' (3.25)
Note that submatrices P, Q, U and V are of size N/16 x N/16.
46
3.3. BILINEAR ALGORITHM FOR 2N-POINT DHT
3.3.1 16-point D H T
When N = 16, the group order for (3.21) is 4(16) = C2 x C4 = {1,7} x {1,3, 9,11} :
{1,3, 9,11,7,5,15,13}. It can be shown that P and Q have identical values since
Q = cas(3 • 2?r/16)
= cas(7r/2 - 27r/16)
= cas(l • 27r/16)
= P.
Similarly, the following relationship of U = — V can be easily verified:
(3.26)
U = cas(7 • 2TT/16)
cas(7r - 2 • 2TT/16)
- COS(2TT/16) + sin(27r/16)
and (3.27)
V cas(5 • 27r/16)
= cas(7r/2 + 2 • 2?r/16)
COS(2TT/16) - sin(27r/16).
Combining (3.24) - (3.27), we have in this case
/ y'(0) \ ( P P U -U\
Y'{\)
Y'(2)
V H3) ) \
P
U
-u
-p
-u -u
-u p
p
-u p
-p)
I y'(o) \ y'(i)
y'(2)
V 2/'(3) J
(3.28)
With four extra additions, (3.28) can be evaluated in two separate Hankel products
as:
' y'(0) + y'(i)\
y'(3)-y'(2) I and (3.29)
P -U\ y'(2) + y'(3)\
\-U -P ) \y'(l)-y'(0) ) '
A bilinear algorithm can used for each of these 2-point Hankel products. Since there
are only additions outside the Hankel matrices, the overall algorithm is bilinear as
well.
47
CHAPTER 3. DISCRETE HARTLEY TRANSFORM
We now illustrate the general procedure for thev case of a 16-point DHT. The
computation of the even components of the DHT (X(k), k £ A(N)) directly follows
the discussion in Section 3.2 and is shown in Fig. 3.6.
Figure 3.6: Proposed bilinear algorithm for even-indexed components of a 16-point DHT. Note that multiplication constant eg — \pl.
The evaluation of the odd components X(k), k 6 A(N) of the DHT requires
one to permute the signal and transform indices to realize the computation as cyclic
convolutions and Hankel products. Following the discussion in this section, the set
4(16) forms a group C2 x C4 under the operation of multiplication modulo 16. We
use generator 3 for C4 and 7 for C2. Thus 4(16) = {1,7} x {1,3,9,11}. The
permuted sequence y = {a;(l),a;(3),a;(9),a;(ll),a;(7),a;(5),x(15),x(13)}. Similarly
Y = {X{1),X(3),X(9),X(11),X(7),X(5),X{15),X{13)}. The matrix M which is
used to get Y from y as Y = My is given by (entry q stands for value cas(g27r/16) in
48
3.3. BILINEAR ALGORITHM FOR 2N-POINT DHT
the matrix):
M =
( 1
3
9
11
7
5
15
V 1 3
3
9
11
1
5
15
13
7
9
11
1
3
15
13
7
5
11
1
3
9
13
7
5
15
7
5
15
13
1
3
9
11
5
15
13
7
3
9
11
1
15 13 \
13 7
7 5
5 15
9 11
11 1
1 3
3 9 j
One can easily verify properties (3.16) - (3.21) of the matrix M above.
The proposed bilinear algorithm for computing the odd indexed DHT values for
N — 16 is shown in Fig. 3.7. One should note that with a bilinear algorithm, the
Figure 3.7: Proposed bilinear algorithm for odd-indexed 16-point DHT. Note that the multiplication coefficients are: CQ = A/2, CI = 0.7654, c2 = 0.5412 and c3 = -1.8478.
critical path includes only one multiplication operation resulting in a fast circuit
49
CHAPTER 3. DISCRETE HARTLEY TRANSFORM
(3.30)
implementation.
3.3.2 More than 16-point D H T
When N is greater than 16, we define index groups for i', k' € {0,1,2, • • • , TV/16 — 1}
and
P(i',k') = cas(gi'gk'2Tr/N),
Q{i',k>) = cas( /+ i V / 1 6 / '27r / iV) ,
U{i',H) = cas{hgi'gk'27r/N),
V(i',k') = cas{hgi,+N/wgk'2ir/N).
We now show that P(i', k') and V(i', k') are related.
Substituting in h = N/2 — 1 and using Lemma 3, one gets
V{i',k') = cas{gi,+k'(N/2-l){N/4 + l)2Tr/N)
= cas(gi,+k' (N2/S + N/4 - l)2ix/N)
= cas{gi'+k'7r/2-gi'+k'2-K/N) (3.31)
= sin(/+/ : '7r/2)cas(/+fc '27r/iV)
= sin(/+ f e '7r/2)P(i' , Jfc')-
Since 5 = 3, sin(gJ +k TT/2) equals 1 if (i' + k') is an even number. For odd (i' + k'),
sm(gl'+k'ir/2) equals —1. Therefore P{i',k') and V(i',k') are related by
, , ,. I V(i', k'), if both indices i' and k' are even or odd P(i',k') = { y h (3.32)
[ —V(i', fc'), if one index is even and the other is odd.
Similarly one can show that
U(i',V) = cas{gi,+k'{N/2-l)2Tv/N)
= cas ( / + / ! ' 7 r - /+ f c ' 27 r /A0
= cos(/+fe ,7r)cas(-/+fc '27r/iV)
= -cas( - / + f e ' 27r / iV) ,
(3.33)
and Q{i',k') = cas(/+ f c ' ( iV/4 + l)27r/A0
= c a s ( / + f e V / 2 + gi,+k'2-ir/N)
= sin(/+/ j '7r/2)cas(-/+fe '27r/AT)
= -sin(/+ f c '7r/2)[/(i ' ,fc ')-
(3.34)
50
3.3. BILINEAR ALGORITHM FOR 2N-POINT DHT
Since g — 3, sm(gl +k ir/2) = 1 if (i' + k') is an even number, else sm{g%'+k'ix/2) = — 1.
Therefore Q(i', k') and U(i', k') are related by
Q{i',k') = —U(i', k'), if both indices i' and A;' are even or odd
U(i', k'), if one index is even and the other is odd. (3.35)
We now show that the matrix in (3.24) can be simplified to two iV/8-point skew
circular matrices with identical kernels. Skew circular matrix is a special kind of
matrix closely reassembling a circular matrix, except that elements below the diagonal
from top right to bottom left are negated. For example, from (3.25) matrices A' and
B' can be seen to be skew circular matrices. To compute Y' through products with
two N/8 x iV/8 skew circular matrices, first recall that
Y' = A' B'
xy B' A'
( P Q U V \
Q -P V -U
U V P Q
\V -U Q -P J
xy
(3.36)
where P, Q, U and V are iV/16 x TV/16 matrices. Because of the relationships (3.32)
and (3.35) elements in the matrix in (3.36) are related. To take advantage of these
relationships, we compute the even-indexed Y' components in the top half with the
odd-indexed components in the bottom half. For these Y' components, (3.36) gives
Y'(k) 0<k< N/8, k even;
N/8 < k < N/4, k odd.
Ay'{i){i even} + By'(i){. o d d } . (3.37)
51
CHAPTER 3. DISCRETE HARTLEY TRANSFORM
where, from (3.32) and (3.35) one gets
A(i,k) =
and
B(i,k)
(
\
P(i,k)
Q(i,k)
Q(i,k)
P(i,k)
P(i,k)
Q(i,k)
Q(i,k)
P(i,k)
Q(i,k)
-P{i,k)
-P(i,k)
-Q(i,k)
Q(i,k)
-P(i,k)
P(i, k)
Q(i,k)
-Q(i,k)
P(i,k)
P(i,k)
Q(i,k)
Q(i,k)
-P(i,k)
P(i,k)
Q(i,k)
P{i,k) \
Q(hk)
Q(i,k)
-P(i,k) J
-P{i,k) \
-Q(i,k)
Q(i,k)
-P(i,k) )
(3.38)
One may note that first A/16 columns in matrix A are identical to the last A/16
columns. Further, the columns from N/16 to 2N/16 are exactly negative of the
columns in positions 2A/16 to 3N/16. One can therefore add or subtract the cor
responding elements of vecotr y'(i), i even and then multiply the sum by the same
elements P(i, k) or Q(i, k). In other words, one can shrink the number of columns in
matrix A by appropriate folding of vector y'(i), i even.
Similarly, one can use the fact that in matrix B, the first N/16 and last JV/16
columns are negative of each other while the columns iV/16 to 2N/16 are identical
to columns 2N/16 to 3./V/16. Thus the appropriate folding of the vector y'(i), i odd
allows one to reduce the number of columns in matrix B by a factor of 2 as well.
From this discussion, the algorithm to compute the even-indexed components in first
half of Y' and the odd-indexed components in the last half of Y' can be designed as
follows:
Define a new sequence {ya} of iV/8 components as
Va(2j)
ya(2j + 1 )
ya(N/16 + 2j)
ya(N/16 + 2j + l)
y'(2j + 3N/16) + y'(2j),
y'{2j + 1 + AT/16) + y'(2j + 1 + N/8),
y'(2j +N/16)-y'(2j +N/8),
y'{2j + 1 + 3JV/16) - y'{2j + 1), 0 < j < N/32.
(3.39)
52
3.3. BILINEAR ALGORITHM FOR 2N-POINT DHT
Let Ya = H x ya, where H is a skew cyclic matrix
H =
h0 hi
hi h2
y hN/s-i —ho
with h2j =
h-2j+i —
^IV/16+2j =
hN/W+2j+l =
Then for 0 < j < N/16, one gets
Y'(2j)
Y'(N/8 + 2j + l)
hN/8-i
-h0
• • • — / i j V / 8 - 2 j
P&3,0),
Q(2j + 1,0),
Q(2j,0),
- P ( 2 j + 1,0).
= Ya(2j),
= Ya(2j + 1).
(3.40)
(3.41)
(3.42)
Similarly, computation of the odd-indexed components in the first half of Y' and
even indexed components in the later half of Y' can be computed together following
the same procedure. The resultant algorithm for these components is as follows.
First define a yb vector of length iV/8 as
yb(2j) = y'(2j + N/lQ) + y'(2j + N/8),
yb(2j + l) = y'(2j + l + 3N/W)+y'(2j + l),
yb(N/16 + 2j) = y'(2j + 3N/16)-y'(2j),
yb{N/16 + 2j + l) = y'{2j+ 1 +N/16)-y'(2j + l +N/8), 0 < j < N/32.
(3.43)
Compute Yb = H x yb where the H matrix is the same as before. Then one may
obtain the remaining components of Y' as
Y'(2j' + 1) = Yb(2j' + 1),
Y'(N/8 + 2f) = Yb(2j'),0<j<N/16. (3.44)
In case of the large length bilinear algorithms, one can trade the speed for the
hardware complexity. We have developed a pipelined (through folding) architecture
exploiting the recursive nature of the algorithm. Firstly for the 2"-point DHT, the
53
CHAPTER 3. DISCRETE HARTLEY TRANSFORM
block circular matrix inside the Hartley group transform is computed with two Hankel
products with identical kernel. Hardware complexity for this computation can be cut
in half if only one copy of the hardware block is used and the computation is shared.
Secondly, as can be seen from Figs. 3.3 and 3.4, HGTN/P is used in the computation
of X(k) both when k 6 A(N) and also when k £ A(N). This repeated computation
is also marked in boxes in Figs. 3.6 and 3.7. By using only one block of HGTN/P
in the DHT and reusing it in multiple computations breaks the bilinear structure of
the algorithm and increases its inherent delay. However, this can have a significant
impact on the area reduction as is shown in Fig. 3.8 in Section 3.4. In addition, since
the even transform outputs X(k) where k € A(N) and the odd transform outputs
X(k) where k <£ A(N) are never realized concurrently, a circuit implementation with
multiplexed outputs may be used to reduce the output pin requirement by 50%.
3.4 Performance analysis for 2n-point DHT
The arithmetic complexity of any algorithm determines its hardware complexity. The
bilinear algorithm for N — 2™ point DHT developed here uses ( 3 n _ 1 — l ) /2 — n + 1
multiplications and 2n+1 + (3n - 81)/2 - 3n + 42 additions for n > 3. Thus the
hardware costs are of order 0(N16). The actual numbers of arithmetic operation
for the proposed algorithm are compared in Table 3.1. This table also provides
Table 3.1: Hardware complexity of various 2"-point DHT algorithms. N
2 4 8 16 32 64 128 256
Bilinear mults adds 0 2 0 8 2 22 10 62 36 172 116 476 358 1330 1086 3770
Folding mults adds
8 52 19 121 63 371 197 847 601 2315
Ref. [4] mults adds
2 26 10 74 34 194 98 482 256 1154 642 2690
Ref. [2] mults adds
34 174 98 438 258 1070 642 2518
54
3.4. PERFORMANCE ANALYSIS FOR 2N-POINT DHT
complexities of two reference algorithms [2, 4]. Both these algorithms use a split-
radix 2/4 method. It is worth noting that, reference [2] uses the bilinear algorithm
for iV < 32. Clearly the author recognizes the complexity advantage of bilinear
algorithm at small lengths. In [4], the algorithm is further extended to split-radix
2/8. Though this improves the data transfer between the processor and the memory,
it does not reduce the arithmetic complexity. Therefore we have used only split-radix
2/4 algorithm for reference [4]. The speed of an implementation can be estimated
from the inherent delay of the algorithm. Table 3.2 compares the speed of the new
algorithm with the reference algorithms.
Table 3.2: Time complexity of various 2n-point DHT algorithms. Note that M and A stand for multiplier and adder delays respectively.
N 2 4 8 16 32 64 128 256
Bilinear A A M+2A M+5A M+7A M+9A M+11A M+13A
Folding
2M+7A 2M+11A 2M+15A 2M+19A 2M+23A
Ref. [4]
M+5A M+6A 2M+9A 2M+10A 3M+13A 3M+14A
Ref. [2]
2M+6A 2M+9A 3M+10A 3M+13A
The proposed algorithm and the reference algorithms [2,4] were implemented with
16-bit fixed point arithmetic in TSMC 90nm CMOS technology. Normalized area and
critical path delay of these implementations is shown in Fig. 3.8. One can see that
the bilinear architecture always provides the highest speed. Our implementations are
faster by 61%, 15%, 42% and 40% at 8, 16, 32 and 64 points than those [4], and faster
by 17% and 29% at 32 and 64 points than those of [2]. When designs of bilinear
architecture and [4] are operated at the same speed, the bilinear implementations can
be as much as at 45% smaller at 8-point, 22% at 16-point, 18% at 32-point and 6%
at 64-point. Clearly, because the hardware cost of N point bilinear DHT goes up as
0(N1S), for larger sizes, their area advantage diminishes. However with a pipelined
and folded design, both the speed and area advantages of bilinear circuits can be
55
CHAPTER 3. DISCRETE HARTLEY TRANSFORM
8 point 16 point
0.5
- * — Ref. [4] - 9 — Bilinear
10 12
1
0.9
0.8
0.7
0.6
0.5
0.4
32 point
%
-*— Ref. [4] H — R e f r2] -©— Bilinear
> Pipelined
^%&mm&s>
10 20 30 40
1
0.9
0.8
0.7
0.6
0.5
0.4
64 point
- x — Ref. [4] H — Ref. [2] - 6 — Bilinear
f> Pipelined
i!^mmmm>
o 10 20 30 40
Figure 3.8: Delay in nsec (on horizontal axis) and normalized area (on vertical axis) for various implementations of 8, 16, 32 and 64 point DHTs.
56
3.5. BILINEAR ALGORITHM FOR 3N-POINT DHT
extended to much larger transform lengths as can be verified from Fig. 3.8.
3.5 Bilinear algorithm for 3n-point DHT
For apn-point DHT where p is an odd prime, the Abelian group is A(pn) = C(p_ i)pn-i.
Since (p — 1) is prime to p™-1, we have C(p_i)pn-i = Cp_i x Cpn-i. This implies that
for oddp, (p — l ) p n _ 1 transform samples of X(k), k € A(pn), can be obtained through
several convolutions. By combining a {p—1) point cyclic convolution with a / " 1 point
cyclic convolution, one can incorporate effect of all signal components x(i), i E A(N).
Signal components x(i), gcd(i, N) = pk are involved through a two dimensional cyclic
convolution of (p— 1) xpn~1~k points. The remaining transform samples are obtained
from a DHT of length pn~l as described in Section 3.2.
The complexity of the cyclic convolution can be further reduced by using the
following lemma.
Lemma 4 Let g denote the generator of the cyclic group A{N), where N = pn, p
being an odd prime and n > 2. Then
p - i
J2 cas(2Trgk+i^N^"/N) = 0, k = 0,1, 2 , . . . , \A{N)\/p - 1. j=0
Proof. Since g is the generator of cyclic group of order |.<4(iV)|, g\A(N^/p generates
unique order p subgroup of A(N). Each of the elements of this subgroup (except
identity) has order p. We now show that integers 1 +i(N/p) £ A(N), 0 < % < p have
order p. This would imply that these are the exact elements of the subgroup.
From binomial theorem we have p ,
(l + i(N/p)Y = l+iN + J2jl^j)l{iN/Py
= l + iN + i2N(N/p2)J2 .,, P[ .v(iN/Py-2 (3.45)
However, for n > 2, N is divisible by p2. Using this, (3.45) yields
(1 + i(N/p))p = 1 mod N. (3.46)
57
CHAPTER 3. DISCRETE HARTLEY TRANSFORM
Note that since integers 1 + i(N/p), 0 < i < p are the same as the elements of the
subgroup generated by the subgroup ^ W I / P , this set of integers is the same as the
set of integers g*\AlN)Vp, 0<i<p.
Let W represent Nth primitive root of unity. Since WN — 1 = 0 but WNlp — 1 # 0,
we have
p - i y ^ Wi{N/P) = 0
t = 0
Hence
J2w1+iWp) = 0. (347) i=0
Taking the real and imaginary parts of (3.47) one gets
p—I p—I
^ C O S ( 2 T T ( 1 + i{N/p))/N) = 0 and ^ s i n ( 2 7 r ( l + i{N/p))/N) = 0. (3.48) i=0 i=0
Using the equivalence of the set of integers 1 + i{N/p) with the subgroup elements,
we can rewrite (3.48) as
p—I p—I
J ] c o s ( 2 7 r ^ ( J V ) i / 7 i V ) = 0 and ^s in(27r 5i | A ( A r ) l /7iV) = 0. (3.49)
j=0 2=0
By multiplying the arguments of each cosine and sine functions in (3.49) by a constant
integer gk mod iV and applying trigonometric identities one gets
p—I p—\
^ C O S ^ T V ^ W ^ / A O ^ O and ]Tsin(27r 5f e + i | A W I /7iV) = 0. (3.50)
i=0 i=0
By adding the two expressions in (3.50) one gets the required result. I
This lemma implies that in every cyclic convolution in the DHT algorithm of pn
points, the components of the constant sequence always add to zero. This significantly
reduces the number of additions and multiplications in the algorithm. For example,
a 3-point cyclic convolution normally has a complexity of 4 multiplications and 11
additions. But if the constant sequence has the property that its components add
58
3.5. BILINEAR ALGORITHM FOR 3N-POINT DHT
to 0, the convolution complexity reduces to only 3 multiplications and 6 additions, a
saving of over 25%.
One can also take advantage of the fact that the computation of the transform
components X(k),k G A(N) uses cyclic convolutions of (p — l)pn~1~%, 0 < i < N,
points. Each of these is a two dimensional convolution with length 2 in one dimension
and therefore have identical post multiplication processing in that dimension. One can
therefore combine these identical computations so as to reduce the post-multiplication
additions.
We now illustrate the above arguments through the example of a DHT of 9 points.
Note that A{9) = C6 = {1,2,4,8,7,5} with generator g = 2. C6 can always be ex
pressed as C3 x C2. By choosing g2 = 4 as the generator of C% and g3 = 8 as the
generator of C2, A(9) can be reordered as A(9) = {1, 4, 7} x {1,8} = {1, 4, 7, 8, 5, 2}.
The computation of X(k), k G .4(9) can be expressed (with a matrix entry q repre
senting the real value cas(g27r/9)) as:
/ \ X(l)
X(4)
X(7)
X(8)
X{5)
V * ( 2 ) J
( 1 4 7
4 7 1
7 1 4
8 5 2
5 2 8
2 8 5
8 5 2
5 2 8
2 8 5
\
1 4 7
4 7 1
7 1 4
/ \ x(l)
x(4)
*(7)
rr(8)
x(5)
V *(2) J
+
(Z 6 \
3 6
3 6
6 3
6 3
6 3
0
0
0
0
v 0 /
x(Q).
(3.51)
For the 2-point block convolution within the larger 6-point convolution matrix,
one can see that the matrix product with the vector x(i),i G .4(9) can be obtained
by
• Compute output vector (Y(0),Y(1),Y(2)) from input vector (x(l)+x(8),x(4) +
x(5),x(7) + x(2)).
Compute output vector (F'(0), Y'(1), F'(2)) from input vector (x(l)-x(8), x(4)-
a;(5),a:(7)-x(2)).
59
CHAPTER 3. DISCRETE HARTLEY TRANSFORM
• Obtain the first matrix product in (3.51) as
(y(0)+F'(0) , F(1)+F ' (1) , Y(2)+Y'(2), Y(0)-Y'{0), Y(1)-Y'(1), Y{2)-Y'{2)).
(3.52)
Once can see that because of Lemma 4 both the cyclic convolutions used in this part
have a highly reduced complexity.
Similarly, the second matrix product in (3.51) can be seen to be a 2-point cyclic
convolution obtained as
• Compute vector Z from input vector x(3) + x(6).
• Compute vector Z' from input vector x(3) — x(6).
• Obtain the first matrix product in (3.51) as
(Z + Z',Z + Z', Z + Z',Z-Z',Z- Z', Z - Z'). (3.53)
One can combine the steps in (3.52) and (3.53) in two matrix products by first
additing Z to Y(0), Y(l) and F(2) and Z' to F'(0), F ' ( l ) and Y'{2) before step
(3.52). Thus when step (3.52) is carried out, one gets the sum of the first two matrix
products of (3.51). This saves 2 additons of step (3.53). The third matrix product
can similarly be incorporated in the sum merely by adding x(0) to Z before adding it
to F(0), y ( l ) and F(2). The complete algorithm of 9-point DHT is shown in Fig. 3.9.
Pipelining technique used for 2n-point DHT can also be applied to 3n-bilinear DHT
algorithm to trade of hardware complexity with speed. For 3n-point DHT however,
only the Hartley group transform is to be time shared.
3.6 Performance analysis for 3n-point DHT
Computational complexity of 3"-point DHT is summarized in Table 3.3, and critical
path delays are listed in Table 3.4. The bilinear algorithm for 3n-point DHT developed
here uses ( 3 - 5 " - 4 n - 3 ) / 8 multiplications and (39-5 n -28-3 n + 12n-43) /16 additions
for n > 2. Our architectures and the reference designs in [65] were implemented in
60
3.6. PERFORMANCE ANALYSIS FOR 3N-POINT DHT
x(6)/_4£Q
X(0)
X(3)
X(6)
Figure 3.9: Proposed bilinear algorithm for 9-point DHT. Note that the multiplication coefficients are: c0 = -0.5, d = 0.8660, c2 = -0.5924, c3 = -1.7057, c4 = -0.7660, c5 = -1.6276, c6 = -0.3008, and c7 = -0.6428.
61
CHAPTER 3. DISCRETE HARTLEY TRANSFORM
TSMC 90nm CMOS technology. We used 16-bit fixed point data representation. The
normalized area and delays of these implementations are shown in Fig. 3.10. Our
results show that the bilinear implementation has top speeds that are faster than
split-radix reference implementations by 61% at length 9 and 112% at length 27. For
length 27, even the slowest bilinear architecture is 12% faster than fastest reference
circuit, and at the same time is 21% to 34% smaller in size. For length 9 DHT, a
bilinear architecture achieves minimum 25% area saving when operating at the same
speeds as split radix design. For length 27, one level pipelined and folded design is
on average 40% smaller and still 12% faster than the fastest reference design.
Table 3.3: Hardware complexity of various 3"-point DHT algorithms. N
3 9 27 81 243
Bilinear mults adds
1 6 8 44
45 257 232 1382 1169 7193
Pipeline mults adds
38 230 195 1211 982 6200
Ref. [65] mults adds
1 6 12 42 69 204 312 852 1257 3282
Ref. [29] mults adds
1 7 8 44 53 227 236 944 977 3695
Table 3.4: Critical path delay of various 3"-point DHT algorithms. Note that M and A stand for multipliers and adders respectively.
N 3 9 27 81 243
Bilinear M+2A M+7A M+12A M+17A M+22A
Pipeline
2M+13A 2M+23A 2M+43A
Ref. [65] M+2A 3M+6A 5M+10A 7M+14A 9M+18A
62
3.7. DISCUSSION AND CONCL USION
9 point 27 point 1 , • • — x
0.9
0.8
0.7
0.6
0.5
- * — Ref. [65] - e — Bilinear » Pipelined
* * # * *
10 15 20 25
Figure 3.10: Delay in nsec (on horizontal axis) and normalized area (on vertical axis) for various implementations of 9 and 27 point DHTs.
3.7 Discussion and conclusion
Bilinear algorithms can generally provide the fastest ASIC implementations of DHT.
Even though bilinear algorithms for other transforms such as the Fourier [42, 61]
and discrete cosine [35,53] were known, we believe that this work, for the first time,
provides bilinear algorithms for DHT. We have presented here lengths of type pn only,
but the methods of group theory used to extract the structure in the DHT kernel are
quite general and can be extended to other lengths as well. This work shows that
the bilinear algorithm for DHT can perform 20% to 30% faster than other algorithms
implemented in the same technology.
63
Chapter 4
Modified discrete cosine transform
Forward and inverse modified discrete cosine transforms (MDCT/IMDCT) are widely
used for subband coding in the analysis and synthesis filterbanks of time domain alias
ing cancellation (TDAC). Many international audio coding standards rely heavily on
fast algorithms for the MDCT/IMDCT. In this chapter we present hardware efficient
bilinear algorithms to compute MDCT/IMDCT of 2n and 4 • 3 n points. The algo
rithms for composite lengths have practical applications in MPEG-1/2 audio layer III
(MP3) encoding and decoding. It is known that the MDCT/IMDCT can be converted
to type-IV discrete cosine transforms (DCT-IV). Using group theory, our approach
decomposes DCT-IV transform kernel matrix into groups of cyclic and Hankel matri
ces. Bilinear algorithms are then applied to efficiently evaluate these groups. When
implemented in VLSI, our algorithms greatly improve the critical path delay as com
pared with the existing solutions. This is due to the fact that bilinear algorithms
employ only one multiplication along the critical path. For MP3 audio, we propose
three different versions of the unified hardware architectures for both the short and
long blocks, and the forward and inverse transforms.
4.1 Background and prior work
The forward and inverse modified discrete cosine transforms (MDCT/IMDCT) are
used as analysis and synthesis filter bank in transform/subband coding schemes, such
65
CHAPTER 4. MODIFIED DISCRETE COSINE TRANSFORM
as the time domain aliasing cancellation (TDAC) [41] and the modulated lapped
transform (MLT) [31]. The MDCT/IMDCT are basic computing elements in many
transform coding standards [38, 39]. Since the MDCT and IMDCT require inten
sive computations, fast and efficient algorithms for theses transforms is a key to the
realization of high quality audio and video compression schemes [50,51,63].
The iV-point modified discrete cosine transform (MDCT) of a sequence {x(i)} is
defined as
/ x ^ * fir(2i + l + %)(2k + l)\ N , x
X(*) = $ > ( 0 c o s M ^ M , * = 0 , 1 , . . . , — - 1 . (4.1)
Note the similarity between the kernel of the MDCT and that of the discrete cosine
transform (DCT). However unlike a DCT, MDCT converts N signal samples into
only N/2 transform samples.
There have been many fast algorithms proposed for the MDCT and its inverse,
IMDCT. Based on the symmetry of the transform matrix, Malvar [30] converts an N-
point MDCT into an iV/2-point type-IV discrete sine transform (DST-IV). Duhamel
et al. [18] compute the MDCT/IMDCT through the fast Fourier transform (FFT). An
A-point DCT is reduced to an A/4-point complex-valued FFT. Though the overall
arithmetic complexities between the two algorithms are similar, FFT algorithm has
the advantage of existing hardware realization [24]. In [12,14,36], the MDCT and
IMDCT are computed using recursive kernels. Recursive implementations require less
hardware at the expense of extending the critical path.
Unfortunately most MDCT algorithms are formulated for N = 2" and do not di
rectly apply to composite data lengths. Many existing applications of MDCT/IMDCT
however, use composite data lengths. For example, MPEG-1/2 layer III (MP3) audio
format specifies two frames consisting of 1152 and 384 data samples. These frames
are further partitioned into 32 subbands. A long block processes 36 data samples and
a short block 12 data samples. If implemented directly as in the ISO, the arithmetic
complexity of this composite A-point MDCT is A 2 / 2 multiplications and (A2 — N)/2
additions. Britanak and Rao [8,9] have designed efficient MDCT algorithms for MP3
audio. Their algorithms are based on Given's rotations. Depending on block sizes, 3
or 9 point DCT and DST modules are then used to obtain the results. For MDCT, the
66
4.1. BACKGROUND AND PRIOR WORK
DCT and DST used are of type-II. For IMDCT, they are of type-Ill. Their approach is
further refined by Nikolajevic and Fettweis [37], where the number of additions are re
duced while the multiplication count remains the same. Fig. 4.1 shows the flow graph
of MDCT computation based on Given's rotation method. In [27], Lee expresses
.,N-1.
combine and
shuffle rotations
N/4 point
DCT-II
N/4 point
DST-II
combine and
shuffle k=0,l,
Figure 4.1: Flow graph for Ref. [9] implementation of N point MDCT.
MDCT/IMDCT computations in the DCT-IV format, and successively transforms
the DCT-IV to scaled DCT-IIs. The un-normalized or scaled DCTs (SDCT) are used
for both MDCT and IMDCT. Unfortunately, this algorithm has several long recursive
computations. These contribute to lower computational complexity, especially for the
multiplications. However in hardware implementations, they extend the critical path
and the output timing is un-balanced. Flow graph for this approach is shown in
Fig. 4.2. Recently Cheng and Hsu [15] have applied matrix factorization schemes to
N point Forward MDCT
N point Inverse MDCT
*
N/2 point
DCT-IV
N/2 point
SDCT-II
N/4 point
SDCT-II
N/4 point
DCT-IV
N/4 point
SDCT-II
Figure 4.2: Flow graph for Ref. [27] implementation of iV point MDCT/IMDCT. Note that SDCT is unnormalized discrete cosine transform.
further explore the relationships between the DCT and the MDCT. Their algorithms
however, do not directly address the critical path delay.
In this chapter, we present bilinear algorithms to compute the MDCT/IMDCT
through DCT-IV. This allows us to minimize multiplications along the critical path.
67
CHAPTER 4. MODIFIED DISCRETE COSINE TRANSFORM
Using group theory, we decompose the transform kernel into cyclic and Hankel ma
trix products. Bilinear algorithms are then used to efficiently evaluate these matrix
products. We show that when implemented in VLSI with fixed-point arithmetic, our
approach significantly reduces the critical path delay.
The rest of this chapter is organized as follows. Section 4.2 reviews the steps
of transforming MDCT/IMDCT to DCT-IV. In Section 4.3, bilinear algorithms for
2"-point MDCT/IMDCT are presented. Section 4.4 develops bilinear algorithms
for MDCT/IMDCT with composite lengths of 4 • 3". In particular, a 12-point
MDCT/IMDCT is used for MP3 short block processing as a 6-point DCT-IV. The
MP3 long block of 36-point MDCT/IMDCT is computed by an 18-point DCT-IV.
For all DCT-IV algorithms, we discuss group structures, arithmetic complexities, and
critical path delays that are associated with the bilinear algorithm implementation.
In particular for the MP3 application, we explore three different versions of the uni
fied hardware architecture for both the short and long blocks, and the forward and
inverse transforms. Section 4.5 provides the conclusion of this chapter.
4.2 Transformation from TV-point M D C T / I M D C T
to 7V/2-point DCT-IV
As pointed out earlier, an TV-point MDCT uses N signal samples to create N/2
transform samples. The first step in the computation of MDCT therefore involves
converting this N x N/2 kernel into a kernel of a known square transform. It is
known that an TV-point MDCT/IMDCT can be transformed into an iV/2-point type-
IV DCTs [12,27,30,31]. Our derivation here closely follows that of [27].
4.2.1 The forward M D C T transformation
The forward MDCT is defined as
68
4.2. TRANSFORMATION FROM N-POINT MDCT/IMDCT TO N/2-P0INT DCT-IV
Introduce a new data sequence
l s (* - f )> HN/4<i<N.
Then (4.2) can be written as
N-l x(k)= Yl y(^cos
i=0
7r(2i + l)(2k + l)\ N — I, * - U , l , . . . , - - l .
The cosine term in (4.4) satisfies the following relation
ir(2i + l)(2k + l)\ _ _ (K{2N - 1 - 2i)(2k + 1)
2N J "~ \ 2N
Then defining
z(i) = y(i) - y(N -1-i), 0<i< N/2,
an iV-point MDCT can be expressed as an Af/2-point DCT-IV as
-l K 2
X(k) = Y J z(i) cos i=0
7r{2i + l)(2k + 1)\ . _ . N
2N , fc = Q , 1 , . . . , - - 1 .
(4.3)
(4.4)
(4.5)
(4.6)
(4.7)
A general MDCT flow graph based on DCT-IV transformation is shown in Fig. 4.3.
MDCT to DCT-IV
x(i) i=0..N-l. arrange
Eq. (4.3)
y(i) i=0..N-l.
*+ Eq. (4.6)
z(i) i=0..N/2-l DCT-IV
Eq. (4.7)
X(k) k=0..N/2-1.
Figure 4.3: Flow graph for the DCT-IV implementation of iV-point MDCT.
4.2.2 The inverse M D C T transformation
The inverse MDCT (IMDCT) is defined as
*'« = N E X W cos k=0
7r(2z + l + f)(2A; + l)
2N
69
i = 0 , l , . . . , J V - l . (4.8)
CHAPTER 4. MODIFIED DISCRETE COSINE TRANSFORM
To obtain the IMDCT, first compute the iV/2-point type-IV DCT of X as
*'« = ̂ E*w cos 8 = 0
7r(2i + l)(2ft + l) N i = 0 , l , . . . , - - l . (4.9)
Applying the symmetry property (4.5), and defining a new data sequence
f z'(i), if 0<i< N/2-1, y (i) — s
{ -z'(N -1-i), if N/2<i< N,
the IMDCT output x'(i) can then be recovered as
(4.10)
x'{i) M + n i f 0 < i < ^ - l , -y'(i-^), i f f < t < J V - l .
(4.11)
An IMDCT flow graph based on DCT-IV transformation is shown in Fig. 4.4.
DCT-IV to IMDCT
X(k) k=0..N/2-l. DCT-IV
Eq. (4.9)
z'(i) i=0..N/2-l Expand
Eq. (4.10)
y'0) i=0..N-l. Re-order
Eq. (4.11)
x'(i) i=0..N-l.
Figure 4.4: Flow graph for the DCT-IV implementation of N point IMDCT.
4.2.3 The advantage of DCT-IV transformation
The DCT-IV transformation has significant implication on implementations, espe
cially for hardware. It is clear from Figs. 4.3 and 4.4 that a common DCT-IV module
can be shared for both the forward and inverse transforms. Unified hardware archi
tecture for the MDCT and IMDCT is shown in Fig. 4.5. Note that we purposedly
scale the data sample to 2N points so that the core computation module becomes an
Appoint DCT-IV.
A key challenge to ASIC implementation is the requirement on the number of
input and output (10) pins. From a package point of view, the reduction of pad 10
size has not kept pace with the development of transistor technology. From a macro
70
4.2. TRANSFORMATION FROM N-POINT MDCT/IMDCT TO N/2-P0INT DCT-IV
1..N-1.
i=0,1..2N-l. ^-
Figure 4.5: Flow graph for the DCT-IV implementation of 2iV-point unified MDCT and IMDCT. Note that IMODE = 0 for MDCT and IMODE = 1 for IMDCT.
perspective, all inputs and outputs must observe a minimum spacing requirement
to reduce potential cross-talk issue. This constraint on inputs and outputs can be
addressed with an improved architecture shown in Fig. 4.6. On the input side, input
pins of IMDCT can be merged with the N input pins of MDCT. For simplicity, we
choose the first N input pins of MDCT. On the output side, (4.10) shows that only N
outputs of IMDCT are truly unique. Therefore we can only keep the N outputs from
DCT-IV without scarifying any loss of information. Combing together the input and
output reduction techniques, our improved architecture can save upto 50% of the IOs
comparing to the implementation in Fig. 4.5.
in(i)
i=0,1..2N-l
Figure 4.6: Flow graph for the DCT-IV implementation of 2JV-point unified MDCT and IMDCT with reduced 10 requirement. Note that for MDCT, IMODE = 0 and in(i) = x{i), i = 0,1, • • • , 2N - 1. For IMDCT, IMODE = 1 and in{k) = X(k), k = 0 , 1 , . - - , J V - 1 .
71
MDCT to DCT-IV
x(i) i=0,1..2N-l
) © -
IMODE
z(i)
i=0,l..N-l
X(k)
k=0,l..N-l
— • • DCT-IV
DCT-IV to IMD(
Expansion
/TDCT to DCT-IV IMODE
y(j)
i=0,l..N-l DCT-IV
out(i)
i=0,l..N-l.
CHAPTER 4. MODIFIED DISCRETE COSINE TRANSFORM
4.3 Bilinear algorithms for 2n-point M D C T / I M D C T
Section 4.2 shows for N — 2n, a 2iV-point MDCT can be converted to an iV-point
DCT-IV with JV pre-additions. For IMDCT there is no extra computation involved.
To construct a bilinear algorithm for MDCT/IMDCT, we need to explore the
group structures within the DCT-IV transform kernel. From (4.7) and (4.9) the
transform kernel indices have iV points of odd values for (2i + 1) and (2k + 1), which
belong to an Abelian group A(8N). From group theory, Abelian group A(2n+3) =
C2 x C*2n+i, where N — 2n. Thus there exists a cyclic sub-group of size 2N of A(8N).
As shown in Lemma 1, integer 3 can be used as the generator g of this group. We now
prove that the integers <p(i), i — 0,1,- • • ,N — 1 are defined in the following lemma
provide the first N odd integers.
L e m m a 5 Let N = 2n and A(8N) = Ci x C2N- Using the generator g = 3 of C2N,
define function 4>(i), 0 < i < N as
i gi mod AN, if (gi mod AN) <2N , <j>[%) = { y . ' J Ky ' (4.12)
[ AN — (gl mod 4AQ, otherwise.
Then values of <p(i), 0 < i < N give all the odd integers in the range 0 to 2N.
Proof. Since g €E A(8N), <fi(i) in (4.12) for every i, 0 < i < N is an odd integer in the
range 0 to 2N. We now prove that every c/>(i), 0 < i < N, is distinct. It would then
imply that these <p(i) give all the N odd integers in the range 0 to 2N.
We now prove the distinctness of each (p(i), 0 < i < N. In particular we show
that if for some 0 < i, j < N, (f)(i) = (/)(j), then i — j . Clearly if gl mod 4Af and gi
mod 4A^ are both smaller or larger than 2AT, then from (4.12), i = j . Assume that g%
mod AN < 2N while gj mod 4JV > 2N. Then from (4.12),
g{ mod AN = 4N - (gj mod AN), or gi = gj mod N.
By squaring both sides, one gets
g2i = g2j mod 8N, or g2{i'j) = 1 mod 8N. (4.13)
72
4.3. BILINEAR ALGORITHMS FOR 2N-POINT MDCT/IMDCT
But since g is the generator of CW, a cyclic group under the operation of multipli
cation modulo 8iV, the only way (4.13) can be true for 0 < i,j < N is if % = j .
I
The fact that each odd integer (2i + 1) for 0 < i < N can be expressed through
the (j) function which is based on a cyclic group allows us to convert the MDCT
computation into a cyclic convolution. Define function ip as follows:
0 W f + l . if to* mod4A0<2AT, ( ^ y —1, otherwise.
We then can express the DCT-IV component
N-l
as
i=0 ^ '
x{m-1) _ N£xim^±)caj«-mm\ 4N J
Thus
i=0
W)Xir^\ ) = E ^W^(^H—)cos I — 4jy ) • (4-15)
Equation (4.15) shows that a permuted and sign adjusted input sequence il>(i)x(((f)(i)-
l)/2) can be cyclically convolved with a constant sequence cos(ir(gt mod 4N)/(4N))
to get the permuted and sign adjusted transform sequence ip(k)x({(f){k) — l)/2).
The bilinear complexity for 2"-point DCT-IV is 3n multiplications and 3(3" - 2")
additions. The bilinear complexity for 2n-point MDCT is 3"_ 1 multiplications and
3" - 2" additions. The bilinear complexity for 2n-point IMDCT is 3 n _ 1 multiplica
tions and 3(3"_1 — 2n_1) additions. Given the complexity requirements, our proposed
bilinear algorithm works best at smaller transform sizes where the hardware imple
mentation is possible.
73
CHAPTER 4. MODIFIED DISCRETE COSINE TRANSFORM
We illustrate the above through an 8-point DCT-IV which is employed in a 16-
point MDCT. Let x(i) and X(i), 0 < i < 8, denote the input and output samples of
the DCT. In this case, g being 3, the values of <j>(i) for i = 0 through 7 are given by
{1,3, 9,5,15,13, 7,11}. The consecutive values of ip(i) are {1,1,1, —1, —1,1, —1,1}.
Using a shorthand notation^ for a value of cos(-KpjAN) and J? be a value of — cos(irp/4N)
with N = 8, we can describe the transform matrix for 8 point DCT-IV as
X(0)\
X(l)
X(4)
-X(2)
-X(7)
X(6)
-X(3)
X{h) )
( 1
3
9
5
15
13
7
V 1 1
3 9 5
9 5 15
5 15 13
15 13 7
13 7 11
7 11 T
11 T 3
T 3 9
T5
13
7
11
T 3
9
5
13
7
11
T
3
9
5
15
7 11 \̂
11 T
T 3
3 9
9 5
5 15
15 13
13 7 )
( x(0)
x(l)
x(4)
-x{2)
-x(7)
x(6)
-x(3)
{ x(5)
A Hankel matrix product is derived and efficient bilinear algorithm can then be
applied to compute the transform. This algorithm is shown in Fig. 4.7. Inidividual
architecture for 16-point MDCT and IMDCT based on this 8-point DCT is shown
in Fig. 4.8, whereas a unified architecture is shown in Fig. 4.9. A solid line means a
transfer function of 1, a dashed line means a transfer function of —1. The multipli
cation coefficients are listed in Table 4.1.
For lengths 8 and 16, our proposed algorithms for MDCT are compared to [9]
which offers a regular structure based on Given's Rotation. The complexities and
critical path delays are shown in Table 4.2. The algorithms are implemented in 16-
bit fixed arithmetic with TSMC 90nm CMOS standard cell library. The normalized
area and speed comparison of the resultant circuits is shown in Fig. 4.10. For 8-point
MDCT, the top speed of the proposed bilinear implementation is 30% higher than
that of [9]. For 16-point MDCT, our speed advantage is over 44%. In fact, the top
speed of 16-point bilinear implementation is even 15% faster than that of the 8-point
implementation of [9]. Given the same speed, the area for 8-point bilinear circuits
can be as much as 32% smaller than that of [9]. For 16-point, the proposed circuit
can be as much as 26% smaller. In addition, the MDCT bilinear implementations
74
4.3. BILINEAR ALGORITHMS FOR 2N-P0INT MDCT/IMDCT
(+)—*- x(i)
+y— x(4)
X(2)
X(0)
X(7)
X(6)
X(3)
X(5)
Figure 4.7: Proposed bilinear implementation of the 8-point DCT-IV. The multiplication coefficients are in Table 4.1.
75
CHAPTER 4. MODIFIED DISCRETE COSINE TRANSFORM
x(12) '
x(ll)
x(13)
x(10)
x(14)
x(9)
x(15)
x(8)
x(0)
x(7)
x(l)
x(6) ,
x(2)
x(5)
x(3)
x(4)
r+\---+\
+ \ — ~\
+ > — - J
+\ — *\
(0)
(1)
(2)
(3)
(5)
(6)
(7)
(0)
(1)
(2)
(3)
8 point DCT-IV
(4) (4)
(5)
(6)
(7)
X(0)
X(l)
X(2)
X(3)
X(4)
X(5)
X(6)
X(7)
(0)
(1)
(2)
(3)
(5)
(6)
(7)
(0) V
(1)
(2)
(3)
8 point DCT-IV
(4) (4)
(5)
(6)
(7)
,* x'(12)
^x'(U)
.* x'(13)
^ x ' ( 1 0 )
,* "'(14)
X x'(9)
." x'(15)
^-KxX8)
S x'(0)
X x-(l)
X x'(2)
-^x'(5)
X x'(3)
X x'(4)
Forward MDCT Inverse MDCT
Figure 4.8: The implementations of 16-point MDCT and IMDCT based on the 8-point DCT-IV.
Table 4.1: Mu Coefficient
C\
c4
c?
ClO
Cl3
Cl6
Cl9
C22
C25
Value -2.3342 2.7607
-3.1108 0.8561
-1.2444 1.2062 0.6220
-0.2719 0.6985
tiplication coei Coefficient
c2
c5
c8
Cll
c14
Cl7
C20
C23
C26
Scients used in Fig. 4.7. Value 2.0200
-1.3533 2.6005
-0.1811 0.4714
-1.4283 -1.6577 0.4105 0.2561
Coefficient c3
C6
C9
Cl2
Cl5
Cl8
C21
C24
C27
Value -1.5097 2.2505
-3.6363 0.4033
-1.4666 1.7891 0.7032 0.6827 0.0581
76
4.3. BILINEAR ALGORITHMS FOR 2N-POINT MDCT/IMDCT
in(ll)
in(12)
in(10)
in(13)
in(9)
in(14)
in(8)
in(15)
in(0)
in(7) -
in(l)
in (6)
in(2)
in(5)
in (3)
in(4)
IMODE
'+} —-H in(0)
in(l)
in(2)
+ > — •
in(3)
in(4)
in(5)
+ in(6)
in(7)
x(0)
x(l)
x(2)
x(3)
X(0)
X(l)
X(2)
X(3)
8 point DCT-IV
x(4) X(4)
x(5)
x(6)
x(7)
X(5)
X(6)
X(7)
Forward: X(0) Inverse: — x '(11), —x '(12)
_^ Forward: X(l) Inverse: —x '(10), —x '(13)
Forward: X(4) ~*~ Inverse: -x'(9),-x'(14)
», Forward: X(2) Inverse: —x '(8), —x '(15)
Forward: X(7) Inverse: x'(0),-x'(7)
Forward: X(6) Inverse: x'(l),-x'(6)
_^ Forward: X(3) Inverse: x'(2),-x'(5)
_^ Forward: X(5) Inverse: x'(3),-x'(4)
Figure 4.9: Unified implementation of the 16-point MDCT and IMDCT employing one 8-point DCT-IV. Note that for MDCT, IMODE = Q,in(i) = x(i),0 < i < 16. For IMDCT, IMODE = l,in(k) =X(k),0<k<8.
77
CHAPTER 4. MODIFIED DISCRETE COSINE TRANSFORM
are based on DCT-IV transform. This permits simple unified architecture for both
the forward and the inverse implementations. The speed and area of these unified
implementations are close to the implementations of the bilinear MDCT.
Table 4.2: Complexities of various 8 and 16 point MDCT algorithms. Note that M and A refer to multiplication and addition respectively.
Transform 4-point DCT-IV 8-point MDCT 8-point MDCT
8-point DCT-IV 16-point MDCT 16-point MDCT
Algorithm Proposed Proposed Ref. [9]
Proposed Proposed Ref. [9]
Arithmetic complexity 9M +15.4 9M + 19,4 8M + 24,4 TIM + 55,4 27M + 63.4 22M + 6CL4
Critical delay M + 4A M + 5A 2M + 5A M + 6A M + 7A 2M + 7A
1
0.9
0.8
0.7
0.6
0.5
0.4
Figure 4.10: Delay in nsec (on horizontal axis) and normalized area (on vertical axis) for various implementations of 8 and 16 point MDCTs. Note that Fig. 4.9 is a unified MDCT and IMDCT architecture, while all others compute MDCT only.
4.4 Bilinear algorithms for 4-3n-point M D C T / I M D C T
The MDCT/IMDCT algorithms for composite lengths of 4 • 3" points where n > 0,
have found many practical applications in audio coding standards. In particular,
78
8 point 16 point
• *— Ref. [9] B— Bilinear -E>— Fig. 4.9
10 15 20
4.4. BILINEAR ALGORITHMS FOR 4 • 3N-POINT MDCT/IMDCT
12-point MDCT/IMDCT is used for the short block and 36-point MDCT/IMDCT is
used for the long block of MPEG-1/2 layer III (MP3) audio processing.
The algorithm for 4 • 3"-point MDCT can be designed following an approach
similar to the one in Section 4.2, i.e., a 2iV-point MDCT is first converted to an
iV-point DCT-IV as
X W = | > W c o s ( * ( a + ^ f + 1 ) ) , * = 0,1 J V - 1 . (4.17)
An iV-point IMDCT is computed directly from an iV-point DCT-IV to obtain one
half of the outputs. The other half is redundant and can be obtained with trivial sign
changes.
As discussed in Section 4.2, the MDCT of any even length can be computed via
a DCT-IV of half the length. Let N = 2 • 3" where n > 0. We will use the symbol
Xn to indicate a DCT-IV of length 2 • 3 n . Consider the group A(8N) = {0 < i <
8N | gcd(i, 8N) = 1}. The proposed computation shown in Fig. 4.11 uses transform
division of DCT-IV kernel matrix based on A(8N). For MDCT, it is a frequency
division scheme; for IMDCT, a time division scheme.
+» X(k) 2k+l £A(8N) and 0=<k<N.
+»X(k) 2k+l ftA(8N) and 0<k<N.
Figure 4.11: Flow graph for 2 • 3"-point bilinear DCT-IV.
Consider first the computation of Xn(k), where (2k + 1) ^ A(8N) , i.e. (2k +1) is
a multiple of 3. In this case, it can be shown that the multiplication coefficients for
79
Cosine Group Transform CGT„,
Pre-Addition
N/3-1 y(i)=^jX(i)-x(2N/3-i-l)-x(2N/3+i)
i=0 DCT-IV of length N/3
CHAPTER 4. MODIFIED DISCRETE COSINE TRANSFORM
x(i), x(2N/3 — i — 1) and x(2N/3 + i) are related. In particular,
-7r(2z + l)(2A: + l ) ^ _ _ fir(2(2N/3 - * - 1) + l)(2fc + 1) C O B I I — C<Oo
AN J V 4N
= ̂ (ifiMyjMj (418)
To take advantage of (4.18), define
z(i) = x(i) - x(2N/3 - i - 1) - x(2N/3 + i), i = 0 , 1 , . . . , (N/3) - 1.
Then it is obvious that for (2k + 1) £ 4(8iV),
/ V / 3 - 1
i=0 ^ = ^ ( f c ) , A; = 0 , 1 , . . . , iV/3 - 1, (4.19)
where Zn_i(fc) is the 2-3""1 point DCT-IV of sequence {z(i)}. Therefore the DCT-IV
components with index values (2k + 1) are multiples of 3 can be computed directly
from the DCT-IV of a sequence {z(i)} of a smaller length (N/3).
To compute Xn(k) where (2k+ 1) £ A(8N), we use the fact that A(8N) forms a
group under the operation of multiplication modulo 8N. We refer to this computation
of cosine transform with the transform indices restricted to a group as the 2 • 3" -point
Cosine Group Transform, CGTN- Thus we have
CGTN(k) = Yl XW c o s ^2l + 1^k + 1 \ 0<k<N,2k + le A(8N). (4.20) i=0
We separate the summation in (4.20) in two summations depending on whether (2z+l)
belongs to A(8N) or not. The results of these are combined using \CGTN\ additions
later. When (2i-\-1) e A(8N), we can permute the signal and transform components
to convert the partial kernel to a direct product of cyclic groups. This permutation
and computation thus depends on the group structure and is illustrated later in this
section.
When (2k +1) e A(8N) but (2i + 1) $ A(8N), (2i +1) is a multiple of 3. In this
case, only the first iV/3 components of the cosine group transform are independent
80
4.4. BILINEAR ALGORITHMS FOR 4 • 3N-POINT MDCT/IMDCT
because
CGTN(k) = -CGTN(2N/3-k-l)
= -CGTN(2N/3 + k). (4.21)
It is therefore sufficient to compute CGTN(k) only for (2k+l) e A(8N), 0<k< N/3,
i.e., (2k + 1) e >l(8iV/3). Also, since (2* + 1) £ 4(8JV), one has (2i + 1) g ,4(8JV/3).
Thus
CGTN(k) = 2 ^ a;(i)cosl — 1 fo.-_i_i w zi/'aAr/'a v ' (2J+l)£4(8iV/3)
= E'^)coS(2^+4ft + 1 ' ) , (2*+ 1)^(8^/3),
j=0 ^ '
= CGTm{k). (4.22)
Note that the sequence {rc'(«)} in (4.22) is defined as x'(?) = x(3« + 1), 0 < i < N/3.
Further, CGTN/3 in (4.22) represents the N/3 point cosine group transform of {x'(i)}.
The relationship (4.21) between transform components is essentially the analog of
signal domain relation (4.18) and is due to the symmetry of the kernel. It points to an
alternative division scheme where transform components are first evaluated upon the
signal index i with respect to A(N). For (2z+l) ^ A(N), we then further separate the
cosine group transform based on the relationship between A(N) and the transform
index k. The motivation behind the signal division scheme is that some computations
for (4.19) can be shared with those for the CGTN where (2i + 1) € A(8N). We
will show a reduced complexity for 6-point DCT-IV in Section 4.4.1. The transform
division on the other hand, permits simpler pipelining and can also reduce the number
of output pins for large transform sizes. In Section 4.4.2 and 4.4.3, the advantage of
transform division architecture are discussed in details.
When (2k + 1) € A(8N) and (2i + 1) G A(8N), the computation turns into a
multi-dimensional convolution. This convolution can be described by the structure
of ^4(8iV) = A(1Q • 3") = C2 x C± x C2.3n-i. Let h and g denote the generators of d
81
CHAPTER 4. MODIFIED DISCRETE COSINE TRANSFORM
and C2-3n-i respectively. Define a function (f>(a,b) as follows:
' hagb, if 0 < hagb < 2N,
,, ,, AN-hag\ if 2N < hagb < W, 4>(a,b)={ h
y ' \ 4.23 hagb-4N, if 4N < hagb < 6N,
k 8iV - hagb, if 6iV < /*V < 8N.
Note that in (4.23), the product hagb is always computed modulo 8N. Defined as
above, function </>(a, b) for 0 < a < 2 and 0 < 6 < 2 • 3"_ 1 produces all integers within
^4(8iV) which are less than 2N. Thus if A(8N) is considered to be made up of integers
of the type (2i +1), then the values of >̂ described above produce all (2i + l) € A(8N)
corresponding to 0 < i < N.
Define a sign function ^(a, 6) as
,, ,, f - 1 , if 2N < hagb < 6N, /dnt. i>(a,b) = \ ' . y (4.24)
[ +1, otherwise.
With functions (f>(a,b) and -0(a, 6) defined in (4.23) and (4.24), one can express the
computation
Y(k) = £ s(i) cos M ^ + I K 2 ^ 1 ) ^ j 0<k<Njke A{8N)j (4.25) i=0 ^ '
as a convolution. Using the equivalence between <fr(a, b) values and (2i + 1), (2A; + 1)
ranges, one gets
Y(W) = JZ^xia^cos M(*MW>»)\ f o < a' < 2,0 < b' < 2 • 3""1. a=0 6=0 ^ '
(4.26)
In (4.26), x{%) is relabeled as x(a, b) where </>(a, b) = 2i + l. Similarly Y(k) is relabeled
as Y(a', b') where 4>{a!', b') = 2k + 1. Using the definitions of 4>(a, b) and if; (a, b), one
gets from (4.26),
Y(a',b') = J2 E <a,b)^{aM^{o!,b')cos(^p^). (4.27) a=0 6=0 ^ '
82
4.4. BILINEAR ALGORITHMS FOR 4 • 3N-POINT MDCT/IMDCT
Equation (4.27) can be rewritten as
1 2 - 3 " - 1 - !
iP(a',b,)Y(a',b') = Yl J2 x(a,bty(a,b) cos ^a+a'gb+b'
AN (4.28)
a=0 6=0
Equation (4.28) shows that the permuted and sign adjusted values of Y(k) are ob
tained by a multi-dimensional oepration of permuted and sign adjusted signal samples
with a constant sequence made up of cosine terms. In one dimension, this operation
represents a 2 • 3"_1-point cyclic convolution. In other dimensions, it is a 2-point
Hankel product.
One can verify that h can always be chosen as h = 2N + 1. There are also other
values of h which would work as well. Similarly, one can choose g from amongst
many possible generators of the cyclic group C2.3*>-i C A(8N). Finally the 2 • 3 n _ 1 -
point cyclic convolution can itself be carried out as a two dimensional convolution
with lengths 2 and 3 n _ 1 along the two dimensions. Since an algorithm with a lower
computational complexity is desirable, from Section 2.4.2 we can use the value of
(m — n)/a to determine the decomposition order of bilinear algorithm (n, o, m), where
n is the length of the input vector, a its additive complexity and m its multiplicative
complexity.
The decomposition of CGT^ is summarized in Fig. 4.12. It shows that the com-
x(i) 2i+l 8 A(8N)
Multidimensional
cyclic convolution Y(k)
2k+l £A(8N)
x(i) 2i+l £ A(8N) mmimmmmmmJ^
X'(k) 2k+l £A(8N/3)
X(k) 2k+l £A(8N)
Figure 4.12: Flow graph for cosine group transform of 2 • 3 n point DCT-IV.
putation of CGT^ breaks down into two independent computations, one involving a
83
CHAPTER 4. MODIFIED DISCRETE COSINE TRANSFORM
multi-dimensional cyclic convolution and the other, the transform CGTN/Z- CGT^/3
can also be similarly decomposed into a smaller sized convolution and CGTN/32. Since
all the resultant convolutions can be done concurrently, one can get a bilinear algo
rithm for the DCT-IV from the bilinear algorithms for cyclic convolutions.
The above discussion results in 2 • 3"-point DCT-IV algorithm with the bilinear
complexity of 9(5" + An - l ) / 8 multiplications and (18 • 5n - 3 n + 3 + 20n 4- 17)/2
additions. Thus the bilinear complexity of 4 • 3" point MDCT is 9(5" + 4n - l ) / 8
multiplications and (18-5n —23-3" + 20n + 17)/2 additions. The bilinear complexity of
4-3" point IMDCT is 9(5n + 4 n - l ) / 8 multiplications and (18-5"-3"+ 3 + 20n+17)/2
additions. Given the complexity requirements, our proposed bilinear algorithm works
best at smaller transform sizes where the hardware implementation is possible. This
is the case for MPEG-1/2 layer III (MP3) audio processing, which is discussed in the
next two sections.
4.4.1 The bilinear MDCT/IMDCT for MP3 audio short block
length
A 12-point MDCT/IMDCT is used for short block in MP3 audio processing. As dis
cussed in Section 4.2, these transforms can be converted to a 6-point DCT-IV. Bilinear
algorithms for DCT-IV can then be applied to obtain a fast VLSI implementation.
For DCT-IV signal indices i = 1 and 4 where (2i + 1) is divisible by 3, we can
compute a 2-point DCT-IV. Let its outputs be Xc(0) and Xc(l), using the same
shorthand notation as before, we have
( X'l0) ) = ( 9 ' )( X(1) ) . (4.29) V Xc(l) j \ i 9 ) \ x(A) )
We will add Xc(0) to the rest of X ( l ) and subtract it from the rest of X(2) and X(5).
Similarly we will subtract Xc(l) from the rest of X(0) and add it to the rest of X(3)
and X{A).
To compute DCT-IV trasnform indices k — 1 and 4 where (2k + 1) is divisible by
84
4.4. BILINEAR ALGORITHMS FOR 4 • 3N-POINT MDCT/IMDCT
3, Using the same shorthand notation as before, we get
x(0)-x(3)
-(x(5) + x(2)) (4.30)
One can notice that the computation (4.30) is a Hankel product. As we shall demon
strate later, the advantage of the signal division approach is that (4.30) can be com
pletely obtained from the remaining matrix calculation with only sign changes.
For the remaining matrix, i.e., when (2i + 1) € A(SN) and (2k + 1) € A(8N), we
have h = 2N + 1 = 13 to be the generator of C4 and g = 7 to be the generator of
C2.3—1 = C2. One therefore gets {0(0, 0), 0(1, 0), 0(0,1), 0(1,1)} = {1,11,7, 5} and
the corresponding ip values are {1, —1,1,1}. In addition, since 0(a, b) equals (2i + 1)
or (2k +1 ) , the signal or transform sample index needs to be permuted as {0, 5, 3, 2}.
The resultant matrix equation is given by:
/
V
xk(o) \
Xk(3)
**(2) /
( * n
IT T
7 5
I 5 7
7 5 \ 5 7 1 TT
TT T )
( x(0) \
-x(5)
x(3)
V x(2)J
(4.31)
One can notice that this computation corresponds to a two dimensional convo
lution with 2-point convolution along one dimension and a 2-point Hankel product
along the other. Clearly, efficient bilinear algorithm can be constructed for (4.31).
Applying 2-point bilinear algorithm (Fig. 2.6) for cyclic convolution to (4.31), one
computes {Xk(0), — Xk(5)} with
1 + 7 11 + 5
TT + 5 T + 7
x(Q) + x(3)
x(h)+x(2) (4.32)
This is a 2-point Hankel product and a bilinear algorithm can be applied with 3
multiplications and 3 additions.
Similarly, one can compute the other transform components {Xk(3),Xk(2)} with
**(3)
Xk(2)
x(0) - x(3)
-(x(5) + x(2)) (4.33)
85
CHAPTER 4. MODIFIED DISCRETE COSINE TRANSFORM
This is again a 2-point Hankel product and a bilinear algorithm can be applied with
3 multiplications and 3 additions.
It can be easily verified that for N = 6,
cos(97r/4iV) = cos(7r/4iV) - cos(7?r/4iV),
cos(3?r/4iV) = -cos(ll7r/4JV)-cos(57r/4iV). (4.34)
Therefore one can express (4.33) as
(4.35)
Comparing (4.35) with (4.30), one gets
Xk(l) = - 2 X , ( 2 ) ,
Xk(4) = 2X*(3). (4.36)
Therefore computation for (4.30) can be absorbed into the that for (4.31). The
operation of multiplying-by-2 may be counted as one addition. Frequently in hardware
design, this scale-by-2 can be realized as a trivial left shift and thus its impact on
area and speed is negligible.
The complete flow graph of this computation is shown in Fig. 4.13. The multipli
cation coefficients are listed in Table 4.3. Architecture of 12-point MDCT/IMDCT
based on this DCT-IV is given in Fig. 4.14.
Table 4.3: Mul Coefficient
C\
c4
C7
Value 0.5412 0.4687 0.6533
tiplication coefficients used in Fig. 4.13 Coefficient
c2
c5
c8
Value 0.3827 0.3314
-0.4619
Coefficient c3
c6
c9
Value -1.3066 -1.1315 -0.2706
Our proposed algorithms may be compared to those available in the literature
[9,27,37]. The complexities and critical path delays of these are listed in Table 4.4.
The bilinear algorithms improve both the arithmetic complexity and the critical path
delay compared with the referenced fast algorithms.
86
4.4. BILINEAR ALGORITHMS FOR 4 • 3N-POINT MDCT/IMDCT
Figure 4.13: Proposed bilinear implementation of the 6-point DCT-IV.
x(9)
x(8)
x(10)
x(7)
x(U)
x(6)
x(0)
x(5) .-
x(l)
x(4)
x(2)
x(3)
+ \ — *\
+ •
+> —-H
(0)
(i)
(0)
(1)
(2) (2)
6pointDCT-IV
(3) (3)
(4)
(5)
(4)
(5)
Forward MDCT
X(0)
X(l)
X(2)
X(3)
X(4)
X(5)
(0)
(1)
(0) \
(1)
(2) (2)
6 point DCT-IV
(3) (3)
(4)
(5)
(4)
(5) \
,*x-(9)
~V x'(8)
,*' x'(10)
V x'(7)
,"x'(U)
X x'(6)
Sx'(0)
V x-(5)
S xV) X *'(4)
S x'(2)
^ x'(V
Inverse MDCT
Figure 4.14: The implementations of 12-point MDCT and IMDCT based on the 6-point DCT-IV.
87
CHAPTER 4. MODIFIED DISCRETE COSINE TRANSFORM
We have implemented these algorithms in 16-bit fixed arithmetic with TSMC
90nm CMOS standard cell library. The circuit speed and normalized area for various
12-point MDCT/IMDCT architectures is compared in Fig. 4.15. The top speed of
the proposed forward bilinear implementation is 41% to 37% faster than those of
Given's rotation based forward transforms [9,37] and is 68% faster than the recursive
approach in [27]. On the inverse transform, the top speed advantage is 50% faster
over [37] and 73% faster over [27]. Given the same speed, the area for bilinear circuits
can be as much as 41% and 33% smaller than those of [37] and [27] respectively for the
forward, and 38% and 37% smaller respectively for the inverse transform. Clearly our
proposed bilinear algorithm provides the most efficient implementation of 12-point
MDCT.
Table 4.4: Complexities of various 12-point MDCT and IMDCT algorithms. Note that M and A refer to multiplication and addition respectively.
Transform 6-point DCT-IV
6-point CGT 12-point MDCT 12-point MDCT 12-point MDCT 12-point MDCT 12-point IMDCT 12-point IMDCT 12-point IMDCT 12-point IMDCT
Algorithm Proposed Proposed Proposed Ref. [9] Ref. [37] Ref. [27 Proposed Ref. [9] Ref. [37] Ref. [27]
Arithmetic complexity 9M + 23,4 9M + 19.4 9M + 29,4 13M + 39A 13M + 27.4 11M + 29.4 9M + 23,4 13M + 33A 13M + 21.4 11M + 23.4
Critical delay M + 5,4 M + 5̂ 4 M + 6,4 2M+6A 2M+5A 3M+7A M + 5A 2M+5A 2M+4A 3M+6A
4.4.2 The bilinear M D C T / I M D C T for M P 3 audio long block
length
We now explain in detail the architecture for a 36-point MDCT/IMDCT via N = 18
point DCT-IV. The DCT-IV components X(1),X(4),X(7),X(10),X(13) and X(16)
can be computed by a 6-point DCT-IV. For the remaining components of CGTig, we
88
4.4. BILINEAR ALGORITHMS FOR 4 • 3N-POINT MDCT/IMDCT
12 point MDCT 12 point IMDCT
0.9
0.8
0.7
0.6
0.5
1
- Ref.[37] Ref.[27]
- Bilinear •
10 15 20
Figure 4.15: Delay in nsec (on horizontal axis) and normalized area (on vertical axis) for various implementations of 12-point MDCT and IMDCT.
further divide the kernel matrix in two parts. A CGTe is computed for signal indices
i = 1,4, 7,10,13,16. The computation involving signal and transform indices i, k €
{0,2,3,5,6,8,9,11,12,14,15,17}, i.e., those for which (2i + 1), (2k + 1) € A(8N),
results into a multi-dimensional convolution. As explained earlier, this convolution
is based upon the group C4 x C2.3«-i. Since the cyclic group C2.3n-i can be further
expressed as C2 x C3«-i, in (4.23) and (4.24), we can substitute in gb = g^g^ where #2
and #3 are generators for C2 and CSn-i respectively. By using the generator h = 19 of
C4, generator g2 = 17 of C2 and generator g3 = 49 of C3, we get the values of function
(/>, a = 0,1, from (4.23) as {1,19,23,5,25,29,17,35,31,13,7,11}. The corresponding
values of function tp are obtained from (4.24) as {1,1, —1, —1, —1,1,1,1,1,1, —1, — 1}-
Further, <f> represents values of (2z +1) or (2k + 1), where i and k are indices of signal
and transform samples. Thus the permutation of the signal and transform samples
can be derived from the values of </>. For the present set (/) values, this index order is
given by {0, 9,11,2,12,14,8,17,15,6,3,5}. The computation can thus be expressed
CHAPTER 4. MODIFIED DISCRETE COSINE TRANSFORM
as the matrix product:
Xk(0
Xk(9
-X*(ll
-Xk{2
-Xk{\2
Xk{U
( 1
19
23
5
25
29
17
35
31
13
7
V ii
19
T
5
23
29
25
35
17
13
3T
n 7
23
5
25
29
1
19
31
13
7
TT
17
35
5
23
29
25
19
T
13
3l IT 7
35
17
25
29
1
19
23
5
7
IT
17
35
31
13
29
25
19
I 5
23
TT 7
35
T7
13
31
17
35
31
13
7
IT 1
19
23
5
25
29
35
17
13
3l II 7
19
I
5
23
29
25
31
13
7
TT
17
35
23
5
25
29
1
19
13
31
TT
7
35
17
5
23
29
25
19
T
7
TT
17
35
31
13
25
29
1
19
23
5
n\ 7
35
17
13
31
29
25
19
I 5
23 i
( x(0) \
x(9)
-a;(ll)
-x{2)
-x{\2)
x(U)
Xk(8) ~ 17 35 31 13 7 TT 1 19 23 5 25 29 s(8)
Xfc(17) 35 T7 13 31 TT 7 19 T 5 23 29 25 x(YI)
Xk(lb) 31 13 7 TT 17 35 23 5 25 29 1 19 x(15)
Xk(6) 13 3T II 7 35 17 5 23 29 25 19 I x(6)
-Xk(3) 7 TT 17 35 31 13 25 29 1 19 23 5 -x(3)
\ -Xfe(5) / \ TT 7 35 17 13 31 29 25 19 I 5 23 / \ -x(5) /
(4.37)
Note that we use p in this matrix to represent the value CQs(irp/AN) and p for the
value — cos("7rp/4./V), where N = 18. The 6-point cyclic convolution can be obtained
by combining 2-point and 3-point algorithms.
Efficient bilinear algorithms exist for the 2-point and 3-point cyclic convolution
and Hankel product. For the 3-point cyclic convolution, applying the trigonometric
identity cos(o;) + cos(27r/3 + a) + cos(47r/3 + a) — 0, we can lower its complexity to 3
multiplications and 6 additions and reduce the critical path delay to 1 multiplication
and 4 additions.
The flow graph for (4.37) is shown in Fig. 4.16. The complete implementation flow
graph for for 18-point DCT-IV is shown in Fig. 4.17. The multiplication coefficients
used therein are listed in Table 4.5. Figures 4.18 and 4.19 show the 36-point MDCT
and IMDCT respectively. The complexities and critical path delays of these and of
other algorithms available in literature are listed in Table 4.6. One can see from the
table that the proposed bilinear algorithm improves the critical path delay and has
the lowest multiplication requirements. The addition operations however are higher
than [27,37].
The proposed bilinear algorithms and the reference algorithms are implemented
90
4.4. BILINEAR ALGORITHMS FOR 4 • 3N-POINT MDCT/IMDCT
out(9)
out(6)
Figure 4.16: Proposed bilinear implementation of multidimensional convolution involved in the 18-point DCT-IV.
Table 4.5: Multiplication coef Coefficient
C\
c4
C7
ClO
Cl3
Cl6
Value -0.9231 -0.5086 -0.6025 -0.1628 0.1851 0.7181
Coefficient c2
c5
Cg
Cn C14
Cl7
icients us Value
-0.6528 -0.3596 -0.4261 0.2779
-0.3160 -1.2258
ed in Fig. 4.16 Coefficient
c3
c& c9
Cl2
Cl5
Cl8
Value 2.2287 1.2278 1.4546
-0.3930 0.4469 1.7336
91
CHAPTER 4. MODIFIED DISCRETE COSINE TRANSFORM
(0) (1) (2) (3) (4) (5) (6) (7) (8) (9) (10)
(11)
Multidimensional cyclic convolution
Fig. 4.16
(0) (1) (2) (3) (4) (5)
(0) (1) (2) (3) (4) (5) (6) (7) (8) (9)
(10) (11)
(0,2,3,5)
CGT,
{(0)-(2)-(0),(2)-(0)-(2), -(3)-(5),-(3)-(5),(3),(5)}
CGT 18
Figure 4.17: Proposed bilinear implementation of the 18-point DCT-IV.
92
4.4. BILINEAR ALGORITHMS FOR 4 • 3N-POINT MDCT/IMDCT
2(0)
2(9)
2(11)
2(2)
2(12)
2(14) 2(8)
2(17) 2(15)
2(6) 2(3)
2(5)
2(1)
2(4) 2(7) 2(10)
2(13)
2(16)
x(0:35) MDCT to DCT-IV
3 Fig. 4.3
2(0:17)
(0) (1) (2) (3) (4) (5) (6) cyclic convolution
Multidimensional
(7) (8) (9) (10)
(11)
Fig. 4.16
(0)
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(0) (1) (2) (3) (4) (5)
(0,2,3,5)
CGT,
~®~
{(0),-(2)-(0),(2),-(0)-(2), -(3)-(5) r(3),-(5),(3),(5)}
C G T ,
(0:5) (11:6)
(12:17) +̂
(0:5)
6 point DCT-IV
(0) (1) (2) (3) (4) (5)
•X(0) X(9)
• X(ll) X(2)
• X(12)
• X(14) X(8) X(17)
• X(15) X(6) X(3)
•X(5)
X(l)
X(4)
X(7)
• X(10)
• X(13)
• X(16)
Figure 4.18: Proposed bilinear implementation of the 36-point MDCT.
Table 4.6: Complexities of various 36-point MDCT and IMDCT algorithms. Note that M and A refer to multiplication and addition respectively.
Transform 18-point DCT-IV 36-point MDCT 36-point MDCT 36-point MDCT 36-point MDCT 36-point IMDCT 36-point IMDCT 36-point IMDCT 36-point IMDCT
Algorithm Proposed Proposed Ref. [9] Ref. 37] Ref. [27] Proposed Ref. [9] Ref. [37] Ref. [27]
Arithmetic complexity 36M +132,4 36M + 15(L4 47M + 165 A AIM +129 A 43M + 133A 36M +132,4 51M + 151,4 51M + 115,4 43M +115,4
Critical delay M + 9A M + 10A 2M+9A 2M+8A
3M+22A M + 9A 2M+8A 2M+7A 3M+21A
93
CHAPTER 4. MODIFIED DISCRETE COSINE TRANSFORM
X(0) X(9) X(ll) X(2) X(12) X(14) X(8) X(17) X(15) X(6) X(3) X(5)
(0) 0) (2) (3)
(0) (1) (2) (3)
( ' Multidimensional ^ ' (5) .. . . (5) (6) cyclic convolution (6) (7) (7) (8) Fig. 4.16 ( g )
(9) (9) (10) (10)
(11) (11)
' + •
{(0),"(2)-(0),(2),-(0)-(2), ~(3),-(5),-(3)-(5),(3),(5)}
CGT,
-x(26),-x(27) x(0),~x(17) x(2),-x(15) -x(24),-x(29) x(3),-x(14) x(5),-x(12) -x(18),-x(35) x(8),-x(9) x(6),-x(U) -x(20),-x(33) -x(23),-x(30) -x(21),-x(32)
X(0:17) (0:5)
(11:6)
Xr+> (12:17) S * ^
(0:5)
6 point DCT-IV
(0) (1) (2) (3) (4) (5)
-x(25)-x(28) -x(22),-x(31) -x(19),-x(34) x(l),-x(16) x(4)-x(13) x(7),-x(10)
Figure 4.19: Proposed bilinear implementation of the 36-point IMDCT.
94
4.4. BILINEAR ALGORITHMS FOR 4 • 3N-POINT MDCT/IMDCT
in 16-bit fixed arithmetic with TSMC 90nm CMOS standard cell library. The circuit
speed and normalized area is shown in Fig. 4.20. The top speed of the proposed
forward bilinear implementation is 19% to 14% faster than those of Given's rotation
based forward transforms [9, 37] respectively, and is 61% faster than successive de
composition approach [27]. On the inverse transform, the top speed advantage is 25%
over [37] and 64% over [27]. Given the same speed, the area for bilinear circuits can
be as much as 19% smaller than those of [27,37] for the forward and 30% and 17%
smaller for the inverse implementations respectively.
36 point MDCT 36 point IMDCT
0.9
10 15 20 25 30
Figure 4.20: Delay in nsec (on horizontal axis) and normalized area (on vertical axis) for various implementations of 36-point MDCT and IMDCT.
4.4.3 The unified MDCT/IMDCT architecture for MP3 au
dio
In Fig. 4.5, we have shown that the forward and inverse MDCT can be obtained
together on a DCT-IV based hardware architecture. This is accomplished with rel
atively simple input and/or output data multiplexers. This unified implementation
allows encoder and decoder to share the same hardware accelerator through time
multiplexing.
In MPEG-1/2 layer III (MP3) audio format, two different block sizes are defined.
95
CHAPTER 4. MODIFIED DISCRETE COSINE TRANSFORM
The long block size is normally used to provide better frequency resolution and the
short block is used where as better time resolution is needed. The switch from the
long block to the short block occurs whenever pre-echo is expected. Pre-echo is a
distortion in the frequency domain coding of an audio signal. It is commonly dealt
with using a window switching technique, where short block sizes are used in place of
long block sizes. Therefore a truly unified algorithmic accelerator will need to process
not only the forward and inverse (unified encoder and decoder), but also the short
and long block sizes (window switching).
Figs. 4.18 and 4.19 show that 36-point MDCT/IMDCT consists of three major
processing modules: a 12-point block circular matrix, a 6-point CGT and a 6-point
DCT-IV. Both 12-point MDCT and IMDCT rely on 6-point DCT-IV. These obser
vations lead us to three different unified hardware architectures.
Shown in Fig. 4.21, architecture A is a straightforward enhancement to the unified
architecture Fig. 4.6. We use the 6-point DCT-IV inside the 36-point MDCT/IMDCT
to process the 12-point MDCT/IMDCT. The data throughput is one 36-point or one
12-point MDCT/IMDCT per cycle. The pre-addition stages of 12-point MDCT is
shared with that of the 36-point MDCT. Table 4.7 and 4.8 show possible input and
output assignments for Fig. 4.21.
Architecture B shown in Fig. 4.22 improves upon the simple enhancement of
Fig. 4.21. From Table 4.4, We realize that the difference between 6-point CGT and
6-point DCT-IV is small and only amounts to 4 additions (or 2 additions and 2 left
shifts). Therefore 6-point CGT can be expanded into 6-point DCT-IV to process
a second 12-point MDCT/IMDCT. The data throughput is one 36-point or two 12-
point MDCT/IMDCT per cycle. The pre-addition stage for both 12-point MDCT's is
shared with that of the 36-point MDCT. The ability to process multiple short blocks
concurrently is important. During window switching, the 32 subbands can operate
in mixed block mode, where two lower subbands process long blocks and all other
30 upper bands switch to short blocks. Table 4.9 and 4.10 shows possible input and
output assignments for Fig. 4.22.
Architecture C (pipeline) takes a different look at the relationship between the 6-
point CGT and DCT-IV. Instead of doubling up CGT6 to another DCT-IV in order to
96
4.4. BILINEAR ALGORITHMS FOR 4 • 3N-POINT MDCT/IMDCT
Table 4.7: 12 and 36 point MDCT and IMDCT input mapping for unified architecture A.
I n p u t
SMODE IMODE
in(0) in(l) in(2) in(3) in(4) in(5) in(6) in(7) in(8) in(9)
in(10) in( l l ) in(12) in(l3) in(14) in(15) in(16) in(17) in(18) in(19) in(20) in(21) in(22) in(23) in(24) in(25) in(26) in(27) in(28) in(29) in(30) in(31) in(32) in(33) in(34) in(35)
MDCT36
0 0
x(0) x(l) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)
x(10) x ( l l ) x(12) x(13) x(14) x(15) x(16) x(17) x(18) x(19) x(20) x(21) x(22) x(23) x(24) x(25) x(26) x(27) x(28) x(29) x(30) x(31) x(32) x(33) x(34) x(35)
lMDCTm
0 1
X(0) X(l) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X( l l ) X(12) X(13) X(14) X(15) X(16) X(17)
MDCTl2
1 0
x(0)
x(l)
x(2)
x(3)
x(4)
x(5)
x(6)
x(7)
x(8) x(9)
x(10)
x ( l l )
IMDCT12
1 1
X(0)
X(l)
X(2)
X(3)
X(4)
X(5)
97
CHAPTER 4. MODIFIED DISCRETE COSINE TRANSFORM
Table 4.8: ture A.
12 and 36 point MDCT and IMDCT output mapping for unified architec-
O u t p u t
out(O) out(l) out(2) out(3) out (4) out(5) out(6) out (7) out (8) out (9) out(10) out( l l ) out(12) out(13) out(14) out(15) out(16) out(17)
MDCT36
X(0) X(l) X(2) x ( 3 ) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X( l l ) X(12) X(13) X(14) X(15) X(16) X(17)
IMDCTm
-x'(27),-x'(26) -x'(28),-x'(25) -x'(29),-x'(24) -x'(30),-x'(23)
;-x'(31),-x'(22) -x,(32),-x'(21) -x'(33) rx'(20) -x'(34),-x'(19) -x'(35),-x'(18) x'(0),-x'(17) x'(l),-x'(16) x'(2),-x'(15) x'(3),-x'(14) x'(4),-x'(13) x'(5),-x'(12) x'(6),-x'(ll) x'(7),-x'(10) x'(8),-x'(9)
MDCTU
X(0) X(l) X(2) X(3) X(4) X(5)
IMDCTl2
-x'(9),-x'(8) -x'(10),-x'(7) -x'(ll),-x'(6) x'(0),-x'(5) x'(l),-x'(4) x'(2),-x'(3)
4.4. BILINEAR ALGORITHMS FOR 4 • 3N-POINT MDCT/IMDCT
y(0) y(9) y(H) y(2) y('2) y(14)
y(8) y(n) y(is) y(6) y(3) y(5)
yd) y(4) y(V y(10)
y('3) y(l6)
(0) (1)
(2) Multi-' " dimensional *• ' cyclic *• ' convolution (6) (7) Fig. 4.16
(8) (9) (10) (11)
(0) 0) (2) (3) (4) (5) (6) (7) (8) (9)
(10)
(11)
(+}
(0) (1) (2) (3) (4) (5)
(0,2,3,5)
CGT,
{(0),-(2)-(0),(2)-(0)-(2), -(3),-(5),-(3),-(5),(3),(5)}
CGT,
MDCT to DCT-IV
(0:5)
Y(0) Y(9) Y(") Y(2) Y(12) Y(14) Y(8) Y(17) Y(15) Y(6) Y(3) Y(5)
(0) (1) (2) (3)
6 point DCT-IV (4) (5)
y(0,3,6,9,12,15)
Y(l,4,7,10,13,16) -
Y(0:5) -
Y(6:17)-
SMODE
i out(0:5)
out(6:17)
Output multiplexer
Y(l) Y(4) Y(7) Y(10) Y(13) Y(16)
Figure 4.21: Proposed bilinear implementation of the unified 12 and 36 point MDCT and IMDCT (architecture A).
99
CHAPTER 4. MODIFIED DISCRETE COSINE TRANSFORM
Table 4.9: 12 and 36 point MDCT and IMDCT input mapping for unified architecture B. Note that xA and xB refer to the two 6-point blocks whose MDCT is computed concurrently. Similarly XA and XB represent two independent 6-point transform blocks whose IMDCT is computed concurrently.
Input SMODE IMODE
in(0) in(l) in(2) in(3) in(4) in(5) in(6) in(7) in(8) in(9) in(10) in( l l ) in(12) in(13) in(14) in(15) in(16) in(17) in(18) in(19) in(20) in(21) in(22) in(23) in(24) in(25) in(26) in(27) in(28) in(29) in(30) in(31) in(32) in(33) in(34) in(35)
MDCT36
0 0
x(0) x(l) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(10) x ( l l ) x(12) x(13) x(14) x(15) x(16) x(17) x(18) x(19) x(20) x(21) x(22) x(23) x(24) x(25) x(26) x(27) x(28) x(29) x(30) x(31) x(32) x(33) x(34) x(35)
IMDCT36
0 1
X(0) X(l) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X( l l ) X(12) X(13) X(14) X(15) X(16) X(17)
100
MDCT12
1 0
xA(0) xB{0)
XA{~\)
xB{\)
xA(2) xB{2)
xB(3) xA{3)
xB{4)
XA{4)
xB(b) xA{5)
xB(6) xA(6)
xB{7) xA{7)
xB(S) xA(S) xA{9) xB{9)
XA(10)
xB(10)
^ ( 1 1 ) xB{n)
IMDCT12
1 1
XA(0) XB(0)
XA(1)
XBO-)
XA{2) XB(2)
XA(3) XB(3)
XA(4)
XB(4)
XA(5) XB(5)
4.4. BILINEAR ALGORITHMS FOR 4 • 3N-POINT MDCT/IMDCT
Table 4.10: 12 and 36 point MDCT and IMDCT output mapping for unified architecture B. Note that XA and XB refer to MDCTs of 6-point sequences xA and xB
respectively and are computed concurrently. Similarly x'A and x'B refer to IMDCTs of 6-point transforms XA and XB respectively and are computed concurrently.
O u t p u t
out(0) out(l) out (2) out(3) out(4) out(5) out(6) out (7) out(8) out(9)
out(10) out ( l l ) out(12) out(13) out(14) out(15) out(16) out (17)
MDCTm
X(0) X(l) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9)
X(10) X( l l ) X(12) X(13) X(14) X(15) X(16) X(17)
IMDCTm
-x'(27),-x'(26) -x'(28),-x'(25) -x'(29),-x'(24) -x'(30),-x'(23) -x'(31),-x'(22) -x'(32),-x'(21) -x'(33),-x'(20) -x'(34),-x'(19) -x'(35),-x'(18) x'(0),-x'(17) x'(l),-x'(16) x'(2),-x'(15) x'(3),-x'(14) x'(4),-x'(13) x'(5),-x'(12) x'(6),-x'(ll) x'(7),-x'(10) x'(8),-x'(9)
MDCTl2
XA(0) XA{1)
XA(2)
XA(3)
XA{±) XA(5)
XB(0)
XB{1)
XB(2) XB(3) XB{4) XB{5)
IMDCT12
-x'A(9),-xA(8) -X'A(W),-XA(7)
-x'A(U),-x'A(6) x'A(0),-x'A(5) x'A(l),-x'A(4) x'A(2),-xA(3)
-x'B(9),-x'B(8) -x'B(10),-xB(7) -x'B(ll),-x'B(6)
x'B(0),-x'B(5) x'B(l),-x'B(4) x'B(2),-x'B(3)
101
CHAPTER 4. MODIFIED DISCRETE COSINE TRANSFORM
y(0) y(9) y(H) y(V y(i2) y(14)
y(8) y(n) y(is) y(6) yd) y(V
y(') y(4) y(V y(10)
y('V y(16)
(0) (1) (2) (3) Multi-(4) dimensional (5) cyclic (6) convolution (7) (8) Fig. 4.16
(9) (10) (11)
(0) (1) (2) (3) (4) (5) (6) (7) (8) (9)
(10)
(11)
(0) (1) (2) (3) (4) (5)
(0:5)
6 point DCT-IV
{y(0)-r(2)-y -r(3),-r(5),-i
•(0).VQ). '•(3),~T(5%
modified CGTi
Y(0) Y(9) Y(ll) Y(2) Y(12) Y(14) Y(8) Y(17) Y(15) Y(6) Y(3) Y(5)
T(0).-Y-(2). i,Y-(3),Y'(5)}
Y'(0:5)
MDCT to DCT-IV MODE
z(0:17l[\y(0:l7^
SMODE
Y(l,4,7,10,13,16) .
Y(0:5) •
n(0:17)
SMODE
(0:5)
(0) (1) (2) (3)
6 point DCT-IV (4) (5)
y(0,3,6,9,12,15)
SMODE
out(0:5) V(0:5)
Y(6:ll)
Y(l) Y(4)
Y(7) Y(10) Y(13) Y(16)
out(6:ll) Y(12:17) out(12:17)
Figure 4.22: Proposed bilinear implementation of the unified 12 and 36 point MDCT and IMDCT (architecture B).
102
4.4. BILINEAR ALGORITHMS FOR 4 • 3N-P0INT MDCT/IMDCT
process a second short block, we fold CGT6 function into the existing 6-point DCT-IV.
This provides a natural way to pipeline the 36-point MDCT/IMDCT. In addition, a
constant focal point of hardware implementation is the number of required input and
output pins (10). Many designs today are switching from die-limited to IO-limited.
Therefore it is important to cap the number of input and output pins for a design.
Our proposed pipelined architecture is shown in Fig. 4.23. With the 6-point DCT-IV,
6 outputs of an 18-point DCT-IV are ready upon the completion of the first clock
phase. During the second clock phase, we use the 6-point DCT-IV to compute CGT6
and also complete the computation of multi-dimensional cyclic convolution. These
12 outputs of 18-point DCT-IV are then available at the end of second clock phase.
Thus we cut the required outputs from a maximum 36 for IMDCT to just 12 with
the unified architecture, a 66% reduction. The area savings comes from two sources.
The major saving is from removing the CGTQ computations of 9 multiplications and
21 additions. A secondary saving is due to the fact that block circular matrix is
no longer on the critical path and thus can afford using smaller and low-power logic
gates. The critical path for 36-point MDCT/IMDCT roughly doubles, compared to
non-unified bilinear designs. However in one clock cycle, two short blocks can be
processed and MP3 window switching can be accomplished rather fast. For the 36-
point MDCT/IMDCT, the inputs only toggle on the rising edge of clock. 6 outputs
are obtained on the falling edge and the other 12 outputs are obtained on the rising
edge. For the 12-point, new inputs are sending on both rising and falling clock edges
and outputs are generated on both rising and falling edges as well. Table 4.11 and
4.12 shows possible input and output assignments for the pipelined architecture of
Fig. 4.23.
The complexity of the proposed unified bilinear algorithms are listed in Table 4.13.
The proposed unified bilinear algorithms are implemented in 16-bit fixed arithmetic
with TSMC 90nm CMOS standard cell library. The circuit speed and normalized
area is shown in Fig. 4.24. Architecture A is 4.7% slower at top speed than our
proposed bilinear MDCT, and is 7% larger when the speed is the same. Architecture
B is 4.7% slower at top speed than our proposed bilinear MDCT, and is 8% larger
when the speed is the same. We have also compared the fast unified architectures
103
CHAPTER 4. MODIFIED DISCRETE COSINE TRANSFORM
Table 4.11: 12 and 36 point MDCT and IMDCT input mapping for unified architecture C (pipeline).
Input Clock edge
SMODE IMODE
in(0) m(l) in(2) m(3) in(4) Ln(5) m(6) in(7) m(8) in(9)
in(10) in( l l ) in(12) in(13) in(14) in(15) in(16) in(17) in(18) in(19) in(20) in(21) in(22) in(23) in(24) in(25) in(26) in(27) in(28) in(29) in(30) in(31) in(32) in(33) in(34) in(35)
MDCTm
rise
0 0
X(0) X(l) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X( l l ) X(12) X(13) X(14) X(15) X(16) X(17) X(18) X(19) X(20) X(21) X(22) X(23) X(24) X(25) X(26) X(27) X(28) X(29) X(30) X(31) X(32) X(33) X(34) X(35)
IMDCT36
rise
0 1
x(0) x(l) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)
x(10) x ( l l ) x(12) x(13) x(14) x(15) x(16) x(17)
1U4
MDCT12
rise/fall
1 0
X(0)
X(l)
X(2)
X(3)
X(4)
X(5)
X(6)
X(7)
X(8)
X(9)
X(10)
X( l l )
IMDCTU
rise/fall
1 1
x(0)
x(l)
x(2)
x(3)
x(4)
x(5)
4.4. BILINEAR ALGORITHMS FOR 4 • 3N-P0INT MDCT/IMDCT
y(0) — y(9) — y(ll) - j y(2) y(12) y(14)
y(8) y(17) y(15)
y(6) y(3) y(5)
(0) (0)
(1) (1) (2) (2) (3) M u l t H (3) J-̂ -J dimensional ^\
(5) cy c l i c (5) /̂ -v convolution /g\
(7) Fig. 4.16 O (8) (8) (9) (10) (11)
MDCT to DCT-IV
in(0:35)\ z(0.
IMODE
:]7)Ay(0:17l
in(0:17)
(0:5) (11:6)
(12:17)
(9) (10) (11)
-~Y(7) -*• Y(l)
-~Y(8) — Y(9) — Y(5) -*• Y(ll)
— Y(10) -~Y(4) — Y(2) — Y(3)
ir(0), - r(2), - Y'(o), Y-(2), -Y •(<)), - r<2), -T(3), - Y'(5). ~ r(3), - Y'(5), Y'(3),Y'(5)}
SMODE or !CLK
frl (0:5) (0:5)
6 point DCT-IV
Y'(0:5)
y(l,4,7,10,13,16)
Y(0:5)
T(0:5)
Y(6:ll)
SMODE or CLK
— - r i _ ' ] • out(0:5)
• out(6:ll)
Output multiplexer
Figure 4.23: Proposed bilinear implementation for the pipelined unified 12 and 36 point MDCT and IMDCT (architecture C).
105
CHAPTER 4. MODIFIED DISCRETE COSINE TRANSFORM
Table 4.12: 12 and 36 point MDCT and IMDCT output mapping for unified architecture C (pipeline).
Output
out(O) out(l) out(2) out(3) out (4) out(5) out(6) out (7) out(8) out(9)
out(10) out (11)
MDCTm
fall
X(l) X(4) X(7)
X(10) X(13) X(16)
rise
X(0) X(2) X(3) X(5) X(6) X(8) X(9)
X( l l ) X(12) X(14) X(15) X(17)
IMDCT36
fall
-x'(28),-x'(25) -x'(31),-x'(22) -x'(34),-x'(19) x'(l),-x'(16) x'(4),-x'(13) x'(7),-x'(10)
rise
-x'(27),-x'(26) -x'(29),-x'(24) -x'(30),-x'(23) -x'(32),-x'(21) -x'(33),-x'(20) -x'(35),-x'(18) x'(0),-x'(17) x'(2),-x'(15) x'(3),-x'(14) x'(5),-x'(12) x'(6),-x'(ll) x'(8),-x'(9)
MDCT12
fall/rise
X(0) X(l) X(2) X(3) X(4) X(5)
IMDCT12
fall/rise
-x'(9),-x'(8) -x'(10),-x'(7) -x'(ll),-x'(6) x'(0),-x'(5) x'(l),-x'(4) x'(2),-x'(3)
(A,B) with bilinear 36-point MDCT and [9,37], and separately compared pipelined
architecture with a low complexity MDCT [27]. The fast unified architectures A and
B are 9% faster than that of [37] and 13% faster than that of [9] at top speed. At
higher speeds, the proposed designs can be more than 27% smaller. The pipelined
architecture is 26% faster than [27] at the top speed and 19% smaller if the speed is the
same. The three proposed unified architectures can process not only the forward and
inverse transforms, but also the short and long blocks in MP3 application, whereas
the compared reference designs can only compute forward MDCT for the long block.
For architecture B and pipelined architecture, two short blocks are processed in one
clock cycle.
106
4.4. BILINEAR ALGORITHMS FOR 4 • 3N-POINT MDCT/IMDCT
Table 4.13: Complexities of unified 12 and 36 point MDCT and IMDCT architectures for MP3 application. Note that M and A refer to multiplication and addition respectively.
Architecture
A B
C (pipeline)
Arithmetic complexity
36M + 15(L4 36M + 1544 27M + 131A
Critical delay 36-point M + 10A M + 10A 2M + 12A
12-point M + 6A M + 6A M + 6A
Fast Implementation Pipelined Implementation 1
0.95
0.9
0.85
0.8
0.75
0.7
0.65
0.6
0.55
Ref.[9] Ref.[37] Bilinear Fig. 4.21 Fig. 4.22
1.1
1.05
1
0.95
0.9
0.85
0.8
0.75
0.7
0.65
"©""hO
10 15 20 10 15 20 25 30
Figure 4.24: Delay in nsec (on horizontal axis) and normalized area (on vertical axis) for unified 12 and 36 point MDCT and IMDCT architectures (A, B and pipeline), with comparison to the 36-point MDCT architectures in literature.
107
CHAPTER 4. MODIFIED DISCRETE COSINE TRANSFORM
4.5 Discussion and conclusion
Forward and inverse modified discrete cosine transforms (MDCT/IMDCT) are widely-
used for subband coding in the analysis and synthesis filterbanks of time domain alias
ing cancellation (TDAC). Many international audio coding standards rely heavily on
fast algorithms for the MDCT/IMDCT. In this chapter we have presented hardware
efficient bilinear algorithms to compute MDCT/IMDCT of 2" and 4 • 3n points. The
algorithms for composite lengths have practical applications in MP3 audio encoding
and decoding. It is known that the MDCT/IMDCT can be converted to type-IV
discrete cosine transforms (DCT-IV). Using group theory, our approach decomposes
DCT-IV transform kernel matrix into groups of cyclic and Hanke product matrices.
Bilinear algorithms are then applied to efficiently evaluate these groups. When im
plemented in VLSI, bilinear algorithms have improved the critical path delays over
existing solutions. For MPEG-1/2 layer III (MP3) audio, we propose three different
versions of unified hardware architectures for both the short and long blocks and the
forward and inverse transforms.
108
Chapter 5
Modulated complex lapped
transform
This chapter presents a new algorithm for the modulated complex lapped transform
(MCLT) with a sine window. It is shown that by merging the window with the main
computation, both the real and the imaginary parts of the MCLT with 2N inputs
can be obtained from two iV-point discrete cosine transforms (DCT) of appropriate
inputs. The resultant algorithm is computationally very efficient. The value of N can
be in general, an even number. When iV is a power of 2, the proposed algorithm uses
only N log N + 2 multiplications, with none of those being outside the DCT blocks.
5.1 Background and prior work
The modulated complex lapped transform (MCLT) is structured as a cosine- and
sine-modulated filter bank that maps overlapping blocks of a real-valued signal into
blocks of complex-valued transform coefficients [33]. It is a special form of a two
times oversampled discrete Fourier transform (DFT) filter bank to perform frequency
decomposition. Since the reconstruction formula of the MCLT is not unique, the
MCLT allows more flexible implementations of audio enhancement and encoding sys
tems than the DFT. Recent MCLT applications include acoustic echo cancellation [33]
109
CHAPTER 5. MODULATED COMPLEX LAPPED TRANSFORM
and audio watermarking [34] by using the phase information from the imaginary co
efficients.
The JV-point MCLT of a 2iV-point input sequence {x(n)} is defined as [33]
2 A T - 1
y(k) = Yl ^ n ' k ) - ^ ( n ' *)1 ^W- A; = 0 ,1 , . . . , iV - 1, (5.1)
where the real and imaginary parts of the MCLT kernel are defined as
Mn,*) = ^h{n)sm(^±i±§^±2)iy (5.2)
Function h(n) in (5.2) is the window function. The most common choice for h(n) is
the sine window specified as
M „ ) = - s i n ( < ^ ) , (5.3)
which is used in many applications such as MPEG-1 and MPEG-2 since it permits a
perfect reconstruction [31,41].
The real part of MCLT is the forward/direct modulated lapped transform (MLT)
[31], which is used to implement transform coding in video and audio compression
applications [46]. The MLT has also been referred to as time domain echo cancellation
(TDAC) [41] and cosine modulated filter banks [52]. The calculation of the MLT
involves scaling the input with a window function and then using a modified discrete
cosine transform (MDCT). Existing fast MDCT algorithms are either FFT-based or
DCT-based with pre- and post- permutations [10,15,18]. Computational efficiency
of the MLT can be improved by combining the window function with the MDCT
[18,24,32,45].
The original sequence {x(n)} can be recovered from MCLT by using either its real
or the imaginary part, or both [34]. If only the real part is used, the inverse transform
is the same as the inverse MLT (IMLT) which has been studied in detail [32]. Here we
focus on fast algorithm for the forward/direct MCLT where both real and imaginary
parts are required.
110
5.1. BACKGROUND AND PRIOR WORK
As a complex extension of the MLT, the MCLT shares many fast algorithms with
the MLT. Malvar has shown that the real part of MCLT with arbitrary window
function can be obtained from a discrete cosine transform (DCT) of type IV and
the imaginary, from a discrete sine transform (DST) of type IV [33] . However, the
computational complexity is affected by applying the window function to the input
before using the DCT or the DST. Later, FFT-based MCLT algorithms have been
developed that merge a sine window [34,49] with the main computation for improved
computational efficiency. Fig. 5.1 shows a patented computational flow using a FFT-
based algorithm. Another recently proposed MCLT algorithm applicable to arbitrary
2N real inputs 2N point FFT N+l complex rotation N complex coefficients complex outputs
c(0) - ^ ^ ^
x(0) *- u(0)
x(l) - u(l)
u(2)
u(N-l)
x(2N-l) *• u(N)
Figure 5.1: Flow graph for Ref. [34] implementation of the MCLT. Note that c(k) = W8(2k + l)W4N(k).
but symmetric window uses two DCTs to compute the real and imaginary parts
of the MCLT separately [17]. However, this algorithm employs permutation stages
with non-constant multiplications outside the main computation module limiting its
efficiency. In addition, this algorithm involves coefficient division, which could have
numerical instability when implemented in fixed-point precision for large transform
lengths. Flow graph of this algorithm is shown in Fig. 5.2.
In this chapter, we introduce a new MCLT algorithm with lower computational
complexity than other algorithms available today. We use a sine window and merge
its computation with the main computational modules. Our algorithm for 2N input
y(N-l)
111
CHAPTER 5. MODULATED COMPLEX LAPPED TRANSFORM
x(N-3)
x(N-2)
x(N-l)
Figure 5.2: Flow graph for Ref. [17] implementation of the MCLT. Note that c(i) — h(N — 1 — i)/h{i) and d(i) is a constant for a sine window. Each circle respresents an addition.
MCLT is based on the evenly stacked modified discrete cosine transform (Cjf) and
evenly stacked modified discrete sine transform (S^). The iV-point evenly stacked
MDCT and MDST are, respectively, denned as [41]:
Cf
°JV/2
i V - 1
o,
cos ^(2n + l + ^)k 0<k<N-l,
(5.4)
si cE ^0
2JV-1
= 2~] x(n)sm
n=0
0.
^(2n + l + y)A; 1 < k < N,
(5.5)
We show that 2iV-point C^ and S^ may be computed from two iV-point DCTs of
appropriately folded input sequences. On the other hand, the real and complex parts
of the MCLT can be obtained by adding the outputs of scaled C^ and Sjf with at
most two extra multiplications..
112
5.2. PROPOSED ALGORITHM
5.2 Proposed algorithm
We now show that both the real and the imaginary parts of MCLT may be obtained
from two iV-point DCT. However, unlike the iV-point DCT, the input sequence of
an MCLT has 2iV points. For mathematical convenience, we use two intermediate
transforms ADCT and ADST with 2N input and N + 1 output points defined as:
ADCT(k) = ^ = = 2 ] T cos (*k{2nw+N)) *(n), 0 < k < N, (5.6) 71=0
2N-1
ADST(k) = -L= J2 s in (^^N + N)) "(")• 0<*<7V. (5.7)
It is clear from (5.4) and (5.6) that ADCT is equivalent to a 2iV-point Cf scaled
by a constant of \/(2y/~N). Similarly, from (5.5) and (5.7), ADST is equivalent to a
2iV-point Sj? scaled by the same constant.
One can show that transforms ADCT and ADST (and equivalently Cf? and Sf?)
are related to DCT. To convert ADCT into an iV-point DCT, note that in the ADCT
computation of (5.6), the sequence components x(n + 3N/2) and x(3N/2 — n — 1) for
0 < n < N/2 — 1 multiply with the same cosine term cos(7rA;(2n + l)/(2JV)). Similarly,
the components x(n — N/2) and x(3N/2 — n — 1) for N/2 < n < N also multiply
with this same cosine term, cos(7rA;(2n + 1)/{2N)). Further, as n goes over the stated
ranges, the x component indices span the entire range of the input sequence from 0
to 2N — 1. To take advantage of this, define a sequence {xc(ri}} as:
f x{n + 3iV/2) + x(3N/2 - n - 1), if 0 < n < JV/2, , , xAn) = < (5.8)
{ x(n-N/2)+x(3N/2-n-l), if N/2 < n< N.
It is then obvious that
ADCT{k) = Jj= J2 cos ( 7 r M 22 ^ + 1 ) ) xe(n), 0 < k < N. (5.9)
Thus for 0 < k < N — 1, the ADCT of the sequence {x(n)} is the same as the
iV-point DCT (type II) of the sequence {xc(n)} except for a constant multiplication
factor 1/(2VN). These ADCT outputs can therefore be computed using any available
113
CHAPTER 5. MODULATED COMPLEX LAPPED TRANSFORM
DCT algorithms [25,43] without incurring additional computational penalty except
may be of one multiplication. Further, (5.9) also shows that
ADCT{N) = 0. (5.10)
Transform ADST can also be related to DCT in a similar way. In the definition
(5.7) of ADST, sequence components x(n + 3N/2) and -x(3N/2 - n - 1) when
0 < n < N/2 as well as x{n - N/2) and -x{3N/2 - n - 1) when N/2 < n < N all
multiply with the same sine term sin(7rfc(2? + 1)/(2JV)). Again, the indices of these
four x terms span the entire input index range, 0 to 2iV — 1, when n goes over the
specified ranges. To take advantage of this, define the sequence {xs(n)} as
, , f x(n + 3N/2) - x(3N/2 - n - 1), if 0 < n < N/2, xs{n) = < (5.11)
\x(n- N/2) - x{3N/2 - n - 1), if N/2 < n < N.
One can then see that
ADST(k) = -±= Y, s i n (1Tk{22N+l)) ^ ( " ) . 0 < ^ < iV. (5.12)
Equation (5.12) shows that except for the scaling factor l/(2\/N), transform
ADST(k), 1 < k < N, is the A^-point DST of type II and can be obtained using
any available algorithm for DST. But since we are already using the DCT for ADCT
computation, we prefer to use the same for the ADST computation as well. For this,
we employ an approach similar to that of [56]. Define a new sequence {xs(n)} as
x,{n) = {-l)nxs{n). (5.13)
Eq. (5.12) can then be rewritten as
ADSTik) = ^ = | > ((2n + 1 ) | - ^N ~ ^ + " ) (-!)-,.(,)
- ^ E cos ( " ( A r - ^ " + 1 ) ) *.(»), 0 < * < N. (5,4)
Equation (5.14) shows that when 1 < k < N, except for the constant multipli
cation factor l/(2vA/V), the k-th output of the ADST of {x(n)} is the same as the
114
5.2. PROPOSED ALGORITHM
(N - fc)-th output of the iV-point DCT (type II) of {xs{n)}. As before, these AD ST
outputs can therefore be computed using any available DCT algorithm with at most
one additional multiplication. Equation (5.14) also shows that
ADST(0) = 0. (5.15)
In the following two subsections, we show that the real and imaginary parts of the
MCLT can be obtained directly from the ADCT and ADST. This would then allow
us to compute MCLT from the DCTs of {xc(n)} and {xs(n)}.
5.2.1 The real part of the MCLT
The real part of the MCLT kernel pc(n, k) given by (5.2) with the sine window function
of (5.3) can be simplified using the trigonometric identity sin a cos (3 = (l/2)(sin(a —
0) + sin(a + 0)) as:
pc{n,k)=p'c(n,k)+p'^n,k), (5-16)
y/2N \ 2N 4
where,
*<„,*) = - ^ s i „ ( i ^ « § L ± I ^ ) . „17) Expressions for p'c and p" in (5.17) can be further simplified by using sin(o; ± 7r/4) =
( l / \ /2 ) (sin a ± cos a) as
p'c{n, k) =
p"c{n,k) =
2^/N
1
. fkTv(2n + l+N)\ fkTv(2n + l + N) sm {—^N -)+m{ 2N
2VN
. '(k + l)n(2n + l + N)\ {(k + l)?r(2n + 1 + N) • sm \ , T + cos ' 2N J \ 2 ^
From these relations, one can get the real part of the MCLT of the sequence x as:
2N-1 2N-1
n=0 n=0
= ADST(&)+ADCT(&)-ADST(fc+l)+ADCT(&+l), 0 < k < N. (5.18)
115
CHAPTER 5. MODULATED COMPLEX LAPPED TRANSFORM
5.2.2 The imaginary part of the MCLT
The imaginary part of the MCLT y(k) can also be computed through ADCT and
ADST. The imaginary part of the MCLT kernel ps{n, k) given by (5.2) with the sine
window function of (5.3) can be simplified using the identity sin a sin /? = (1/2) (cos(o:—
0) — cos(a + /?)) as:
P$(n, k) = p'a{n, k) +p"{n, k), (5.19)
where,
y/2N \ 2N 4 (5.20)
These expressions for p's and p"s can be further simplified by using the identity cos(a±
7r/4) == (l/v /2)(cosaTsina) to
p's{n,k) = 2VN - C°S \^-2N "J + Sm {-^~2N "
p"(n,k) 2VN
(k + l)ir(2n + I + N)\ . ((k + lW(2n + 1 + N) cos I TTT ) + sin '
2N 2N
From these relations, the imaginary part of the MCLT of the sequence x is obtained
as:
2AT-1 2 J V - 1
lm[y(k)] = Y^ P's(n> k)x{n) + J ^ p"s{n, k)x{n) 71=0 n=0
= -ADCT(&)+ADST(&)+ADCT(fc+l)+ADST(&+l), 0 < k < N. (5.21)
5.2.3 The new MCLT algorithm
The above discussion leads to the following MCLT algorithm.
• Create sequences {xc(n)} and {xs(n)} using (5.8), (5.11) and (5.13). This step
requires 2iV additions.
116
5.2. PROPOSED ALGORITHM
• Compute the discrete cosine transforms (type II) of {xc(n)} and {xs(n)} using
any of the fast DCT algorithms. For example, when iV is a power of 2, one may
employ the procedure of [25] to compute each DCT in (N/2) log N multiplica
tions and (3N/2)\ogN — N + 1 additions. The algorithm in [25] recursively
partitions the DCT into two DCTs of half the length. Denote them by DCTi
and DCT2 . The input for DCTi is obtained by folding the input sequence
and its output gives the even-indexed DCT components. The input for DCT2
is obtained by folding the input sequence and then multiplying each compo
nent by a proper cosine term. A linear combination DCT2 output gives the
odd-indexed DCT components. The DCT output can be scaled by a constant
factor l/(2\/iV) (see (5.9) and (5.14)) by combining this factor with the multi
pliers applied to create the input of DCT2. This gives scaled odd indexed DCT
components without increasing the multiplication count. Since the algorithm
of [25] is recursive, the same process can be used repeatedly every time DCTi
is partitioned to scale half of its outputs. Thus the only extra multiplication
needed to scale all the components of the DCT is the multiplication used to
scale the 0-th DCT component. Thus the two DCTs with the scaling factor as
required in (5.9) and (5.14) can be computed in JVlogiV + 2 multiplications and
3iVlog N - 2N + 2 additions.
The scaled DCTs of {xc(n)} and {xs(n)} provide values of ADCT(k) and
ADST(k) for 0 < k < n from relations (5.9), (5.10), (5.14) and (5.15).
• Finally, to obtain the MCLT, first computes
zc{k) = ADCT(k + l) + ADST{k) and
zs(k) = ADST(k + l)-ADCT(k), 0 < k < N.
Then the real and imaginary parts of MCLT y are given from (5.18) and (5.21)
by
TZe[y(k)] = zc(k) — zs(k) and
lm[y(k)} = zc(k) + zs(k), 0 < k < N.
117
CHAPTER 5. MODULATED COMPLEX LAPPED TRANSFORM
This step requires 4N — 2 additions (ignoring the trivial additions by ADCT(N)
and ADST(0), each of which is 0).
Thus we can obtain the MCLT in N log N + 2 real multiplications and 3N log iV +
AN real additions. The signal flow graph of the proposed algorithm is shown in
Fig. 5.3.
Pre-processing (2N additions)
Main Processing (2 DCTs)
x(2N-l)
DCT-1I (scaled)
Post-processing (4N-2 additions)
2c(k)=ADCT(k+l)+ADST(k)
KZ)\ 1(3) " yc(MCLTreal) MLT
ADCT(N)=0
ADST(0)=0
x(N/2+l)
DCT-II (scaled)
i Q l j(+) ». ys(MCLTimg)
z. (k) =ADST(k+l)-ADCT(k)
DST-II computed from DCT-II
Figure 5.3: Flow graph for the proposed MCLT algorithm. Note that DCT modules are scaled by a constant of l/(2y/iV).
5.3 Discussion and Conclusion
This section proposes a fast algorithm for the modulated complex lapped transform
(MCLT) with a sine window. By merging the window with the MCLT computation,
we show that both the real and imaginary parts of MCLT can be computed from the
same two iV-point discrete cosine transforms. Table 5.1 compares the computational
complexity of our algorithm to those of [17,24,33,34].
One can see from this table that our algorithm has the smallest number of mul
tiplications (for N — 2") of any algorithm available in literature. At the end of this
research, it was brought to our attention during a publication preparation that the
authors of [17] have made improvements. The paper titled "A novel DCT-based al
gorithm for computing the modulated complex lapped transform", was published in
118
5.3. DISCUSSION AND CONCLUSION
Table 5.1: Complexities of various fast MCLT algorithms for block size N = 2n . Algorithm
Ref. [33] Ref. [17] Ref. [34] Proposed
Window choice
any sine sine sine
Real multiplications
NlogN + W iVlogiV + iV NlogN + N JVlogiV + 2
Real additions
3NlogN + 2N 3iVlogiV + 4iV
3iVlogiV + 3 iV-2 3iVlogJV + 4iV
the november 2006 issue of IEEE Trans, on Signal Processing. Their new algorithm
starts out by treating the window function separately as before. It then applies novel
matrix factorization and is able to obtain a term that coincidently cancels out the
window functions, leaving with two N-point DCT as a result. The overall computa
tional complexity matches ours exactly, though their paper does not include the two
extra multiplications due to the scaling of first row of DCT modules. Our method on
the other hand, has a different signal flow chart and is proposed from the beginning
with the intention of merging the window with the main computational module.
The propsoed method is applicable to any even block size N. It does not assume
any specific algorithm to compute the DCT, consequently, it may be adapted to any
efficient DCT software or hardware modules. For example, by using a bilinear algo
rithm for the DCT, one can get a bilinear algorithm for the MCLT which can produce
a very fast implementation in VLSI. Most other algorithms use multiplications out
side the main computational blocks of DCT and therefore cannot lead to a bilinear
structure.
119
Chapter 6
Conclusions
In the final chapter, we summaries our research on fast algorithm derivations for the
discrete Hartley transform (DHT), the modified discrete cosine transform (MDCT)
and the modulated complex lapped transform (MCLT). To obtain bilinear algorithms
for the transforms under consideration, a group theoretic approach can be proven suc
cessful in providing a fast VLSI architecture. The use of group theory will allow us
to partition and transform kernel of interest into smaller cyclic and Hankel matrix
products Then using bilinear algorithms for these matrix products, the desired archi
tectures can be obtained as a result. Future work on structured bilinear algorithms
is discussed at the end.
6.1 Thesis summary
This dissertation presents a formal hardware design approach using bilinear algo
rithms for fast digital signal processing (DSP) applications. In particular, we focus
on the design of application specific integrated circuit (ASIC), where dedicated algo
rithmic accelerators are implemented in fixed-point arithmetic.
Most signal processing algorithms involve a transform kernel with a known struc
ture. Using concepts of the group theory, the kernel matrix can be recurisvely par
titioned into small length cyclic convolutions and Hankel matrix products. Bilinear
algorithms for these smaller blocks are then combined together to obtain the required
121
CHAPTER 6. CONCLUSIONS
bilinear algorithm of the transform. Bilinear algorithms have a high degree of con
currency as all multiplication operations are independent of each other and can be
computed at the same time. As a result, the hardware realizations of bilinear algo
rithms are much faster than any other implementations. The structural modularity
also allows simple pipelining and greatly reduces the number of input and output
(10) pins.
In this dissertation, we develope new bilinear algorithms and implementations
for the discrete Hartley transform (DHT), the modified discrete cosine transform
(MDCT) and the modulated complex lapped transform (MCLT). In case of bilin
ear DHT algorithms, we show that the kernel division based on the group theory,
is identical for all prime power lengths. Our implementations are 20%-60% faster
than existing implementations. For MPEG-1/2 audio layer III (MP3) application,
our proposed MDCT algorithms have about 30% lower computational complexity as
compared with other fast algorithms in the literature. The modularity of our algo
rithms also permits one to design, for the first time, a unified architecture for forward
and inverse transforms using different MP3 block sizes. In case of the MCLT, we
achieve a bilinear algorithm by merging the external sine window function with the
main computation through trigonometric manipulation. As compared with other
algorithms, our MCLT algorithm requires about A^-less multiplications where the
typical block size, N, for applicatios such as audio watermarking is 2048.
6.2 Future work
This dissertation has concentrated on the development of fast algorithms and their
hardware implementations. The next disign step is taking these fast algorithms and
implementations to the system level. For the discrete Hartley transform, there are
many recent work published on Hartley domain equalization. A faster DHT im
plementation will enable a faster convergence of equalization parameters. For the
modified discrete cosine transform, it will be of not only academic interests, but also
practical reasons to develop a complete MP3 audio encoder and decoder. The initial
122
6.2. FUTURE WORK
prototype can be a hybrid of a field programmable gate array and a low cost micro
controller. For the MCLT, one can further investigate computations beyond the most
commonly used sine window function.
One should also note that key computational elements in the fast algorithms
developed in this dissertation, are bilinear algorithms for 2-point cyclic and Hankel
matrix products. A bilinear computation generally involves 2 multiplications and 4
additions, or 3 multiplications and 3 additions. It is thus of interest profiling DSP
algorithms for a wide range of applications to determine how frequent these common
matrix products are used. Most advanced programmable DSP chips already have
more-than-one multiply-accumulation (MAC) units. In existing forms, they often
exist as independent computational elements. It is thus possible to link two or more
MAC units together and perform a single cycle computation of some commonly used
cyclic and/or Hankel matrix products. Since the majority of the hardware is already
in place, the cost should be reasonably low for additional support or glue logic.
123
Bibliography
[1] N. Anupindi, S. Narayanan, and K. Prabhu. New radix-3 FHT algorithm. Elec
tronics Letters, 26(18):1537-1538, Aug. 1990.
[2] G. Bi. New split-radix algoritm for the discrete Hartley transform. IEEE Trans.
Signal Processing, 45(2):297-302, Feb. 1997.
[3] R. E. Blahut. Fast algorithms for digital signal processing. Addison Wesley, 1984.
[4] S. Bouguezel, M. Ahmed, and M. Swamy. A new split-radix FHT algorithm for
length-g*2m DHTs. IEEE Trans. Circuits Syst. I, 51(10):2031-2043, Oct. 2004.
[5] S. Boussakta and A. Holt. Prime factor Hartley and Hartley-like transform
caculation using transversal filter-type structure. IEE Proceedings, 136(5):269-
277, Oct. 1989.
[6] R. Bracewell. Discrete Hartley transform. J. Opt. Soc. Amer., 73:1832-1835,
Dec. 1983.
[7] R. Bracewell. Aspects of the Hartley transform. Proc. IEEE, 82(3):381-387,
Mar. 1994.
[8] V. Britanak and K. Rao. Corrrection to "an efficient implementation of the
forward and inverse MDCT in MPEG audio coding". IEEE Signal Processing
Letters, 8(10):279, Oct. 2001.
[9] V. Britanak and K. Rao. An efficient implementation of the forward and inverse
MDCT in MPEG audio coding. IEEE Signal Processing Letters, 8(2):48-50, Feb.
2001.
125
BIBLIOGRAPHY
[10] V. Britanak and K. Rao. A new fast algorithm for the unified forward and inverse
MDCT/MDST computation. Signal processing, 82(3):433-459, 2002.
[11] C. Chakrabarti and J. JaJa. Systolic architectures for the computation of the
discrete Hartley and the discrete cosine transforms based on prime factor decom
position. IEEE Trans. Computers, 39(11):1359-1990, Nov. 1990.
[12] D. Chan, J. Yang, and C. Fang. Fast implementation of MPEG audio coder
using recursive formula with fast discrete cosine transforms. IEEE Trans. Speech,
Audio Processing, 4(2):144-148, Mar. 1996.
[13] L. Chang and S. Lee. Systolic arrays for the discrete Hartley transform. IEEE
Trans. Signal Processing, 39(11):2411-2418, Nov. 1991.
[14] C. Chen, B. Liu, and J. Yang. Recursive architectures for realizing modified
discrete cosine transform and its inverse. IEEE Trans. Circuits Syst. II, 50(5):38-
45, Jan. 2003.
[15] M. Cheng and Y. Hsu. Fast IMDCT and MDCT algorithms - a matrix approach.
IEEE Trans. Signal Processing, 51(l):221-229, Jan. 2003.
[16] Forward Concepts. DSP market bulletin, http://www.fwdconcepts.com/
dsp8104.htm, Aug. 2004.
[17] Q. Dai and X. Chen. New algorithm for modulated complex lapped transform
with symmetrical window function. IEEE Signal Processing Letters, 11(12):925-
928, Dec. 2004.
[18] P. Duhamel, Y. Mahieux, and J. Petit. A fast algorithm for the implementation of
filter banks based on time domain aliasing cancellation. Proc. ICASSP, 3:2209-
2212, Apr. 1991.
[19] A. Erickson and B. Fagin. Calculating the FHT in hardware. IEEE Trans. Signal
Processing, 40(6):1341-1353, June 1992.
[20] M. Balducci et. al. Benchmarking of FFT algorithms. Proc.Eng. New Century,
pages 328-330, Mar. 1997.
126
BIBLIOGRAPHY
[21] A. Grigoryan. A novel algorithm for computing the 1 D discrete Hartley trans
form. IEEE Signal Processing Letters, 11(2):156-159, Feb. 2004.
[22] S. Gudvangen and A. Holt. Computation of prime factor DFT and DHT/DCCT
algorithms using cyclic and skew-cyclic bit-serial semisystolic IC convolvers. IEE
Proceedings, 137(5):373-389, Oct. 1990.
[23] J. Guo. An efficient design for one-dimensional discrete Hartley transform using
parallel additions. IEEE Trans. Signal Processing, 48(10) :2806-2813, Oct. 2000.
[24] C. Jing and H. Tai. Fast algorithm for computing modulated lapped transform.
Electronics Letters, 37(12):796-797, June 2001.
[25] C. Kok. Fast algorithm for computing discrete cosine transform. IEEE Trans.
Signal processing, 45(3):757-760, Mar. 1997.
[26] C. Kwong and K. Shiu. Structured fast Hartley transform algorithms. IEEE
Trans. Acoust, Speech, Signal Processing, ASSP-34(4):1000-1002, Aug. 1986.
[27] S. Lee. Improved algorithm for efficient computation of the forward and inverse
MDCT in MPEG audio coding. IEEE Trans. Circuits Syst. II, 48(10) :990-994,
Oct. 2001.
[28] D. Lun and W. Siu. On prime factor mapping for the discrete Hartley transform.
IEEE Trans. Signal Processing, 40(6):1399-1411, June 1992.
[29] D. Lun and W. Siu. On prime factor mapping for the discrete Hartley transform.
IEEE Trans. Signal Processing, 41 (7):2494-2499, July 1993.
[30] H. Malvar. Lapped transforms for efficient transform/subband coding. IEEE
Trans. Acoust., Speech, Signal Processing, 38(6):969-978, June 1990.
[31] H. Malvar. Signal processing with lapped transforms. Artech House, 1992.
[32] H. Malvar. Biorthogonal and nonuniform lapped transforms for transform coding
with reduced blocking and ringing artifacts. IEEE Trans. Signal Processing,
46(4):1043-1053, Apr. 1998.
127
BIBLIOGRAPHY
[33] H. Malvar. A modulated complex lapped transform and its applications to audio
processing. Proc. ICASSP, pages 1421-1424, Mar. 1999.
[34] H. Malvar. Fast algorithm for the modulated complex lapped transform. IEEE
Signal Processing Letters, 10(1):8-10, Jan. 2003.
[35] V. Muddhasani and M. D. Wagh. Bilinear algorithms for discrete cosine trans
forms of prime lengths. Signal Processing, 86:2393-2406, 2006.
[36] V. Nikolajevic and G. Fettweis. Computation of forward and inverse MDCT using
Clenshaw's recurrence formula. IEEE Trans. Signal Processing, 51(5):1439-1444,
May 2003.
[37] V. Nikolajevic and G. Fettweis. Improved implementation of MDCT in MP3
audio coding. 10th Asia-Pacific Conf. Comrn. and 5th Intern. Symp. Multi-
Dimen. Mobile Comm., 1:309-312, Aug. 2004.
[38] P. Noll. MPEG digital audio coding. IEEE Signal Processing Magazine, 14(5):59-
81, Sept. 1997.
[39] D. Pan. A tutorial on MPEG audio compression. IEEE Multimedia, 2(2):60-74,
Summer 1995.
[40] K. Parhi. VLSI digital signal processing systems: design and implementation.
John Wiley, 1999.
[41] J. Princen and A. Bradley. Analysis/synthesis filter bank design based on time
domain aliasing cancellation. IEEE Trans. Acoust. Speech Signal Processing,
ASSP-34(5):1153-1161, Oct. 1986.
[42] C. Rader. Discrete Fourier transforms when the number of data samples is prime.
Proc. IEEE, 56:104-105, June 1968.
[43] K. Rao and P. Yip. Discrete consine transform: algorithms, advantages, appli
cations. Academic Press, 1990.
128
BIBLIOGRAPHY
[44] M. Romdhane, V. Madisetti, and J. Hines. Quick-turnaround ASIC design in
VHDL. Kluwer Academic Publisher, 1996.
[45] D. Sevic and M. Popvic. A new efficient implementation of the oddly stacked
Princen-Bradley filter bank. IEEE Signal Processing Letters, 1(11):166—168, Nov.
1994.
[46] S. Shlien. The modulated lapped transform, its time-varying forms, and its ap
plications to audio coding standards. IEEE Trans. Speech and Audio Processing,
5(4):359-366, July 1997.
[47] International Consumer Electronics Show. Agenda for Enabling Technology
Forums, http:/ /www.enablingtechnologyforums.com/ces2005/index.htm,
Jan. 2005.
[48] M. Smith. Application-specific integrated circuits. Addison Wesley, 1997.
[49] H. Tai and C. Jing. Design and efficient implimentation of a modulated complex
lapped transform processor using pipelining technique. IEICE Trans. Funda
mentals, E84-A(5):1280-1286, May 2001.
[50] S. Tai, C. Wang, and C. Lin. FFT and IMDCT circuit sharing in DAB receiver.
IEEE Trans. Broadcasting, 49(2):124-131, June 2003.
[51] T. Tsai, T. Chen, and L. Chen. An MPEG audio decoder chip. IEEE Trans.
Consumer Electronics, 41(l):89-96, Feb. 1995.
[52] P. Vaidyanathan. Multirate systems and filter banks. Prentice Hall, 1993.
[53] M. Wagh. A new algorithm for the discrete cosine transform of arbitrary number
of points. IEEE Trans. Computers, C-29(4):269-277, Apr. 1980.
[54] M. Wagh. Modular algorithms for cyclic convolution of arbitrary length. Lehigh
University, Feb. 2005.
[55] M. Wagh. A structured bilinear algorithm for discrete Fourier transform. Lehigh
University, Feb. 2005.
129
BIBLIOGRAPHY
[56] Z. Wang. A fast algorithm for the discrete sine transform implemented by the
fast cosine transform. IEEE Trans. Acoust., Speech, Signal Processing, ASSP-
30(5):814-815, Oct. 1982.
[57] Z. Wang. Fast algorithms for discrete W transform and for the discrete Fourier
transform. IEEE Trans. Acoust., Speech, Signal Processing, ASSP-32:803-816,
Aug. 1984.
[58] Z. Wang. A prime factor fast W transform algorithm. IEEE Trans. Signal
Processing, 40(9):2361-2368, Sept. 1992.
[59] L. Wanhammar. DSP integrated circuits. Academic Press, 1999.
[60] N. Weste and K. Eshraghian. Principles of CMOS VLSI design: a systems
perspective. Addison Wesley, 2nd edition, 1992.
[61] S. Winograd. On computing the discrete Fourier transform. Math Compt.,
32:175-199, Jan. 1978.
[62] J. Wu and J. Shiu. Discrete Hartley transform in error control coding. IEEE
Trans. Signal Processing, 39(10):2356-2359, Oct. 1991.
[63] Y. Yao, Q. Yao, P. Liu, and Z. Xiao. Embedded software optimization for
MP3 decoder implemented on RISC core. IEEE Trans. Consumer Electronics,
50(4):1244-1249, June 2005.
[64] P. Yeh. Data compression properties of the Hartley transform. IEEE Trans.
Acoust., Speech, Signal Processing, 37(3):450-451, Mar. 1989.
[65] Z. Zhao. In-place radix-3 fast Hartley transform algorithm. Electronics Letters,
28(3):319-321, Jan. 1992.
130
Vita
Xingdong Dai received a Bachelor degree from Southern Illinois University at Car-
bondale (SIUC) in 1994 and a Master degree from Arizona State Univeristy (ASU)
in 1996, all in electrical engineering. Xingdong was a research associate at ASU and
his study focused on quantum effects in nanometer-scale structures fabricated with
scanning tunneling miscroscope. In 1996, Xingdong joined Lucent Technologies Mi
croelectronics Group in Allentown PA, where he was involved in integrated circuit
design of modem and DSL chips. In 1998, He was recognized with Bell Laboratory
President's Silver Award for technical contributions to the soft modem development.
From 2000 to 2001, Xingdong was with Spinnaker Networks (acquired by Network
Appliance) in Pittsburgh PA. He contributed to the design and validation of hard
ware acceleration FPGA for file system on high performance network attached storage
servers. Since 2002, Xingdong worked for Agere Systems developing high speed inter
face IP for telecom, enterprise networking and consumer applications. He is currently
a staff engineer in the Mixed Signal R&D organization at LSI Corporation. Xing
dong has one published journal paper, holds two issued U.S. patents with additional
applications pending. Xingdong is a member of Tau Beta Pi, a national engineering
honor society.
131