Bili Near

BILINEAR ALGORITHMS

AND

ASIC ARCHITECTURES

FOR

FAST SIGNAL PROCESSING

by

Xingdong Dai

Presented to the Graduate and Research Committee

of Lehigh University

in Candidacy for the Degree of

Doctor of Philosophy

in

Computer Engineering

Lehigh University

2008

UMI Number: 3314490

INFORMATION TO USERS

The quality of this reproduction is dependent upon the quality of the copy

submitted. Broken or indistinct print, colored or poor quality illustrations and

photographs, print bleed-through, substandard margins, and improper

alignment can adversely affect reproduction.

In the unlikely event that the author did not send a complete manuscript

and there are missing pages, these will be noted. Also, if unauthorized

copyright material had to be removed, a note will indicate the deletion.

®

UMI UMI Microform 3314490

Copyright 2008 by ProQuest LLC.

All rights reserved. This microform edition is protected against

unauthorized copying under Title 17, United States Code.

ProQuest LLC 789 E. Eisenhower Parkway

PO Box 1346 Ann Arbor, Ml 48106-1346

Approved and recommended for acceptance as a dissertation in partial fulfillment

of the requirements for the degree of Doctor of Philosophy.

AbAil £ 5 , 2.00$ Date

A/3A» 1 2Sj 2QO? Accepted Date

Dissertation Director Meghanad D. Wagh

Committee Members:

Meghanad D. Wagh

Bruce D. Fritchman

Zhiyuan Yan

\>A^>~ <-- Jf 0-3U&—

Bruce A. Dodson

QCu^-cte^J^ • "Sandeep Kumar

11

Acknowledgements

I wish to express appreciation for the members of my doctoral committee, the fac

ulty and staff at Lehigh University, friends and colleagues at LSI Corporation, and

family members here and abroad. This dissertation simply would not have been

possible, without your constant encouragement and unwavering support over many

years. Thank you very much for a wonderful experience. I will cherish it forever as

this research draws to a close.

I would like to thank my advisor, Dr. Meghanad D. Wagh, for his guidance during

my graduate study. Not only did he introduce me to the wonderful world of bilinear

algorithms on which this dissertation is focused, he has also shown me the virtue of

unselfishness. I remember many times how Dr. Wagh changed his busy schedule in

order to provide direction and support for this part-time student.

I would like to thank the members of my doctoral committee: Dr. Zhuyuan

Yan, Dr. Bruce D. Fritchman, Dr. Bruce A. Dodson and Dr. Sandeep Kumar. Each

contributed many stimulating ideas throughout this research. Their valuable feedback

greatly enhanced my dissertation.

During this work, friends and colleagues at LSI Corporation also generously lent

their support. From time to time, they substituted for me in conference calls or

meetings.

I want to thank my wife, Mingwei, for her cooperation and extreme patience during

long work days and late school nights. Last but not least, thanks to my parents for

their unconditional love and for providing me with a good education. You both were

my inspiration to accomplish this doctoral research work.

I am forever in your debt, xingdong.

in

iv

Contents

Acknowledgements iii

Abstract 1

1 Introduction 3

1.1 Motivation 3

1.2 Objective and contributions 7

1.3 Organization 11

2 ASIC methodology and mathematical background 15

2.1 Design methodology 16

2.1.1 General ASIC flow 16

2.1.2 ASIC for signal processing 18

2.2 Performance metric 20

2.3 Group theroy 25

2.3.1 Group definition 25

2.3.2 Cyclic and Hankel matrix products 26

2.4 Bilinear algorithm 28

2.4.1 Recursive decomposition 29

2.4.2 Order of computation 30

3 Discrete Hartley transform 35

3.1 Background and prior work 35

3.2 Partitioning the DHT of prime power lengths 38

3.3 Bilinear algorithm for 2n-point DHT 40

v

3.3.1 16-point DHT 47

3.3.2 More than 16-point DHT 50

3.4 Performance analysis for 2n-point DHT 54

3.5 Bilinear algorithm for 3n-point DHT 57

3.6 Performance analysis for 3n-point DHT 60

3.7 Discussion and conclusion 63

4 Modified discrete cosine transform 65


4.2 Transformation from iV-point MDCT/IMDCT to iV/2-point DCT-IV 68

4.2.1 The forward MDCT transformation 68

4.2.2 The inverse MDCT transformation 69

4.2.3 The advantage of DCT-IV transformation 70

4.3 Bilinear algorithms for 2"-point MDCT/IMDCT 72

4.4 Bilinear algorithms for 4 • 3"-point MDCT/IMDCT 78

4.4.1 The bilinear MDCT/IMDCT for MP3 audio short block length 84

4.4.2 The bilinear MDCT/IMDCT for MP3 audio long block length 88

4.4.3 The unified MDCT/IMDCT architecture for MP3 audio . . . 95

4.5 Discussion and conclusion 108

5 Modulated complex lapped transform 109


5.2 Proposed algorithm 113

5.2.1 The real part of the MCLT 115

5.2.2 The imaginary part of the MCLT 116

5.2.3 The new MCLT algorithm 116

5.3 Discussion and Conclusion 118

6 Conclusions 121 6.1 Thesis summary 121

6.2 Future work 122

Bibliography 124

VI

Vita 131

vn

viii

List of Tables

1.1 Research steps and tools for fast bilinear algorithm development. . . . 8

1.2 Complexities of various MDCT and IMDCT algorithms for MP3 audio.

Note that M and A refer to multiplication and addition respectively. 10

1.3 Complexities of unified 12 and 36 point MDCT and IMDCT architec

tures for MP3 audio application. Note that M and A refer to multi

plication and addition respectively. 10

1.4 Complexities of various fast MCLT algorithms for block size N — 2n

with a sine window 11

2.1 Complexity and delay comparisons for the three implementations of

(2.1). Note that M and A refer to multiplication and addition respec

tively 23

2.2 Arithmetic complexity of different decomposition orders 32

2.3 Determining the order of combining some bilinear algorithms. Note

that matrices identified with * have the special property that the sum

of all elements in the first row is 0 33

3.1 Hardware complexity of various 2"-point DHT algorithms 54

3.2 Time complexity of various 2ra-point DHT algorithms. Note that M

and A stand for multiplier and adder delays respectively. 55

3.3 Hardware complexity of various 3n-point DHT algorithms 62

3.4 Critical path delay of various 3n-point DHT algorithms. Note that M

and A stand for multipliers and adders respectively. 62

4.1 Multiplication coefficients used in Fig. 4.7 76

ix

4.2 Complexities of various 8 and 16 point MDCT algorithms. Note that

M and A refer to multiplication and addition respectively 78


4.4 Complexities of various 12-point MDCT and IMDCT algorithms. Note

that M and A refer to multiplication and addition respectively. . . . 88


4.6 Complexities of various 36-point MDCT and IMDCT algorithms. Note

that M and A refer to multiplication and addition respectively. . . . 93

4.7 12 and 36 point MDCT and IMDCT input mapping for unified archi

tecture A 97

4.8 12 and 36 point MDCT and IMDCT output mapping for unified ar

chitecture A 98


tecture B. Note that XA and XB refer to the two 6-point blocks whose

MDCT is computed concurrently. Similarly XA and XB represent two

independent 6-point transform blocks whose IMDCT is computed con

currently 100

4.10 12 and 36 point MDCT and IMDCT output mapping for unified archi

tecture B. Note that XA and XB refer to MDCTs of 6-point sequences

XA and xB respectively and are computed concurrently. Similarly x'A

and x'B refer to IMDCTs of 6-point transforms XA and XB respectively

and are computed concurrently 101


tecture C (pipeline) 104

4.12 12 and 36 point MDCT and IMDCT output mapping for unified ar

chitecture C (pipeline) 106

4.13 Complexities of unified 12 and 36 point MDCT and IMDCT architec

tures for MP3 application. Note that M and A refer to multiplication

and addition respectively. 107

5.1 Complexities of various fast MCLT algorithms for block size N — 2". 119

x

List of Figures

1.1 2006 DSP markets by Forward Concepts 5

2.1 Digital ASIC abstraction levels 17

2.2 Critical path delay for an addition 21

2.3 Critical path delay for a multiplication 22

2.4 First implementation of (2.1) 22

2.5 Second implementation of (2.1) 23

2.6 Third implementation of (2.1) 23

2.7 A bilinear algorithm for the 4 point cyclic convolution 31

3.1 Flow graph for Ref. [2] implementation of 2"-point DHT. Note that

each rotation can also be implemented with 3 multiplciations and 3

additions. The index and the coefficients are: i', k' — 0,1, • • • , N/2 — 1;

C{j) = cos{2nj/N) and S{j) = sin(2n j/N), where j = 0 ,1 , • • • , N/4 -1. 36

3.2 Flow graph for Ref. [65] implementation of 3™-point DHT. Note that

each rotation can also be implemented with 3 multiplciations and 3

additions. The index and the coefficients are: i', k' = 0 ,1 , • • • , N/3 — 1,

C(i') = cos(2m'/N) and S(i') = sin{2-Ki'/N) 37

3.3 Flow chart for pn-point bilinear DHT 38

3.4 Hartley group transform for p"-point bilinear DHT 40

3.5 Kernel matrix group division of pn-point DHT 41

3.6 Proposed bilinear algorithm for even-indexed components of a 16-point

DHT. Note that multiplication constant c0 = \[2 48

XI

3.7 Proposed bilinear algorithm for odd-indexed 16-point DHT. Note that

the multiplication coefficients are: c0 = y/2, ci = 0.7654, c2 = 0.5412

and c3 = -1.8478 49

3.8 Delay in nsec (on horizontal axis) and normalized area (on vertical

axis) for various implementations of 8, 16, 32 and 64 point DHTs. . . 56

3.9 Proposed bilinear algorithm for 9-point DHT. Note that the multi

plication coefficients are: c0 = —0.5, C\ = 0.8660, c2 = —0.5924,

c3 = -1.7057, c4 = -0.7660, c5 = -1.6276, c6 = -0.3008, and

c7 = -0.6428 61


axis) for various implementations of 9 and 27 point DHTs 63

4.1 Flow graph for Ref. [9] implementation of iV point MDCT 67

4.2 Flow graph for Ref. [27] implementation of N point MDCT/IMDCT.

Note that SDCT is unnormalized discrete cosine transform 67

4.3 Flow graph for the DCT-IV implementation of iV-point MDCT. . . . 69

4.4 Flow graph for the DCT-IV implementation of N point IMDCT. . . . 70

4.5 Flow graph for the DCT-IV implementation of 2iV-point unified MDCT

and IMDCT. Note that IMODE = 0 for MDCT and IMODE = 1

for IMDCT 71

4.6 Flow graph for the DCT-IV implementation of 2iV-point unified MDCT

and IMDCT with reduced IO requirement. Note that for MDCT,

IMODE = 0 and in(i) = x(i),i = 0,1, • • • , 2N - 1. For IMDCT,

IMODE = 1 and in(k) = X(k), k = 0,1,--- , iV - 1 71

4.7 Proposed bilinear implementation of the 8-point DCT-IV. The multi

plication coefficients are in Table 4.1 75

4.8 The implementations of 16-point MDCT and IMDCT based on the

8-point DCT-IV 76

4.9 Unified implementation of the 16-point MDCT and IMDCT employing

one 8-point DCT-IV. Note that for MDCT, IMODE = 0,in(i) =

x{i), 0 < i < 16. For IMDCT, IMODE = 1, in(k) - X(k), 0<k<8. 77

xn


axis) for various implementations of 8 and 16 point MDCTs. Note

that Fig. 4.9 is a unified MDCT and IMDCT architecture, while all

others compute MDCT only. 78

4.11 Flow graph for 2 • 3n-point bilinear DCT-IV 79

4.12 Flow graph for cosine group transform of 2 • 3 n point DCT-IV 83

4.13 Proposed bilinear implementation of the 6-point DCT-IV 87

4.14 The implementations of 12-point MDCT and IMDCT based on the

6-point DCT-IV 87


axis) for various implementations of 12-point MDCT and IMDCT. . . 89

4.16 Proposed bilinear implementation of multidimensional convolution in

volved in the 18-point DCT-IV 91

4.17 Proposed bilinear implementation of the 18-point DCT-IV 92

4.18 Proposed bilinear implementation of the 36-point MDCT 93

4.19 Proposed bilinear implementation of the 36-point IMDCT 94


axis) for various implementations of 36-point MDCT and IMDCT. . . 95

4.21 Proposed bilinear implementation of the unified 12 and 36 point MDCT

and IMDCT (architecture A) 99

4.22 Proposed bilinear implementation of the unified 12 and 36 point MDCT

and IMDCT (architecture B) 102

4.23 Proposed bilinear implementation for the pipelined unified 12 and 36

point MDCT and IMDCT (architecture C) 105


axis) for unified 12 and 36 point MDCT and IMDCT architectures (A,

B and pipeline), with comparison to the 36-point MDCT architectures

in literature 107

5.1 Flow graph for Ref. [34] implementation of the MCLT. Note that c(k) =

W8{2k + l)W4N{k) I l l

xni

5.2 Flow graph for Ref. [17] implementation of the MCLT. Note that c(i) =

h(N — 1 — i)/h(i) and d(i) is a constant for a sine window. Each circle

respresents an addition 112

5.3 Flow graph for the proposed MCLT algorithm. Note that DCT mod

ules are scaled by a constant of l/(2\/iV) 118

xiv

Abstract

This dissertation presents a formal hardware design approach using bilinear algo

rithms for fast digital signal processing (DSP) applications. In particular, we focus

on the design of application specific integrated circuit (ASIC), where dedicated algo

rithmic accelerators are implemented in fixed-point arithmetic.

Most signal processing algorithms involve a transform kernel with a known struc

ture. Using concepts of group theory, the kernel matrix can be recurisvely partitioned

into computations of small length of cyclic convolutions and Hankel matrix products.

Bilinear algorithms for these smaller blocks are then combined together to obtain the

required bilinear algorithm of the transform. Bilinear algorithms have a high degree

of concurrency as all multiplication operations are independent of each other and can

be computed at the same time. As a result, the hardware realizations of bilinear algo

rithms are much faster than any other implementations. The structural modularity

also allows simple pipelining and greatly reduces the number of input and output

(10) pins.

In this dissertation, we develop new bilinear algorithms and implementations

for the discrete Hartley transform (DHT), the modified discrete cosine transform

(MDCT) and the modulated complex lapped transform (MCLT). In case of bilinear

DHT algorithms, we show that the kernel divisions are identical for all prime power

lengths. Our implementations are 20%-60% faster than existing implementations. For

MPEG-1/2 audio layer III (MP3) application, our proposed MDCT algorithms have

about 30% lower computational complexity as compared with other fast algorithms

in the literature. The modularity of our algorithms also permits one to design, for

the first time, a unified architecture for forward and inverse transforms using different

MP3 block sizes. In case of the MCLT, we achieve a bilinear algorithm by by merging

1

the external sine window function with the main computation through trigonometric

manipulations. As compared with most algorithms, our MCLT algorithm requires

about iV-less multiplications, where the typical block size, N, for applications such

as audio watermarking is 2048.

2

Chapter 1

Introduction

1.1 Motivation

The research presented in this dissertation is motivated by the need of developing fast

signal processing algorithms and efficient hardware implementations. The demand for

fast signal processing algorithms has been rising for many years, driven primarily by

the spread of broadband communications and wide usages of cellular technologies [16].

Fast and efficient audio and video processing and compression algorithms have been

seen as enabling technologies for many emerging real-time multimedia-heavy appli

cations [47]. This dissertation focuses on fast algorithms and architectures for the

discrete Hartley transform (DHT), the modified discrete cosine transform (MDCT)

and the modulated complex lapped transform (MCLT). Among many signal process

ing algorithms, these three transforms have attracted particular research interests

from both the academia and the industries.

The discrete Hartley transform is a real-valued transform, and its forward and

inverse transforms share the same kernel. The DHT is favored for fast convolution

on real data sets, and is frequently used as an alternative to the real-valued discrete

Fourier transform (DFT). A recent study finds fast Hartley transform (FHT) has the

fastest realization of the DFT, when implemented across a variety of general purpose

processor platforms [20]. DHT is such a fundamental signal processing tool that its

fast implementations will continue to benefit many signal processing applications.

3

CHAPTER 1. INTRODUCTION

The modulated lapped transform (MLT) is a cosine-modulated filter bank based

on concepts of time domain aliasing cancellation (TDAC) which permits perfect re

construction. A key signal processing component within the framework of MLT is the

modified discrete cosine transform (MDCT). The MLT and the MDCT are essential

parts of many state of the art audio compression technologies, such as MPEG audio,

Dolby AC-3, and windows media audio (WMA). In complex extended forms, the mod

ulated complex lapped transform (MCLT) has been proposed for many exciting new

applications such as audio watermarking and audio identification. As many of these

functions being integrated onto mobile devices, fast and efficient implementations are

needed to free the host processor from these CPU-taxing computations.

This research starts out with a modest goal of designing fast MDCT algorithms

and circuits for MPEG-1/2 layer III (MP3) audio applications. Since 2001 when Apple

Computer first made the iPod a household name, the demand for portable MP3 player

has been booming. In a report by leading market research firm IDC, shipments of

MP3 flash memory music players are predicted to surge to nearly 124 million units in

2009. That is a 370% increase from the 26.4 million units shipped worldwide in 2004.

Combined with the MP3 player category, all compressed audio players are expected

to reach 945.5 million units shipped and $145.4 billion in revenue worldwide by 2009.

The forward and inverse MDCT algorithms involve intensive computations and are

frequently seen as implementation bottlenecks for MP3 encoder and decoder. To

further complicate the computational requirement, the sample sizes for MP3 audio

data are not a power of two. As a result, today only a few fast MDCT algorithms

are published on MP3 audio processing.

Recently we have also witnessed the coming of multimedia handsets. A multime

dia handset is a handset capable of processing various kinds of data including text,

graphics, animation, audio and video, in addition to voice services. In 2006, nearly

all of the major manufacturers have developed music handsets. A recent report from

Market and Research has predicted that handsets will gradually transfer from com

munication tools into multimedia terminals with the appearance of completely new

consumer electronics such as DV handsets, TV handsets, game handsets, etc. A lot

4

1.1. MOTIVATION

of computing power is required to make multimedia handset a reality, since informa

tion will need to be digitally sampled, encoded, compressed and transmitted at the

source, then received, decoded, decompressed and played back at the destination. In

addition, all of these data transformations must take place either in real time for in

teractive type communications and entertainments, or within a well constrained time

frame because of limited on-board memory. Today a significant amount of computing

load is shared by a programmable digital signal processor (DSP). According to the

DSP analyst Forward Concepts, 70% of programmable DSP are shipped for wireless

applications including handsets (Fig. 1.1).

$8.3 Billion worldwide

Computer W i r e l i n e

73% Wireless

Figure 1.1: 2006 DSP markets by Forward Concepts.

Algorithm research on signal processing has a long history of concentrating on

reducing computational complexity in terms of arithmetic operations. However this

approach often ignores the underlying hardware platform, making the algorithms most

5


suitable for general purpose processors where computations are executed sequentially

in a shared arithmetic logic unit (ALU). Attention is paid to minimize data trans

fers between the memory and the ALU. A fast algorithm is the one that moves data

through the ALU quickly and does so in the fewest times. Most signal processing al

gorithms involve a transform kernel of matrix-vector multiplications, programmable

DSP takes advantage of it by offering a single cycle multiply-accumulation (MAC)

and low overhead loop structures. As a result, for many signal processing applica

tions, a better performance can be achieved using a programmable DSP over a general

purpose processor. In the past decade, in order to meet the rising demand of compu

tational requirements, a lot of important advances in processor technology have been

made. Among them, very long instruction word (VLIW), single instruction multiple

data (SIMD), multi-threading (MT) and multi-core processors are just a few notable

improvements. However these advanced technologies require close interactions be

tween the compiler and the silicon to achieve the much touted performance. It can

mean a prolonged development cycle, in addition to commanding a high cost and

power premium.

An attractive approach to realize a fast alogrithm is to apply dedicated hardware

acceleration, in particular through the use of application specific integrated circuits

(ASIC) [40,59]. For an ASIC implementation, algorithm speed is intrinsically tied

to critical path delay. Improvements to critical path can be explored at several ab

straction levels, including architecture, logic, and circuit [60]. At architecture level,

concurrency available in an algorithm can be exploited by pipelining and parallel

processing techniques. Pipelining reduces effective critical path by performing inde

pendent computations in an interleaved manner. Parallel processing on the other

hand, performs these computations concurrently using duplicated hardware. At logic

level, fast logic styles are favored on critical path, while area-efficient styles can be

used to meet cost and power goals. For example, one can choose carry look-ahead

adder (CLA) for critical path and use carry propagation adder (CPA) on non-critical

paths. At circuit level, sub-micron CMOS offers best combinations of speed, capac

ity, and cost among commercially available options. From a vendor supplied ASIC

library, selection of cell type, size and drive strength can be made through synthesis

6

1.2. OBJECTIVE AND CONTRIBUTIONS

to meet a set of pre-defined design objectives such as area and power.

A frequently cited advantage of general purpose processors (GPP) and digital sig

nal processors (DSP) is faster time-to-market than an ASIC implementation. This

makes GPP and DSP good choices for initial prototyping when function requirements

are not completely specified. However once the performance bottleneck is well un

derstood, ASIC can provide a targeted high performance solution that is also low in

cost and power. This is because for a given application, the computational require

ments for the data flow can often be isolated from those for the control flow. Data

flow is frequently point-to-point in nature and with fewer conditional branches. In

addition, similar to many software project developments, the hardware design cycle

can be significantly reduced if a pre-verified signal processing library is used. Once

the computation intensive tasks are off-loaded, the rest of processing can be carried

out in a more flexible DSP or GPP with relaxed MIPS requirements. Consequently

this hybrid approach can generate better system performance and savings in cost and

power.

This ASIC-driven approach however, is not physically possible or economically

feasible in micron-width technologies. Large transistor sizes limited the amount of

hardware that can be put on a die. In sub-micron technologies, ASIC design has

increasingly crossed over from die-limited to pad-limited [48]. This means for ASIC

design, transistors are cheap and inputs and outputs (10) are expensive. Therefore

more diversified functions and larger scale of signal processing can be implemented

with little cost penalty, if the numebr of ASIC inputs and outputs can be capped.

This change in ASIC technology permits a fresh look into the hardware acceleration

approach to fast signal processing.

1.2 Objective and contributions

The objective of this research is to develop fast bilinear algorithms and efficient hard

ware architectures, particularly for three types of signal transforms: the discrete

Hartley transform (DHT), the modified discrete cosine transform (MDCT) and and

the modulated complex lapped transform (MCLT). Early research [35] has shown

7


that implementations based on bilinear algorithms can provide very fast and effi

cient architectures for the discrete cosine transform (DCT) of certain lengths. To

obtain bilinear algorithms for the DHT, MDCT and MCLT, we plan to employ a

similar group theoretic approach, which allows us to partition a transform kernel into

smaller cyclic convolutions and Hankel products. Appling bilinear algorithms to these

smaller matrix products, we will then be able to obtain the desired architectures. For

the transforms under consideration, we aim for a well defined, scalable signal flow

graph of the computation which will be ideal for ASIC implementation. Table 1.1

summarizes the steps and tools utilized in this study.

Table 1.1: Research steps and tools for fast bilinear algorithm development. Research steps Algorithm design

Hardware architecture Design synthesis

Design technology

Research tools Matlab Verilog

Synopsys Design Compiler TSMC 90nm CMOS

The major contributions of this dissertation include:

1. Discrete Hartley Transforms (DHT) of prime power lengths

• Theory: Using group theory, it is first shown that the transform kernel for

lengths 2" and pn for odd prime p, can be partitioned into two individual

sub-matrices, a DHT of smaller size and a Hartley Group Transform. The

Hartley group transform further decomposes into a cyclic matrix and a

Hartley group transform of smaller size.

• Implementation: Hardware complexity is competitive with the referenced

algorithms up to 256-point with a pipelined and folded design. For lengths

less than 64, our proposed implementations are 20% to 60% faster on

average. The pipelined design can further reduce the output pins by 50%.

2. Modified Discrete Cosine Transforms (MDCT) of lengths 2"

1.2. OBJECTIVE AND CONTRIBUTIONS

• Theory: Using group theory, it is first shown that the transform kernel

matrix converts to a single Hankel matrix.

• Implementation: Algorithm complexity is competitive for 8 and 16 point

MDCT. However our proposed structured bilinear implementations are

30% to 45% faster than the referenced design.

3. Modified Discrete Cosine Transforms (MDCT) of composite length

4 - 3 "

• Application: MPEG-1/2 audio layer III (MP3) employs two MDCTs. The

long block (36-point) is normally used to provide better frequency resolu

tion. The short block (12-point) is used whenever better time resolution

is needed. The switch from the long block to the short block occurs if a

distortion is expected in the frequency domain coding of an audio signal.

• Theory: Using group theory, it is first shown that the transform kernel can

be partitioned into two individual sub-matrices, a type-IV discrete cosine

transform of smaller size and a Cosine Group Transform. The cosine group

transform further decomposes into a block cyclic matrix and a cosine group

transform of smaller size.

• Algorithm: We have designed the most efficient MDCT algorithms for MP3

audio processing (Table 1.2).

• Implementation: Based on a group theoretic division of transform kernels,

we are the first in developing three versions of unified hardware architec

tures to compute transforms of different block sizes, in addition to the for

ward and inverse transforms. Not only our single function designs perform

better than existing solutions, the unified architectures are also showing

superior performance against non-unified existing designs (Table 1.3).

4. Modulated Complex Lapped Transforms (MCLT) of lengths divisible

by four

• Applications: Digital audio watermarking, audio and video denoising. The

transform sizes are frequently larger than 2048.

9


Table 1.2: Complexities of various MDCT and IMDCT algorithms for MP3 audio. Note that M and A refer to multiplication and addition respectively.

Transform 12-point MDCT 12-point MDCT 12-point MDCT 12-point MDCT 12-point IMDCT 12-point IMDCT 12-point IMDCT 12-point IMDCT 36-point MDCT 36-point MDCT 36-point MDCT 36-point MDCT 36-point IMDCT 36-point IMDCT 36-point IMDCT 36-point IMDCT

Algorithm Proposed Ref. [9] Ref. [37] Ref. [27] Proposed Ref. [9] Ref. [37] Ref. [27] Proposed Ref. [9] Ref. [37] Ref. [27] Proposed Ref. [9] Ref. [37] Ref. [27]

Arithmetic complexity 9M + 29,4 11M + 39A 13M + 27.4 11M + 29A 9M + 23A UM + 33A 13M + 21A 11M + 23A

36M + 150A AIM + 1654 47M + 129 A 43M + 133A 36M + 1324 51M + 1514 51M + 1154 43M + 1154

Critical delay M + 6A 2M+6A 2M+5A 3M+7A M + 5A 2M+5A 2M+4A 3M+6A

M + 10A 2M+9A 2M+8A

3M+22A M + 9A 2M+8A 2M+7A 3M+21A

Table 1.3: Complexities of unified 12 and 36 point MDCT and IMDCT architectures for MP3 audio application. Note that M and A refer to multiplication and addition respectively.

Architecture

A B

C (pipeline)

Arithmetic complexity

36M + 15CL4 36M + 154,4 27M + 1314

Critical delay 36-point M + 1CL4 M + 10A 2M + 12A

12-point M + 6A M + QA M + 6A

10

1.3. ORGANIZATION

Theory: Using trigonometric identities, it was shown that the window func

tion can be completely merged with the transform kernel so as to preserve

the bilinearity of the algorithm. The kernel can then be decomposed into a

discrete cosine transform and a discrete sine transform done concurrently.

Algorithm: We propose an efficient algorithm for the MCLT with a sine

window fuction (Table 1.4). This solution also preserves the bilinearity of

the design. Our algorithm first breaks down the MCLT into a linear combi

nation of an evenly-stacked modified discrete cosine transform (E-MDCT)

and an evenly-stacked modified discrete sine transform (E-MDST). We

show E-MDCT can be converted to discrete cosine transform of type II

(DCT-II), and similarly E-MDST to discrete sine transform of type II

(DST-II). With a known transformation technique, DST-II can be com

puted through a DCT-II of same size. Thus the entire MCLT can share a

single DCT-II core module.

Table 1.4: Complexities of various fast MCLT algorithms for block size N = 2n with a sine window.

Algorithm

Ref. [33] Ref. [17] Ref. [34] Proposed

Window choice

any sine sine sine

Real multiplications

N\ogN + W iVlogJV + JV NlogN + N NlogN + 2

Real additions

3N\ogN + 2N 3JVlogJV + 4/V

3NlogN + 3N -2 3N\ogN + 4N

1.3 Organization

This chapter has provided an introduction on our research into the design of bilinear

hardware accelerator for signal processing applications. We stressed the importance of

developing fast signal processing algorithms and the challenges posted by a converged

multimedia wireless network. The requirements on hardware implementation are also

discussed. We advocate an ASIC-driven approach to deliver a high performance and

11


yet cost effective solution. Also presented in this chapter are our research objectives

and a summary of contributions.

In Chapter 2, we first discuss the ASIC design flow. Then concepts of group theory

and basic properties of the group are reviewed. This is followed by a detailed treat

ment of the bilinear algorithm. We show examples of bilinear algorithms for small

size cyclic convolutions and Hankel products. A bilinear algorithm for larger size

cyclic and Hankel matrix products can be derived by using small length algorithms

as building blocks. Since structured bilinear architecture has only one multiplication

on the critical path, the resulting ASIC circuit can be very fast in fixed point im

plementations. For a complex group structure, the order of computation can have

significant impact on the additive complexity. We discuss a procedure for determining

this computation order.

Bilinear algorithms for the DHT of prime power lengths are developed in Chapter

3. We investigate the structure of DHT kernel matrix using group A(pn) where p is

a prime number. This study leads to bilinear algorithms with a single multiplication

stage. Algorithm derivation and verification using MATLAB was carried out. The

new algorithms and two prior algorithms were implemented for a number of 2™ and

3" points and areas and critical path delays are compared.

In Chapter 4, we extend structured bilinear algorithms to 2n and 4 • 3" point

MDCT. Composite length MDCT algorithms of 4-3" points are the workhorse behind

the popularity of MPEG-1/2 layer III (MP3) audio format. The kernel matrix of

MDCT is rectangular (2" x 2"+1) and not square as in case of the DHT. With front-

end data massaging, we first transform the MDCT into a type-IV DCT. Both forward

and inverse transforms can be unified with a DCT-IV based algorithm. This has a

significant hardware implication since both encoder and decoder can now time-share

a single hardware. By taking advantages of the recursive structure inside a bilinear

algorithm, we propose three variations of unified hardware architecture for not only

the forward and inverse transforms, but also the short and long blocks. We show that

these proposed architectures offer superior performance over existing design solutions

and can be obtained with little area penalty.

Fast algorithm for Modulated complex lapped transform is developed in Chapter

12

1.3. ORGANIZATION

5. We obtain a bilinear algorithm for MCLT of lengths divisible-by-4 and with a sine

window. The real part of the MCLT is a modulated lapped transform (MLT). MLT

is related to the MDCT by first applying a window function to the input data. A

commonly used window is a sine window which preserves the perfect reconstruction.

The imaginary part of MCLT can be obtained by a modified discrete sine transform

(MDST) with windowed data. In most publications, this windowing function was

performed separately from the MDCT flow. A handful publications attempted to

merge the discrete computation steps. All met with limited success. Using trigono

metric identities, we show that the window function can be completely merged with

the transform kernel so as to preserve the bilinearity of the algorithm. The hardware

complexity of the new algorithm is computed and compared with four prior algo

rithms. It is shown that our proposed algorithm has the lowest overall complexity

and lowest multiplication requirements.

In the final chapter, the method and results of this research are summarized. To

obtain bilinear algorithms for the transforms under consideration, a group theoretic

approach can be proven successful in providing a fast VLSI architecture. The use of

group theory will allow us to partition a transform kernel of interest into smaller cyclic

and Hankel matrices. By applying bilinear algorithms on these matrix products, the

desired architectures can be obtained as a result. Future work is discussed on topics

of structured bilinear algorithm and algorithmic hardware acceleration.

13


14

Chapter 2

ASIC methodology and

mathematical background

This chapter provides the information about implementation techniques used as well

as the mathematical background required for this research. We start out with a re

view of ASIC design flow and metrics used to evaluate an implementation. Then the

concepts of group theory and basic properties of group used here are reviewed. This

is followed by a detailed treatment of the bilinear algorithm. We illustrate examples

of bilinear algorithms for small length cyclic convolutions and Hankel products. A

bilinear alogrithm for larger length cyclic and Hankel matrix products can be derived

using small length algorithms as building blocks. Since structured bilinear architec

ture has only one multiplication operation on the critical path, one can devise very

fast ASIC bilinear architectures in fixed-point implementations. For a complex group

structure, the ordering of computation can have a significant impact on the additive

complexity of the bilinear algorithm. We discuss a procedure for determining this

computation order.

15

CHAPTER 2. ASIC METHODOLOGY AND MATHEMATICAL BACKGROUND

2.1 Design methodology

2.1.1 General ASIC flow

Application specific integrated circuit (ASIC) has been the technology of choice for

many large volume applications. Many digital signal processing systems fit in this

category. For example, mobile phones are produced in very large numbers and re

quire high-performance circuits with respect to throughput and power consumption.

ASIC often has the performance advantages over digital signal processors (DSP) in

size, power and speed. These advantages are gained when area is optimized by elim

inating unnecessary and all unwanted functions that are part of a standard DSP. For

example, the instruction fetch-and-decode logic and interrupt handling circuits. As

such, ASIC circuits drive less capacitive load, resulting in less power dissipation and

better performance. In addition, the word precision and other design parameters can

be tailored to a specific application to provide optimum high-level performance.

ASIC however, is not without challenges. The most notables are the lack of the

programmability and ease of portability, and with that a longer design cycle. Since

ASIC implementations are literally set in stone, they can not be easily field-upgraded

and thus must be designed reliably on the first try. To achieve this objective, a

design ripples through a sequence of refinement stages called abstraction levels before

making its way into the silicon. The different design abstraction levels of a top-down

design methodology are illustrated in Fig. 2.1.

In a digital ASIC flow, the highest abstraction level is the specification level. It

defines the functionality, the performance and the constraints of target application.

Next is the behavioral level, where a design is often embodied by a system software

program that executes the function to be implemented in hardware. At the register

transfer level (RTL), the behavioral description is mapped to a well defined hardware

micro-architecture. The behavior and structure of a design is specified by describing

the operations that are performed on data as it flows between circuit inputs, outputs,

and clocked registers. At circuit level, different blocks of the hardware architecture

are committed to Boolean logic gates, latches, and flip-flops. For example, an adder

is committed to a specific implementation such as a carry propagation adder (CPA),

16

1. DESIGN METHODOLOGY

Abstraction Example Tools

Specification

Behavior

Register transfer (RTL)

Circuit

Physical layout

Function: MP3 decoder Performance: 320 kbps Constraint: max. 5 mm

if(s) x =a+b;

else x =a+c;

b—*~

c — * •

s

K. CLA

s a

if IX/1 m««««

Word Spreadsheet

Matlab C/C++

VHDL

Verilog

Synthesis:

Synopsys Cadence Mentor

Cadence Synopsys Magma

Figure 2.1: Digital ASIC abstraction levels.

17


a carry-lookahead adder (CLA), or a carry select adder (CSA) depending on the

availability of library cells. The circuit level is the stage where design synthesis is

performed to a specific technology node, such as TSMC 90nm CMOS process used in

this research. The final abstraction level before design sign-off is the physical layout

level. Here the design is linked to a specific technology and mask geometries are

derived. Simulation parameters such as wire resistance and parasitic capacitance are

extracted and fed back to a circuit simulator to ensure design can function at spec

ified performance level. Cross-talk analysis on electromigration is also performed to

determine if the device life time is sufficient for an application. Most often design will

iterate a few times between last two levels before a satisfactory solution is obtained.

We have seen that these abstraction levels offer opportunities to divide a compli

cated design process into smaller and more manageable tasks. A well-practiced ASIC

flow can significantly reduce the risk of introducing errors into the design. However it

is also this elaborated and fine-tuned procedure that demands significant effort and

time, which in turn often prolongs the design process. In order to shorten the schedule

without negatively impacting the quality of design, a frequently used improvement

technique is designing for reuse. This requires that during the design process, atten

tion must be paid to any iterative process within the scope. These iterative processes

can be extracted out as a macro or a core. They can be verified once and confi

dently applied many times. A systematic reuse of macros and cores provides the

quickest and most efficient approach to design ASIC [44]. In this dissertation, we will

show that our choice of design provides a natural way for design-for-reuse, As such,

a larger design can be quickly generated based on existing pre-verified blocks thereby

significantly shortening the design schedule.

2.1.2 ASIC for signal processing

Signal processing is fundamental to information processing. Generally the goal is

to reduce the information content in a signal to facilitate a decision about what

information the signal carries. In other instances, the aim is to retain the information

and to transform the signal into a form that is more suitable for transmission or

storage.

18

2.1. DESIGN METHODOLOGY

The major attributes of our ASIC design for signal processing are

• Standard cell based ASIC: Standard cell based ASIC, or cell-based IC (CBIC)

uses a collection of pre-designed logic cells to build a large complex circuit.

These logic cells are known as standard cells and include Boolean gates, flip-

flops and other complex modules such as multiplexers etc. The entire collection

is called a library. If building an ASIC is like building a house, the library is like

a Home Depot catalog, and the cells are equivalent to the lumber, bricks, nails

and tiles listed in the catalog. The advantage of this design approach is that

designers save time and reduce risk by using a pre-designed, pre-tested, and

pre-characterized standard cell library. ASIC designers can save their efforts to

focus on system functionality and high-level design tradeoffs.

• Fixed-point arithmetic: Early signal processors used fixed-point arithmetic and

often had far too short internal data length and far too small on-chip memory

to be efficient. Recent processors use floating-point arithmetic which is much

more expensive than fixed-point arithmetic in terms of power consumption,

execution time and chip area. In fact, these processors are not exclusively

aimed at DSP applications. Applications that typically require floating-point

arithmetic are three-dimension (3D) graphics and mechanical computer aided

design (CAD) applications. Fixed-point arithmetic is better suited for DSP

applications than floating-point arithmetic since good DSP algorithms require

high accuracy (long mantissa), but not the large dynamic signal range provided

by floating-point arithmetic. Further, the performance degradation due to non-

linearity (rounding of products) are less severe in fixed-point arithmetic.

• Direct mapping: DSP algothms rarely have many data dependent branching

operations. This makes them ideal candidates for the direct mapping technique

which involves creating a hardware structure that exactly matches with the sig

nal flow graph of the algorithm. This implementation technique is particularly

suitable for system with a fixed function, for example, digital filters and signal

transforms. Since direct mapping approach provides perfect match between the

19


DSP algorithm and the circuit architecture, it allows algorithm level design pa

rameter tradeoffs and tuning. It thus reduces the overall design time and leads

to a more reliable design.

2.2 Performance metric

The design space can be represented as a triplet:

(Function, Performance, Constraints).

That is to say a design has to deliver a function at a performance level under certain

constraints. For example, in signal processing, the function can be a digital filter

or a signal transform. The performance objectives are often time related, such as

signal sampling rate, or processor requirements such as clock frequency or MIPS.

The constraint targets, on the other hand, are frequently associated with cost, such

as chip area, power consumption, design schedule, or a combination. In general, a

successful design is about the trade off between the performance and the constraints

in order to realize some particular functions. In this dissertation, we are interested in

developing architectures for real-time DSP applications, where computations must be

completed within a given time frame (for example, the signal sample period). In such

applications, unacceptable errors occur if the time limit is exceeded. Consequently,

general purpose or DSP processors that rely on concepts such as memory management

units, cache etc., which may have a high throughput but not necessarily a guaranteed

time for each computation, are unacceptable. On the other hand, without these

features, general purpose and DSP processors cannot provide the necessary speed.

We therefore focus on ASIC solutions in this work.

Architecture speed is intrinsically tied to its critical path delay. Reducing overall

arithmetic complexity helps on the circuit area. However it does not address the

critical path delay directly. For algorithmic circuit, the critical path delay is attributed

to arithmetic operation speed and the number of operations on the signal path. This

implies that on the critical path, one needs to favor simple and fast operations such as

additions rather than more complex and slow operations such as multiplications. This

is especially important for fixed-point arithmetic, since intrinsic delay of a multiplier

20

2.2. PERFORMANCE METRIC

is much greater than that of an adder. Fig. 2.2 shows an addition of 3 operands

using carry propagation adders (CPA). The delay involved in an n-bit CPA is nA,

where A is the delay of a full adder (FA). However when these adders are cascaded to

perform successive additions, the delay introduced by each additional adder is only

A. This additional delay is not dependent on the fixed-point word size. This is

in contrast with a multiplication operation, shown in Fig. 2.3 using a Pezaris array

multiplier. Note that the critical path delay of this multiplier is 2nA where n-th bit

is the operand sign. In addition, cascading another operation increases delay a time

up to nA since even the (useful) least significant bit of the product may take upto

nA time to compute. Further, in any hardware implementation, the size and value

of variables involved in multiplication play a significant role in determining both the

area and the delay.

03 Oj Oj

Figure 2.2: Critical path delay for an addition.

Clearly there is a distinction between the design for area and the design for speed.

In a design for area, the quantity of arithmetic operation is of utmost importance.

Whereas in design for speed, the placement of arithmetic operations along the com

putational path is important. For fixed-point arithmetic ASIC, the total number of

multiplications tends to dominate the chip area and the number of multiplication on

the critical path tends to dominate the delay, or the circuit speed. To reduce the

number of operations on the critical path, concurrency available in an algorithm can

21


a {b0 ,i=4,3,2,l,0. a4 bQ a^bQ ^ bQ a, b Q aQ

0 1 .0 1.0 j ^ O

a4 bi (+>*- r+n- (+

7 J 6 J 5 " 4 J 3 J 2 " 1

Figure 2.3: Critical path delay for a multiplication.

t P.

be exploited by parallel processing techniques. On the other hand, to reduce the cir

cuit area, existing hardware units can be shared, potentially at the expense of circuit

speed.

Consider the following matrix-vector product.

a b

b a x0

Xl

(2.1)

Equation (2.1) can be implemented by three different flow graphs shown in Figs. (2.4-

2.6). The hardware and time complexities of these implementations are listed in Table

2.1.

Figure 2.4: First implementation of (2.1).

22

2.2. PERFORMANCE METRIC

Figure 2.5: Second implementation of (2.1).

(a+b)/2

(a-b)/2 e—-s

Figure 2.6: Third implementation of (2.1).

Table 2.1: Complexity and delay comparisons for the three implementations of (2.1). Note that M and A refer to multiplication and addition respectively.

Implementation Fig. 2.4 Fig. 2.5 Fig. 2.6

Arithmetic Complexity 4M+2A 4M+2A 2M+4A

Critical Delay M+A 2M+A M+2A

23


Even though the implementations in Figs. 2.4 and 2.5 have the same complexity,

the critical path delays cannot be more different. Therefore the complexity cannot be

the sole indicator of an algorithm design. This is especially important for real-time

applications. In fixed-point arithmetic implementation, Fig. 2.4 is better because

for the same area (same total operations), its delay is shorter due to less number

operations on the critical path. However when a/b is a power of 2, the multiplication

with a/b can be realized as a left or right shift with sign extension. It is considered

to be such a trivial operation that its impact to area and delay is negligible. In this

case, the implementation of Fig. 2.5 is better than that of Fig. 2.4 since it has the

same delay but smaller area.

Implementation in Fig. 2.6 can be considered as a compromise between the per

formance and the area. It has smaller area but slightly longer delay than the one

in Fig. 2.4. As we will see later in this chapter, both these implementations belong

to the family of bilinear algorithm. Bilinear algorithms exploit the parallelism in

multiplication operations to the fullest extent, i.e. no multiplication is dependent on

another multiplication operation. Since they minimize the number of multiplications

along the critical path, bilinear architectures are very fast.

Doing so however, does require one to possess sufficient hardware resources. For

tunately modern ASIC technology provides an opportunity to accomplish this goal.

As the manufacturing process continues marching down to finer design geometries,

more designs are limited by the number of inputs and outputs (10) than the number

of transistors a die can support. This shift from capacity limitation to 10 limitation,

permits the use of more processing elements for parallel computing. At the same

time, it also brings up a new design constraint on the number of inputs and outputs.

In later chapters, we will show the effects of this new constraint on the performance

of the design.

24

2.3. GROUP THEROY

2.3 Group theroy

2.3.1 Group definition

A group is a nonempty set G together with an operation o satisfying the following

conditions.

• Closure: If a, b E G, then a o b e G.

• Associativity: For all a, b, c € G, (a o b) o c = a o (6 o c).

• Identity: There exists an element e G G such that for all a € G, aoe = a = eoa.

• Inverse: For each a £ G: there exists an a~1 € G such that a o o - 1 = e = a~l oo.

In this dissertation, our groups will be sets of integers with operation of multipli

cation modulo N. This is because many signal processing algorithms use trigonomic

functions which are inherantly periodic, thus providing the reasoning for the modulo

base of our operation.

A group G is cyclic if it can be generated by one element g £ G. A cyclic group

of n elements is represented as Cn. Thus

Cn = {l,g,g2,~-,9n-1}-

In most cases, the choice of a generator is not unique.

If G and H both are groups, then we can construct a new group F — G x H,

called the direct product of G and H. The elements of G x H are pairs (a, 6) where

a G G and b € H. The operation between two pairs is done component-wise in the

two individual groups. But if G and H have the same group operation, then elements

o f F = G x t f = {aob\aeG,be H}. For example if G = {1, g} with g2 mod iV = 1,

and H = {l,h,h2} with h3 mod N = 1, then F = G x H = {l,h,h2,g,gh,gh2},

which is done by component-wise multiplication modulo N.

Two groups F and F' are isomorphic if there exists a one-to-one correspondence

ijj between the elements in the two groups and iff/' preserves the group operation, i.e.,

i>{gi°g2) = V,(5i)°V'(flr2)- For groups G and H defined in the previous example, choose

gh2 as the generator. We can define a new group F' = {l,gh2,h,g,h2,gh}. One

25


can verify that groups F and F' are isomorphic. Note that computations expressed

over one group can always be translated over to an equivalent group without extra

computational effort.

In an Abelian group, operation o is commutative, that is for all a,b G G, aob — boa.

Positive integers which are less than N and relatively prime to N, form an Abelian

group under the operation of multiplication modulo N. We will denote this group by

A(N). From the fundamental theorem of group theory, every finite Abelian group is

a product of cyclic groups. In particular, A(N) can be decomposed according to the

following rules.

• ^(rir2) = A{rr) x A(r2) when gcd(ri,r2)= 1.

• A(pn) = C(p_i)pn-i when p is an odd prime.

• A{2n) = C2 x C2n-2 when n > 3, A{A) = C2 and A(2) = {1}.

In addition, for a cyclic group,

• CTlT2 = CTl x CT2 when gcd (r1;r2)= 1.

2.3.2 Cyclic and Hankel matr ix products

Many signal processing algorithms rely heavily on matrix multiplication. This is

especially true in the time-frequency analysis of the signal. A two-dimensional kernel

matrix generally involves the time and frequency indices. We can reorder the indices

and transform the kernel (maybe in parts) to certain desirable forms. Using efficient

algorithms for these parts, one can then obtain efficient implementations for the DSP

applications. This section explains the desriable matrix forms we will use heavily in

this dissertation.

The most common matrix we use is a Hankel matrix. A general form of iV-point

26

2.3. GROUP THEROY

cyclic matrix-vector product can be expressed as

/ X0 \ x1

x2

I c0 cx c2

C\ C2 C3

c2 c3 c4

V XN_i J \ cN-i c0 ci

CAT. -A Co

Cl

/ XQ

XI

X2

CN-2 J \ XN-1 /

(2.2)

Note that each row of a Hankel matrix is a left rotated version of its previous row.

Hankel matrix is symmetric. Equation (2.2) can also be rewritten by permuting its

columns and the elements of the x vector as:

/ Xa \ Lo

x2

\ xN_, J

I Co CN-i CN-2

Ci C0 CN-i

C2 Ci C0

\ C T V - 1 CN-2 CN-3

C 1 \

c2

c3

/ XQ

XN-1

XN-2

Co / \ Xi )

(2.3)

The matrix obtained in (2.3) is a cyclic matrix. Cyclic matrix is antisymmetric.

There are efficient bilinear algorithms available for computing (2.3). Clearly, all these

algorithms (with trivial transformation) can be applied to computing (2.2) as well.

A kernel matrix (or part of it) can be turned into a Hankel matrix if the indices

involved form a cyclic group and if the matrix element M(i,j) = <f>(ioj) for a chosen

fucntion <j>. To illustrate this, consider a 7-point Fourier transform kernel with indices

restricted to range 1 < i,j < 7. This set of indices forms a cyclic group A(7) — Ce —

{1, 3, 2,6,4, 5} under the operation of multiplication modulo 7, where the elements of

the group are ordered as powers of its generator 3. Further, M(i,j) = W]3 — <fi(i°j)

where <f>(t) — W? and o is the group operation of multiplication modulo 7. Using the

index order dictated by the group element order, we can see that the partial Fourier

27


matrix can be written as

X3

x2 X,

x± x5 \A5J

( w1 w3 w2 w® w4 w5 \ w3 w2 w6 w4 w5 w1

w2 w6 w4 w5 w1 w3

w6 w4 w5 w1 w3 w2

W4 Wb Wl W3 W2 WG

w5 w1 w3 w2 w6 w4

fxx\ x3

X2

x6

\x, J

(2.4)

Clearly the matrix in (2.4) is a Hankel matrix. Further, since CQ = C2 x C3 =

{1,3} x {1,2,4} = {1,2,4,3,6,5}, one can reorder the indices to this new order to

see that the original Hankel matrix of (2.4) can be partitioned into 3 x 3 blocks such

that each is a Hankel matrix and the structure is formed by the blocks is a 2 x 2

Hankel matrix. This is shown in (2.5).

\

x2

x3

x6 \x5/

( wl w2

w2 w4

w4 w1

w3 ws

w6 w5

w5 w3

w4

wl

w2

w5

w3

w6

w3 w6 w5 \ w6 w5 w3

w5 w3 w6

w1 w2 w4

w2 w4 w1

W4 W1 W2 j

f Xl\ x2

Xi

x3

x6

\X5J

(2.5)

Both cyclic and Hankel matrices are important in this dissertation. We will review

efficient bilinear algorithms for cyclic and Hankel matrix products in the next section.

Once bilinear algorithms are established for arbitrary sized cyclic convolutions and

Hankel products, we will demonstrate in later chapters that for the signal processing

applications of interest, indices forming groups can be extracted and thereby the

computations reduced to Hankel products.

2.4 Bilinear algorithm

Typical bilinear algorithm has three sequential computing stages: pre-addition, mul

tiplication, and post-addition. A key characteristic of the bilinear architecture is that

all the multiplications therein are independent and can be executed concurrently.

28

2.4. BILINEAR ALGORITHM

Since there is only one multiplication along the critical data path, the structured

bilinear hardware has potential to achieve ultra-fast VLSI implementation.

Bilinear algorithms have been studied earlier by Rader [42] for prime length DFT

and by Winograd [61] for certain short composite length DFT. Recently bilinear al

gorithms have been proposed for more general length 2n DFT [55]. The process of

converting a transform to a bilinear algorithm consists of two steps. First, group

theory is used to identify convolution structures within the transform kernel matrix.

Typically one looks for cyclic and Hankel sub-matrices within the kernel. Using bi

linear algorithms for these cyclic and Hankel matrix-vector multiplications, a bilinear

algorithm is obtained for the complete transform.

2.4.1 Recursive decomposition

The basic building blocks for our targeted applications are the bilinear algorithms

for cyclic convolutions and Hankel products of prime power lengths. For small prime

lengths, these algorithms are well documented [3]. Bilinear algorithms for large prime

lengths can be derived as well [54]. A general cyclic matrix of any size can always be

decomposed into smaller cyclic matrices of relatively prime lengths. Similarly a large

size Hankel matrix can be decomposed into smaller Hankel matrices. In addition, a

cylic or Hankel matrix of prime power length can be decomposed into smaller cyclic

and Hankel matrices for whom bilinear algorithms are readily available.

Consider the following 4-point cyclic convolution as an example.

2/2

( C0 Ci

C\ C 2

c2 c3

\ c3 c0

C2 C3 \

C3 C0

Co C\

C\ C2 /

/ x0 \

X2 (2.6)

Let Y0 ' , ^ 1 = X0 x0

Xi ,X,

X2

X3

, and define Hankel

matrices A = [ \ and B — [ | . Equation (2.6) can then be rewritten c\ c2 c2 c0

29


as a 2-point block cyclic convolution as

A B \ ( X, \ (2.7)

Applying 2-point bilinear algorithm (Fig. 2.6) for cyclic convolution to (2.7), one

gets

Y0 = {(A + B)(X0 + X1) + (A-B){X0-X1))/2

*i = ( ( ^ + B)(X0 + X 1 ) - ( ^ - B ) ( X 0 - ^ i ) ) / 2 . (2.8)

The block coefficient matrices are:

A + B = I

A-B = (

c0 + c2 cx+cz \

c\ + c3 c0 + c2 J '

c0 - c2 ci - c3 Cl - C3 - ( C 0 - C 2 )

(2.9)

Therefore the block multiplication with (A + B) is again a 2-point cyclic matrix

multiplication requiring 2 multiplications and 4 additions in a bilinear algorithm.

The multiplication with (A — B) however, is a 2-point Hankel product and its bilinear

algorithm needs 3 multiplications and 3 additions. Putting together the blocks, the

complete bilinear algorithm shown in Fig. (2.7) is obtained. The net arithmetic

complexity of a bilinear algorithm for (2.6) is therefore 5 multiplications and 15

additions and the critical path delay is one multiplication and 4 additions.

In this dissertation, we employ this kind of recursive decomposition approach

repeatedly. A key aspect of the approach is that only a small number of bilinear algo

rithms are needed in order to obtain the solution to a much larger problem size. The

small number of required algorithm means more efficient design reuse for both soft

ware and hardware. When sizes are parameterized, it can reduce the overall code size

and shorten the design schedule since the verification effort is reduced substantially.

2 . 4 . 2 O r d e r o f c o m p u t a t i o n

When a larger bilinear algorithm is decomposed into two or more smaller algorithms,

it is important to determine the order of decomposition which provides the minimum

complexity. A bilinear algorithm can be characterized as a triplet (n, a, m), where n

30


(" C 0" C 1 + C 2 + C 3) / 2

Figure 2.7: A bilinear algorithm for the 4 point cyclic convolution.

is the length of the input vector, a its additive complexity and m its multiplicative

complexity.

Consider designing a bilinear algorithm of nin2 points using two bilinear algo

rithms described as (ni,ai,mi) and (rz2, a2,m2). We can partition the input and

output vectors into n\ sub-vectors, with each sub-vector having n2 points. The com

putation resembles the algorithm of (ni,ai,mi) at the block level. The a\ additions

are duplicated n2 times for each component of the sub-vectors, resulting in n2«i

additions. There are mx block multiplications, each is resolved with the algorithm

(n2,a2,m2). Total computation for the block multiplication requires mia2 additions

and mim2 multiplications. Thus the arithmetic complexity of the algorithm for this

n\ x n2 decomposition is mim2 multiplications and n2ax + mia2 additions.

Alternately, using the decomposition n2 x rti, the computation resembles the al

gorithm of (ri2, 0,2,1712) at the block level. The a2 additions are duplicated n\ times,

resulting in nia2 additions. There are m2 block multiplications, each is resolved with

the algorithm (ni,ai,mi). Total computation for the block multiplication requires

m2ai additions and mim.2 multiplications. Thus the arithmetic complexity of this

n2 x n\ decomposition is mim2 multiplications and nia2 + m2ai additions. A com

parison table of the two decomposition strategies is shown in Table 2.2.

It is important to note that the multiplicative complexity is identical for both

31


Table 2.2: Arithmetic complexity of different decomposition orders. Decomposition

Multiplication Addition

ni x n2

vn,\vn,2 n2a\ + m\a2

n2 x ni

m\m2

nxa2 + m2ai

decomposition orders. The difference is in the additive complexity. If algorithm

decomposition order n\ x n2 has a lower additive complexity than n2 x n\ does, then

{n2a\ + m,ia2) < (nia2 + m2a\) which can be further simplified as

(mi - n i ) /a i < (m2 - n2)/a2. (2-10)

Since an algorithm with a lower computational complexity is desirable, we can thus

use the value of (m — n)/a to determine the decomposition order of bilinear algorithm

(n, a,m).

As an example, consider the computation of a 6-point cyclic convolution. This

may be expressed either as a 2-point block cyclic convolution where each block is a 3-

point cyclic convolution or alternately as a 3-point block cyclic convolution with each

block being a 2-point cyclic convolution. The characteristic of bilinear algorithms

for a 2-point cyclic matrix is (2,4,2) and that for the 3-point arbitrary cyclic matrix

is (3,11,4). Apply (2.10), the value (m - n)/a for C2 is 0 and that for C$ is 1/11.

Therefore the decomposition order C2 x C3 will result in a lower complexity than that

of C3 x C2.

In Table 2.3, commonly used 2, 3 and 5 point bilinear algorithms are listed ac

cording to their (m — n)/a values. When two algorithms have the same complexity

values, their order of combining can be arbitrarily chosen.

32


Table 2.3: Determining the order of combining some bilinear algorithms. Note that matrices identified with * have the special property that the sum of all elements in the first row is 0.

Matrix type

Circular Circular* Circular Hankel

Circular* Hankel Hankel

Size n

2 3 3 3 5 2 5

Multiplications m

2 3 4 •5

9 3 14

Additions a

4 6 11 16 22 3

27

Value (m — n)/a

0 0

1/11 1/8

2/11 1/3 1/3

33


34

Chapter 3

Discrete Hartley transform

This chapter develops bilinear algorithms for the discrete Hartley transform (DHT)

of pn points for a prime p. Using a group theoretic approach, we show that the DHT

kernel matrix can be recursively transformed into cyclic and Hankel sub-matrices.

By using bilinear algorithms for cyclic convolution and Hankel product, one can

then obtain bilinear algorithms for the DHT. Bilinear algorithms ensure highest com

putational speeds in dedicated hardware. We have implemented in VLSI our new

algorithms of 2™ and 3" point DHTs as well as the ones available in literature. We

find that our algorithms have a speed advantage of 20% - 30% over others.

3.1 Background and prior work

The iV-point discrete Hartley transform (DHT) of sequence {x(i)}, is defined as [6]

N-l

X(k) = ^2x{i)cas{2irik/N), k = 0 , 1 , . . . , N - 1, (3.1) i=0

where cas(a) = cos (a) +sin(a) . The DHT is a real-valued transform with its forward

and inverse transforms sharing the same kernel (except for a scaling factor) and is

useful for obtaining convolutions of real sequences. It has also been used in many

applications in the fields of spectral analysis [57], error control coding [62], data

compression [64], and optics and microwave [6,7]. Further, it has been shown that fast

35

CHAPTER 3. DISCRETE HARTLEY TRANSFORM

Hartley transform (FHT) has the fastest realization of the DFT when implemented

across a variety of general purpose processor platforms [20].

Most of the algorithms for the DHT target von Neumann architecture. Since all

the operations in such general purpose computers are sequential, the performance of

these algorithms is evaluated by means of their arithmetic complexity and the number

of memory accesses [2,4,21,28,58]. However, the recent progress of VLSI technology

has now made it possible to develop cost effective dedicated Application Specific In

tegrated Circuits (ASIC) for signal processing applications. The ready availability of

low cost field programmable gate arrays (FPGA) has made such hardware solutions

practical even for low volume applications. Unfortunately, efficient algorithms devel

oped for general purpose computers such as the split-radix algorithm [2,4], separate a

2"-point DHT into n serial computation stages each involving multiplications. Thus,

if converted to hardware, the critical path of these algorithms involves several multi

plications one after the other. Fig. 3.1 shows a typical computational flow graph of a

2"-point DHT using a split-radix algorithm.

rotation re-order

Figure 3.1: Flow graph for Ref. [2] implementation of 2"-point DHT. Note that each rotation can also be implemented with 3 multiplciations and 3 additions. The index and the coefficients are: i',k' — 0,l,--- ,N/2 — 1; C(j) = cos(2ivj/N) and S(j) = sin(27rj/iV), where j = 0,1, • • • , N/4 - 1.

Split-radix algorithms have also been proposed for 3n-point DHT [1,29,65]. The

signal flow graph of a 3"-point DHT [65] is shown in Fig. 3.2. As can be seen from

the figure, these algorithms require additional computations to separate indices into

evenly divided three or nine bands. Hence these algorithms are quite different from

those of 2™-point DHT. However, similar to the algorithms for 2"-point DHT, these

36

3.1. BACKGROUND AND PRIOR WORK

3n-point DHT algorithms also have several multiplication stages on the critical path,

thus limiting the performance in hardware implementation.

x(i'+2N/3)

Length-N/3 DHT

Length-N/3 DHT

Length-N/3 DHT arrange

X(3k')

X(3k'+1)

X(3k'+2)

Figure 3.2: Flow graph for Ref. [65] implementation of 3n-point DHT. Note that each rotation can also be implemented with 3 multiplciations and 3 additions. The index and the coefficients are: i',k' = 0,l,--- , iV/3 — 1, C(i') = cos(2-jri'/N) and S(i') = sm(2m'/N).

Previous hardware implementations of the DHT include algorithms using FFT-like

butterfly structures [19,26], CORDIC schemes [13], multiplier-arrays [11], bit-serial

architectures [22], transversal filters [5] and distributed arithmetic (DA) [23]. All of

these implementations, except the last one have more-than-one multiplication stages

along the critical path. The distributed arithmetic design does not use multipliers.

Instead, multiplications are converted into a memory-based look-up table. The DA

approach can theoretically compute any length DHT, however both transform length

and algorithm speed are limited by memory size and speed.

The inherent delay of the p"-point DHT algorithm developed here consists of only

one multiplication and a few additions. The primary focus of the algorithm devel

opment is on partitioning the transform kernel into cyclic convolutions and Hankel

products and then realizing these with bilinear algorithms. Even though the method

is more general, we illustrate it here for lengths of the form 2" and 3". The ASIC

designs were synthesized and compared with realizations of the most current DHT

algorithms.

The rest of this chapter is organized as follows. Section 3.2 describes a group

theoretic DHT kernel partitioning. Section 3.3 presents the details of the proposed

algorithms for 2"-point DHT. Performance of the ASIC implementations is presented

in Section 3.4. Proposed algorithms for 3"-point DHT are discussed in Section 3.5,

with performance analysis followed in Section 3.6. Finally, Section 3.7 provides our

37


concluding remarks.

3.2 Partit ioning the D H T of prime power lengths

Let N — pn where p is a prime number. We will use the symbol Xn to indicate

a transform of length pn. Define a set A(N) = {0 < i < N\gcd(i,N) = 1}. We

compute the transform components depending on whether the indices belong to A(N)

or not. The proposed computation of DHT based on A(N) is shown in Fig. 3.3.

Hartley Group Transform HGTN

Pre-Addition

z(i)=Xx(i+jN/p) j=o

i=0,l, ..,N/p-l.

DHT of length N/p

k 8 A(N)

k ^ A(N)

Figure 3.3: Flow chart for pn-point bilinear DHT.

Consider first the computation of Xn(k), k ^ A(N), i.e., when A; is a multiple of

p. In this case, as definition (3.1) shows, each x(i + (N/p)j), 0 < j < p, multiplies

the same cas(2irik/N). Define a sequence

p- i

z(i) = ^x(i+ (N/p)j), i = 0 ,1 , . . . (N/p) - 1. j=o

It is then obvious that

Xn(pk) = Zn_i(fc).

Thus the DHT components with indices which are multiples of p can be computed

directly from the DHT of a sequence {z(i)} of smaller length (N/p).

To compute Xn(k), k e A(N), we use the fact that A(N) forms a group under

the operation of multiplication modulo N. We refer to this computation of Hartley

38

3.2. PARTITIONING THE DHT OF PRIME POWER LENGTHS

transform with the transform indices restricted to a group as the pn-point Hartley

Group Transform, HGTN. Thus we have

HGTN{k) = J2 x ( * ) c a s ( ^ ) , * G A(N). (3.2)

We separate (3.2) in two summations depending on whether signal index i belongs to

A(N) or not. The results of these are combined using |̂ 4(iV)| additions later. When

i E A(N), we can permute the signal and transform components to convert the partial

kernel to a cyclic matrix (when p is odd) or a direct product of cyclic groups (when

p = 2). This permutation and computation thus depends on the group structure and

is illustrated in Section 3.3 for N = 2n and in Section 3.5 for N = 3".

When k £ A(N) but i £ A(N), i is a multiple of p. As a result, each HGTN(k +

(N/p)j), 0 < j < p in (3.2) is identical. It is therefore sufficient to compute HGTN(k)

only for k € A(N), 0 < k < {N/p), i.e., k € A(N/p). However,

HGTN{k) = J2 * « c a s ( ~ ) i$A(N)

27T7 n

= J2 ^ / P ) c a s ( ^ ) , k e A{N/p),

= HGTN/P(k), keA(N/p). (3.3)

Note that the HGTN/P in (3.3) represents the Hartley group transform of the length

N/p sequence {x(pi)}, 0 < i < N/p.

This decomposition of HGTN is shown in Fig. 3.4. It shows that the computation

of HGTN breaks down into two independent computations, one involving a cyclic

convolution (or a multi-dimensional cyclic convolution) and the other, the transform

HGTN/P. HGTN/P can also be similarly decomposed into a smaller sized convolution

and HGTN/V2. Since all the resultant convolutions can be done concurrently, one can

get a bilinear algorithm for the DHT from bilinear algorithms for cyclic convolutions.

The proposed bilinear DHT flow can also be visualized from kernel matrix per

spective. In Fig. 3.5, each solid boxes represents a circular matrix associated with

an Abelian group that needs to be computed. The circular matrices are identical if

they are of the same size. These matrices are basic computing blocks for our bilin

ear algorithms. Clearly there are plenty of redundant coefficients in the DHT kernel

39


x(i) i£A(N)

f W

x(i) v i§A(N)

Multi-dimensional cyclic convolution

Hartley Group Transform

HGTN/p

Y(k) k £A(N)

X'(k') k'E A(N/p)

Post-Addition X(k) = Y(k) + X'(k modN/p)

k £A(N)

Figure 3.4: Hartley group transform for pn-point bilinear DHT.

matrix. The presented matrix division method recursively identifies and removes

this inherited redundancy, thus it achieves a higher computational efficiency. Only

p2(p — l ) 2 / (p 2 — 1) percent of original DHT kernel matrix needs to be computed.

This computational load can be futher reduced using bilinear algorithm.

In a fully parallel ASIC architecture, it can be seen that alogrithms for small

circular matrices are deployed at many places. Therefore the DHT kernel matrix is

an ideal candidate for folding, done in such a way that computing module for same

size circular matrices is shared to reduce overall hardware cost. To one extreme,

there will be only one module instantiation of arbitrary-sized circular matrices in an

implementation using bilinear algorithm.

3.3 Bilinear algorithm for 2n-point DHT

As seen in Section 3.2, by partitioning the signal and transform indices using the set

A(N) allows one to compute some parts of the Appoint DHT recursively. However,

when both the indices are restricted to A(N), the structure of the computation is

governed by the structure of the group A(N). Recall that N — pn and that group

A(N) uses the operation of multiplication modulo N. It is known that when p is an

odd prime, A(N) is a cyclic group C(P_I)JV/J> of (p — l)N/p elements and if p = 2,

A(N) is isomorphic to C2 x CN/A- We will focus in this section on lengths N — 2n

which have a greater applicability.

40

3.3. BILINEAR ALGORITHM FOR 2N-POINT DHT

i where i EA(N). i where i £A(N).

k where k 8 A(N).

k where k S A(N).

=0,...,p-l.

Lrl

i=(p-l)N/p,...,(N-l). i

+ ... +

Figure 3.5: Kernel matrix group division of p"-point DHT.

41


When N = 2", the group A(N) has a structure C2 x Cjv/4. Let the group CAT/4 =

{l,g,g2,--- ,gN/i'1}, where gN^ mod JV = 1. Similarly, let C2 = {1, ft} .where

h £ CN/i and ft2 mod N = 1. Each element of -<4(iV) can then be expressed as a

product of an element of CN/A and an element of C2. It is known that it is always

possible to find the generators g and h of the cyclic group described here. These

generators are not unique. In the following discussions on 2n-point DHT, we choose

g — 3 and h = N/2 — 1. We first state and prove three lemmas which would help

us in the clarification of the algorithm. The first of these lemmas shows that 3 is in

fact the generator of CN/A with operation of multiplication modulo N.

L e m m a 1 Let N = 2n, n>3. The order of 3 in A(N) is N/4.

Proof. When N = 2n, \A(N)\ = 2n~1. Thus order of 3 in A(N) is 2j for the smallest

j satisfying

32i = 1 mod 2n

= 1 + q2n, for some integer q. (3.4)

We now prove by mathematical induction that j — n — 2 and the quotient q is odd.

These statements can be directly verified for n = 3. Now assume that they are true

for some n, i.e., the smallest j satisfying (3.4) is j = n — 2 and q obtained in (3.4) is

odd. To prove the statements for n + 1, let the order of 3 in A{2n+l) be denoted by

2k, i.e., k is the smallest value of j such that

32J = 1 + q'2n+\ for some integer q'. (3.5)

Clearly any j satisfying (3.5) cannot be smaller than n — 2 because, otherwise, one

can reduce both sides of (3.4) by 2" to contradict the assumption that the smallest

value of j satisfying (3.5) is n — 2. Further, j in (3.5) cannot also equal n — 2 because

if it did, then (3.4) and (3.5) give

32 n"2 = l + g2n = l + g ' 2 " + 1 ,

contradicting the assumption that q is odd. Thus any j < n — 2 does not satisfy (3.5).

42


Finally, setting j = n — 2 in (3.4) and squaring both sides gives

32"~ l = l + (g + g22n-1)2n + 1 . (3.6)

Equation (3.6) shows that j = n — 1 satisfies (3.5). From preceding discussion, this

is the smallest such value of j . By comparing (3.5) and (3.6) one gets

q' = q(l + q2n~1)

showing that q' is an odd integer. I

L e m m a 2 Let N = 2", n > 4. Then

3iv/8 = jv/2 + 1 mod N. (3.7)

Proof. We will prove this lemma by mathematical induction over n. For n = 4, one

can verify the result directly. Assume it is true for n. Equation (3.7) can be written

as

3N/8 = N/2 + 1+qN, (3.8)

where q is some integer. To prove the result for n + 1, we square both sides of (3.8)

to get

32W/8 = N2/4 + l + q2N2 + N + qN2 + 2qN

= (2N)/2 + l + (N/8 + q2N/2 + qN/2 + q)(2N). (3.9)

Since (N/8 + q2N/2 + qN/2 + q) is an integer, (3.9) gives

32iv/8 = (2N)/2 + 1 mod (2iV)

showing that the result is true for n + 1 if it is true for n. I

Lemma 1 shows that when N — 2n and therefore ^4(iV) = C2 x CAT/4, 3 is the

generator of C7V/4. Clearly, there is only one element in this cyclic group which has

an order 2. Lemma 2 shows that, the element identified in Lemma 1 is (iV/2 + 1).

Thus the generator h of spliting subgroup C2 of A(N) can be chosen to be N/2 — 1.

It is obvious that N/2 — 1 has an order 2 under multiplication modulo N. For N — 8,

A(N) = C2 x C2 with generators 3 and 5.

43


Lemma 3 Let N = 2n, n > 5. Then

3*/i6 = N/4 + 1 mod N. (3.10)

Proof. We will prove this lemma by mathematical induction over n. For n = 5, one

can verify the result directly. Assume it is true for n. Equation (3.10) can be written

as

3N/W = N/4 + 1+ qN, (3.11)

where q is some integer. To prove the result for n + 1 , we square both sides of (3.11)

to get

32AT/i6 = N2/16 + l + q2N2 + N/2 + qN2/2 + 2qN

= (2N)/4 + l + (N/32 + q2N/2 + qN/4 + q)(2N). (3.12)

Since (JV/32 + q2N/2 + qN/4 + q) is an integer, (3.12) gives

32iv/i6 = (2N)/4 + 1 mod (2N)

showing that the result is true for n + 1 if it is true for n. I

To obtain the iV-point DHT algorithm, and in particular transform components

X(k), k € A(N), we first reorder the signal and transform components as follows.

Using generators g — 3 and h — N/2 — 1 as explained earlier, permute sequences x(i),

X(k), i,k E A(N) to form sequences y(i), Y(i), 0 < i, k < N/2 as

/ xtf mod TV), H0<i<N/4 y(i) — < . (3.13)

\x{hg% mod AT), if N/4 < i < N/2.

and similarly,

f X(gk mod A) , if 0 < k < N/4 Y(k) = { vy , ; ' - 7 3.14

{ X(hgk mod A), if N/4 < k < N/2.

Note that these permutations do not cost any computation. The relationship between

Y and y is given by a matrix obtained by choosing and permuting the rows and

44


columns of the DHT kernel consistent with the permutations of x and X. Denote

this N/2 x N/2 matrix by M, and let

Y = My. (3.15)

Using (3.1), (3.13) and (3.14), one can see that when 0 < i, k < N/4 or N/4 <i,k <

N/2,

M(i, k) = cas{gi+k2ir/N). (3.16)

On the other hand, when one of i and k is in the range 0 to N/4 — 1 and the other

in N/4 to N/2 - 1, one has

M(i, k) = cas(hgi+k27r/N). (3.17)

From (3.16) and (3.17) it is obvious that if the N/2 x N/2 matrix M is partitioned

into four equal submatrices with diagonally opposite submatrices being equal, i.e.,

( A B \ M=\ . 3.18)

\B A)

Now recall that generator g has order N/4 under multiplication modulo N (Lemma

1). Therefore, both N/4 x N/4 submatrices A and B are cyclic as their (i, k)-th

element depends only on (i + k) mod {N/4). This decomposition of M given by

(3.18) is a reflection of the fact that A(N) = C2 x CAT/4-

When N > 16, matrices A and B are at least 4 x 4 . Lemma 2 can then be used

to show that they then have additional redundancies. Consider the matrix A whose

entries are given by (3.16). These entries can be simplified using lemma 2 as follows:

A(i + N/8,k) = cas(gi+N^+k2n/N)

= cas{gi+k(N/2 + l)2Tr/N)

= -cas{gi+k2<ir/N)

= -A(i, k), 0<i,k< N/8 (3.19)

The last step in (3.19) uses the fact that the generator g is odd. In fact all elements

of A(2n) are odd. In a similar fashion, one can show that A(i, k + iV/8) = — A(i, k)

and A(i + iV/8, k + N/8) = A(i, k). Similar manipulation of (3.17) exposes identical

45


redundancies in the matrix B. As a result, we see that the AT/4 x N/4 matrices A

and B in (3.18) have the structure

A = A'

A' - A ) A' ) and B=[

\

B'

-B'

-B'

B' (3.20)

M (3.21)

where each of submatrices A' and B' is of size N/8 x AT/8. Thus matrix M becomes

/ A' -A' B' -B' \

-A' A' -B' B'

B' -B' A' -A'

\ -B' B' -A' A' )

Equation (3.21) shows that half the rows of M are negatives of the other half and

therefore it is sufficient to compute Y(k) only for 0 < k < N/8 and for N/2 < k <

3AT/8. Thus define a sequence of {Y'(k)} for 0 < k < N/4, we have

Y'(k) = Y(k) = -Y{k + N/8), H0<k<N/8,

Y'(k + N/8) = Y(k + N/4) = -Y(k + 3N/8), if N/8 < k < N/4.

But even for these rows of M, the columns have a high degree of redundancy. This

redundancy may be exploited by combining y components that multiply with the

same value in M. By computing sequences

' y(i) ~ y(* + N/8), ifO<i<N/8

y{i + N/8)-y{i + N/4), ifN/8<i<N/4.

one only has to compute

(3.22)

y'ii) = (3.23)

Y' = (3.24)

Since submatrices A' and B' are carved out of circular matrices A and B respec

tively, they are structurd as Hankel matrices. By partitioning each in four equal

submatrices, one would describe the structure as:

A' and B' (3.25)

Note that submatrices P, Q, U and V are of size N/16 x N/16.

46


3.3.1 16-point D H T

When N = 16, the group order for (3.21) is 4(16) = C2 x C4 = {1,7} x {1,3, 9,11} :

{1,3, 9,11,7,5,15,13}. It can be shown that P and Q have identical values since

Q = cas(3 • 2?r/16)

= cas(7r/2 - 27r/16)

= cas(l • 27r/16)

= P.

Similarly, the following relationship of U = — V can be easily verified:

(3.26)

U = cas(7 • 2TT/16)

cas(7r - 2 • 2TT/16)

- COS(2TT/16) + sin(27r/16)

and (3.27)

V cas(5 • 27r/16)

= cas(7r/2 + 2 • 2?r/16)

COS(2TT/16) - sin(27r/16).

Combining (3.24) - (3.27), we have in this case

/ y'(0) \ ( P P U -U\

Y'{\)

Y'(2)

V H3) ) \

P

U

-u

-p

-u -u

-u p

p

-u p

-p)

I y'(o) \ y'(i)

y'(2)

V 2/'(3) J

(3.28)

With four extra additions, (3.28) can be evaluated in two separate Hankel products

as:

' y'(0) + y'(i)\

y'(3)-y'(2) I and (3.29)

P -U\ y'(2) + y'(3)\

\-U -P ) \y'(l)-y'(0) ) '

A bilinear algorithm can used for each of these 2-point Hankel products. Since there

are only additions outside the Hankel matrices, the overall algorithm is bilinear as

well.

47


We now illustrate the general procedure for thev case of a 16-point DHT. The

computation of the even components of the DHT (X(k), k £ A(N)) directly follows

the discussion in Section 3.2 and is shown in Fig. 3.6.

Figure 3.6: Proposed bilinear algorithm for even-indexed components of a 16-point DHT. Note that multiplication constant eg — \pl.

The evaluation of the odd components X(k), k 6 A(N) of the DHT requires

one to permute the signal and transform indices to realize the computation as cyclic

convolutions and Hankel products. Following the discussion in this section, the set

4(16) forms a group C2 x C4 under the operation of multiplication modulo 16. We

use generator 3 for C4 and 7 for C2. Thus 4(16) = {1,7} x {1,3,9,11}. The

permuted sequence y = {a;(l),a;(3),a;(9),a;(ll),a;(7),a;(5),x(15),x(13)}. Similarly

Y = {X{1),X(3),X(9),X(11),X(7),X(5),X{15),X{13)}. The matrix M which is

used to get Y from y as Y = My is given by (entry q stands for value cas(g27r/16) in

48


the matrix):

M =

( 1

3

9

11

7

5

15

V 1 3

3

9

11

1

5

15

13

7

9

11

1

3

15

13

7

5

11

1

3

9

13

7

5

15

7

5

15

13

1

3

9

11

5

15

13

7

3

9

11

1

15 13 \

13 7

7 5

5 15

9 11

11 1

1 3

3 9 j

One can easily verify properties (3.16) - (3.21) of the matrix M above.

The proposed bilinear algorithm for computing the odd indexed DHT values for

N — 16 is shown in Fig. 3.7. One should note that with a bilinear algorithm, the

Figure 3.7: Proposed bilinear algorithm for odd-indexed 16-point DHT. Note that the multiplication coefficients are: CQ = A/2, CI = 0.7654, c2 = 0.5412 and c3 = -1.8478.

critical path includes only one multiplication operation resulting in a fast circuit

49


(3.30)

implementation.

3.3.2 More than 16-point D H T

When N is greater than 16, we define index groups for i', k' € {0,1,2, • • • , TV/16 — 1}

and

P(i',k') = cas(gi'gk'2Tr/N),

Q{i',k>) = cas( /+ i V / 1 6 / '27r / iV) ,

U{i',H) = cas{hgi'gk'27r/N),

V(i',k') = cas{hgi,+N/wgk'2ir/N).

We now show that P(i', k') and V(i', k') are related.

Substituting in h = N/2 — 1 and using Lemma 3, one gets

V{i',k') = cas{gi,+k'(N/2-l){N/4 + l)2Tr/N)

= cas(gi,+k' (N2/S + N/4 - l)2ix/N)

= cas{gi'+k'7r/2-gi'+k'2-K/N) (3.31)

= sin(/+/ : '7r/2)cas(/+fc '27r/iV)

= sin(/+ f e '7r/2)P(i' , Jfc')-

Since 5 = 3, sin(gJ +k TT/2) equals 1 if (i' + k') is an even number. For odd (i' + k'),

sm(gl'+k'ir/2) equals —1. Therefore P{i',k') and V(i',k') are related by

, , ,. I V(i', k'), if both indices i' and k' are even or odd P(i',k') = { y h (3.32)

[ —V(i', fc'), if one index is even and the other is odd.

Similarly one can show that

U(i',V) = cas{gi,+k'{N/2-l)2Tv/N)

= cas ( / + / ! ' 7 r - /+ f c ' 27 r /A0

= cos(/+fe ,7r)cas(-/+fc '27r/iV)

= -cas( - / + f e ' 27r / iV) ,

(3.33)

and Q{i',k') = cas(/+ f c ' ( iV/4 + l)27r/A0

= c a s ( / + f e V / 2 + gi,+k'2-ir/N)

= sin(/+/ j '7r/2)cas(-/+fe '27r/AT)

= -sin(/+ f c '7r/2)[/(i ' ,fc ')-

(3.34)

50


Since g — 3, sm(gl +k ir/2) = 1 if (i' + k') is an even number, else sm{g%'+k'ix/2) = — 1.

Therefore Q(i', k') and U(i', k') are related by

Q{i',k') = —U(i', k'), if both indices i' and A;' are even or odd

U(i', k'), if one index is even and the other is odd. (3.35)

We now show that the matrix in (3.24) can be simplified to two iV/8-point skew

circular matrices with identical kernels. Skew circular matrix is a special kind of

matrix closely reassembling a circular matrix, except that elements below the diagonal

from top right to bottom left are negated. For example, from (3.25) matrices A' and

B' can be seen to be skew circular matrices. To compute Y' through products with

two N/8 x iV/8 skew circular matrices, first recall that

Y' = A' B'

xy B' A'

( P Q U V \

Q -P V -U

U V P Q

\V -U Q -P J

xy

(3.36)

where P, Q, U and V are iV/16 x TV/16 matrices. Because of the relationships (3.32)

and (3.35) elements in the matrix in (3.36) are related. To take advantage of these

relationships, we compute the even-indexed Y' components in the top half with the

odd-indexed components in the bottom half. For these Y' components, (3.36) gives

Y'(k) 0<k< N/8, k even;

N/8 < k < N/4, k odd.

Ay'{i){i even} + By'(i){. o d d } . (3.37)

51


where, from (3.32) and (3.35) one gets

A(i,k) =

and

B(i,k)

(

\

P(i,k)

Q(i,k)

Q(i,k)

P(i,k)

P(i,k)

Q(i,k)

Q(i,k)

P(i,k)

Q(i,k)

-P{i,k)

-P(i,k)

-Q(i,k)

Q(i,k)

-P(i,k)

P(i, k)

Q(i,k)

-Q(i,k)

P(i,k)

P(i,k)

Q(i,k)

Q(i,k)

-P(i,k)

P(i,k)

Q(i,k)

P{i,k) \

Q(hk)

Q(i,k)

-P(i,k) J

-P{i,k) \

-Q(i,k)

Q(i,k)

-P(i,k) )

(3.38)

One may note that first A/16 columns in matrix A are identical to the last A/16

columns. Further, the columns from N/16 to 2N/16 are exactly negative of the

columns in positions 2A/16 to 3N/16. One can therefore add or subtract the cor

responding elements of vecotr y'(i), i even and then multiply the sum by the same

elements P(i, k) or Q(i, k). In other words, one can shrink the number of columns in

matrix A by appropriate folding of vector y'(i), i even.

Similarly, one can use the fact that in matrix B, the first N/16 and last JV/16

columns are negative of each other while the columns iV/16 to 2N/16 are identical

to columns 2N/16 to 3./V/16. Thus the appropriate folding of the vector y'(i), i odd

allows one to reduce the number of columns in matrix B by a factor of 2 as well.

From this discussion, the algorithm to compute the even-indexed components in first

half of Y' and the odd-indexed components in the last half of Y' can be designed as

follows:

Define a new sequence {ya} of iV/8 components as

Va(2j)

ya(2j + 1 )

ya(N/16 + 2j)

ya(N/16 + 2j + l)

y'(2j + 3N/16) + y'(2j),

y'{2j + 1 + AT/16) + y'(2j + 1 + N/8),

y'(2j +N/16)-y'(2j +N/8),

y'{2j + 1 + 3JV/16) - y'{2j + 1), 0 < j < N/32.

(3.39)

52


Let Ya = H x ya, where H is a skew cyclic matrix

H =

h0 hi

hi h2

y hN/s-i —ho

with h2j =

h-2j+i —

^IV/16+2j =

hN/W+2j+l =

Then for 0 < j < N/16, one gets

Y'(2j)

Y'(N/8 + 2j + l)

hN/8-i

-h0

• • • — / i j V / 8 - 2 j

P&3,0),

Q(2j + 1,0),

Q(2j,0),

- P ( 2 j + 1,0).

= Ya(2j),

= Ya(2j + 1).

(3.40)

(3.41)

(3.42)

Similarly, computation of the odd-indexed components in the first half of Y' and

even indexed components in the later half of Y' can be computed together following

the same procedure. The resultant algorithm for these components is as follows.

First define a yb vector of length iV/8 as

yb(2j) = y'(2j + N/lQ) + y'(2j + N/8),

yb(2j + l) = y'(2j + l + 3N/W)+y'(2j + l),

yb(N/16 + 2j) = y'(2j + 3N/16)-y'(2j),

yb{N/16 + 2j + l) = y'{2j+ 1 +N/16)-y'(2j + l +N/8), 0 < j < N/32.

(3.43)

Compute Yb = H x yb where the H matrix is the same as before. Then one may

obtain the remaining components of Y' as

Y'(2j' + 1) = Yb(2j' + 1),

Y'(N/8 + 2f) = Yb(2j'),0<j<N/16. (3.44)

In case of the large length bilinear algorithms, one can trade the speed for the

hardware complexity. We have developed a pipelined (through folding) architecture

exploiting the recursive nature of the algorithm. Firstly for the 2"-point DHT, the

53


block circular matrix inside the Hartley group transform is computed with two Hankel

products with identical kernel. Hardware complexity for this computation can be cut

in half if only one copy of the hardware block is used and the computation is shared.

Secondly, as can be seen from Figs. 3.3 and 3.4, HGTN/P is used in the computation

of X(k) both when k 6 A(N) and also when k £ A(N). This repeated computation

is also marked in boxes in Figs. 3.6 and 3.7. By using only one block of HGTN/P

in the DHT and reusing it in multiple computations breaks the bilinear structure of

the algorithm and increases its inherent delay. However, this can have a significant

impact on the area reduction as is shown in Fig. 3.8 in Section 3.4. In addition, since

the even transform outputs X(k) where k € A(N) and the odd transform outputs

X(k) where k <£ A(N) are never realized concurrently, a circuit implementation with

multiplexed outputs may be used to reduce the output pin requirement by 50%.

3.4 Performance analysis for 2n-point DHT

The arithmetic complexity of any algorithm determines its hardware complexity. The

bilinear algorithm for N — 2™ point DHT developed here uses ( 3 n _ 1 — l ) /2 — n + 1

multiplications and 2n+1 + (3n - 81)/2 - 3n + 42 additions for n > 3. Thus the

hardware costs are of order 0(N16). The actual numbers of arithmetic operation

for the proposed algorithm are compared in Table 3.1. This table also provides

Table 3.1: Hardware complexity of various 2"-point DHT algorithms. N

2 4 8 16 32 64 128 256

Bilinear mults adds 0 2 0 8 2 22 10 62 36 172 116 476 358 1330 1086 3770

Folding mults adds

8 52 19 121 63 371 197 847 601 2315

Ref. [4] mults adds

2 26 10 74 34 194 98 482 256 1154 642 2690

Ref. [2] mults adds

34 174 98 438 258 1070 642 2518

54

3.4. PERFORMANCE ANALYSIS FOR 2N-POINT DHT

complexities of two reference algorithms [2, 4]. Both these algorithms use a split-

radix 2/4 method. It is worth noting that, reference [2] uses the bilinear algorithm

for iV < 32. Clearly the author recognizes the complexity advantage of bilinear

algorithm at small lengths. In [4], the algorithm is further extended to split-radix

2/8. Though this improves the data transfer between the processor and the memory,

it does not reduce the arithmetic complexity. Therefore we have used only split-radix

2/4 algorithm for reference [4]. The speed of an implementation can be estimated

from the inherent delay of the algorithm. Table 3.2 compares the speed of the new

algorithm with the reference algorithms.

Table 3.2: Time complexity of various 2n-point DHT algorithms. Note that M and A stand for multiplier and adder delays respectively.

N 2 4 8 16 32 64 128 256

Bilinear A A M+2A M+5A M+7A M+9A M+11A M+13A

Folding

2M+7A 2M+11A 2M+15A 2M+19A 2M+23A

Ref. [4]

M+5A M+6A 2M+9A 2M+10A 3M+13A 3M+14A

Ref. [2]

2M+6A 2M+9A 3M+10A 3M+13A

The proposed algorithm and the reference algorithms [2,4] were implemented with

16-bit fixed point arithmetic in TSMC 90nm CMOS technology. Normalized area and

critical path delay of these implementations is shown in Fig. 3.8. One can see that

the bilinear architecture always provides the highest speed. Our implementations are

faster by 61%, 15%, 42% and 40% at 8, 16, 32 and 64 points than those [4], and faster

by 17% and 29% at 32 and 64 points than those of [2]. When designs of bilinear

architecture and [4] are operated at the same speed, the bilinear implementations can

be as much as at 45% smaller at 8-point, 22% at 16-point, 18% at 32-point and 6%

at 64-point. Clearly, because the hardware cost of N point bilinear DHT goes up as

0(N1S), for larger sizes, their area advantage diminishes. However with a pipelined

and folded design, both the speed and area advantages of bilinear circuits can be

55


8 point 16 point

0.5

- * — Ref. [4] - 9 — Bilinear

10 12

1

0.9

0.8

0.7

0.6

0.5

0.4

32 point

%

-*— Ref. [4] H — R e f r2] -©— Bilinear

> Pipelined

^%&mm&s>

10 20 30 40

1

0.9

0.8

0.7

0.6

0.5

0.4

64 point

- x — Ref. [4] H — Ref. [2] - 6 — Bilinear

f> Pipelined

i!^mmmm>

o 10 20 30 40

Figure 3.8: Delay in nsec (on horizontal axis) and normalized area (on vertical axis) for various implementations of 8, 16, 32 and 64 point DHTs.

56


extended to much larger transform lengths as can be verified from Fig. 3.8.

3.5 Bilinear algorithm for 3n-point DHT

For apn-point DHT where p is an odd prime, the Abelian group is A(pn) = C(p_ i)pn-i.

Since (p — 1) is prime to p™-1, we have C(p_i)pn-i = Cp_i x Cpn-i. This implies that

for oddp, (p — l ) p n _ 1 transform samples of X(k), k € A(pn), can be obtained through

several convolutions. By combining a {p—1) point cyclic convolution with a / " 1 point

cyclic convolution, one can incorporate effect of all signal components x(i), i E A(N).

Signal components x(i), gcd(i, N) = pk are involved through a two dimensional cyclic

convolution of (p— 1) xpn~1~k points. The remaining transform samples are obtained

from a DHT of length pn~l as described in Section 3.2.

The complexity of the cyclic convolution can be further reduced by using the

following lemma.

Lemma 4 Let g denote the generator of the cyclic group A{N), where N = pn, p

being an odd prime and n > 2. Then

p - i

J2 cas(2Trgk+i^N^"/N) = 0, k = 0,1, 2 , . . . , \A{N)\/p - 1. j=0

Proof. Since g is the generator of cyclic group of order |.<4(iV)|, g\A(N^/p generates

unique order p subgroup of A(N). Each of the elements of this subgroup (except

identity) has order p. We now show that integers 1 +i(N/p) £ A(N), 0 < % < p have

order p. This would imply that these are the exact elements of the subgroup.

From binomial theorem we have p ,

(l + i(N/p)Y = l+iN + J2jl^j)l{iN/Py

= l + iN + i2N(N/p2)J2 .,, P[ .v(iN/Py-2 (3.45)

However, for n > 2, N is divisible by p2. Using this, (3.45) yields

(1 + i(N/p))p = 1 mod N. (3.46)

57


Note that since integers 1 + i(N/p), 0 < i < p are the same as the elements of the

subgroup generated by the subgroup ^ W I / P , this set of integers is the same as the

set of integers g*\AlN)Vp, 0<i<p.

Let W represent Nth primitive root of unity. Since WN — 1 = 0 but WNlp — 1 # 0,

we have

p - i y ^ Wi{N/P) = 0

t = 0

Hence

J2w1+iWp) = 0. (347) i=0

Taking the real and imaginary parts of (3.47) one gets

p—I p—I

^ C O S ( 2 T T ( 1 + i{N/p))/N) = 0 and ^ s i n ( 2 7 r ( l + i{N/p))/N) = 0. (3.48) i=0 i=0

Using the equivalence of the set of integers 1 + i{N/p) with the subgroup elements,

we can rewrite (3.48) as

p—I p—I

J ] c o s ( 2 7 r ^ ( J V ) i / 7 i V ) = 0 and ^s in(27r 5i | A ( A r ) l /7iV) = 0. (3.49)

j=0 2=0

By multiplying the arguments of each cosine and sine functions in (3.49) by a constant

integer gk mod iV and applying trigonometric identities one gets

p—I p—\

^ C O S ^ T V ^ W ^ / A O ^ O and ]Tsin(27r 5f e + i | A W I /7iV) = 0. (3.50)

i=0 i=0

By adding the two expressions in (3.50) one gets the required result. I

This lemma implies that in every cyclic convolution in the DHT algorithm of pn

points, the components of the constant sequence always add to zero. This significantly

reduces the number of additions and multiplications in the algorithm. For example,

a 3-point cyclic convolution normally has a complexity of 4 multiplications and 11

additions. But if the constant sequence has the property that its components add

58


to 0, the convolution complexity reduces to only 3 multiplications and 6 additions, a

saving of over 25%.

One can also take advantage of the fact that the computation of the transform

components X(k),k G A(N) uses cyclic convolutions of (p — l)pn~1~%, 0 < i < N,

points. Each of these is a two dimensional convolution with length 2 in one dimension

and therefore have identical post multiplication processing in that dimension. One can

therefore combine these identical computations so as to reduce the post-multiplication

additions.

We now illustrate the above arguments through the example of a DHT of 9 points.

Note that A{9) = C6 = {1,2,4,8,7,5} with generator g = 2. C6 can always be ex

pressed as C3 x C2. By choosing g2 = 4 as the generator of C% and g3 = 8 as the

generator of C2, A(9) can be reordered as A(9) = {1, 4, 7} x {1,8} = {1, 4, 7, 8, 5, 2}.

The computation of X(k), k G .4(9) can be expressed (with a matrix entry q repre

senting the real value cas(g27r/9)) as:

/ \ X(l)

X(4)

X(7)

X(8)

X{5)

V * ( 2 ) J

( 1 4 7

4 7 1

7 1 4

8 5 2

5 2 8

2 8 5

8 5 2

5 2 8

2 8 5

\

1 4 7

4 7 1

7 1 4

/ \ x(l)

x(4)

*(7)

rr(8)

x(5)

V *(2) J

+

(Z 6 \

3 6

3 6

6 3

6 3

6 3

0

0

0

0

v 0 /

x(Q).

(3.51)

For the 2-point block convolution within the larger 6-point convolution matrix,

one can see that the matrix product with the vector x(i),i G .4(9) can be obtained

by

• Compute output vector (Y(0),Y(1),Y(2)) from input vector (x(l)+x(8),x(4) +

x(5),x(7) + x(2)).

Compute output vector (F'(0), Y'(1), F'(2)) from input vector (x(l)-x(8), x(4)-

a;(5),a:(7)-x(2)).

59


• Obtain the first matrix product in (3.51) as

(y(0)+F'(0) , F(1)+F ' (1) , Y(2)+Y'(2), Y(0)-Y'{0), Y(1)-Y'(1), Y{2)-Y'{2)).

(3.52)

Once can see that because of Lemma 4 both the cyclic convolutions used in this part

have a highly reduced complexity.

Similarly, the second matrix product in (3.51) can be seen to be a 2-point cyclic

convolution obtained as

• Compute vector Z from input vector x(3) + x(6).

• Compute vector Z' from input vector x(3) — x(6).

• Obtain the first matrix product in (3.51) as

(Z + Z',Z + Z', Z + Z',Z-Z',Z- Z', Z - Z'). (3.53)

One can combine the steps in (3.52) and (3.53) in two matrix products by first

additing Z to Y(0), Y(l) and F(2) and Z' to F'(0), F ' ( l ) and Y'{2) before step

(3.52). Thus when step (3.52) is carried out, one gets the sum of the first two matrix

products of (3.51). This saves 2 additons of step (3.53). The third matrix product

can similarly be incorporated in the sum merely by adding x(0) to Z before adding it

to F(0), y ( l ) and F(2). The complete algorithm of 9-point DHT is shown in Fig. 3.9.

Pipelining technique used for 2n-point DHT can also be applied to 3n-bilinear DHT

algorithm to trade of hardware complexity with speed. For 3n-point DHT however,

only the Hartley group transform is to be time shared.

3.6 Performance analysis for 3n-point DHT

Computational complexity of 3"-point DHT is summarized in Table 3.3, and critical

path delays are listed in Table 3.4. The bilinear algorithm for 3n-point DHT developed

here uses ( 3 - 5 " - 4 n - 3 ) / 8 multiplications and (39-5 n -28-3 n + 12n-43) /16 additions

for n > 2. Our architectures and the reference designs in [65] were implemented in

60

3.6. PERFORMANCE ANALYSIS FOR 3N-POINT DHT

x(6)/_4£Q

X(0)

X(3)

X(6)

Figure 3.9: Proposed bilinear algorithm for 9-point DHT. Note that the multiplication coefficients are: c0 = -0.5, d = 0.8660, c2 = -0.5924, c3 = -1.7057, c4 = -0.7660, c5 = -1.6276, c6 = -0.3008, and c7 = -0.6428.

61


TSMC 90nm CMOS technology. We used 16-bit fixed point data representation. The

normalized area and delays of these implementations are shown in Fig. 3.10. Our

results show that the bilinear implementation has top speeds that are faster than

split-radix reference implementations by 61% at length 9 and 112% at length 27. For

length 27, even the slowest bilinear architecture is 12% faster than fastest reference

circuit, and at the same time is 21% to 34% smaller in size. For length 9 DHT, a

bilinear architecture achieves minimum 25% area saving when operating at the same

speeds as split radix design. For length 27, one level pipelined and folded design is

on average 40% smaller and still 12% faster than the fastest reference design.

Table 3.3: Hardware complexity of various 3"-point DHT algorithms. N

3 9 27 81 243

Bilinear mults adds

1 6 8 44

45 257 232 1382 1169 7193

Pipeline mults adds

38 230 195 1211 982 6200

Ref. [65] mults adds

1 6 12 42 69 204 312 852 1257 3282

Ref. [29] mults adds

1 7 8 44 53 227 236 944 977 3695

Table 3.4: Critical path delay of various 3"-point DHT algorithms. Note that M and A stand for multipliers and adders respectively.

N 3 9 27 81 243

Bilinear M+2A M+7A M+12A M+17A M+22A

Pipeline

2M+13A 2M+23A 2M+43A

Ref. [65] M+2A 3M+6A 5M+10A 7M+14A 9M+18A

62

3.7. DISCUSSION AND CONCL USION

9 point 27 point 1 , • • — x

0.9

0.8

0.7

0.6

0.5

- * — Ref. [65] - e — Bilinear » Pipelined

* * # * *

10 15 20 25

Figure 3.10: Delay in nsec (on horizontal axis) and normalized area (on vertical axis) for various implementations of 9 and 27 point DHTs.

3.7 Discussion and conclusion

Bilinear algorithms can generally provide the fastest ASIC implementations of DHT.

Even though bilinear algorithms for other transforms such as the Fourier [42, 61]

and discrete cosine [35,53] were known, we believe that this work, for the first time,

provides bilinear algorithms for DHT. We have presented here lengths of type pn only,

but the methods of group theory used to extract the structure in the DHT kernel are

quite general and can be extended to other lengths as well. This work shows that

the bilinear algorithm for DHT can perform 20% to 30% faster than other algorithms

implemented in the same technology.

63


64

Chapter 4

Modified discrete cosine transform

Forward and inverse modified discrete cosine transforms (MDCT/IMDCT) are widely

used for subband coding in the analysis and synthesis filterbanks of time domain alias

ing cancellation (TDAC). Many international audio coding standards rely heavily on

fast algorithms for the MDCT/IMDCT. In this chapter we present hardware efficient

bilinear algorithms to compute MDCT/IMDCT of 2n and 4 • 3 n points. The algo

rithms for composite lengths have practical applications in MPEG-1/2 audio layer III

(MP3) encoding and decoding. It is known that the MDCT/IMDCT can be converted

to type-IV discrete cosine transforms (DCT-IV). Using group theory, our approach

decomposes DCT-IV transform kernel matrix into groups of cyclic and Hankel matri

ces. Bilinear algorithms are then applied to efficiently evaluate these groups. When

implemented in VLSI, our algorithms greatly improve the critical path delay as com

pared with the existing solutions. This is due to the fact that bilinear algorithms

employ only one multiplication along the critical path. For MP3 audio, we propose

three different versions of the unified hardware architectures for both the short and

long blocks, and the forward and inverse transforms.


The forward and inverse modified discrete cosine transforms (MDCT/IMDCT) are

used as analysis and synthesis filter bank in transform/subband coding schemes, such

65

CHAPTER 4. MODIFIED DISCRETE COSINE TRANSFORM

as the time domain aliasing cancellation (TDAC) [41] and the modulated lapped

transform (MLT) [31]. The MDCT/IMDCT are basic computing elements in many

transform coding standards [38, 39]. Since the MDCT and IMDCT require inten

sive computations, fast and efficient algorithms for theses transforms is a key to the

realization of high quality audio and video compression schemes [50,51,63].

The iV-point modified discrete cosine transform (MDCT) of a sequence {x(i)} is

defined as

/ x ^ * fir(2i + l + %)(2k + l)\ N , x

X(*) = $ > ( 0 c o s M ^ M , * = 0 , 1 , . . . , — - 1 . (4.1)

Note the similarity between the kernel of the MDCT and that of the discrete cosine

transform (DCT). However unlike a DCT, MDCT converts N signal samples into

only N/2 transform samples.

There have been many fast algorithms proposed for the MDCT and its inverse,

IMDCT. Based on the symmetry of the transform matrix, Malvar [30] converts an N-

point MDCT into an iV/2-point type-IV discrete sine transform (DST-IV). Duhamel

et al. [18] compute the MDCT/IMDCT through the fast Fourier transform (FFT). An

A-point DCT is reduced to an A/4-point complex-valued FFT. Though the overall

arithmetic complexities between the two algorithms are similar, FFT algorithm has

the advantage of existing hardware realization [24]. In [12,14,36], the MDCT and

IMDCT are computed using recursive kernels. Recursive implementations require less

hardware at the expense of extending the critical path.

Unfortunately most MDCT algorithms are formulated for N = 2" and do not di

rectly apply to composite data lengths. Many existing applications of MDCT/IMDCT

however, use composite data lengths. For example, MPEG-1/2 layer III (MP3) audio

format specifies two frames consisting of 1152 and 384 data samples. These frames

are further partitioned into 32 subbands. A long block processes 36 data samples and

a short block 12 data samples. If implemented directly as in the ISO, the arithmetic

complexity of this composite A-point MDCT is A 2 / 2 multiplications and (A2 — N)/2

additions. Britanak and Rao [8,9] have designed efficient MDCT algorithms for MP3

audio. Their algorithms are based on Given's rotations. Depending on block sizes, 3

or 9 point DCT and DST modules are then used to obtain the results. For MDCT, the

66


DCT and DST used are of type-II. For IMDCT, they are of type-Ill. Their approach is

further refined by Nikolajevic and Fettweis [37], where the number of additions are re

duced while the multiplication count remains the same. Fig. 4.1 shows the flow graph

of MDCT computation based on Given's rotation method. In [27], Lee expresses

.,N-1.

combine and

shuffle rotations

N/4 point

DCT-II

N/4 point

DST-II

combine and

shuffle k=0,l,

Figure 4.1: Flow graph for Ref. [9] implementation of N point MDCT.

MDCT/IMDCT computations in the DCT-IV format, and successively transforms

the DCT-IV to scaled DCT-IIs. The un-normalized or scaled DCTs (SDCT) are used

for both MDCT and IMDCT. Unfortunately, this algorithm has several long recursive

computations. These contribute to lower computational complexity, especially for the

multiplications. However in hardware implementations, they extend the critical path

and the output timing is un-balanced. Flow graph for this approach is shown in

Fig. 4.2. Recently Cheng and Hsu [15] have applied matrix factorization schemes to

N point Forward MDCT

N point Inverse MDCT

*

N/2 point

DCT-IV

N/2 point

SDCT-II

N/4 point

SDCT-II

N/4 point

DCT-IV

N/4 point

SDCT-II

Figure 4.2: Flow graph for Ref. [27] implementation of iV point MDCT/IMDCT. Note that SDCT is unnormalized discrete cosine transform.

further explore the relationships between the DCT and the MDCT. Their algorithms

however, do not directly address the critical path delay.

In this chapter, we present bilinear algorithms to compute the MDCT/IMDCT

through DCT-IV. This allows us to minimize multiplications along the critical path.

67


Using group theory, we decompose the transform kernel into cyclic and Hankel ma

trix products. Bilinear algorithms are then used to efficiently evaluate these matrix

products. We show that when implemented in VLSI with fixed-point arithmetic, our

approach significantly reduces the critical path delay.

The rest of this chapter is organized as follows. Section 4.2 reviews the steps

of transforming MDCT/IMDCT to DCT-IV. In Section 4.3, bilinear algorithms for

2"-point MDCT/IMDCT are presented. Section 4.4 develops bilinear algorithms

for MDCT/IMDCT with composite lengths of 4 • 3". In particular, a 12-point

MDCT/IMDCT is used for MP3 short block processing as a 6-point DCT-IV. The

MP3 long block of 36-point MDCT/IMDCT is computed by an 18-point DCT-IV.

For all DCT-IV algorithms, we discuss group structures, arithmetic complexities, and

critical path delays that are associated with the bilinear algorithm implementation.

In particular for the MP3 application, we explore three different versions of the uni

fied hardware architecture for both the short and long blocks, and the forward and

inverse transforms. Section 4.5 provides the conclusion of this chapter.

4.2 Transformation from TV-point M D C T / I M D C T

to 7V/2-point DCT-IV

As pointed out earlier, an TV-point MDCT uses N signal samples to create N/2

transform samples. The first step in the computation of MDCT therefore involves

converting this N x N/2 kernel into a kernel of a known square transform. It is

known that an TV-point MDCT/IMDCT can be transformed into an iV/2-point type-

IV DCTs [12,27,30,31]. Our derivation here closely follows that of [27].

4.2.1 The forward M D C T transformation

The forward MDCT is defined as

68

4.2. TRANSFORMATION FROM N-POINT MDCT/IMDCT TO N/2-P0INT DCT-IV

Introduce a new data sequence

l s (* - f )> HN/4<i<N.

Then (4.2) can be written as

N-l x(k)= Yl y(^cos

i=0

7r(2i + l)(2k + l)\ N — I, * - U , l , . . . , - - l .

The cosine term in (4.4) satisfies the following relation

ir(2i + l)(2k + l)\ _ _ (K{2N - 1 - 2i)(2k + 1)

2N J "~ \ 2N

Then defining

z(i) = y(i) - y(N -1-i), 0<i< N/2,

an iV-point MDCT can be expressed as an Af/2-point DCT-IV as

-l K 2

X(k) = Y J z(i) cos i=0

7r{2i + l)(2k + 1)\ . _ . N

2N , fc = Q , 1 , . . . , - - 1 .

(4.3)

(4.4)

(4.5)

(4.6)

(4.7)

A general MDCT flow graph based on DCT-IV transformation is shown in Fig. 4.3.

MDCT to DCT-IV

x(i) i=0..N-l. arrange

Eq. (4.3)

y(i) i=0..N-l.

*+ Eq. (4.6)

z(i) i=0..N/2-l DCT-IV

Eq. (4.7)

X(k) k=0..N/2-1.

Figure 4.3: Flow graph for the DCT-IV implementation of iV-point MDCT.

4.2.2 The inverse M D C T transformation

The inverse MDCT (IMDCT) is defined as

*'« = N E X W cos k=0

7r(2z + l + f)(2A; + l)

2N

69

i = 0 , l , . . . , J V - l . (4.8)


To obtain the IMDCT, first compute the iV/2-point type-IV DCT of X as

*'« = ̂ E*w cos 8 = 0

7r(2i + l)(2ft + l) N i = 0 , l , . . . , - - l . (4.9)

Applying the symmetry property (4.5), and defining a new data sequence

f z'(i), if 0<i< N/2-1, y (i) — s

{ -z'(N -1-i), if N/2<i< N,

the IMDCT output x'(i) can then be recovered as

(4.10)

x'{i) M + n i f 0 < i < ^ - l , -y'(i-^), i f f < t < J V - l .

(4.11)

An IMDCT flow graph based on DCT-IV transformation is shown in Fig. 4.4.

DCT-IV to IMDCT

X(k) k=0..N/2-l. DCT-IV

Eq. (4.9)

z'(i) i=0..N/2-l Expand

Eq. (4.10)

y'0) i=0..N-l. Re-order

Eq. (4.11)

x'(i) i=0..N-l.

Figure 4.4: Flow graph for the DCT-IV implementation of N point IMDCT.

4.2.3 The advantage of DCT-IV transformation

The DCT-IV transformation has significant implication on implementations, espe

cially for hardware. It is clear from Figs. 4.3 and 4.4 that a common DCT-IV module

can be shared for both the forward and inverse transforms. Unified hardware archi

tecture for the MDCT and IMDCT is shown in Fig. 4.5. Note that we purposedly

scale the data sample to 2N points so that the core computation module becomes an

Appoint DCT-IV.

A key challenge to ASIC implementation is the requirement on the number of

input and output (10) pins. From a package point of view, the reduction of pad 10

size has not kept pace with the development of transistor technology. From a macro

70

4.2. TRANSFORMATION FROM N-POINT MDCT/IMDCT TO N/2-P0INT DCT-IV

1..N-1.

i=0,1..2N-l. ^-

Figure 4.5: Flow graph for the DCT-IV implementation of 2iV-point unified MDCT and IMDCT. Note that IMODE = 0 for MDCT and IMODE = 1 for IMDCT.

perspective, all inputs and outputs must observe a minimum spacing requirement

to reduce potential cross-talk issue. This constraint on inputs and outputs can be

addressed with an improved architecture shown in Fig. 4.6. On the input side, input

pins of IMDCT can be merged with the N input pins of MDCT. For simplicity, we

choose the first N input pins of MDCT. On the output side, (4.10) shows that only N

outputs of IMDCT are truly unique. Therefore we can only keep the N outputs from

DCT-IV without scarifying any loss of information. Combing together the input and

output reduction techniques, our improved architecture can save upto 50% of the IOs

comparing to the implementation in Fig. 4.5.

in(i)

i=0,1..2N-l

Figure 4.6: Flow graph for the DCT-IV implementation of 2JV-point unified MDCT and IMDCT with reduced 10 requirement. Note that for MDCT, IMODE = 0 and in(i) = x{i), i = 0,1, • • • , 2N - 1. For IMDCT, IMODE = 1 and in{k) = X(k), k = 0 , 1 , . - - , J V - 1 .

71

MDCT to DCT-IV

x(i) i=0,1..2N-l

) © -

IMODE

z(i)

i=0,l..N-l

X(k)

k=0,l..N-l

— • • DCT-IV

DCT-IV to IMD(

Expansion

/TDCT to DCT-IV IMODE

y(j)

i=0,l..N-l DCT-IV

out(i)

i=0,l..N-l.


4.3 Bilinear algorithms for 2n-point M D C T / I M D C T

Section 4.2 shows for N — 2n, a 2iV-point MDCT can be converted to an iV-point

DCT-IV with JV pre-additions. For IMDCT there is no extra computation involved.

To construct a bilinear algorithm for MDCT/IMDCT, we need to explore the

group structures within the DCT-IV transform kernel. From (4.7) and (4.9) the

transform kernel indices have iV points of odd values for (2i + 1) and (2k + 1), which

belong to an Abelian group A(8N). From group theory, Abelian group A(2n+3) =

C2 x C*2n+i, where N — 2n. Thus there exists a cyclic sub-group of size 2N of A(8N).

As shown in Lemma 1, integer 3 can be used as the generator g of this group. We now

prove that the integers <p(i), i — 0,1,- • • ,N — 1 are defined in the following lemma

provide the first N odd integers.

L e m m a 5 Let N = 2n and A(8N) = Ci x C2N- Using the generator g = 3 of C2N,

define function 4>(i), 0 < i < N as

i gi mod AN, if (gi mod AN) <2N , <j>[%) = { y . ' J Ky ' (4.12)

[ AN — (gl mod 4AQ, otherwise.

Then values of <p(i), 0 < i < N give all the odd integers in the range 0 to 2N.

Proof. Since g €E A(8N), <fi(i) in (4.12) for every i, 0 < i < N is an odd integer in the

range 0 to 2N. We now prove that every c/>(i), 0 < i < N, is distinct. It would then

imply that these <p(i) give all the N odd integers in the range 0 to 2N.

We now prove the distinctness of each (p(i), 0 < i < N. In particular we show

that if for some 0 < i, j < N, (f)(i) = (/)(j), then i — j . Clearly if gl mod 4Af and gi

mod 4A^ are both smaller or larger than 2AT, then from (4.12), i = j . Assume that g%

mod AN < 2N while gj mod 4JV > 2N. Then from (4.12),

g{ mod AN = 4N - (gj mod AN), or gi = gj mod N.

By squaring both sides, one gets

g2i = g2j mod 8N, or g2{i'j) = 1 mod 8N. (4.13)

72

4.3. BILINEAR ALGORITHMS FOR 2N-POINT MDCT/IMDCT

But since g is the generator of CW, a cyclic group under the operation of multipli

cation modulo 8iV, the only way (4.13) can be true for 0 < i,j < N is if % = j .

I

The fact that each odd integer (2i + 1) for 0 < i < N can be expressed through

the (j) function which is based on a cyclic group allows us to convert the MDCT

computation into a cyclic convolution. Define function ip as follows:

0 W f + l . if to* mod4A0<2AT, ( ^ y —1, otherwise.

We then can express the DCT-IV component

N-l

as

i=0 ^ '

x{m-1) _ N£xim^±)caj«-mm\ 4N J

Thus

i=0

W)Xir^\ ) = E ^W^(^H—)cos I — 4jy ) • (4-15)

Equation (4.15) shows that a permuted and sign adjusted input sequence il>(i)x(((f)(i)-

l)/2) can be cyclically convolved with a constant sequence cos(ir(gt mod 4N)/(4N))

to get the permuted and sign adjusted transform sequence ip(k)x({(f){k) — l)/2).

The bilinear complexity for 2"-point DCT-IV is 3n multiplications and 3(3" - 2")

additions. The bilinear complexity for 2n-point MDCT is 3"_ 1 multiplications and

3" - 2" additions. The bilinear complexity for 2n-point IMDCT is 3 n _ 1 multiplica

tions and 3(3"_1 — 2n_1) additions. Given the complexity requirements, our proposed

bilinear algorithm works best at smaller transform sizes where the hardware imple

mentation is possible.

73


We illustrate the above through an 8-point DCT-IV which is employed in a 16-

point MDCT. Let x(i) and X(i), 0 < i < 8, denote the input and output samples of

the DCT. In this case, g being 3, the values of <j>(i) for i = 0 through 7 are given by

{1,3, 9,5,15,13, 7,11}. The consecutive values of ip(i) are {1,1,1, —1, —1,1, —1,1}.

Using a shorthand notation^ for a value of cos(-KpjAN) and J? be a value of — cos(irp/4N)

with N = 8, we can describe the transform matrix for 8 point DCT-IV as

X(0)\

X(l)

X(4)

-X(2)

-X(7)

X(6)

-X(3)

X{h) )

( 1

3

9

5

15

13

7

V 1 1

3 9 5

9 5 15

5 15 13

15 13 7

13 7 11

7 11 T

11 T 3

T 3 9

T5

13

7

11

T 3

9

5

13

7

11

T

3

9

5

15

7 11 \̂

11 T

T 3

3 9

9 5

5 15

15 13

13 7 )

( x(0)

x(l)

x(4)

-x{2)

-x(7)

x(6)

-x(3)

{ x(5)

A Hankel matrix product is derived and efficient bilinear algorithm can then be

applied to compute the transform. This algorithm is shown in Fig. 4.7. Inidividual

architecture for 16-point MDCT and IMDCT based on this 8-point DCT is shown

in Fig. 4.8, whereas a unified architecture is shown in Fig. 4.9. A solid line means a

transfer function of 1, a dashed line means a transfer function of —1. The multipli

cation coefficients are listed in Table 4.1.

For lengths 8 and 16, our proposed algorithms for MDCT are compared to [9]

which offers a regular structure based on Given's Rotation. The complexities and

critical path delays are shown in Table 4.2. The algorithms are implemented in 16-

bit fixed arithmetic with TSMC 90nm CMOS standard cell library. The normalized

area and speed comparison of the resultant circuits is shown in Fig. 4.10. For 8-point

MDCT, the top speed of the proposed bilinear implementation is 30% higher than

that of [9]. For 16-point MDCT, our speed advantage is over 44%. In fact, the top

speed of 16-point bilinear implementation is even 15% faster than that of the 8-point

implementation of [9]. Given the same speed, the area for 8-point bilinear circuits

can be as much as 32% smaller than that of [9]. For 16-point, the proposed circuit

can be as much as 26% smaller. In addition, the MDCT bilinear implementations

74

4.3. BILINEAR ALGORITHMS FOR 2N-P0INT MDCT/IMDCT

(+)—*- x(i)

+y— x(4)

X(2)

X(0)

X(7)

X(6)

X(3)

X(5)

Figure 4.7: Proposed bilinear implementation of the 8-point DCT-IV. The multiplication coefficients are in Table 4.1.

75


x(12) '

x(ll)

x(13)

x(10)

x(14)

x(9)

x(15)

x(8)

x(0)

x(7)

x(l)

x(6) ,

x(2)

x(5)

x(3)

x(4)

r+\---+\

+ \ — ~\

+ > — - J

+\ — *\

(0)

(1)

(2)

(3)

(5)

(6)

(7)

(0)

(1)

(2)

(3)

8 point DCT-IV

(4) (4)

(5)

(6)

(7)

X(0)

X(l)

X(2)

X(3)

X(4)

X(5)

X(6)

X(7)

(0)

(1)

(2)

(3)

(5)

(6)

(7)

(0) V

(1)

(2)

(3)

8 point DCT-IV

(4) (4)

(5)

(6)

(7)

,* x'(12)

^x'(U)

.* x'(13)

^ x ' ( 1 0 )

,* "'(14)

X x'(9)

." x'(15)

^-KxX8)

S x'(0)

X x-(l)

X x'(2)

-^x'(5)

X x'(3)

X x'(4)

Forward MDCT Inverse MDCT

Figure 4.8: The implementations of 16-point MDCT and IMDCT based on the 8-point DCT-IV.

Table 4.1: Mu Coefficient

C\

c4

c?

ClO

Cl3

Cl6

Cl9

C22

C25

Value -2.3342 2.7607

-3.1108 0.8561

-1.2444 1.2062 0.6220

-0.2719 0.6985

tiplication coei Coefficient

c2

c5

c8

Cll

c14

Cl7

C20

C23

C26

Scients used in Fig. 4.7. Value 2.0200

-1.3533 2.6005

-0.1811 0.4714

-1.4283 -1.6577 0.4105 0.2561

Coefficient c3

C6

C9

Cl2

Cl5

Cl8

C21

C24

C27

Value -1.5097 2.2505

-3.6363 0.4033

-1.4666 1.7891 0.7032 0.6827 0.0581

76

4.3. BILINEAR ALGORITHMS FOR 2N-POINT MDCT/IMDCT

in(ll)

in(12)

in(10)

in(13)

in(9)

in(14)

in(8)

in(15)

in(0)

in(7) -

in(l)

in (6)

in(2)

in(5)

in (3)

in(4)

IMODE

'+} —-H in(0)

in(l)

in(2)

+ > — •

in(3)

in(4)

in(5)

+ in(6)

in(7)

x(0)

x(l)

x(2)

x(3)

X(0)

X(l)

X(2)

X(3)

8 point DCT-IV

x(4) X(4)

x(5)

x(6)

x(7)

X(5)

X(6)

X(7)

Forward: X(0) Inverse: — x '(11), —x '(12)

_^ Forward: X(l) Inverse: —x '(10), —x '(13)

Forward: X(4) ~*~ Inverse: -x'(9),-x'(14)

», Forward: X(2) Inverse: —x '(8), —x '(15)

Forward: X(7) Inverse: x'(0),-x'(7)

Forward: X(6) Inverse: x'(l),-x'(6)

_^ Forward: X(3) Inverse: x'(2),-x'(5)

_^ Forward: X(5) Inverse: x'(3),-x'(4)

Figure 4.9: Unified implementation of the 16-point MDCT and IMDCT employing one 8-point DCT-IV. Note that for MDCT, IMODE = Q,in(i) = x(i),0 < i < 16. For IMDCT, IMODE = l,in(k) =X(k),0<k<8.

77


are based on DCT-IV transform. This permits simple unified architecture for both

the forward and the inverse implementations. The speed and area of these unified

implementations are close to the implementations of the bilinear MDCT.

Table 4.2: Complexities of various 8 and 16 point MDCT algorithms. Note that M and A refer to multiplication and addition respectively.

Transform 4-point DCT-IV 8-point MDCT 8-point MDCT

8-point DCT-IV 16-point MDCT 16-point MDCT

Algorithm Proposed Proposed Ref. [9]

Proposed Proposed Ref. [9]

Arithmetic complexity 9M +15.4 9M + 19,4 8M + 24,4 TIM + 55,4 27M + 63.4 22M + 6CL4

Critical delay M + 4A M + 5A 2M + 5A M + 6A M + 7A 2M + 7A

1

0.9

0.8

0.7

0.6

0.5

0.4

Figure 4.10: Delay in nsec (on horizontal axis) and normalized area (on vertical axis) for various implementations of 8 and 16 point MDCTs. Note that Fig. 4.9 is a unified MDCT and IMDCT architecture, while all others compute MDCT only.

4.4 Bilinear algorithms for 4-3n-point M D C T / I M D C T

The MDCT/IMDCT algorithms for composite lengths of 4 • 3" points where n > 0,

have found many practical applications in audio coding standards. In particular,

78

8 point 16 point

• *— Ref. [9] B— Bilinear -E>— Fig. 4.9

10 15 20

4.4. BILINEAR ALGORITHMS FOR 4 • 3N-POINT MDCT/IMDCT

12-point MDCT/IMDCT is used for the short block and 36-point MDCT/IMDCT is

used for the long block of MPEG-1/2 layer III (MP3) audio processing.

The algorithm for 4 • 3"-point MDCT can be designed following an approach

similar to the one in Section 4.2, i.e., a 2iV-point MDCT is first converted to an

iV-point DCT-IV as

X W = | > W c o s ( * ( a + ^ f + 1 ) ) , * = 0,1 J V - 1 . (4.17)

An iV-point IMDCT is computed directly from an iV-point DCT-IV to obtain one

half of the outputs. The other half is redundant and can be obtained with trivial sign

changes.

As discussed in Section 4.2, the MDCT of any even length can be computed via

a DCT-IV of half the length. Let N = 2 • 3" where n > 0. We will use the symbol

Xn to indicate a DCT-IV of length 2 • 3 n . Consider the group A(8N) = {0 < i <

8N | gcd(i, 8N) = 1}. The proposed computation shown in Fig. 4.11 uses transform

division of DCT-IV kernel matrix based on A(8N). For MDCT, it is a frequency

division scheme; for IMDCT, a time division scheme.

+» X(k) 2k+l £A(8N) and 0=<k<N.

+»X(k) 2k+l ftA(8N) and 0<k<N.

Figure 4.11: Flow graph for 2 • 3"-point bilinear DCT-IV.

Consider first the computation of Xn(k), where (2k + 1) ^ A(8N) , i.e. (2k +1) is

a multiple of 3. In this case, it can be shown that the multiplication coefficients for

79

Cosine Group Transform CGT„,

Pre-Addition

N/3-1 y(i)=^jX(i)-x(2N/3-i-l)-x(2N/3+i)

i=0 DCT-IV of length N/3


x(i), x(2N/3 — i — 1) and x(2N/3 + i) are related. In particular,

-7r(2z + l)(2A: + l ) ^ _ _ fir(2(2N/3 - * - 1) + l)(2fc + 1) C O B I I — C<Oo

AN J V 4N

= ̂ (ifiMyjMj (418)

To take advantage of (4.18), define

z(i) = x(i) - x(2N/3 - i - 1) - x(2N/3 + i), i = 0 , 1 , . . . , (N/3) - 1.

Then it is obvious that for (2k + 1) £ 4(8iV),

/ V / 3 - 1

i=0 ^ = ^ ( f c ) , A; = 0 , 1 , . . . , iV/3 - 1, (4.19)

where Zn_i(fc) is the 2-3""1 point DCT-IV of sequence {z(i)}. Therefore the DCT-IV

components with index values (2k + 1) are multiples of 3 can be computed directly

from the DCT-IV of a sequence {z(i)} of a smaller length (N/3).

To compute Xn(k) where (2k+ 1) £ A(8N), we use the fact that A(8N) forms a

group under the operation of multiplication modulo 8N. We refer to this computation

of cosine transform with the transform indices restricted to a group as the 2 • 3" -point

Cosine Group Transform, CGTN- Thus we have

CGTN(k) = Yl XW c o s ^2l + 1^k + 1 \ 0<k<N,2k + le A(8N). (4.20) i=0

We separate the summation in (4.20) in two summations depending on whether (2z+l)

belongs to A(8N) or not. The results of these are combined using \CGTN\ additions

later. When (2i-\-1) e A(8N), we can permute the signal and transform components

to convert the partial kernel to a direct product of cyclic groups. This permutation

and computation thus depends on the group structure and is illustrated later in this

section.

When (2k +1) e A(8N) but (2i + 1) $ A(8N), (2i +1) is a multiple of 3. In this

case, only the first iV/3 components of the cosine group transform are independent

80


because

CGTN(k) = -CGTN(2N/3-k-l)

= -CGTN(2N/3 + k). (4.21)

It is therefore sufficient to compute CGTN(k) only for (2k+l) e A(8N), 0<k< N/3,

i.e., (2k + 1) e >l(8iV/3). Also, since (2* + 1) £ 4(8JV), one has (2i + 1) g ,4(8JV/3).

Thus

CGTN(k) = 2 ^ a;(i)cosl — 1 fo.-_i_i w zi/'aAr/'a v ' (2J+l)£4(8iV/3)

= E'^)coS(2^+4ft + 1 ' ) , (2*+ 1)^(8^/3),

j=0 ^ '

= CGTm{k). (4.22)

Note that the sequence {rc'(«)} in (4.22) is defined as x'(?) = x(3« + 1), 0 < i < N/3.

Further, CGTN/3 in (4.22) represents the N/3 point cosine group transform of {x'(i)}.

The relationship (4.21) between transform components is essentially the analog of

signal domain relation (4.18) and is due to the symmetry of the kernel. It points to an

alternative division scheme where transform components are first evaluated upon the

signal index i with respect to A(N). For (2z+l) ^ A(N), we then further separate the

cosine group transform based on the relationship between A(N) and the transform

index k. The motivation behind the signal division scheme is that some computations

for (4.19) can be shared with those for the CGTN where (2i + 1) € A(8N). We

will show a reduced complexity for 6-point DCT-IV in Section 4.4.1. The transform

division on the other hand, permits simpler pipelining and can also reduce the number

of output pins for large transform sizes. In Section 4.4.2 and 4.4.3, the advantage of

transform division architecture are discussed in details.

When (2k + 1) € A(8N) and (2i + 1) G A(8N), the computation turns into a

multi-dimensional convolution. This convolution can be described by the structure

of ^4(8iV) = A(1Q • 3") = C2 x C± x C2.3n-i. Let h and g denote the generators of d

81


and C2-3n-i respectively. Define a function (f>(a,b) as follows:

' hagb, if 0 < hagb < 2N,

,, ,, AN-hag\ if 2N < hagb < W, 4>(a,b)={ h

y ' \ 4.23 hagb-4N, if 4N < hagb < 6N,

k 8iV - hagb, if 6iV < /*V < 8N.

Note that in (4.23), the product hagb is always computed modulo 8N. Defined as

above, function </>(a, b) for 0 < a < 2 and 0 < 6 < 2 • 3"_ 1 produces all integers within

^4(8iV) which are less than 2N. Thus if A(8N) is considered to be made up of integers

of the type (2i +1), then the values of >̂ described above produce all (2i + l) € A(8N)

corresponding to 0 < i < N.

Define a sign function ^(a, 6) as

,, ,, f - 1 , if 2N < hagb < 6N, /dnt. i>(a,b) = \ ' . y (4.24)

[ +1, otherwise.

With functions (f>(a,b) and -0(a, 6) defined in (4.23) and (4.24), one can express the

computation

Y(k) = £ s(i) cos M ^ + I K 2 ^ 1 ) ^ j 0<k<Njke A{8N)j (4.25) i=0 ^ '

as a convolution. Using the equivalence between <fr(a, b) values and (2i + 1), (2A; + 1)

ranges, one gets

Y(W) = JZ^xia^cos M(*MW>»)\ f o < a' < 2,0 < b' < 2 • 3""1. a=0 6=0 ^ '

(4.26)

In (4.26), x{%) is relabeled as x(a, b) where </>(a, b) = 2i + l. Similarly Y(k) is relabeled

as Y(a', b') where 4>{a!', b') = 2k + 1. Using the definitions of 4>(a, b) and if; (a, b), one

gets from (4.26),

Y(a',b') = J2 E <a,b)^{aM^{o!,b')cos(^p^). (4.27) a=0 6=0 ^ '

82


Equation (4.27) can be rewritten as

1 2 - 3 " - 1 - !

iP(a',b,)Y(a',b') = Yl J2 x(a,bty(a,b) cos ^a+a'gb+b'

AN (4.28)

a=0 6=0

Equation (4.28) shows that the permuted and sign adjusted values of Y(k) are ob

tained by a multi-dimensional oepration of permuted and sign adjusted signal samples

with a constant sequence made up of cosine terms. In one dimension, this operation

represents a 2 • 3"_1-point cyclic convolution. In other dimensions, it is a 2-point

Hankel product.

One can verify that h can always be chosen as h = 2N + 1. There are also other

values of h which would work as well. Similarly, one can choose g from amongst

many possible generators of the cyclic group C2.3*>-i C A(8N). Finally the 2 • 3 n _ 1 -

point cyclic convolution can itself be carried out as a two dimensional convolution

with lengths 2 and 3 n _ 1 along the two dimensions. Since an algorithm with a lower

computational complexity is desirable, from Section 2.4.2 we can use the value of

(m — n)/a to determine the decomposition order of bilinear algorithm (n, o, m), where

n is the length of the input vector, a its additive complexity and m its multiplicative

complexity.

The decomposition of CGT^ is summarized in Fig. 4.12. It shows that the com-

x(i) 2i+l 8 A(8N)

Multidimensional

cyclic convolution Y(k)

2k+l £A(8N)

x(i) 2i+l £ A(8N) mmimmmmmmJ^

X'(k) 2k+l £A(8N/3)

X(k) 2k+l £A(8N)

Figure 4.12: Flow graph for cosine group transform of 2 • 3 n point DCT-IV.

putation of CGT^ breaks down into two independent computations, one involving a

83


multi-dimensional cyclic convolution and the other, the transform CGTN/Z- CGT^/3

can also be similarly decomposed into a smaller sized convolution and CGTN/32. Since

all the resultant convolutions can be done concurrently, one can get a bilinear algo

rithm for the DCT-IV from the bilinear algorithms for cyclic convolutions.

The above discussion results in 2 • 3"-point DCT-IV algorithm with the bilinear

complexity of 9(5" + An - l ) / 8 multiplications and (18 • 5n - 3 n + 3 + 20n 4- 17)/2

additions. Thus the bilinear complexity of 4 • 3" point MDCT is 9(5" + 4n - l ) / 8

multiplications and (18-5n —23-3" + 20n + 17)/2 additions. The bilinear complexity of

4-3" point IMDCT is 9(5n + 4 n - l ) / 8 multiplications and (18-5"-3"+ 3 + 20n+17)/2

additions. Given the complexity requirements, our proposed bilinear algorithm works

best at smaller transform sizes where the hardware implementation is possible. This

is the case for MPEG-1/2 layer III (MP3) audio processing, which is discussed in the

next two sections.

4.4.1 The bilinear MDCT/IMDCT for MP3 audio short block

length

A 12-point MDCT/IMDCT is used for short block in MP3 audio processing. As dis

cussed in Section 4.2, these transforms can be converted to a 6-point DCT-IV. Bilinear

algorithms for DCT-IV can then be applied to obtain a fast VLSI implementation.

For DCT-IV signal indices i = 1 and 4 where (2i + 1) is divisible by 3, we can

compute a 2-point DCT-IV. Let its outputs be Xc(0) and Xc(l), using the same

shorthand notation as before, we have

( X'l0) ) = ( 9 ' )( X(1) ) . (4.29) V Xc(l) j \ i 9 ) \ x(A) )

We will add Xc(0) to the rest of X ( l ) and subtract it from the rest of X(2) and X(5).

Similarly we will subtract Xc(l) from the rest of X(0) and add it to the rest of X(3)

and X{A).

To compute DCT-IV trasnform indices k — 1 and 4 where (2k + 1) is divisible by

84


3, Using the same shorthand notation as before, we get

x(0)-x(3)

-(x(5) + x(2)) (4.30)

One can notice that the computation (4.30) is a Hankel product. As we shall demon

strate later, the advantage of the signal division approach is that (4.30) can be com

pletely obtained from the remaining matrix calculation with only sign changes.

For the remaining matrix, i.e., when (2i + 1) € A(SN) and (2k + 1) € A(8N), we

have h = 2N + 1 = 13 to be the generator of C4 and g = 7 to be the generator of

C2.3—1 = C2. One therefore gets {0(0, 0), 0(1, 0), 0(0,1), 0(1,1)} = {1,11,7, 5} and

the corresponding ip values are {1, —1,1,1}. In addition, since 0(a, b) equals (2i + 1)

or (2k +1 ) , the signal or transform sample index needs to be permuted as {0, 5, 3, 2}.

The resultant matrix equation is given by:

/

V

xk(o) \

Xk(3)

**(2) /

( * n

IT T

7 5

I 5 7

7 5 \ 5 7 1 TT

TT T )

( x(0) \

-x(5)

x(3)

V x(2)J

(4.31)

One can notice that this computation corresponds to a two dimensional convo

lution with 2-point convolution along one dimension and a 2-point Hankel product

along the other. Clearly, efficient bilinear algorithm can be constructed for (4.31).

Applying 2-point bilinear algorithm (Fig. 2.6) for cyclic convolution to (4.31), one

computes {Xk(0), — Xk(5)} with

1 + 7 11 + 5

TT + 5 T + 7

x(Q) + x(3)

x(h)+x(2) (4.32)

This is a 2-point Hankel product and a bilinear algorithm can be applied with 3

multiplications and 3 additions.

Similarly, one can compute the other transform components {Xk(3),Xk(2)} with

**(3)

Xk(2)

x(0) - x(3)

-(x(5) + x(2)) (4.33)

85


This is again a 2-point Hankel product and a bilinear algorithm can be applied with

3 multiplications and 3 additions.

It can be easily verified that for N = 6,

cos(97r/4iV) = cos(7r/4iV) - cos(7?r/4iV),

cos(3?r/4iV) = -cos(ll7r/4JV)-cos(57r/4iV). (4.34)

Therefore one can express (4.33) as

(4.35)

Comparing (4.35) with (4.30), one gets

Xk(l) = - 2 X , ( 2 ) ,

Xk(4) = 2X*(3). (4.36)

Therefore computation for (4.30) can be absorbed into the that for (4.31). The

operation of multiplying-by-2 may be counted as one addition. Frequently in hardware

design, this scale-by-2 can be realized as a trivial left shift and thus its impact on

area and speed is negligible.

The complete flow graph of this computation is shown in Fig. 4.13. The multipli

cation coefficients are listed in Table 4.3. Architecture of 12-point MDCT/IMDCT

based on this DCT-IV is given in Fig. 4.14.

Table 4.3: Mul Coefficient

C\

c4

C7

Value 0.5412 0.4687 0.6533

tiplication coefficients used in Fig. 4.13 Coefficient

c2

c5

c8

Value 0.3827 0.3314

-0.4619

Coefficient c3

c6

c9

Value -1.3066 -1.1315 -0.2706

Our proposed algorithms may be compared to those available in the literature

[9,27,37]. The complexities and critical path delays of these are listed in Table 4.4.

The bilinear algorithms improve both the arithmetic complexity and the critical path

delay compared with the referenced fast algorithms.

86


Figure 4.13: Proposed bilinear implementation of the 6-point DCT-IV.

x(9)

x(8)

x(10)

x(7)

x(U)

x(6)

x(0)

x(5) .-

x(l)

x(4)

x(2)

x(3)

+ \ — *\

+ •

+> —-H

(0)

(i)

(0)

(1)

(2) (2)

6pointDCT-IV

(3) (3)

(4)

(5)

(4)

(5)

Forward MDCT

X(0)

X(l)

X(2)

X(3)

X(4)

X(5)

(0)

(1)

(0) \

(1)

(2) (2)

6 point DCT-IV

(3) (3)

(4)

(5)

(4)

(5) \

,*x-(9)

~V x'(8)

,*' x'(10)

V x'(7)

,"x'(U)

X x'(6)

Sx'(0)

V x-(5)

S xV) X *'(4)

S x'(2)

^ x'(V

Inverse MDCT

Figure 4.14: The implementations of 12-point MDCT and IMDCT based on the 6-point DCT-IV.

87


We have implemented these algorithms in 16-bit fixed arithmetic with TSMC

90nm CMOS standard cell library. The circuit speed and normalized area for various

12-point MDCT/IMDCT architectures is compared in Fig. 4.15. The top speed of

the proposed forward bilinear implementation is 41% to 37% faster than those of

Given's rotation based forward transforms [9,37] and is 68% faster than the recursive

approach in [27]. On the inverse transform, the top speed advantage is 50% faster

over [37] and 73% faster over [27]. Given the same speed, the area for bilinear circuits

can be as much as 41% and 33% smaller than those of [37] and [27] respectively for the

forward, and 38% and 37% smaller respectively for the inverse transform. Clearly our

proposed bilinear algorithm provides the most efficient implementation of 12-point

MDCT.

Table 4.4: Complexities of various 12-point MDCT and IMDCT algorithms. Note that M and A refer to multiplication and addition respectively.

Transform 6-point DCT-IV

6-point CGT 12-point MDCT 12-point MDCT 12-point MDCT 12-point MDCT 12-point IMDCT 12-point IMDCT 12-point IMDCT 12-point IMDCT

Algorithm Proposed Proposed Proposed Ref. [9] Ref. [37] Ref. [27 Proposed Ref. [9] Ref. [37] Ref. [27]

Arithmetic complexity 9M + 23,4 9M + 19.4 9M + 29,4 13M + 39A 13M + 27.4 11M + 29.4 9M + 23,4 13M + 33A 13M + 21.4 11M + 23.4

Critical delay M + 5,4 M + 5̂ 4 M + 6,4 2M+6A 2M+5A 3M+7A M + 5A 2M+5A 2M+4A 3M+6A

4.4.2 The bilinear M D C T / I M D C T for M P 3 audio long block

length

We now explain in detail the architecture for a 36-point MDCT/IMDCT via N = 18

point DCT-IV. The DCT-IV components X(1),X(4),X(7),X(10),X(13) and X(16)

can be computed by a 6-point DCT-IV. For the remaining components of CGTig, we

88


12 point MDCT 12 point IMDCT

0.9

0.8

0.7

0.6

0.5

1

- Ref.[37] Ref.[27]

- Bilinear •

10 15 20

Figure 4.15: Delay in nsec (on horizontal axis) and normalized area (on vertical axis) for various implementations of 12-point MDCT and IMDCT.

further divide the kernel matrix in two parts. A CGTe is computed for signal indices

i = 1,4, 7,10,13,16. The computation involving signal and transform indices i, k €

{0,2,3,5,6,8,9,11,12,14,15,17}, i.e., those for which (2i + 1), (2k + 1) € A(8N),

results into a multi-dimensional convolution. As explained earlier, this convolution

is based upon the group C4 x C2.3«-i. Since the cyclic group C2.3n-i can be further

expressed as C2 x C3«-i, in (4.23) and (4.24), we can substitute in gb = g^g^ where #2

and #3 are generators for C2 and CSn-i respectively. By using the generator h = 19 of

C4, generator g2 = 17 of C2 and generator g3 = 49 of C3, we get the values of function

(/>, a = 0,1, from (4.23) as {1,19,23,5,25,29,17,35,31,13,7,11}. The corresponding

values of function tp are obtained from (4.24) as {1,1, —1, —1, —1,1,1,1,1,1, —1, — 1}-

Further, <f> represents values of (2z +1) or (2k + 1), where i and k are indices of signal

and transform samples. Thus the permutation of the signal and transform samples

can be derived from the values of </>. For the present set (/) values, this index order is

given by {0, 9,11,2,12,14,8,17,15,6,3,5}. The computation can thus be expressed


as the matrix product:

Xk(0

Xk(9

-X*(ll

-Xk{2

-Xk{\2

Xk{U

( 1

19

23

5

25

29

17

35

31

13

7

V ii

19

T

5

23

29

25

35

17

13

3T

n 7

23

5

25

29

1

19

31

13

7

TT

17

35

5

23

29

25

19

T

13

3l IT 7

35

17

25

29

1

19

23

5

7

IT

17

35

31

13

29

25

19

I 5

23

TT 7

35

T7

13

31

17

35

31

13

7

IT 1

19

23

5

25

29

35

17

13

3l II 7

19

I

5

23

29

25

31

13

7

TT

17

35

23

5

25

29

1

19

13

31

TT

7

35

17

5

23

29

25

19

T

7

TT

17

35

31

13

25

29

1

19

23

5

n\ 7

35

17

13

31

29

25

19

I 5

23 i

( x(0) \

x(9)

-a;(ll)

-x{2)

-x{\2)

x(U)

Xk(8) ~ 17 35 31 13 7 TT 1 19 23 5 25 29 s(8)

Xfc(17) 35 T7 13 31 TT 7 19 T 5 23 29 25 x(YI)

Xk(lb) 31 13 7 TT 17 35 23 5 25 29 1 19 x(15)

Xk(6) 13 3T II 7 35 17 5 23 29 25 19 I x(6)

-Xk(3) 7 TT 17 35 31 13 25 29 1 19 23 5 -x(3)

\ -Xfe(5) / \ TT 7 35 17 13 31 29 25 19 I 5 23 / \ -x(5) /

(4.37)

Note that we use p in this matrix to represent the value CQs(irp/AN) and p for the

value — cos("7rp/4./V), where N = 18. The 6-point cyclic convolution can be obtained

by combining 2-point and 3-point algorithms.

Efficient bilinear algorithms exist for the 2-point and 3-point cyclic convolution

and Hankel product. For the 3-point cyclic convolution, applying the trigonometric

identity cos(o;) + cos(27r/3 + a) + cos(47r/3 + a) — 0, we can lower its complexity to 3

multiplications and 6 additions and reduce the critical path delay to 1 multiplication

and 4 additions.

The flow graph for (4.37) is shown in Fig. 4.16. The complete implementation flow

graph for for 18-point DCT-IV is shown in Fig. 4.17. The multiplication coefficients

used therein are listed in Table 4.5. Figures 4.18 and 4.19 show the 36-point MDCT

and IMDCT respectively. The complexities and critical path delays of these and of

other algorithms available in literature are listed in Table 4.6. One can see from the

table that the proposed bilinear algorithm improves the critical path delay and has

the lowest multiplication requirements. The addition operations however are higher

than [27,37].

The proposed bilinear algorithms and the reference algorithms are implemented

90


out(9)

out(6)

Figure 4.16: Proposed bilinear implementation of multidimensional convolution involved in the 18-point DCT-IV.

Table 4.5: Multiplication coef Coefficient

C\

c4

C7

ClO

Cl3

Cl6

Value -0.9231 -0.5086 -0.6025 -0.1628 0.1851 0.7181

Coefficient c2

c5

Cg

Cn C14

Cl7

icients us Value

-0.6528 -0.3596 -0.4261 0.2779

-0.3160 -1.2258

ed in Fig. 4.16 Coefficient

c3

c& c9

Cl2

Cl5

Cl8

Value 2.2287 1.2278 1.4546

-0.3930 0.4469 1.7336

91


(0) (1) (2) (3) (4) (5) (6) (7) (8) (9) (10)

(11)

Multidimensional cyclic convolution

Fig. 4.16

(0) (1) (2) (3) (4) (5)

(0) (1) (2) (3) (4) (5) (6) (7) (8) (9)

(10) (11)

(0,2,3,5)

CGT,

{(0)-(2)-(0),(2)-(0)-(2), -(3)-(5),-(3)-(5),(3),(5)}

CGT 18

Figure 4.17: Proposed bilinear implementation of the 18-point DCT-IV.

92


2(0)

2(9)

2(11)

2(2)

2(12)

2(14) 2(8)

2(17) 2(15)

2(6) 2(3)

2(5)

2(1)

2(4) 2(7) 2(10)

2(13)

2(16)

x(0:35) MDCT to DCT-IV

3 Fig. 4.3

2(0:17)

(0) (1) (2) (3) (4) (5) (6) cyclic convolution

Multidimensional

(7) (8) (9) (10)

(11)

Fig. 4.16

(0)

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

(11)

(0) (1) (2) (3) (4) (5)

(0,2,3,5)

CGT,

~®~

{(0),-(2)-(0),(2),-(0)-(2), -(3)-(5) r(3),-(5),(3),(5)}

C G T ,

(0:5) (11:6)

(12:17) +̂

(0:5)

6 point DCT-IV

(0) (1) (2) (3) (4) (5)

•X(0) X(9)

• X(ll) X(2)

• X(12)

• X(14) X(8) X(17)

• X(15) X(6) X(3)

•X(5)

X(l)

X(4)

X(7)

• X(10)

• X(13)

• X(16)

Figure 4.18: Proposed bilinear implementation of the 36-point MDCT.

Table 4.6: Complexities of various 36-point MDCT and IMDCT algorithms. Note that M and A refer to multiplication and addition respectively.

Transform 18-point DCT-IV 36-point MDCT 36-point MDCT 36-point MDCT 36-point MDCT 36-point IMDCT 36-point IMDCT 36-point IMDCT 36-point IMDCT

Algorithm Proposed Proposed Ref. [9] Ref. 37] Ref. [27] Proposed Ref. [9] Ref. [37] Ref. [27]

Arithmetic complexity 36M +132,4 36M + 15(L4 47M + 165 A AIM +129 A 43M + 133A 36M +132,4 51M + 151,4 51M + 115,4 43M +115,4

Critical delay M + 9A M + 10A 2M+9A 2M+8A

3M+22A M + 9A 2M+8A 2M+7A 3M+21A

93


X(0) X(9) X(ll) X(2) X(12) X(14) X(8) X(17) X(15) X(6) X(3) X(5)

(0) 0) (2) (3)

(0) (1) (2) (3)

( ' Multidimensional ^ ' (5) .. . . (5) (6) cyclic convolution (6) (7) (7) (8) Fig. 4.16 ( g )

(9) (9) (10) (10)

(11) (11)

' + •

{(0),"(2)-(0),(2),-(0)-(2), ~(3),-(5),-(3)-(5),(3),(5)}

CGT,

-x(26),-x(27) x(0),~x(17) x(2),-x(15) -x(24),-x(29) x(3),-x(14) x(5),-x(12) -x(18),-x(35) x(8),-x(9) x(6),-x(U) -x(20),-x(33) -x(23),-x(30) -x(21),-x(32)

X(0:17) (0:5)

(11:6)

Xr+> (12:17) S * ^

(0:5)

6 point DCT-IV

(0) (1) (2) (3) (4) (5)

-x(25)-x(28) -x(22),-x(31) -x(19),-x(34) x(l),-x(16) x(4)-x(13) x(7),-x(10)

Figure 4.19: Proposed bilinear implementation of the 36-point IMDCT.

94


in 16-bit fixed arithmetic with TSMC 90nm CMOS standard cell library. The circuit

speed and normalized area is shown in Fig. 4.20. The top speed of the proposed

forward bilinear implementation is 19% to 14% faster than those of Given's rotation

based forward transforms [9, 37] respectively, and is 61% faster than successive de

composition approach [27]. On the inverse transform, the top speed advantage is 25%

over [37] and 64% over [27]. Given the same speed, the area for bilinear circuits can

be as much as 19% smaller than those of [27,37] for the forward and 30% and 17%

smaller for the inverse implementations respectively.

36 point MDCT 36 point IMDCT

0.9

10 15 20 25 30

Figure 4.20: Delay in nsec (on horizontal axis) and normalized area (on vertical axis) for various implementations of 36-point MDCT and IMDCT.

4.4.3 The unified MDCT/IMDCT architecture for MP3 au

dio

In Fig. 4.5, we have shown that the forward and inverse MDCT can be obtained

together on a DCT-IV based hardware architecture. This is accomplished with rel

atively simple input and/or output data multiplexers. This unified implementation

allows encoder and decoder to share the same hardware accelerator through time

multiplexing.

In MPEG-1/2 layer III (MP3) audio format, two different block sizes are defined.

95


The long block size is normally used to provide better frequency resolution and the

short block is used where as better time resolution is needed. The switch from the

long block to the short block occurs whenever pre-echo is expected. Pre-echo is a

distortion in the frequency domain coding of an audio signal. It is commonly dealt

with using a window switching technique, where short block sizes are used in place of

long block sizes. Therefore a truly unified algorithmic accelerator will need to process

not only the forward and inverse (unified encoder and decoder), but also the short

and long block sizes (window switching).

Figs. 4.18 and 4.19 show that 36-point MDCT/IMDCT consists of three major

processing modules: a 12-point block circular matrix, a 6-point CGT and a 6-point

DCT-IV. Both 12-point MDCT and IMDCT rely on 6-point DCT-IV. These obser

vations lead us to three different unified hardware architectures.

Shown in Fig. 4.21, architecture A is a straightforward enhancement to the unified

architecture Fig. 4.6. We use the 6-point DCT-IV inside the 36-point MDCT/IMDCT

to process the 12-point MDCT/IMDCT. The data throughput is one 36-point or one

12-point MDCT/IMDCT per cycle. The pre-addition stages of 12-point MDCT is

shared with that of the 36-point MDCT. Table 4.7 and 4.8 show possible input and

output assignments for Fig. 4.21.

Architecture B shown in Fig. 4.22 improves upon the simple enhancement of

Fig. 4.21. From Table 4.4, We realize that the difference between 6-point CGT and

6-point DCT-IV is small and only amounts to 4 additions (or 2 additions and 2 left

shifts). Therefore 6-point CGT can be expanded into 6-point DCT-IV to process

a second 12-point MDCT/IMDCT. The data throughput is one 36-point or two 12-

point MDCT/IMDCT per cycle. The pre-addition stage for both 12-point MDCT's is

shared with that of the 36-point MDCT. The ability to process multiple short blocks

concurrently is important. During window switching, the 32 subbands can operate

in mixed block mode, where two lower subbands process long blocks and all other

30 upper bands switch to short blocks. Table 4.9 and 4.10 shows possible input and

output assignments for Fig. 4.22.

Architecture C (pipeline) takes a different look at the relationship between the 6-

point CGT and DCT-IV. Instead of doubling up CGT6 to another DCT-IV in order to

96


Table 4.7: 12 and 36 point MDCT and IMDCT input mapping for unified architecture A.

I n p u t

SMODE IMODE

in(0) in(l) in(2) in(3) in(4) in(5) in(6) in(7) in(8) in(9)

in(10) in( l l ) in(12) in(l3) in(14) in(15) in(16) in(17) in(18) in(19) in(20) in(21) in(22) in(23) in(24) in(25) in(26) in(27) in(28) in(29) in(30) in(31) in(32) in(33) in(34) in(35)

MDCT36

0 0

x(0) x(l) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)

x(10) x ( l l ) x(12) x(13) x(14) x(15) x(16) x(17) x(18) x(19) x(20) x(21) x(22) x(23) x(24) x(25) x(26) x(27) x(28) x(29) x(30) x(31) x(32) x(33) x(34) x(35)

lMDCTm

0 1

X(0) X(l) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X( l l ) X(12) X(13) X(14) X(15) X(16) X(17)

MDCTl2

1 0

x(0)

x(l)

x(2)

x(3)

x(4)

x(5)

x(6)

x(7)

x(8) x(9)

x(10)

x ( l l )

IMDCT12

1 1

X(0)

X(l)

X(2)

X(3)

X(4)

X(5)

97


Table 4.8: ture A.

12 and 36 point MDCT and IMDCT output mapping for unified architec-

O u t p u t

out(O) out(l) out(2) out(3) out (4) out(5) out(6) out (7) out (8) out (9) out(10) out( l l ) out(12) out(13) out(14) out(15) out(16) out(17)

MDCT36

X(0) X(l) X(2) x ( 3 ) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X( l l ) X(12) X(13) X(14) X(15) X(16) X(17)

IMDCTm

-x'(27),-x'(26) -x'(28),-x'(25) -x'(29),-x'(24) -x'(30),-x'(23)

;-x'(31),-x'(22) -x,(32),-x'(21) -x'(33) rx'(20) -x'(34),-x'(19) -x'(35),-x'(18) x'(0),-x'(17) x'(l),-x'(16) x'(2),-x'(15) x'(3),-x'(14) x'(4),-x'(13) x'(5),-x'(12) x'(6),-x'(ll) x'(7),-x'(10) x'(8),-x'(9)

MDCTU

X(0) X(l) X(2) X(3) X(4) X(5)

IMDCTl2

-x'(9),-x'(8) -x'(10),-x'(7) -x'(ll),-x'(6) x'(0),-x'(5) x'(l),-x'(4) x'(2),-x'(3)


y(0) y(9) y(H) y(2) y('2) y(14)

y(8) y(n) y(is) y(6) y(3) y(5)

yd) y(4) y(V y(10)

y('3) y(l6)

(0) (1)

(2) Multi-' " dimensional *• ' cyclic *• ' convolution (6) (7) Fig. 4.16

(8) (9) (10) (11)

(0) 0) (2) (3) (4) (5) (6) (7) (8) (9)

(10)

(11)

(+}

(0) (1) (2) (3) (4) (5)

(0,2,3,5)

CGT,

{(0),-(2)-(0),(2)-(0)-(2), -(3),-(5),-(3),-(5),(3),(5)}

CGT,

MDCT to DCT-IV

(0:5)

Y(0) Y(9) Y(") Y(2) Y(12) Y(14) Y(8) Y(17) Y(15) Y(6) Y(3) Y(5)

(0) (1) (2) (3)

6 point DCT-IV (4) (5)

y(0,3,6,9,12,15)

Y(l,4,7,10,13,16) -

Y(0:5) -

Y(6:17)-

SMODE

i out(0:5)

out(6:17)

Output multiplexer

Y(l) Y(4) Y(7) Y(10) Y(13) Y(16)

Figure 4.21: Proposed bilinear implementation of the unified 12 and 36 point MDCT and IMDCT (architecture A).

99


Table 4.9: 12 and 36 point MDCT and IMDCT input mapping for unified architecture B. Note that xA and xB refer to the two 6-point blocks whose MDCT is computed concurrently. Similarly XA and XB represent two independent 6-point transform blocks whose IMDCT is computed concurrently.

Input SMODE IMODE

in(0) in(l) in(2) in(3) in(4) in(5) in(6) in(7) in(8) in(9) in(10) in( l l ) in(12) in(13) in(14) in(15) in(16) in(17) in(18) in(19) in(20) in(21) in(22) in(23) in(24) in(25) in(26) in(27) in(28) in(29) in(30) in(31) in(32) in(33) in(34) in(35)

MDCT36

0 0

x(0) x(l) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9) x(10) x ( l l ) x(12) x(13) x(14) x(15) x(16) x(17) x(18) x(19) x(20) x(21) x(22) x(23) x(24) x(25) x(26) x(27) x(28) x(29) x(30) x(31) x(32) x(33) x(34) x(35)

IMDCT36

0 1

X(0) X(l) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X( l l ) X(12) X(13) X(14) X(15) X(16) X(17)

100

MDCT12

1 0

xA(0) xB{0)

XA{~\)

xB{\)

xA(2) xB{2)

xB(3) xA{3)

xB{4)

XA{4)

xB(b) xA{5)

xB(6) xA(6)

xB{7) xA{7)

xB(S) xA(S) xA{9) xB{9)

XA(10)

xB(10)

^ ( 1 1 ) xB{n)

IMDCT12

1 1

XA(0) XB(0)

XA(1)

XBO-)

XA{2) XB(2)

XA(3) XB(3)

XA(4)

XB(4)

XA(5) XB(5)


Table 4.10: 12 and 36 point MDCT and IMDCT output mapping for unified architecture B. Note that XA and XB refer to MDCTs of 6-point sequences xA and xB

respectively and are computed concurrently. Similarly x'A and x'B refer to IMDCTs of 6-point transforms XA and XB respectively and are computed concurrently.

O u t p u t

out(0) out(l) out (2) out(3) out(4) out(5) out(6) out (7) out(8) out(9)

out(10) out ( l l ) out(12) out(13) out(14) out(15) out(16) out (17)

MDCTm

X(0) X(l) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9)

X(10) X( l l ) X(12) X(13) X(14) X(15) X(16) X(17)

IMDCTm

-x'(27),-x'(26) -x'(28),-x'(25) -x'(29),-x'(24) -x'(30),-x'(23) -x'(31),-x'(22) -x'(32),-x'(21) -x'(33),-x'(20) -x'(34),-x'(19) -x'(35),-x'(18) x'(0),-x'(17) x'(l),-x'(16) x'(2),-x'(15) x'(3),-x'(14) x'(4),-x'(13) x'(5),-x'(12) x'(6),-x'(ll) x'(7),-x'(10) x'(8),-x'(9)

MDCTl2

XA(0) XA{1)

XA(2)

XA(3)

XA{±) XA(5)

XB(0)

XB{1)

XB(2) XB(3) XB{4) XB{5)

IMDCT12

-x'A(9),-xA(8) -X'A(W),-XA(7)

-x'A(U),-x'A(6) x'A(0),-x'A(5) x'A(l),-x'A(4) x'A(2),-xA(3)

-x'B(9),-x'B(8) -x'B(10),-xB(7) -x'B(ll),-x'B(6)

x'B(0),-x'B(5) x'B(l),-x'B(4) x'B(2),-x'B(3)

101


y(0) y(9) y(H) y(V y(i2) y(14)

y(8) y(n) y(is) y(6) yd) y(V

y(') y(4) y(V y(10)

y('V y(16)

(0) (1) (2) (3) Multi-(4) dimensional (5) cyclic (6) convolution (7) (8) Fig. 4.16

(9) (10) (11)

(0) (1) (2) (3) (4) (5) (6) (7) (8) (9)

(10)

(11)

(0) (1) (2) (3) (4) (5)

(0:5)

6 point DCT-IV

{y(0)-r(2)-y -r(3),-r(5),-i

•(0).VQ). '•(3),~T(5%

modified CGTi

Y(0) Y(9) Y(ll) Y(2) Y(12) Y(14) Y(8) Y(17) Y(15) Y(6) Y(3) Y(5)

T(0).-Y-(2). i,Y-(3),Y'(5)}

Y'(0:5)

MDCT to DCT-IV MODE

z(0:17l[\y(0:l7^

SMODE

Y(l,4,7,10,13,16) .

Y(0:5) •

n(0:17)

SMODE

(0:5)

(0) (1) (2) (3)

6 point DCT-IV (4) (5)

y(0,3,6,9,12,15)

SMODE

out(0:5) V(0:5)

Y(6:ll)

Y(l) Y(4)

Y(7) Y(10) Y(13) Y(16)

out(6:ll) Y(12:17) out(12:17)

Figure 4.22: Proposed bilinear implementation of the unified 12 and 36 point MDCT and IMDCT (architecture B).

102

4.4. BILINEAR ALGORITHMS FOR 4 • 3N-P0INT MDCT/IMDCT

process a second short block, we fold CGT6 function into the existing 6-point DCT-IV.

This provides a natural way to pipeline the 36-point MDCT/IMDCT. In addition, a

constant focal point of hardware implementation is the number of required input and

output pins (10). Many designs today are switching from die-limited to IO-limited.

Therefore it is important to cap the number of input and output pins for a design.

Our proposed pipelined architecture is shown in Fig. 4.23. With the 6-point DCT-IV,

6 outputs of an 18-point DCT-IV are ready upon the completion of the first clock

phase. During the second clock phase, we use the 6-point DCT-IV to compute CGT6

and also complete the computation of multi-dimensional cyclic convolution. These

12 outputs of 18-point DCT-IV are then available at the end of second clock phase.

Thus we cut the required outputs from a maximum 36 for IMDCT to just 12 with

the unified architecture, a 66% reduction. The area savings comes from two sources.

The major saving is from removing the CGTQ computations of 9 multiplications and

21 additions. A secondary saving is due to the fact that block circular matrix is

no longer on the critical path and thus can afford using smaller and low-power logic

gates. The critical path for 36-point MDCT/IMDCT roughly doubles, compared to

non-unified bilinear designs. However in one clock cycle, two short blocks can be

processed and MP3 window switching can be accomplished rather fast. For the 36-

point MDCT/IMDCT, the inputs only toggle on the rising edge of clock. 6 outputs

are obtained on the falling edge and the other 12 outputs are obtained on the rising

edge. For the 12-point, new inputs are sending on both rising and falling clock edges

and outputs are generated on both rising and falling edges as well. Table 4.11 and

4.12 shows possible input and output assignments for the pipelined architecture of

Fig. 4.23.

The complexity of the proposed unified bilinear algorithms are listed in Table 4.13.

The proposed unified bilinear algorithms are implemented in 16-bit fixed arithmetic

with TSMC 90nm CMOS standard cell library. The circuit speed and normalized

area is shown in Fig. 4.24. Architecture A is 4.7% slower at top speed than our

proposed bilinear MDCT, and is 7% larger when the speed is the same. Architecture

B is 4.7% slower at top speed than our proposed bilinear MDCT, and is 8% larger

when the speed is the same. We have also compared the fast unified architectures

103


Table 4.11: 12 and 36 point MDCT and IMDCT input mapping for unified architecture C (pipeline).

Input Clock edge

SMODE IMODE

in(0) m(l) in(2) m(3) in(4) Ln(5) m(6) in(7) m(8) in(9)

in(10) in( l l ) in(12) in(13) in(14) in(15) in(16) in(17) in(18) in(19) in(20) in(21) in(22) in(23) in(24) in(25) in(26) in(27) in(28) in(29) in(30) in(31) in(32) in(33) in(34) in(35)

MDCTm

rise

0 0

X(0) X(l) X(2) X(3) X(4) X(5) X(6) X(7) X(8) X(9) X(10) X( l l ) X(12) X(13) X(14) X(15) X(16) X(17) X(18) X(19) X(20) X(21) X(22) X(23) X(24) X(25) X(26) X(27) X(28) X(29) X(30) X(31) X(32) X(33) X(34) X(35)

IMDCT36

rise

0 1

x(0) x(l) x(2) x(3) x(4) x(5) x(6) x(7) x(8) x(9)

x(10) x ( l l ) x(12) x(13) x(14) x(15) x(16) x(17)

1U4

MDCT12

rise/fall

1 0

X(0)

X(l)

X(2)

X(3)

X(4)

X(5)

X(6)

X(7)

X(8)

X(9)

X(10)

X( l l )

IMDCTU

rise/fall

1 1

x(0)

x(l)

x(2)

x(3)

x(4)

x(5)

4.4. BILINEAR ALGORITHMS FOR 4 • 3N-P0INT MDCT/IMDCT

y(0) — y(9) — y(ll) - j y(2) y(12) y(14)

y(8) y(17) y(15)

y(6) y(3) y(5)

(0) (0)

(1) (1) (2) (2) (3) M u l t H (3) J-̂ -J dimensional ^\

(5) cy c l i c (5) /̂ -v convolution /g\

(7) Fig. 4.16 O (8) (8) (9) (10) (11)

MDCT to DCT-IV

in(0:35)\ z(0.

IMODE

:]7)Ay(0:17l

in(0:17)

(0:5) (11:6)

(12:17)

(9) (10) (11)

-~Y(7) -*• Y(l)

-~Y(8) — Y(9) — Y(5) -*• Y(ll)

— Y(10) -~Y(4) — Y(2) — Y(3)

ir(0), - r(2), - Y'(o), Y-(2), -Y •(<)), - r<2), -T(3), - Y'(5). ~ r(3), - Y'(5), Y'(3),Y'(5)}

SMODE or !CLK

frl (0:5) (0:5)

6 point DCT-IV

Y'(0:5)

y(l,4,7,10,13,16)

Y(0:5)

T(0:5)

Y(6:ll)

SMODE or CLK

— - r i _ ' ] • out(0:5)

• out(6:ll)

Output multiplexer

Figure 4.23: Proposed bilinear implementation for the pipelined unified 12 and 36 point MDCT and IMDCT (architecture C).

105


Table 4.12: 12 and 36 point MDCT and IMDCT output mapping for unified architecture C (pipeline).

Output

out(O) out(l) out(2) out(3) out (4) out(5) out(6) out (7) out(8) out(9)

out(10) out (11)

MDCTm

fall

X(l) X(4) X(7)

X(10) X(13) X(16)

rise

X(0) X(2) X(3) X(5) X(6) X(8) X(9)

X( l l ) X(12) X(14) X(15) X(17)

IMDCT36

fall

-x'(28),-x'(25) -x'(31),-x'(22) -x'(34),-x'(19) x'(l),-x'(16) x'(4),-x'(13) x'(7),-x'(10)

rise

-x'(27),-x'(26) -x'(29),-x'(24) -x'(30),-x'(23) -x'(32),-x'(21) -x'(33),-x'(20) -x'(35),-x'(18) x'(0),-x'(17) x'(2),-x'(15) x'(3),-x'(14) x'(5),-x'(12) x'(6),-x'(ll) x'(8),-x'(9)

MDCT12

fall/rise

X(0) X(l) X(2) X(3) X(4) X(5)

IMDCT12

fall/rise

-x'(9),-x'(8) -x'(10),-x'(7) -x'(ll),-x'(6) x'(0),-x'(5) x'(l),-x'(4) x'(2),-x'(3)

(A,B) with bilinear 36-point MDCT and [9,37], and separately compared pipelined

architecture with a low complexity MDCT [27]. The fast unified architectures A and

B are 9% faster than that of [37] and 13% faster than that of [9] at top speed. At

higher speeds, the proposed designs can be more than 27% smaller. The pipelined

architecture is 26% faster than [27] at the top speed and 19% smaller if the speed is the

same. The three proposed unified architectures can process not only the forward and

inverse transforms, but also the short and long blocks in MP3 application, whereas

the compared reference designs can only compute forward MDCT for the long block.

For architecture B and pipelined architecture, two short blocks are processed in one

clock cycle.

106


Table 4.13: Complexities of unified 12 and 36 point MDCT and IMDCT architectures for MP3 application. Note that M and A refer to multiplication and addition respectively.

Architecture

A B

C (pipeline)

Arithmetic complexity

36M + 15(L4 36M + 1544 27M + 131A

Critical delay 36-point M + 10A M + 10A 2M + 12A

12-point M + 6A M + 6A M + 6A

Fast Implementation Pipelined Implementation 1

0.95

0.9

0.85

0.8

0.75

0.7

0.65

0.6

0.55

Ref.[9] Ref.[37] Bilinear Fig. 4.21 Fig. 4.22

1.1

1.05

1

0.95

0.9

0.85

0.8

0.75

0.7

0.65

"©""hO

10 15 20 10 15 20 25 30

Figure 4.24: Delay in nsec (on horizontal axis) and normalized area (on vertical axis) for unified 12 and 36 point MDCT and IMDCT architectures (A, B and pipeline), with comparison to the 36-point MDCT architectures in literature.

107


4.5 Discussion and conclusion

Forward and inverse modified discrete cosine transforms (MDCT/IMDCT) are widely-

used for subband coding in the analysis and synthesis filterbanks of time domain alias

ing cancellation (TDAC). Many international audio coding standards rely heavily on

fast algorithms for the MDCT/IMDCT. In this chapter we have presented hardware

efficient bilinear algorithms to compute MDCT/IMDCT of 2" and 4 • 3n points. The

algorithms for composite lengths have practical applications in MP3 audio encoding

and decoding. It is known that the MDCT/IMDCT can be converted to type-IV

discrete cosine transforms (DCT-IV). Using group theory, our approach decomposes

DCT-IV transform kernel matrix into groups of cyclic and Hanke product matrices.

Bilinear algorithms are then applied to efficiently evaluate these groups. When im

plemented in VLSI, bilinear algorithms have improved the critical path delays over

existing solutions. For MPEG-1/2 layer III (MP3) audio, we propose three different

versions of unified hardware architectures for both the short and long blocks and the

forward and inverse transforms.

108

Chapter 5

Modulated complex lapped

transform

This chapter presents a new algorithm for the modulated complex lapped transform

(MCLT) with a sine window. It is shown that by merging the window with the main

computation, both the real and the imaginary parts of the MCLT with 2N inputs

can be obtained from two iV-point discrete cosine transforms (DCT) of appropriate

inputs. The resultant algorithm is computationally very efficient. The value of N can

be in general, an even number. When iV is a power of 2, the proposed algorithm uses

only N log N + 2 multiplications, with none of those being outside the DCT blocks.


The modulated complex lapped transform (MCLT) is structured as a cosine- and

sine-modulated filter bank that maps overlapping blocks of a real-valued signal into

blocks of complex-valued transform coefficients [33]. It is a special form of a two

times oversampled discrete Fourier transform (DFT) filter bank to perform frequency

decomposition. Since the reconstruction formula of the MCLT is not unique, the

MCLT allows more flexible implementations of audio enhancement and encoding sys

tems than the DFT. Recent MCLT applications include acoustic echo cancellation [33]

109

CHAPTER 5. MODULATED COMPLEX LAPPED TRANSFORM

and audio watermarking [34] by using the phase information from the imaginary co

efficients.

The JV-point MCLT of a 2iV-point input sequence {x(n)} is defined as [33]

2 A T - 1

y(k) = Yl ^ n ' k ) - ^ ( n ' *)1 ^W- A; = 0 ,1 , . . . , iV - 1, (5.1)

where the real and imaginary parts of the MCLT kernel are defined as

Mn,*) = ^h{n)sm(^±i±§^±2)iy (5.2)

Function h(n) in (5.2) is the window function. The most common choice for h(n) is

the sine window specified as

M „ ) = - s i n ( < ^ ) , (5.3)

which is used in many applications such as MPEG-1 and MPEG-2 since it permits a

perfect reconstruction [31,41].

The real part of MCLT is the forward/direct modulated lapped transform (MLT)

[31], which is used to implement transform coding in video and audio compression

applications [46]. The MLT has also been referred to as time domain echo cancellation

(TDAC) [41] and cosine modulated filter banks [52]. The calculation of the MLT

involves scaling the input with a window function and then using a modified discrete

cosine transform (MDCT). Existing fast MDCT algorithms are either FFT-based or

DCT-based with pre- and post- permutations [10,15,18]. Computational efficiency

of the MLT can be improved by combining the window function with the MDCT

[18,24,32,45].

The original sequence {x(n)} can be recovered from MCLT by using either its real

or the imaginary part, or both [34]. If only the real part is used, the inverse transform

is the same as the inverse MLT (IMLT) which has been studied in detail [32]. Here we

focus on fast algorithm for the forward/direct MCLT where both real and imaginary

parts are required.

110


As a complex extension of the MLT, the MCLT shares many fast algorithms with

the MLT. Malvar has shown that the real part of MCLT with arbitrary window

function can be obtained from a discrete cosine transform (DCT) of type IV and

the imaginary, from a discrete sine transform (DST) of type IV [33] . However, the

computational complexity is affected by applying the window function to the input

before using the DCT or the DST. Later, FFT-based MCLT algorithms have been

developed that merge a sine window [34,49] with the main computation for improved

computational efficiency. Fig. 5.1 shows a patented computational flow using a FFT-

based algorithm. Another recently proposed MCLT algorithm applicable to arbitrary

2N real inputs 2N point FFT N+l complex rotation N complex coefficients complex outputs

c(0) - ^ ^ ^

x(0) *- u(0)

x(l) - u(l)

u(2)

u(N-l)

x(2N-l) *• u(N)

Figure 5.1: Flow graph for Ref. [34] implementation of the MCLT. Note that c(k) = W8(2k + l)W4N(k).

but symmetric window uses two DCTs to compute the real and imaginary parts

of the MCLT separately [17]. However, this algorithm employs permutation stages

with non-constant multiplications outside the main computation module limiting its

efficiency. In addition, this algorithm involves coefficient division, which could have

numerical instability when implemented in fixed-point precision for large transform

lengths. Flow graph of this algorithm is shown in Fig. 5.2.

In this chapter, we introduce a new MCLT algorithm with lower computational

complexity than other algorithms available today. We use a sine window and merge

its computation with the main computational modules. Our algorithm for 2N input

y(N-l)

111


x(N-3)

x(N-2)

x(N-l)

Figure 5.2: Flow graph for Ref. [17] implementation of the MCLT. Note that c(i) — h(N — 1 — i)/h{i) and d(i) is a constant for a sine window. Each circle respresents an addition.

MCLT is based on the evenly stacked modified discrete cosine transform (Cjf) and

evenly stacked modified discrete sine transform (S^). The iV-point evenly stacked

MDCT and MDST are, respectively, denned as [41]:

Cf

°JV/2

i V - 1

o,

cos ^(2n + l + ^)k 0<k<N-l,

(5.4)

si cE ^0

2JV-1

= 2~] x(n)sm

n=0

0.

^(2n + l + y)A; 1 < k < N,

(5.5)

We show that 2iV-point C^ and S^ may be computed from two iV-point DCTs of

appropriately folded input sequences. On the other hand, the real and complex parts

of the MCLT can be obtained by adding the outputs of scaled C^ and Sjf with at

most two extra multiplications..

112

5.2. PROPOSED ALGORITHM

5.2 Proposed algorithm

We now show that both the real and the imaginary parts of MCLT may be obtained

from two iV-point DCT. However, unlike the iV-point DCT, the input sequence of

an MCLT has 2iV points. For mathematical convenience, we use two intermediate

transforms ADCT and ADST with 2N input and N + 1 output points defined as:

ADCT(k) = ^ = = 2 ] T cos (*k{2nw+N)) *(n), 0 < k < N, (5.6) 71=0

2N-1

ADST(k) = -L= J2 s in (^^N + N)) "(")• 0<*<7V. (5.7)

It is clear from (5.4) and (5.6) that ADCT is equivalent to a 2iV-point Cf scaled

by a constant of \/(2y/~N). Similarly, from (5.5) and (5.7), ADST is equivalent to a

2iV-point Sj? scaled by the same constant.

One can show that transforms ADCT and ADST (and equivalently Cf? and Sf?)

are related to DCT. To convert ADCT into an iV-point DCT, note that in the ADCT

computation of (5.6), the sequence components x(n + 3N/2) and x(3N/2 — n — 1) for

0 < n < N/2 — 1 multiply with the same cosine term cos(7rA;(2n + l)/(2JV)). Similarly,

the components x(n — N/2) and x(3N/2 — n — 1) for N/2 < n < N also multiply

with this same cosine term, cos(7rA;(2n + 1)/{2N)). Further, as n goes over the stated

ranges, the x component indices span the entire range of the input sequence from 0

to 2N — 1. To take advantage of this, define a sequence {xc(ri}} as:

f x{n + 3iV/2) + x(3N/2 - n - 1), if 0 < n < JV/2, , , xAn) = < (5.8)

{ x(n-N/2)+x(3N/2-n-l), if N/2 < n< N.

It is then obvious that

ADCT{k) = Jj= J2 cos ( 7 r M 22 ^ + 1 ) ) xe(n), 0 < k < N. (5.9)

Thus for 0 < k < N — 1, the ADCT of the sequence {x(n)} is the same as the

iV-point DCT (type II) of the sequence {xc(n)} except for a constant multiplication

factor 1/(2VN). These ADCT outputs can therefore be computed using any available

113


DCT algorithms [25,43] without incurring additional computational penalty except

may be of one multiplication. Further, (5.9) also shows that

ADCT{N) = 0. (5.10)

Transform ADST can also be related to DCT in a similar way. In the definition

(5.7) of ADST, sequence components x(n + 3N/2) and -x(3N/2 - n - 1) when

0 < n < N/2 as well as x{n - N/2) and -x{3N/2 - n - 1) when N/2 < n < N all

multiply with the same sine term sin(7rfc(2? + 1)/(2JV)). Again, the indices of these

four x terms span the entire input index range, 0 to 2iV — 1, when n goes over the

specified ranges. To take advantage of this, define the sequence {xs(n)} as

, , f x(n + 3N/2) - x(3N/2 - n - 1), if 0 < n < N/2, xs{n) = < (5.11)

\x(n- N/2) - x{3N/2 - n - 1), if N/2 < n < N.

One can then see that

ADST(k) = -±= Y, s i n (1Tk{22N+l)) ^ ( " ) . 0 < ^ < iV. (5.12)

Equation (5.12) shows that except for the scaling factor l/(2\/N), transform

ADST(k), 1 < k < N, is the A^-point DST of type II and can be obtained using

any available algorithm for DST. But since we are already using the DCT for ADCT

computation, we prefer to use the same for the ADST computation as well. For this,

we employ an approach similar to that of [56]. Define a new sequence {xs(n)} as

x,{n) = {-l)nxs{n). (5.13)

Eq. (5.12) can then be rewritten as

ADSTik) = ^ = | > ((2n + 1 ) | - ^N ~ ^ + " ) (-!)-,.(,)

- ^ E cos ( " ( A r - ^ " + 1 ) ) *.(»), 0 < * < N. (5,4)

Equation (5.14) shows that when 1 < k < N, except for the constant multipli

cation factor l/(2vA/V), the k-th output of the ADST of {x(n)} is the same as the

114


(N - fc)-th output of the iV-point DCT (type II) of {xs{n)}. As before, these AD ST

outputs can therefore be computed using any available DCT algorithm with at most

one additional multiplication. Equation (5.14) also shows that

ADST(0) = 0. (5.15)

In the following two subsections, we show that the real and imaginary parts of the

MCLT can be obtained directly from the ADCT and ADST. This would then allow

us to compute MCLT from the DCTs of {xc(n)} and {xs(n)}.

5.2.1 The real part of the MCLT

The real part of the MCLT kernel pc(n, k) given by (5.2) with the sine window function

of (5.3) can be simplified using the trigonometric identity sin a cos (3 = (l/2)(sin(a —

0) + sin(a + 0)) as:

pc{n,k)=p'c(n,k)+p'^n,k), (5-16)

y/2N \ 2N 4

where,

*<„,*) = - ^ s i „ ( i ^ « § L ± I ^ ) . „17) Expressions for p'c and p" in (5.17) can be further simplified by using sin(o; ± 7r/4) =

( l / \ /2 ) (sin a ± cos a) as

p'c{n, k) =

p"c{n,k) =

2^/N

1

. fkTv(2n + l+N)\ fkTv(2n + l + N) sm {—^N -)+m{ 2N

2VN

. '(k + l)n(2n + l + N)\ {(k + l)?r(2n + 1 + N) • sm \ , T + cos ' 2N J \ 2 ^

From these relations, one can get the real part of the MCLT of the sequence x as:

2N-1 2N-1

n=0 n=0

= ADST(&)+ADCT(&)-ADST(fc+l)+ADCT(&+l), 0 < k < N. (5.18)

115


5.2.2 The imaginary part of the MCLT

The imaginary part of the MCLT y(k) can also be computed through ADCT and

ADST. The imaginary part of the MCLT kernel ps{n, k) given by (5.2) with the sine

window function of (5.3) can be simplified using the identity sin a sin /? = (1/2) (cos(o:—

0) — cos(a + /?)) as:

P$(n, k) = p'a{n, k) +p"{n, k), (5.19)

where,

y/2N \ 2N 4 (5.20)

These expressions for p's and p"s can be further simplified by using the identity cos(a±

7r/4) == (l/v /2)(cosaTsina) to

p's{n,k) = 2VN - C°S \^-2N "J + Sm {-^~2N "

p"(n,k) 2VN

(k + l)ir(2n + I + N)\ . ((k + lW(2n + 1 + N) cos I TTT ) + sin '

2N 2N

From these relations, the imaginary part of the MCLT of the sequence x is obtained

as:

2AT-1 2 J V - 1

lm[y(k)] = Y^ P's(n> k)x{n) + J ^ p"s{n, k)x{n) 71=0 n=0

= -ADCT(&)+ADST(&)+ADCT(fc+l)+ADST(&+l), 0 < k < N. (5.21)

5.2.3 The new MCLT algorithm

The above discussion leads to the following MCLT algorithm.

• Create sequences {xc(n)} and {xs(n)} using (5.8), (5.11) and (5.13). This step

requires 2iV additions.

116


• Compute the discrete cosine transforms (type II) of {xc(n)} and {xs(n)} using

any of the fast DCT algorithms. For example, when iV is a power of 2, one may

employ the procedure of [25] to compute each DCT in (N/2) log N multiplica

tions and (3N/2)\ogN — N + 1 additions. The algorithm in [25] recursively

partitions the DCT into two DCTs of half the length. Denote them by DCTi

and DCT2 . The input for DCTi is obtained by folding the input sequence

and its output gives the even-indexed DCT components. The input for DCT2

is obtained by folding the input sequence and then multiplying each compo

nent by a proper cosine term. A linear combination DCT2 output gives the

odd-indexed DCT components. The DCT output can be scaled by a constant

factor l/(2\/iV) (see (5.9) and (5.14)) by combining this factor with the multi

pliers applied to create the input of DCT2. This gives scaled odd indexed DCT

components without increasing the multiplication count. Since the algorithm

of [25] is recursive, the same process can be used repeatedly every time DCTi

is partitioned to scale half of its outputs. Thus the only extra multiplication

needed to scale all the components of the DCT is the multiplication used to

scale the 0-th DCT component. Thus the two DCTs with the scaling factor as

required in (5.9) and (5.14) can be computed in JVlogiV + 2 multiplications and

3iVlog N - 2N + 2 additions.

The scaled DCTs of {xc(n)} and {xs(n)} provide values of ADCT(k) and

ADST(k) for 0 < k < n from relations (5.9), (5.10), (5.14) and (5.15).

• Finally, to obtain the MCLT, first computes

zc{k) = ADCT(k + l) + ADST{k) and

zs(k) = ADST(k + l)-ADCT(k), 0 < k < N.

Then the real and imaginary parts of MCLT y are given from (5.18) and (5.21)

by

TZe[y(k)] = zc(k) — zs(k) and

lm[y(k)} = zc(k) + zs(k), 0 < k < N.

117


This step requires 4N — 2 additions (ignoring the trivial additions by ADCT(N)

and ADST(0), each of which is 0).

Thus we can obtain the MCLT in N log N + 2 real multiplications and 3N log iV +

AN real additions. The signal flow graph of the proposed algorithm is shown in

Fig. 5.3.

Pre-processing (2N additions)

Main Processing (2 DCTs)

x(2N-l)

DCT-1I (scaled)

Post-processing (4N-2 additions)

2c(k)=ADCT(k+l)+ADST(k)

KZ)\ 1(3) " yc(MCLTreal) MLT

ADCT(N)=0

ADST(0)=0

x(N/2+l)

DCT-II (scaled)

i Q l j(+) ». ys(MCLTimg)

z. (k) =ADST(k+l)-ADCT(k)

DST-II computed from DCT-II

Figure 5.3: Flow graph for the proposed MCLT algorithm. Note that DCT modules are scaled by a constant of l/(2y/iV).

5.3 Discussion and Conclusion

This section proposes a fast algorithm for the modulated complex lapped transform

(MCLT) with a sine window. By merging the window with the MCLT computation,

we show that both the real and imaginary parts of MCLT can be computed from the

same two iV-point discrete cosine transforms. Table 5.1 compares the computational

complexity of our algorithm to those of [17,24,33,34].

One can see from this table that our algorithm has the smallest number of mul

tiplications (for N — 2") of any algorithm available in literature. At the end of this

research, it was brought to our attention during a publication preparation that the

authors of [17] have made improvements. The paper titled "A novel DCT-based al

gorithm for computing the modulated complex lapped transform", was published in

118

5.3. DISCUSSION AND CONCLUSION

Table 5.1: Complexities of various fast MCLT algorithms for block size N = 2n . Algorithm

Ref. [33] Ref. [17] Ref. [34] Proposed

Window choice

any sine sine sine

Real multiplications

NlogN + W iVlogiV + iV NlogN + N JVlogiV + 2

Real additions

3NlogN + 2N 3iVlogiV + 4iV

3iVlogiV + 3 iV-2 3iVlogJV + 4iV

the november 2006 issue of IEEE Trans, on Signal Processing. Their new algorithm

starts out by treating the window function separately as before. It then applies novel

matrix factorization and is able to obtain a term that coincidently cancels out the

window functions, leaving with two N-point DCT as a result. The overall computa

tional complexity matches ours exactly, though their paper does not include the two

extra multiplications due to the scaling of first row of DCT modules. Our method on

the other hand, has a different signal flow chart and is proposed from the beginning

with the intention of merging the window with the main computational module.

The propsoed method is applicable to any even block size N. It does not assume

any specific algorithm to compute the DCT, consequently, it may be adapted to any

efficient DCT software or hardware modules. For example, by using a bilinear algo

rithm for the DCT, one can get a bilinear algorithm for the MCLT which can produce

a very fast implementation in VLSI. Most other algorithms use multiplications out

side the main computational blocks of DCT and therefore cannot lead to a bilinear

structure.

119


120

Chapter 6

Conclusions

In the final chapter, we summaries our research on fast algorithm derivations for the

discrete Hartley transform (DHT), the modified discrete cosine transform (MDCT)

and the modulated complex lapped transform (MCLT). To obtain bilinear algorithms

for the transforms under consideration, a group theoretic approach can be proven suc

cessful in providing a fast VLSI architecture. The use of group theory will allow us

to partition and transform kernel of interest into smaller cyclic and Hankel matrix

products Then using bilinear algorithms for these matrix products, the desired archi

tectures can be obtained as a result. Future work on structured bilinear algorithms

is discussed at the end.

6.1 Thesis summary

This dissertation presents a formal hardware design approach using bilinear algo

rithms for fast digital signal processing (DSP) applications. In particular, we focus

on the design of application specific integrated circuit (ASIC), where dedicated algo

rithmic accelerators are implemented in fixed-point arithmetic.

Most signal processing algorithms involve a transform kernel with a known struc

ture. Using concepts of the group theory, the kernel matrix can be recurisvely par

titioned into small length cyclic convolutions and Hankel matrix products. Bilinear

algorithms for these smaller blocks are then combined together to obtain the required

121

CHAPTER 6. CONCLUSIONS

bilinear algorithm of the transform. Bilinear algorithms have a high degree of con

currency as all multiplication operations are independent of each other and can be

computed at the same time. As a result, the hardware realizations of bilinear algo

rithms are much faster than any other implementations. The structural modularity

also allows simple pipelining and greatly reduces the number of input and output

(10) pins.

In this dissertation, we develope new bilinear algorithms and implementations

for the discrete Hartley transform (DHT), the modified discrete cosine transform

(MDCT) and the modulated complex lapped transform (MCLT). In case of bilin

ear DHT algorithms, we show that the kernel division based on the group theory,

is identical for all prime power lengths. Our implementations are 20%-60% faster

than existing implementations. For MPEG-1/2 audio layer III (MP3) application,

our proposed MDCT algorithms have about 30% lower computational complexity as

compared with other fast algorithms in the literature. The modularity of our algo

rithms also permits one to design, for the first time, a unified architecture for forward

and inverse transforms using different MP3 block sizes. In case of the MCLT, we

achieve a bilinear algorithm by merging the external sine window function with the

main computation through trigonometric manipulation. As compared with other

algorithms, our MCLT algorithm requires about A^-less multiplications where the

typical block size, N, for applicatios such as audio watermarking is 2048.

6.2 Future work

This dissertation has concentrated on the development of fast algorithms and their

hardware implementations. The next disign step is taking these fast algorithms and

implementations to the system level. For the discrete Hartley transform, there are

many recent work published on Hartley domain equalization. A faster DHT im

plementation will enable a faster convergence of equalization parameters. For the

modified discrete cosine transform, it will be of not only academic interests, but also

practical reasons to develop a complete MP3 audio encoder and decoder. The initial

122

6.2. FUTURE WORK

prototype can be a hybrid of a field programmable gate array and a low cost micro

controller. For the MCLT, one can further investigate computations beyond the most

commonly used sine window function.

One should also note that key computational elements in the fast algorithms

developed in this dissertation, are bilinear algorithms for 2-point cyclic and Hankel

matrix products. A bilinear computation generally involves 2 multiplications and 4

additions, or 3 multiplications and 3 additions. It is thus of interest profiling DSP

algorithms for a wide range of applications to determine how frequent these common

matrix products are used. Most advanced programmable DSP chips already have

more-than-one multiply-accumulation (MAC) units. In existing forms, they often

exist as independent computational elements. It is thus possible to link two or more

MAC units together and perform a single cycle computation of some commonly used

cyclic and/or Hankel matrix products. Since the majority of the hardware is already

in place, the cost should be reasonably low for additional support or glue logic.

123

CHAPTER 6. CONCLUSIONS

124

Bibliography

[1] N. Anupindi, S. Narayanan, and K. Prabhu. New radix-3 FHT algorithm. Elec

tronics Letters, 26(18):1537-1538, Aug. 1990.

[2] G. Bi. New split-radix algoritm for the discrete Hartley transform. IEEE Trans.

Signal Processing, 45(2):297-302, Feb. 1997.

[3] R. E. Blahut. Fast algorithms for digital signal processing. Addison Wesley, 1984.

[4] S. Bouguezel, M. Ahmed, and M. Swamy. A new split-radix FHT algorithm for

length-g*2m DHTs. IEEE Trans. Circuits Syst. I, 51(10):2031-2043, Oct. 2004.

[5] S. Boussakta and A. Holt. Prime factor Hartley and Hartley-like transform

caculation using transversal filter-type structure. IEE Proceedings, 136(5):269-

277, Oct. 1989.

[6] R. Bracewell. Discrete Hartley transform. J. Opt. Soc. Amer., 73:1832-1835,

Dec. 1983.

[7] R. Bracewell. Aspects of the Hartley transform. Proc. IEEE, 82(3):381-387,

Mar. 1994.

[8] V. Britanak and K. Rao. Corrrection to "an efficient implementation of the

forward and inverse MDCT in MPEG audio coding". IEEE Signal Processing

Letters, 8(10):279, Oct. 2001.

[9] V. Britanak and K. Rao. An efficient implementation of the forward and inverse

MDCT in MPEG audio coding. IEEE Signal Processing Letters, 8(2):48-50, Feb.

2001.

125

BIBLIOGRAPHY

[10] V. Britanak and K. Rao. A new fast algorithm for the unified forward and inverse

MDCT/MDST computation. Signal processing, 82(3):433-459, 2002.

[11] C. Chakrabarti and J. JaJa. Systolic architectures for the computation of the

discrete Hartley and the discrete cosine transforms based on prime factor decom

position. IEEE Trans. Computers, 39(11):1359-1990, Nov. 1990.

[12] D. Chan, J. Yang, and C. Fang. Fast implementation of MPEG audio coder

using recursive formula with fast discrete cosine transforms. IEEE Trans. Speech,

Audio Processing, 4(2):144-148, Mar. 1996.

[13] L. Chang and S. Lee. Systolic arrays for the discrete Hartley transform. IEEE

Trans. Signal Processing, 39(11):2411-2418, Nov. 1991.

[14] C. Chen, B. Liu, and J. Yang. Recursive architectures for realizing modified

discrete cosine transform and its inverse. IEEE Trans. Circuits Syst. II, 50(5):38-

45, Jan. 2003.

[15] M. Cheng and Y. Hsu. Fast IMDCT and MDCT algorithms - a matrix approach.

IEEE Trans. Signal Processing, 51(l):221-229, Jan. 2003.

[16] Forward Concepts. DSP market bulletin, http://www.fwdconcepts.com/

dsp8104.htm, Aug. 2004.

[17] Q. Dai and X. Chen. New algorithm for modulated complex lapped transform

with symmetrical window function. IEEE Signal Processing Letters, 11(12):925-

928, Dec. 2004.

[18] P. Duhamel, Y. Mahieux, and J. Petit. A fast algorithm for the implementation of

filter banks based on time domain aliasing cancellation. Proc. ICASSP, 3:2209-

2212, Apr. 1991.

[19] A. Erickson and B. Fagin. Calculating the FHT in hardware. IEEE Trans. Signal

Processing, 40(6):1341-1353, June 1992.

[20] M. Balducci et. al. Benchmarking of FFT algorithms. Proc.Eng. New Century,

pages 328-330, Mar. 1997.

126

http://www.fwdconcepts.com/

BIBLIOGRAPHY

[21] A. Grigoryan. A novel algorithm for computing the 1 D discrete Hartley trans

form. IEEE Signal Processing Letters, 11(2):156-159, Feb. 2004.

[22] S. Gudvangen and A. Holt. Computation of prime factor DFT and DHT/DCCT

algorithms using cyclic and skew-cyclic bit-serial semisystolic IC convolvers. IEE

Proceedings, 137(5):373-389, Oct. 1990.

[23] J. Guo. An efficient design for one-dimensional discrete Hartley transform using

parallel additions. IEEE Trans. Signal Processing, 48(10) :2806-2813, Oct. 2000.

[24] C. Jing and H. Tai. Fast algorithm for computing modulated lapped transform.

Electronics Letters, 37(12):796-797, June 2001.

[25] C. Kok. Fast algorithm for computing discrete cosine transform. IEEE Trans.

Signal processing, 45(3):757-760, Mar. 1997.

[26] C. Kwong and K. Shiu. Structured fast Hartley transform algorithms. IEEE

Trans. Acoust, Speech, Signal Processing, ASSP-34(4):1000-1002, Aug. 1986.

[27] S. Lee. Improved algorithm for efficient computation of the forward and inverse

MDCT in MPEG audio coding. IEEE Trans. Circuits Syst. II, 48(10) :990-994,

Oct. 2001.

[28] D. Lun and W. Siu. On prime factor mapping for the discrete Hartley transform.

IEEE Trans. Signal Processing, 40(6):1399-1411, June 1992.

[29] D. Lun and W. Siu. On prime factor mapping for the discrete Hartley transform.

IEEE Trans. Signal Processing, 41 (7):2494-2499, July 1993.

[30] H. Malvar. Lapped transforms for efficient transform/subband coding. IEEE

Trans. Acoust., Speech, Signal Processing, 38(6):969-978, June 1990.

[31] H. Malvar. Signal processing with lapped transforms. Artech House, 1992.

[32] H. Malvar. Biorthogonal and nonuniform lapped transforms for transform coding

with reduced blocking and ringing artifacts. IEEE Trans. Signal Processing,

46(4):1043-1053, Apr. 1998.

127

BIBLIOGRAPHY

[33] H. Malvar. A modulated complex lapped transform and its applications to audio

processing. Proc. ICASSP, pages 1421-1424, Mar. 1999.

[34] H. Malvar. Fast algorithm for the modulated complex lapped transform. IEEE

Signal Processing Letters, 10(1):8-10, Jan. 2003.

[35] V. Muddhasani and M. D. Wagh. Bilinear algorithms for discrete cosine trans

forms of prime lengths. Signal Processing, 86:2393-2406, 2006.

[36] V. Nikolajevic and G. Fettweis. Computation of forward and inverse MDCT using

Clenshaw's recurrence formula. IEEE Trans. Signal Processing, 51(5):1439-1444,

May 2003.

[37] V. Nikolajevic and G. Fettweis. Improved implementation of MDCT in MP3

audio coding. 10th Asia-Pacific Conf. Comrn. and 5th Intern. Symp. Multi-

Dimen. Mobile Comm., 1:309-312, Aug. 2004.

[38] P. Noll. MPEG digital audio coding. IEEE Signal Processing Magazine, 14(5):59-

81, Sept. 1997.

[39] D. Pan. A tutorial on MPEG audio compression. IEEE Multimedia, 2(2):60-74,

Summer 1995.

[40] K. Parhi. VLSI digital signal processing systems: design and implementation.

John Wiley, 1999.

[41] J. Princen and A. Bradley. Analysis/synthesis filter bank design based on time

domain aliasing cancellation. IEEE Trans. Acoust. Speech Signal Processing,

ASSP-34(5):1153-1161, Oct. 1986.

[42] C. Rader. Discrete Fourier transforms when the number of data samples is prime.

Proc. IEEE, 56:104-105, June 1968.

[43] K. Rao and P. Yip. Discrete consine transform: algorithms, advantages, appli

cations. Academic Press, 1990.

128

BIBLIOGRAPHY

[44] M. Romdhane, V. Madisetti, and J. Hines. Quick-turnaround ASIC design in

VHDL. Kluwer Academic Publisher, 1996.

[45] D. Sevic and M. Popvic. A new efficient implementation of the oddly stacked

Princen-Bradley filter bank. IEEE Signal Processing Letters, 1(11):166—168, Nov.

1994.

[46] S. Shlien. The modulated lapped transform, its time-varying forms, and its ap

plications to audio coding standards. IEEE Trans. Speech and Audio Processing,

5(4):359-366, July 1997.

[47] International Consumer Electronics Show. Agenda for Enabling Technology

Forums, http:/ /www.enablingtechnologyforums.com/ces2005/index.htm,

Jan. 2005.

[48] M. Smith. Application-specific integrated circuits. Addison Wesley, 1997.

[49] H. Tai and C. Jing. Design and efficient implimentation of a modulated complex

lapped transform processor using pipelining technique. IEICE Trans. Funda

mentals, E84-A(5):1280-1286, May 2001.

[50] S. Tai, C. Wang, and C. Lin. FFT and IMDCT circuit sharing in DAB receiver.

IEEE Trans. Broadcasting, 49(2):124-131, June 2003.

[51] T. Tsai, T. Chen, and L. Chen. An MPEG audio decoder chip. IEEE Trans.

Consumer Electronics, 41(l):89-96, Feb. 1995.

[52] P. Vaidyanathan. Multirate systems and filter banks. Prentice Hall, 1993.

[53] M. Wagh. A new algorithm for the discrete cosine transform of arbitrary number

of points. IEEE Trans. Computers, C-29(4):269-277, Apr. 1980.

[54] M. Wagh. Modular algorithms for cyclic convolution of arbitrary length. Lehigh

University, Feb. 2005.

[55] M. Wagh. A structured bilinear algorithm for discrete Fourier transform. Lehigh

University, Feb. 2005.

129

http://www.enablingtechnologyforums.com/ces2005/index.htm

BIBLIOGRAPHY

[56] Z. Wang. A fast algorithm for the discrete sine transform implemented by the

fast cosine transform. IEEE Trans. Acoust., Speech, Signal Processing, ASSP-

30(5):814-815, Oct. 1982.

[57] Z. Wang. Fast algorithms for discrete W transform and for the discrete Fourier

transform. IEEE Trans. Acoust., Speech, Signal Processing, ASSP-32:803-816,

Aug. 1984.

[58] Z. Wang. A prime factor fast W transform algorithm. IEEE Trans. Signal

Processing, 40(9):2361-2368, Sept. 1992.

[59] L. Wanhammar. DSP integrated circuits. Academic Press, 1999.

[60] N. Weste and K. Eshraghian. Principles of CMOS VLSI design: a systems

perspective. Addison Wesley, 2nd edition, 1992.

[61] S. Winograd. On computing the discrete Fourier transform. Math Compt.,

32:175-199, Jan. 1978.

[62] J. Wu and J. Shiu. Discrete Hartley transform in error control coding. IEEE

Trans. Signal Processing, 39(10):2356-2359, Oct. 1991.

[63] Y. Yao, Q. Yao, P. Liu, and Z. Xiao. Embedded software optimization for

MP3 decoder implemented on RISC core. IEEE Trans. Consumer Electronics,

50(4):1244-1249, June 2005.

[64] P. Yeh. Data compression properties of the Hartley transform. IEEE Trans.

Acoust., Speech, Signal Processing, 37(3):450-451, Mar. 1989.

[65] Z. Zhao. In-place radix-3 fast Hartley transform algorithm. Electronics Letters,

28(3):319-321, Jan. 1992.

130

Vita

Xingdong Dai received a Bachelor degree from Southern Illinois University at Car-

bondale (SIUC) in 1994 and a Master degree from Arizona State Univeristy (ASU)

in 1996, all in electrical engineering. Xingdong was a research associate at ASU and

his study focused on quantum effects in nanometer-scale structures fabricated with

scanning tunneling miscroscope. In 1996, Xingdong joined Lucent Technologies Mi

croelectronics Group in Allentown PA, where he was involved in integrated circuit

design of modem and DSL chips. In 1998, He was recognized with Bell Laboratory

President's Silver Award for technical contributions to the soft modem development.

From 2000 to 2001, Xingdong was with Spinnaker Networks (acquired by Network

Appliance) in Pittsburgh PA. He contributed to the design and validation of hard

ware acceleration FPGA for file system on high performance network attached storage

servers. Since 2002, Xingdong worked for Agere Systems developing high speed inter

face IP for telecom, enterprise networking and consumer applications. He is currently

a staff engineer in the Mixed Signal R&D organization at LSI Corporation. Xing

dong has one published journal paper, holds two issued U.S. patents with additional

applications pending. Xingdong is a member of Tau Beta Pi, a national engineering

honor society.

131

Date post:	07-Dec-2015
Category:	Documents
Upload:	mymadi2009
View:	17 times
Download:	0 times

Bili Near

Documents