VLSI Synthesis of DSP Kernels: Algorithmic and Architectural Transformations

VLSI SYNTHESIS OF DSP KERNELS Algorithmic and Architectural Transformations

VLSI SYNTHESIS OF DSP KERNELS Algorithmic and Architectural Transformations

by

MANESH MEHENDALE Texas Instruments (India), Ltd.

and

SUNILD. SHERLEKAR Silicon Automation Systems Ltd.

Springer Science+Business Media, LLC

A C.I.P. Catalogue record for this book is available from the Library ofCongress.

ISBN 978-1-4419-4904-2 ISBN 978-1-4757-3355-6 (eBook) DOI 10.lO07/978-1-4757-3355-6

Printed on acid-free paper

All Rights Reserved © 200 1 Springer Science+Business Media New York Originally published by Kluwer Academic Publishers, Boston in 200l. Softcover reprint ofthe hardcover 1st edition 2001 No part of the material protected by this copyright notice may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage and retrieval system, without written permission from the copyright owner.

Contents

List of Figures List of Tables

Foreword

Acknow ledgments Preface

1. INTRODUCTION 1.1 An Example 1.2 The Design Process: Constraints and Alternatives 1.3 Organization of the Book 1.4 For the Reader

xi xv

xvii

xix XXI

3 7 9

2. PROGRAMMABLE DSP BASED IMPLEMENTATION 11 2.1 Power Dissipation - Sources and Measures 13

2.1.1 Components Contributing to Power Dissipation 13 2.1.2 Measures of Power Dissipation in Busses 13 2.1.3 Measures of Power Dissipation in the Multiplier 13

2.2 Low Power Realization of DSP Algorithms 16 2.2.1 Allocation of Program, Coefficient and Data Memory 16 2.2.2 Bus Coding 17 2.2.2.1 Gray Coded Addressing 17 2.2.2.2 TO coding 18 2.2.2.3 Bus Invert Coding 20 2.2.3 Instruction Buffering 21 2.2.4 Memory Architectures for Low Power 22 2.2.5 Bus Bit Reordering 24 2.2.6 Generic Techniques for Power Reduction 26

2.3 Low Power Realization of Weighted-sum Computation 26 2.3.1 Selective Coefficient Negation 27 2.3.2 Coefficient Ordering 28 2.3.2.1 Coefficient Ordering Problem Formulation 29 2.3.2.2 Coefficient Ordering Algorithm 30 2.3.3 Adder Input Bit Swapping 31 2.3.4 Swapping Multiplier Inputs 33 2.3.5 Exploiting Coefficient Symmetry 34

v

VI VLSI SYNTHESIS OF DSP KERNELS

2.4 Techniques for Low Power Realization of FIR Filters 35 2.4.1 Circular Buffer 36 2.4.2 Multirate Architectures 37 2.4.2.1 Computational Complexity of Multirate Architectures 37 2.4.2.2 Multirate Architecture on a Programmable DSP 38 2.4.3 Architecture to Support Transposed FIR Structure 41 2.4.4 Coefficient Scaling 42 2.4.5 Coefficient Optimization 43 2.4.5.1 Coefficient Optimization - Problem Definition 43 2.4.5.2 Coefficient Optimization - Problem Formulation 43 2.4.5.3 Coefficient Optimization Aigorithm - Components 44 2.4.5.4 Coefficient Optimization Aigorithm 45 2.4.5.5 Coefficient Optimization Using 0-1 Programming 50

2.5 Framework for Low Power Realization of FIR Filters on a Programmable DSP 51

3. IMPLEMENTATION USING HARDWARE MULTIPLIER(S) AND ADDER(S) 55 3.1 Architectural Transformations 55 3.2 Evaluating the Effectiveness of DFG Transformations 56 3.3 Low Energy vs Low Peak Power Tradeoff 61 3.4 Multirate Architectures 63

3.4.1 Computational Complexity of Multirate Architectures 64 3.4.1.1 Non-linear Phase FIR Filters 64 3.4.1.2 Linear Phase FIR Filters 65

3.5 Power Analysis of Multirate Architectures 68 3.5.1 Power Analysis for One Level Decimated Multirate

Architectures 68 3.5.1.1 Power Analysis - an Example 70 3.5.1.2 Power Reduction Using Multirate Architectures 71

4. DISTRIBUTED ARITHMETIC BASED IMPLEMENTATION 75 4.1 DA Structures for Area-Delay Tradeoff 76

4.1.1 DA Based Implementation of Linear Phase FIR Filters 77 4.1.2 I-Bit-At-A-Time vs 2-Bits-At-A-Time Access 78 4.1.3 Multiple Coefficient Memory Banks 79 4.1.4 Multiple Memory Bank Implementation with 2BAAT

Access 80 4.1.5 DA Based Implementation of Multirate Architectures 81 4.1.6 Multirate Architecture with a Decimation Factor ofThree 82 4.1.7 Multirate Architectures with Two Level Decimation 84 4.1.8 Coefficient Memory vs Number of Additions Tradeoff 84

4.2 Improving Area Efficiency of Two LUT Based DA Structures 85 4.2.1 Minimum Area Partitions for Two ROM Implementation 87 4.2.2 Minimum Area Partitions for Hardwired Logic 88

Contents Vll

4.2.2.1 CF2: Estimating Area from the Actual Truth-Table 89 4.2.2.2 CF1: Estimating Area from the Coefficients in Each

Partition 91 4.2.3 Evaluating the Effectiveness ofthe Coefficient Partitioning

Technique 92 4.3 Techniques for Low Power Implementation of DA Based FIR

Filters 94 4.3.1 Toggle Reduction Using Data Coding 95 4.3.1.1 Nega-binary Coding 95 4.3.1.2 2's Complement vs Nega-binary Representation 96 4.3.1.3 Deriving an Optimum Nega-binary Scheme for a Given

Data Distribution 99 4.3.1.4 Incorporating a Nega-binary Scheme into the DA Based

FIR Filter Implementation 101 4.3.1.5 A Few Observations 103 4.3.1.6 Additional Power Saving with Nega-binary Architecture 104 4.3.2 Toggle Reduction in Memory Based Implementations

by Gray Sequencing and Sequence Reordering 107

5. MULTIPLIER-LESS IMPLEMENTATION 113 5.1 Minimizing Additions in the Weighted-sum Computation 114

5.1.1 Minimizing Additions - an Example 114 5.1.2 2 Bit Common Subexpressions 116 5.1.3 Problem Formulation 116 5.1.4 Common Subexpression Elimination 118 5.1.5 The Algorithm 119

5.2 Minimizing additions in MCM Computation 120 5.2.1 Minimizing Additions - an Example 120 5.2.2 2 Bit Common Subexpressions 122 5.2.3 Problem Formulation 123 5.2.4 Common Subexpression Elimination 124 5.2.5 The Algorithm ] 24 5.2.6 An UpperBoundon theNumberof Additions forMCM

Computation 126 5.3 Transformations for Minimizing Number of Additions 128

5.3.] Number Theoretic Transforms 128 5.3.1.] 2's Complement Representation 128 5.3.1.2 Uni-sign Representation 129 5.3.1.3 Canonical Signed Digit (CSD) Representation 129 5.3.2 Signal Flow Graph Transformations 130 5.3.3 Evaluating Effectiveness of the Transformations 133 5.3.4 Transformations for Optimal Initial Solution 137 5.3.4.1 Coefficient Optimization ] 37 5.3.4.2 Efficient Pre-Filter Structures 138

5.4 High Level Synthesis of Multiprecision DFGs 138

viii VLSI SYNTHESlS OF DSP KERNELS

5.4.1 5.4.2 5.4.3

Precision Sensitive Register Allocation Precision Sensitive Functional Unit Binding Precision Sensitive Scheduling

6. IMPLEMENTATION OFMULTIPLICATION-FREE LINEAR

138 139 140

TRANSFORMS 141 6.1 Optimum Code Generation for Register-rich Architectures 142

6.1.1 Generic Register-rich Architecture Model 142 6.1.2 Sources and Measures of Power Dissipation 143 6.1.3 Optimum Code Generation for 1-D Transforms 144 6.1.4 Minimizing NumberofOperations in Two Dimensional

Tran sform s 146 6.1.5 Low Power Code Generation 148

6.2 Optimum Code Generation for Single Register, Accumulator Based Architectures 153 6.2.1 Single Register, Accumulator Based Architecture Model 153 6.2.2 Code Generation Rules 154 6.2.3 Computation Scheduling Algorithm 156 6.2.4 ImpactofDAG Structure on the Optimality ofGenerated

Code 158 6.2.5 DAG Optimizing Transformations 159 6.2.5.1 Transformation I - Tree to Chain Conversion 159 6.2.5.2 Transformation 11 - Serializing a Butterfly 159 6.2.5.3 Transformation III - Fanout Reduction 160 6.2.5.4 Transformation IV - Merging 161 6.2.6 Synthesis of Spill-free DAGs 162 6.2.7 Sources and Measures of Power Dissipation 168 6.2.8 Low Power Code Generation 168

7. RESIDUE NUMBER SYSTEM BASED IMPLEMENTATION 171 7.1 Optimizing RNS based Implementation of the Weighted-sum

Computation 172 7.1.1 Parallel Processing 174 7.1.2 Residue Encoding for Low Power 174 7.1.3 Coefficient Ordering 17 5 7.1.4 Exploiting Redundancy 176 7.1.5 Residue Encoding for minimizing LUT area 177

7.2 Optimizing RNS based Implementation of FIR Filters 179 7.2.1 Coefficient Scaling 179 7.2.2 Coefficient Optimization for Low Power 180 7.2.3 RNS based Implementation of Transposed FIR Filter

Strucrure lW 7.2.4 Coefficient Optimization for Area Reduction 180

7.3 RNS as an Optimizing Transformation for High Precision Signal Processing 1 83

Contcnts IX

8. A FRAMEWORK FOR ALGORITHMIC AND ARCHITECTURAL TRANSFORMATIONS 187 8.1 Classification of Algorithmic and Architectural Transformations 187 8.2 A Snapshot of the Framework ] 91

9. SUMMARY

References

Topic Index

About the Authors

] 95

]99

207

209

List of Figures

1.1 Digital Still Camera System 2 1.2 DSC Image Pipeline 3 1.3 Hardware-Software Codesign Methodology for a System-

on-a-chip 4 1.4 Solution Space for Weighted-Sum Computation 7 2.1 Generic DSP Architecture 12 2.2 4x4 Array Multiplier 14 2.3 Toggle Count as a Function of Number of Ones in the

Multiplier Inputs 16 2.4 Toggle Count as a Function of Hamming Distance be-

tween Successive Inputs 16 2.5 Address Bus Power Dissipation as a Function of Start Address 17 2.6 Binary to Gray Code Conversion 18 2.7 Memory Reorganization to Support Gray Coded Addressing 19 2.8 Programmable Binary to Gray Code Converter 19 2.9 TO Coding Scheme 20 2.10 TO Coding Scheme 21 2.11 Instruction Buffering 22 2.12 Decoded Instruction Buffering 22 2.13 Memory Partitioning for Low Power 23 2.14 Prefetch Buffer 23 2.15 Bus Reordering Scheme for Power Reduction in PD bus 24 2.16 %Reduction in the Number of Adjacent Signal Transi-

tions in Opposite Directions as a Function of the Bus Reordering Span 26

2.17 Coefficients of a 32 Tap Linear Phase Low Pass FIR Filter 27 2.18 Scheme for Reducing Power in the Adder Input Busses 33 2.19 Data Flow Graph of a Weighted-sum Computation with

Coefficient Symmetry 34 2.20 Suitable Abstraction of TMS320C54x Architecture for

Exploiting Coefficient Symmetry 35 2.21 Signal Flow Graph of a Direct Form FIR Filter 36 2.22 One Level Decimated Multirate Architecture 38

Xl

XII VLSI SYNTHESIS OF DSP KERNELS

2.23 Normalized Power Dissipation as a Function ofNumber of Taps for the Multirate FIR Filters Implemented on TMS32OC2x 41

2.24 Signal Flow Graph of the Transposed FIR Filter 42 2.25 Architecture to Support Efficient Implementation ofTrans-

posed FIR Filter 42 2.26 Frequency Domain Characteristics of a 24 Tap FIR Fil-

ter Before and After Optimization 49 2.27 Low Pass Filter Specifications 50 2.28 Framework for Low Power Realization of FIR Filters

on a Programmable DSP 53 3.1 Direct Form Structure of a 4 Tap FIR Filter 57 3.2 Scheduled DFG Using One Multiplier and One Adder 57 3.3 Scheduled DFG Using One Pipelined Multiplier and

One Adder 58 3.4 Loop Unrolled DFG Using 1 Pipelined Multiplier and 1 Adder 59 3.5 Retimed 4 Tap FIR Filter 59 3.6 MCM DFG Using One Pipelined Multiplier and One Adder 60 3.7 Direct Form DFG Using Two Pipelined Multipliers and

One Adder 60 3.8 MCM DFG Using Two Pipelined Multipliers and Two Adders 61 3.9 Energy and Peak Power Dissipation as a Function of

Degree of Parallelism 62 3.10 LowerLimit of VDD/VT for Reduced Peak Power Dis-

sipation as a Function of Degree of Parallelism 63 3.11 One Level Decimated Multirate Architecture: Topology-I 63 3.12 One Level Decimated Multirate Architecture: Topology - 11 64 3.13 Signal Flow Graph of a Direct Form FIR Structure with

Non-linear Phase 65 3.14 Signal Flow Graph of a Direct Form FIR Structure with

Linear Phase 65 3.15 Signal Flow Graph of a Two Level Decimated Multirate

Architecture 68 3.16 Normalized Delay vs Supply Voltage Relationship 69 3.17 Normalized Power Dissipation vs Number of Taps 71 4.1 DA Based 4 Tap FIR Filter 77 4.2 4 Tap Linear Phase FIR Filter 78 4.3 2 Tap FIR Filter with 2BAAT 79 4.4 Using Multiple Memory Banks 80 4.5 Multirate Architecture 81 4.6 DA Based 4 Tap Multirate FIR Filter 82 4.7 Area-Delay Curves for FIR Filters 85

List 0/ Figures Xlll

4.8 Two Bank Implementation - Simple Coefficient Split 86 4.9 Two Bank Implementation - Generic Coefficient Split 86 4.10 Area vs Normalized CF2 Plot for 25 Different Partitions

of a 16 Tap Filter 91 4.11 Range ofRepresented Values for N=4, 2's Complement

and N+ 1=5, Nega-binary 96 4.12 Typical Audio Data Distribution for 25000 SampIes Ex-

tracted from an Audio File 97 4.13 Difference in Toggles for N=6, 2's Complement and

Nega-binary Scheme : + - - + - + + 98 4.14 Difference in Toggles for N=6, 2's Complement and

Nega-binary Scheme : - + + - + - + 99 4.15 Gaussian Distributed Data with N=6, Mean=22, SD=6 100 4.16 Gaussian Distributed Data with N=6, Mean=-22, SD=6 101 4.17 DA Based FIR Architecture Incorporating the Nega-

binary Scheme 102 4.18 Saving vs SD Plot for N=8, Gaussian Distributed Data

with Mean = max/2 105 4.19 Narrow (SD=8) Gaussian Distribution 106 4.20 Broad (SD=44) Gaussian Distribution 107 4.21 Shiftless Implementation of DA Based FIR with Fixed

Gray Sequencing 108 4.22 Shiftless Implementation of DA Based FIR with Any

Sequencing Possible 109 5.1 Data Flow Graph for a 4-term Weighted-sum Computation 114 5.2 Coefficient Subexpression Graph for the 4-term Weighted-

sum Computation 118 5.3 Data Flow Graph for 4 term MCM Computation 121

5.4 SFG Transformation - Computing Y[n] in Terms of Y[n-l] 131

5.5 SFG Transformation - Computing Y[n] in Terms of Y[n-I] 133

5.6 Average Reduction Factor Using Common Subexpres-sion Elimination 134

5.7 Best Reduction Factors Using Coefficient Transforms Without Common Sub-expression Elimination 135

5.8 Best Reduction Factors Using Coefficient Transforms with Common Sub-expression Elimination 136

5.9 Frequency of Various Coefficient Transforms Result-ing in the Best Reduction Factor with Common Sub-expression Elimination 137

5.10 Precision Sensitive Register Allocation 139

XIV VLSI SYNTHESIS OF DSP KERNELS

5.11 Precision Sensitive Register Allocation 139 5.12 Precision Sensitive Scheduling 140 6.1 Generic Register-rich Architecture 143 6.2 3x3 Pixel Window Transform 144 6.3 Prewitt Window Transform 145 6.4 Transformed DAG with All SUB Nodes 145 6.5 Chain-type DAG for Prewitt Window Transform 146 6.6 Optimized Code for Prewitt Window Transform 146 6.7 Optimized DAG for 4x4 Haar Transform 148 6.8 Schedu1ed Instructions for 4x4 Haar Transform 149 6.9 Data Flow Graph and Variable Lifetimes for 4x4 Haar

Transform 150 6.10 Register-Conflict Graph 150 6.11 Consecutive-Variables Graph 150 6.12 Register Assignment for Low Power 151 6.13 Code Optimized for Low Power 151 6.14 3x3 Window Transforms 152 6.15 Single Register, Accumulator Based Architecture 153 6.16 Example DAG 154 6.17 DAG for 4x4 Walsh-Hadamard Transform 158 6.18 Optimized DAG for 4x4 Walsh-Hadamard Transform 159 6.19 Transformation I - Tree to Chain Conversion 160 6.20 Transformation 11 - Serializing a Butterfly 160 6.21 Transformations III and IV 161 6.22 Optimizing DAG Using Transformations 161 6.23 Spill-free DAG Synthesis 164 6.24 DAGs for 8x8 Walsh-Hadamard Transform 165 6.25 Spill-free DAGs for 8x8 Walsh-Hadamard Transform 166 6.26 DAGs for 8x8 Haar Transform 166 7.1 RNS Based Implementation of FIR Filters 173 7.2 Modulo MAC using look-up-tables 173 7.3 Modulo MAC using a single LUT 174 7.4 RNS Based Implementation ofFIR Filters with Parallel

Processing Transformation 175 7.5 Minimizing Look Up Tab\e Area by Exploiting Redundancy 177 7.6 Modulo MAC structure for Transposed Form FIR Filter 181 8.1 A Framework for Area-Power Tradeoff 192 8.2 A Framework for Area-Power Tradeoff - continued 193

List of Tables

2.1 Adjacent Signal Transitions in Opposite Direction as a Function of the Bus-reordering Span 25

2.2 Impact of Selective Coefficient Negation on Total Num-ber of 1 s in the Coefficients 28

2.3 Impact of Coefficient Ordering on Hamming Distance and Adjacent Toggles 31

2.4 Power Optimization Results Using Input Bit Swapping for 1000 Random Number Pairs 33

2.5 TMS320C2x Code for Direct Form Architecture 38 2.6 TMS320C2x Code for the Multirate Architecture 40 2.7 Hamming Distance and Adjacent Signal Toggles After

Coefficient Scaling Followed by Steepest Descent and First Improvement Optimization with No Linear Phase Constraint 47

2.8 Hamming Distance and Adjacent Signal Toggles After Coefficient Scaling Followed by Steepest Descent and First Improvement Optimization with Linear Phase Constraint 48

2.9 Hamming Distance and Adjacent Signal Toggles for Steepest Descent and First Improvement Optimization with and without Linear Phase Constraint (with No Co-efficient Scaling) 48

3.1 Computational Complexity of Multirate Architectures 67 3.2 Comparison with Direct Form and Block FIR Implementations 72 4.1 Coefficient Memory and Number of Additions for DA

based Implementations 85 4.2 A Few Functions and Their Corresponding Correlations

with Actual Area 88 4.3 ROM Areas as a % of Maximum Theoretical Area 92

4.4 ROM vs Hardwired Area (Equivalent NA210 NAND Gates) Comparison 93

4.5 Area (Equivalent NA210 NAND Gates) Statistics for All Possible Coefficient Partitions 93

4.6 Toggle and No-toggle Power Dissipation in Some D FFs 94

xv

xvi VLSI SYNTHESIS OF DSP KERNELS

4.7 Best Nega-binary Schemes for Gaussian Data Distribu-tion ( mean = max/2; SD = 0.17 max ) 105

4.8 Toggle Reduction in LUT (for 10,000 SampIes; Gaus-sian Distributed Data) 106

4.9 Comparison ofWeighted Toggle Data for Different Gray Sequerices 110

4.10 Toggle Reduction as a Percentage of 2's Complement Case for Two Different Gaussian Distributions 110

4.11 Toggle Reduction with Gray Sequencing for N = 8 and Some Typical Distributions 111

5.1 Number of Additions+Subtractions (Initial and After Minimization) 120

5.2 Numberof Additions+Subtractions for Computing MCM Intermediate Outputs 126

6.1 Total Hamming Distance Between Successive Instructions 152 6.2 Code Dependance on the Scheduling of DAG Nodes 155 6.3 Comparison of Code Generator with 'C5x C Compiler 157 6.4 NumberofNodes (Ns) and Cycles(Cs) for Various DAG

Transforms 167 6.5 Hamming Distance Measure for Accumulator based Ar-

chitectures 169 7.1 Area estimates for PLA based modulo adder implementation 178 7.2 Area estimates for PLA based modulo multiplier imple-

mentation 179 7.3 Area estimates for PLA based modulo MAC implementation 179 7.4 Distribution of Residues across the Moduli Set 182 7.5 Impact of Coefficient Optimization on the Area of Mod-

ulo Multiplier and Modulo MAC 183 7.6 RNS based FIR filter with 24-bit precision on C5x 184 7.7 Number of Operations for RNS based FIR filter with

24-bit precision on C5x 184

Foreword

Technology is a driving force in society. At times it seems to be driving us faster than we want to go. At the same time it seems to patiently wait for us to adapt to it and, finally, adopt it as our own. Let me give a few examples.

The answering machine is a good example of us adopting a technology. Twenty years ago if you had called my horne and I had an answering machine, rather than me, responding to your call, you would have thought, "Oh, how rude of hirn! I don't want to talk to a machine, I want to talk to Gene". Today, twenty years later, if you call my horne and do not get my answering machine (or me), you will think, "Oh, how rude of hirn! He should have an answering machine so that I can at least leave a message". We have actually gone far beyond the answering machine in this respect. We now have cellular phones -with answering machines. Forget "Snail" mail, even Email is not fast enough; we have Instant Messaging. But although I have a videophone on my desk, no one else seems to have one. I guess we haven't adopted all of the technology that we are introduced to.

Another example of the advance of technology is seen in the definition of "portable". The term has changed over the last several decades as a result of advances in integrated circuit technology. Think of the Personal Computer. Not long ago, "portable" meant one (strong!) person could carry a Personal Computer on an airplane without putting it in the checked-in baggage. Now, "portable" means I can put my computer in my briefcase and still have room for other things. It is beginning to mean that I can put my computer in my pocket. In the future, it may very weil be that each one ofus wears multiple computers as a matter of daily life. We will have a communications computer, an entertainment computer, an information computer, a personal medical computer to name a few. They will all communicate with one another over a personal area network on our bodies. I like to call this personal area network the "last meter".

The definition of "portable" has also changed in the area of portable phones. We have graduated from car phones - where the electronics were hidden in the trunk of the car - to cellular phones so small that they can easily get lost in a shirt pocket.

There are many more examples of how the marriage of Digital Signal Processing to Integrated Circuit Technology has revolutionized our lives. But rather than continue in that direction, I would like to turn to abrief historical perspective of how successful this marriage has been. After looking at history, 1 would like to tie all of this to the value of this book.

Digital Signal Processing, depending on your view of history, has been around for only about forty years. It began as a university curiosity in the 1 960s. This was about the same time that digital computers were becoming

XVll

XVlll VLSI SYNTHESIS OF DS? KERNELS

useful. In the 1970s, Digital Signal Processing became a military advantage for those nations who could afford it. It was in the late I 970s and early 1980s that Integrated Circuit Technology became mature enough to impact Digital Signal Processing with the introduction of a new device called "Digital Signal Processor". With this new device, Digital Signal Processing moved from the laboratory and military advantage to being a commercial success. Telecommunications was the earliest to adopt Digital Signal Processing with many others to follow. It was in the decade of the 1990s that Digital Signal Processing moved from being a commercial success to being a consumer success. This was a direct result of the advances in Integrated Circuit Technology. These advances yielded four significant benefits: I) lower cost, 2) higher performance, 3) lower power and 4) more transistors per device. The industry began to think in terms of a System on a Chip (SoC). This led us to where we are now and will lead us to where we will go in the coming decades.

What I see in our future is the opportunity to take advantage of these four benefits of Integrated Circuit Technology as it is applied to Digital Signal Processing. SoC technology will either complicate or simplify our decisions on how best to implement Digital Signal Processing solutions on VLSI. We will need to optimize on the best combination of Performance, Power dissipation and Price. We will not only continue to change the definition of "portable" but will begin to change the definitions of "personal", "good enough" and "programmable".

This book focuses on this very marriage of Digital Signal Processing to Integrated Circuit Technology. It addresses implementation options as we try to create new products which will impact society. These new products will need to have good enough performance, low enough power dissipation and a low enough price. At the same time they will need to be quick to market.

So, read this book! It will give you insights and arm you with techniques to make the inevitable tradeoffs necessary to implement Digital Signal Processing on Integrated Circuits to create new products.

One last thought on the marriage of Digital Signal Processing to Integrated Circuit technology. Over the last several years, I have observed that every time the performance of Digital Signal Processors increases significantly, the rules of how we apply Digital Signal Processing theory change. Isn't this a great time we live in?

GENE FRANTZ

Senior Fellow, Digital Signal Processing Texas Instruments Inc.

Houston, Texas April 2001

Acknowledgments

First and foremost, we would like to express our sincere gratitude to Milind Sohoni, Vi kram Gadre and Supratim Biswas (all ofIIT Bombay), G. Venkatesh (with Sasken Communication Technologies Ltd., earlier with IIT Bombay) and Rubin Parekhji of Texas Instruments (India) for their insightful comments, critical remarks and feedback which enriched the quality of this book.

We are thankful to Bobby Mitra and Sham Banerjee of Texas Instruments (India) for their help, support and guidance.

We are grateful to Texas Instruments (India) for sponsoring the doctoral studies of the first author. We deeply appreciate the support and encouragement of IIT Bombay and Sasken Communication Technologies Ltd.

We are thankful to Amit Sinha, Somdipta Basu Roy, M.N. Mahesh, Satrajit Gupta, Anand Pande, Sunil Kashide and Vikas Agrawal (all with Texas Instruments (India) when the work was done) for their assistance in implementing some of the techniques discussed in this book.

Our warm thanks to our children - Aarohi Mehendale and Apama & Nachiket Sherlekar for putting up with our long hours at work. Finally, thanks are due to our wives - Archana Mehendale and Gowri Sherlekar for being there with us at all times.

MAHESH MEHENDALE

SUNIL D. SHERLEKAR

Preface

D.E Knuth in his seminal paper "Structured Programming with Goto Statements" underlines the importance of optimizing the inner loop in a computer program. More than twenty five years and a revolution in semiconductor technology have not diminished the importance of the inner loop.

This book is about synthesis of the 'inner loop' or the kernel of Digital Signal Processing (DSP) systems. These systems process - in real time -digital information in the form of text, data, speech, images, audio and video. The wide variety of these systems notwithstanding, their kerneis or inner loops share a common dass of computation. This is the weighted sum (L: A[i]X[i]). It occurs in Finite Impulse Response (FIR) and Infinite Impulse Response (HR) filters, in signal correlation and in computing signal transforms.

Unlike general purpose computation which asks for computation to be 'as fast as possible', DSP systems require performance that is characterized by the arrival rate of a data stream which, in turn, is determined by the Nyquist sampling rate of the signal to be processed. The performance of the system is therefore a constraint within which one must optimize the area (cost) and power (battery life). This is usually a matter of tradeoff.

The area-power tradeoff is complicated by additional requirements of flexibility. Flexibility is important to track evolving standards, to cater to multiplicity of standards (such as air interfaces in mobile communication) and fast-paced innovation in algorithms. Flexibility is achieved by implementation in software, but a completely soft implementation is likely to be ruinous for power. It is therefore imperative that the requirements of flexibility be carefully predicted and the system be partitioned into hardware and software components.

In this book, we present several algorithmic and architectural transformations to optimize weighted-sum based DSP kerneis over the area-delay-power space. These transformations address implementation technologies that offer varying degrees of programmability (and therefore flexibility) ranging from software programmable processors to customized hardwired solutions using standardcell or gate-array based ASICs. We consider both the multiplier-less and the hardware multiplier-based implementations of the weighted-sum computation.

To start with, we present a comprehensive framework that encapsulates techniques for low power implementation of DSP algorithms on programmable DSPs. These techniques complement one another and address power reduction

XXI

xxii VLSI SYNTHESIS OF DSP KERNELS

in various components such as the program and data memory busses and the multiplier-accumulator datapath of a Harvard architecture based digital signal processor. The techniques are then specialized for weighted sum computations and then for FIR filters.

Next we present architectural transforms for power optimization for hardwired implementation ofFIR filters. Multirate architectures are presented as an important and interesting transform. A detailed analysis of the computational complexity of multirate architectures is presented with results that indicate significant power savings compared to other FIR filter structures.

Distributed Arithmetic (DA) has been presented in the literature as one of the approaches for multiplier-less implementation of weighted-sum computation. We present techniques for deriving multiple DA based structures that represent different data-points in the area-delay space. We look at improving area-efficiency of DA based implementations and specifically show how the fiexibility in coefficient partitioning can be exploited to reduce the area of a DA structure using two look-up-tables. We also address the problem of reducing power dissipation in the input data shift-registers of DA based FIR filters. Our technique is based on a generic nega-binary representation scheme which is customized for a given distribution profile of input data values, so as to minimize toggles in the shift-registers.

For non-adaptive signal processing applications in which the weight values are constant and known at design time, an area-efficient realization can be achieved by implementing the weighted sum computation using shift and add operations. We present techniques for minimizing additions in such multiplierless implementations. These techniques are also useful for efficient implementation of weighted-sum computations on programmable processors that do not support a hardware multiplier.

We address a special dass of weighted-sum computation problem, where the weight-values are restricted to {O, 1, -I}. We present techniques for optimized code generation of one dimensional and two dimensional multiplication-free linear transforms. These are targeted to both register-rich and single-register, accumulator based architectures.

Residue Number Systems (RNS) have been proposed for high-speed parallel implementation of addition, subtraction and multiplication operations. We explain how the power of RNS can be exploited for optimizing the implementation of weighted sum computations. In particular, RNS is proposed as a method to enhance the results of other techniques presented in this book. RNS is also proposed as a technique to enhance the precision of computations on a programmable DSP.

To tie up all these techniques, a methodology is presented to systematically identifying transformations that exploit the characteristics of a given DSP al-

PREFACE XXIll

gorithm and of the implementation style, to achieve tradeoffs in the area-delaypower space.

This book is meant for practicing DSP system designers, who understand that optimal design can never be a push-button activity. We sincerely hope that they can benefit from the variety of techniques presented in this book. Each of the techniques has a potential benefit to offer. But actual benefit will accrue only from a proper selection from these techniques and their appropriate implementation: something that is in the realm of human expertise and judgement.

MAHESH MEHENDALE

SUNIL D. SHERLEKAR

Bangalore April 2001

Chapter 1

INTRODUCTION

Today's digitally networked society has seen the emergence of many applications that process and transceive information in the form of text, data, speech, images, audio and video. Digital Signal Processing (DSP) is the key technology enabling this digital revolution. With advances in semiconductor technology the number of devices that can be integrated on a single chip has been growing exponentially. Experts forecast that Moore's law of exponential growth in chip density will hold good atIeast tilI year 2010. By then, the minimum feature size of 0.07 micron will enable the integration of as many as 800 million transistors on a single chip [69]. As we move into the era of ULSI (Ultra Large Scale Integration), the electronic systems which required multi-chip solutions can now be implemented on a single chip. Single chip solutions are now available for applications such as Video Conferencing, DTADs (Digital Telephone Answering Devices), cellular phones, pagers, modems etc.

1.1. An Example As an example, consider the electronics of a Digital Still Camera (DSC) [26]

shown in figure 1.1. The system-level components are the CCD image sensor, the A/D conversion front-end, the DSP engine for image processing and compression and various interface and memory drivers.

Although there are no intrinsic real-time constraints for such a system, it has performance requirements dictated by the need to have as short a shot-to-shot delay as possible. Besides, many DSCs now have a provision of attaching an audio clip with each picture which requires real-time compression and storage. Of course, being a portable device, the most important constraint on the system design is the need for low power to ensure a long battery life.

Figure 1.2 shows the DSP pipeline of the DSC [26]. The following blocks are of particular interest:

M. Mehendale et al., VLSI Synthesis of DSP Kernels© Springer Science+Business Media New York 2001

2 VLSI SYNTHESIS OF DSP KERNELS

CCD

Driver Correlated Automatie

~" AID

Double Gain Converter Timing

~/ Sampling Control

Generator

/L- DSC Engine LCD

Display ~ I

Image

I Proeessing

/L- Image NTSC/PAL

~~ Compression

output

~t t Universal Flash

Serial RS232 Memory Bus

Figure 1.1. Digital Still Camera System

• Fault pixel eorreetion: Large pixel CCD-arrays may have defeetive pixels. During the normal operation of the DSC, the image values at the faulty pixel loeations are computed using an interpolation technique.

• CFA interpolation: The nature of the front-end is such that only one of R, G or B values is available for each pixel. The other values need to be interpolated from the neighboring pixels.

• Color space conversion: While the CCD sensor produces RGB values, typical image compression techniques use YCrCb. These values are weighted sums of the RGB values.

• Edge enhancement: CFA interpolation introduces low-pass filtering in the image and this needs to be corrected to restore image sharpness.

• Compression: To reduce memory requirement, images are compressed typically using the JPEG compression scheme. This involves DCT computation.

Introduction 3

1 I R G R G Analog I Optical Lens I

G B G B H processing --t --t I black distortion t--I I R G R G and AlD cJamp compensation I I

G B G B I I

~------------------I Auto exposure

C'

CFA Fault color <J-

Gamma <}-,- White <J- ~ ,---

correction balance pixel interpolation correction

- - -- Auto focus L

r

----c Edge Write r-

detection ----c

lPEG to flash -color compression

" 4 -conversion v

RGB to YCrCb False -" ----c color r- Scaling for v

4 For suppression

monitor/LCD preview

Figure 1.2. DSC Image Pipeline

The common characteristics of each of these blocks is that they require computation of weighted sums of pixel values. The optimal computation of such weighted sums in the main concern of this book. Also note that the coefficients in most of the weighted-sum computations in a DSC are constant. Some of the techniques presented in this book are specifically targeted to such situations.

1.2. The Design Process: Constraints and Alternatives The process of designing a System-on-a-Chip (SoC), such as for the pre

ceding example, not only involves integrating processor cores with memory,


.1"- System Specification v

~ System 'A System Partitioning IA Estimators I"

Validation

~ i Hardware IA

rt---c I" Synthesis ,'7

Software ~

Library

l-1 " Synthesis I~ v

Figure 1.3. Hardware-Software Codesign Methodology far a System-on-a-chip

peripherals and custom hardwired logic but also involves developing the software that implements the desired functionality. These hardware and software components are strongly interdependent and hence need to be co-developed.

Figure 1.3 shows the methodology for designing an embedded real-time digital signal processing SoC. The design process starts with a specification of the system in terms of its functionality and the design constraints/objectives. Various mechanisms exist to specify the system functionality. One approach is to use a high-level specification language such as Silage [25] or Lustre [24]. CAD systems such as Ptolemy [79], DSP-Station 1 from Mentor Graphics and COSSAP from Synopsys provide block diagram editors that support hierarchical system specification. The blocks represent various functions and their interconnections represent the data flow. These systems support both the synchronous dataflow (SDF) and the dynamic dataflow (DDF) models [10, 79] for capturing the specifications of DSP algorithms. These environments also provide a rich library of commonly used DSP functions such as filters, FFT, linear transforms, matrix multiplication etc. This can significantly reduce the time required to specify the functionality of an embedded DSP system.

Area, delay (performance) and power constitute the three important design constraints for most systems.

The area constraint is driven primarily by considerations of cost. Area efficient implementation results in a sm aller die size and hence is more cost effective. It also enables integrating more functionality on a single chip.

1 Product and company names appearing here and elsewhere in the book are trademarks owned by the respective companies.

Introduction 5

The perfonnance requirements of a system are driven by its data processing needs. For real-time embedded DSP systems, throughput is the primary perfonnance criterion (although latency is also important for a two-way communication system). The perfonnance constraint is thus dependent on the rate at which the input signals are sampled and on the complexity of processing to be perfonned.

Low power dissipation is a key requirement for portable, battery operated systems as it extends battery life. It also enables operation of the system with smaller batteries thus reducing the weight of the handheld device. Low power dissipation also helps reduce the packaging cost (plastic instead of ceramic) and eliminate/reduce cooling (heat sinks) overhead. Finally, lower power dissipation increases the reliability of the device. The generated heat can be a limiting factor for integrating a large number of devices on a single chip.

Low area, high perfonnance and low power dissipation are often conflicting requirements. The system design process hence involves perfonning appropriate trade-offs so as to meet the desired constraints. In DSP systems, perfonnance is dictated by the Nyquist sampling rate; the tradeoff is then between area and power.

In addition to the area, delay and power constraints there are other design considerations that significantly influence the choice of the target implementation. These include:

• Shorter design cycle time for faster time-to-market.

• F1exibility of making changes to cater to changes in an evolving standard.

• Field upgradability for longer time to obsolescence.

• Re-use and customizability for amortization of design cost over several variants of a product.

All the above requirements imply some sort of programmability. The downside of a programmable implementation, however, is penalty in tenns of either area or power or both.

Fortunately, there is an interesting reason why programmable implementation are becoming increasingly feasible for DSP systems. With the advances in semiconductor technology, digital circuits can be fabricated with increasing chip density and can operate at increasing speeds. However, some of the fundamental properties of signals and systems in nature have remained more or less the same. These include parameters such as the frequency range of signals audible to the human ear, the frequency range of light visible to the human eye, the duration of persistence of human vision etc. Thus more and more real-time DSP functions which earlier required dedicated hardwired solutions can now be implemented using a programmable processor for the same cost. Increasing


speeds means that power can be reduced by dropping the clock rate and the operating voltage.

For any technology, however, a hardwired implementation is always more efficient in area and power than a software one. The system design methodology must trade programmability for area and power by considering implementation technologies with varying degree of programmability. These range from software programmable solutions offered by programmable processors and hardware programmable solutions offered by FPGAs to dedicated hardwired functions implemented as standard cell or gate-array based ASICs.

An important step in the system design process, therefore, is to partition the system into various components and decide on the implementation approach for each component. This decision process involves determining whether to implement a component in hardware or software, and also assigning area, delay and power budgets so as to meet the system level design constraints. For a given function to be implemented in hardware or software, multiple alternatives exist, each representing a different data point in the area-delay-power space.

Most approaches to system partitioning [9, 21, 31] model it as a combinatorial optimization problem and use integer programming or other heuristic techniques to arrive at a solution. However, these approaches assume the availability of area, delay and power estimates for different implementation alternatives. The quality of partitioning thus depends on the accuracy of these estimates. A serious barrier to accurate estimation is that the deeper we move into submicron geometries, the lesser the correlation between high-level descriptions (or even optimized logic equations) and the size and speed ofthe circuit [69]. One approach to address this limitation is to actually perform hardware/software synthesis [71] and extract the design parameters. However, this method is time consuming and can limit the search space that can be explored. The other approach is to partition the system in terms of pre-characterized library functions. While it is virtually impossible to build a comprehensive library of functions which can realize any system behavior, such an approach can successfully be used for designing systems belonging to specific application domains such as DSP. Most DSP systems can be characterized in terms of the core algorithms or kerneIs they use. These include functions such as filtering (both Finite Impulse Response (FIR) and Infinite Impulse Response (UR) filtering), correlation and linear transforms (matrix multiplication). All these perform weighted-sum (2:: A[iJX[i]) as the core computation. This class of DSP algorithms forms the focus of this book.

The optimality of a system partition can be greatly influenced by providing a rich set of implementation alternatives. This book covers the entire solution space as shown in figure 1.4 for realizing weighted-sum based DSP kerneIs. It represents implementation styles that offer varying degrees of programmability and perform weighted-sum computation with or without a hardware multiplier.

Introduction

With Hardware Multiplier

Without

Hardware Multiplier

7

Implementation Programmable Using Hardware

Multiplier(s) Digital Signal

and Adder(s) Processors

Residue Processors Distributed Implementation Number

Arithmetic (DA WithNo Using Adders System (RNS) Based Dedicated and S hifters Based Implementation Hardware

Implementation Multiplier

Low Medium High

<11---- Degree of Programmability

Figure J.4. Solution Space for Weighted-Sum Computation

The implementation style marked 'X' (hardware multiplier with no programmability) has not been considered, as such an implementation does not offer any advantage and hence is not useful.

The synthesis process involves taking the system through various levels of design abstraction. These include behavioral/algorithmic level, architecturall register transfer level, logic level, circuit level and layout level abstractions for the hardwired functions and behaviorallalgorithmic level, high-Ievellanguage program level, assembly language program level and object code level abstractions for the functions implemented in software. Various studies have shown [60] that design decisions taken at higher levels of abstraction have a much bigger impact on the area-delay-power characteristics of a system. The book therefore focuses on behavioral level transformations to achieve the desired area-delay-power tradeoffs.

1.3. Organization of the Book The rest of the book is organized as folIows. Chapter 2 presents a comprehensive framework that encapsulates techniques

for low power implementation of DSP algorithms on programmable DSPs. The sources of power dissipation in various components of programmable DSPs (e.g. data paths, busses) are identified and their measures established. Mutually complementary techniques for power reduction in these components are then described. The techniques are then specialized for weighted sum computations and then for FIR filters.


Chapter3 presents implementations using hardware multiplier(s) and adder(s). It evaluates the effectiveness of various DFG transformations such as parallel processing, pipelining, re-timing and loop-unrolling with respect to weighted sum computation. It shows that the parallel processing technique, while reducing energy dissipation, may result in increased peak power dissipation. Most of these transforms do not impact computational complexity and achieve power reduction at the expense of increased area. On the other hand, multirate architectures, which are presented next, achieve power reduction by reducing the computational complexity ofFIR filters. A detailed analysis is presented to show how multirate architectures can realize low power FIR filters with minimal overhead in datapath area.

Distributed Arithmetic (DA) has been presented in the literature as one of the approaches for multiplier-Iess implementation of weighted-sum computation. In Chapter 4, various techniques are discussed far deriving multiple DA based structures that represent different data-points in the area-delay space. The chapter looks at improving area-efficiency of DA based implementations and specifically shows how the flexibility in coefficient partitioning can be exploited to reduce the area of a DA structure using two look-up-tables. The chapter also addresses the problem of reducing power dissipation in the input data shiftregisters of DA based FIR filters. This is achieved using a technique that is based on a generic nega-binary representation scheme which is customized for a given distribution profile of input data values so as to minimize toggles in the shift -registers.

For non-adaptive signal processing applications in which the weight values are constant and known at design time, an area-efficient realization can be achieved by implementing the weighted sum computation using shift and add operations. Chapter 5 presents a technique based on common subexpression precomputation for reducing the number of additions. The chapter also presents techniques that use different number representation schemes (such as Canonical Signed Digit (CSD)) and transform signal flow graphs to further minimize the number of additions. Finally, the chapter discusses high-level synthesis of multi-precision DFGs - a useful technique to minimize computation at the bit-precision level.

Chapter 6 focuses on optimized code generation of multiplication-free linear transforms. These transforms perform weighted-sum computation with the weights being restricted to {O, 1,-1} and can be realized using add and subtract operations. Both one dimensional and two dimensional transforms are considered. The code generation is targeted to both a register-rich architecture and a single register, accumulator based architecture, and aims at reducing power dissipation. The chapter also presents DAG transformations to further improve performance and reduce power dissipation of the multiplication-free linear transforms.

Introduction 9

Residue Number Systems (RNS) have been proposed for high-speed parallel implementation of addition, subtraction and multiplication operations. Chapter 7 describes RNS based implementation of the weighted-sum computation and presents transformations that aim at reducing area, delay and power dissipation of the implementation. Chapter 7 also presents RNS as a transformation to improve performance and reduce power dissipation of DSP algorithms which need the data and the coefficients to have a higher bit precision than what is supported by the target DSP architecture.

To tie up all these techniques, a methodology is presented in Chapter 8 to systematically identify transformations that exploit the characteristics of a given DSP algorithm and of the implementation style to achieve tradeoffs in the area-delay-power space.

Chapter 9 summarizes the key topics covered in this book to address the VLSI synthesis and optimization of DSP kemeIs - primarily the weighted-sum based kemeis.

1.4. For the Reader This book is meant for practicing DSP system designers, who understand

that optimal design can never be a push-button activity. In summary, designers will find the following in this book:

• Description and analysis of several algorithmic and architectural transformations that help achieve the desired tradeoffs in the area-delay-power space for various implementation styles.

• Automated and semi-automated techniques for applying these transformations.

• Classification ofthe transformations based on the properties that they exploit and their encapsulation in a design framework. A methodology that uses the framework to systematically explore the application of these transforms depending on the characteristics of the algorithm and the target implementation style.

As a caveat, the designer needs to use his expertise and judgement to make a proper selection from these techniques and implement them properly in the context of the specific system being designed.

Chapter 2

PROGRAMMABLE DSP BASED IMPLEMENTATION

A programmable DSP is a processor customized to implement digital signal processing algorithms efficiently. The customization is based on the following characteristics of DSP algorithms:

• Compute Intensive: Most DSP kemeis are compute intensive with weightedsum being the core computation. A programmable DSP hence incorporates a dedicated hardwired multiplier and its datapath supports single cycle multiply-accumulate (MAC) operation.

• Data Intensive: In most DSP kemeis, each multiply operation ofthe weightsum computation is performed on a new set of coefficient and data values. A programmable DSP is hence pipelined with an operand read stage before the execute stage, has an address generator unit that operates in parallel with the execute datapath and uses a Harvard architecture with multiple busses to program and data memory.

• Repetitive: DSP algorithms are repetitive both at micro-Ievel (e.g. Multiplyaccumulate operation repeated N times during an N-term weighted-sum computation) and at macro-Ievel (e.g. kemeIs such as filtering repeated every time a new data sampIe is read). A DSP architecture hence uses special instructions (such as RPT MAC) and control mechanisms to support zero overhead looping.

This chapter focuses on low power realization of DSP algorithms on a programmable DSP. The techniques are targeted to a generic DSP architecture shown in figure 2. I.

The architecture has two separate memory spaces (program and data) which can be accessed simultaneously. This is similar to the Harvard architecture

I I


12

ProgramJ Coefficient Memory

Program Counter

CPU

VLSI SYNTHESIS OF DSP KERNELS

Datu Read Address Register

Data Write Address Register

Data Memory

Figure 2.1. Generic DSP Architecture

employed in most of the programmable DSPs [95, 96]. During weighted-sum computation, one of the memories can be used to store coefficients (weights) and the other to store input data sampies. The arithmetic unit performs fixed point computation on numbers represented in 2's complement form. It consists of a dedicated hardware multiplier and an adderlsubtracter connected to the accumulator so as to be able to efficiently execute the multiply-accumulate (MAC) operation. This again is similar to the single-cycle MAC instruction supported by most of the programmab\e DSPs [34, 35, 95, 96]. An N-term weightedsum computation can be performed using this architecture as a sequence of N multiply-accumulate operations.

The sources of power dissipation in CMOS circuits can be classified into dynamic power, short circuit current and leakage current [15]. For most CMOS designs, dynamic power is the main source of power dissipation and is given by equation 2.1.

Pdynamic = Cswitch. V 2 .1 (2.1 )

where V is the supply voltage, 1 is the operating frequency, Cswitch is the switching capacitance given by the product of the physical capacitance being charged/discharged and the corresponding switching activity. The low power transformations described in this chapter aim at reducing one or more of these factors while maintaining the throughput of the algorithm.

Programmahle DSP hased Implementation 13

The rest of this chapter is organized as folIows. Section 2.1, identifies the main sources of power dissipation and deveJops measures for estimating power dissipated in each of the sources. Various techniques for low power realization of DSP algorithms are discussed in section 2.2. Section 2.3, presents algorithmic and architectural transformations which are specific to weighted-sum computation. Section 2.4 presents additional transformations for low power realization of FIR (Finite Impulse Response) filters - a DSP kernel which is based on weighted-sum computation. Finally, section 2.5 integrates various transformations into a comprehensive framework for low power realization of FIR filters on programmable DSPs.

2.1. Power Dissipation - Sources and Measures 2.1.1. Components Contributing to Power Dissipation

Each step in a weighted-sum computation involves getting the appropriate coefficient and data values and performing a multiply-accumulate computation. Thus address and data busses of both the memories and the multiplier-adder datapath experience the highest signal activity during a weighted-sum computation. These hardware components therefore form the main sources of power dissipation.

2.1.2. Measures of Power Dissipation in Busses

For a typical embedded processor, address and data busses are networks with a large capacitive loading [89]. Hence signal switching in these networks has a significant impact on power consumption. In addition to the net capacitance of each signal(bit) of the bus, inter-signal cross-coupling capacitance also contributes to the bus power dissipation. The amount of this capacitance depends on the processing technology and on the spacing between adjacent metal lines carrying the signals. With the deep submicron technology, there is a trend towards increasing contribution of the cross-coupling capacitance to the total bus capacitance. The data presented in [11] indicates that the power dissipation due to the cross-coupling capacitance varies depending on the adjacent signal values. The current required for signals to switch between 5's (OlOlb) and A's (lOlOb) is about 25% more than the current required for the signals to switch between O's (OOOOb) and F's (111Ib) [li].

The Hamming distance between consecutive signal values and the number of adjacent signals toggling in opposite direction thus form the measures of power dissipation in the busses.

2.1.3. Measures of Power Dissipation in the Multiplier

Due to high speed requirements, parallel array architectures are used far implementing dedicated multipliers in programmable DSPs [42, 91]. The logic


A380 A280 Al 80 AO 80

P7 P6 PS P4 P3 P2 PI PO

Figure 2.2. 4x4 Array Multiplier

diagram of a 4x4 bit parallel array multiplier is shown in figure 2.2. The multiplier consists of AND gates to compute partial inner products and an array of adders to compute the complete product. The power dissipation of a multiplier is direct1y proportional to the number of switchings at all the internal nodes of the multiplier. These are the outputs of the AND gates and of the 1 bit adders. The number of internal node switchings depend on the multiplier input values. This dependence can be analyzed using the 'Transition Density' measure of circuit activity.

'Transition Density' [68] of a signal is the average number of transitions/ toggles of the signal per unit time. Consider a combinational logic block with inputs Xl, X2, ... , X n and the output Y. Let TXI , Tx2 , ••• ,Txn be the transition densities at the inputs and let PXI , PX2 ' ••• ,PXn be the probabilities ofthe input signal values being I. Assuming the input values to be mutually independent, the transition density at the output Y is given by

n

Ty = L P(Bd(Y, xd) . TXi (2.2) i=l

where Bd(Y, Xi), is the Boolean difference of Y w.r.t. Xi.

Using equation 2.2, it can be shown that the transition density at the output of a two input AND gate (y = a&b) is given by Ty = (Ta· Pb + n . Pa).

Programmable DSP based Implementation 15

The probability Py of the AND gate output being 1 is given by (Pa' H). The transition density at the output of a two input XOR gate (y = a EB b) is given by (Ty = Ta + Tb). These relationships indicate that the multiplier power is directly dependent on the transition densities and the probabilities of the multiplier inputs.

The transition densities of the multiplier inputs depend on the Hamming distance between successive input values. The input signal probabilities depend on the number of I s in the input signal values of the multiplier. These two thus form the measures of multiplier power dissipation.

It can also be noted in figure 2.2, that the transitions in input bits BO and BI affect more internal nodes than the transitions in input bit B3. In general, transitions in lower order bits of the input signal contribute more to the multiplier power dissipation than the higher order bits. Thus while minimizing transition densities of all the input bits is important, higher gains can be achieved by focusing on lower order bits of the input signals.

These measures have been experimentally verified by simulating an 8x8 parallel array multiplier. One input of the multiplier was kept constant and 1000 random numbers were fed to the other input. The total toggle count at all the internal nodes and the inputs of the multiplier was measured. The toggle count measurement was carried out for all 256 (0 to 255) values ofthe constant. This data was then used to compute the average toggle count as a function of the number of 1 s in the constant input. This relationship, shown in figure 2.3, confirms the analysis that the multiplier power is a direct function of the number of 1 s in its inputs.

The second experiment used sets of 1000 random numbers such that the Hamming distance between consecutive numbers within a set was constant. Seven such sets of numbers were generated corresponding to seven Hamming distance values (l to 7). The total toggle count was measured by applying these random numbers to one input of the multiplier while keeping the other input constant. The toggle count vs Hamming distance relationship for four different constants is shown in figure 2.4. It confirms the analysis that the multiplier power is a direct function of the Hamming distance between its successive inputs.

In addition to the array multiplier, other multiplier topologies based on Booth encoding and Wall ace tree are also common in programmable DSPs [42]. While the measure of Hamming distance between successive inputs applies to all these topologies, the measure based on input data pattern may vary across topologies. For example, power analysis of a Booth multiplier [36] shows that the power dissipation is directly dependent on the number of 1 s in the Booth encoded input. This chapter focuses on techniques for reducing the Hamming distance in the successive inputs of the multiplier. These techniques are hence applicable to all multiplier topologies.

16

8' 0

x

§ 0 u

j e-

140

120

100

80

60

40

20

0 0

VLSI SYNTIlESlS OF DSP KERNELS

4 Numbet of ones

Figure 2.3. Toggle Count as a Function of Number of Ones in the Multiplier Inputs

160

x143 140

x77

120 x88

x36 0' 100 8 x

~ 80 u m g:

'" 60

40

20

0 0 4

Hamming dislance

Figure 2.4. Toggle Count as a Function of Hamming Distance between Successive Inputs

2.2. Low Power Realization of DSP Algorithms This section presents techniques and architectural extensions that can be

applied to most DSP applications for reducing power dissipation.

2.2.1. Allocation of Program, Coefficient and Data Memory Since in most DSP kerneIs the code within the loops executes sequentially,

the transitions in the program memory address bus can be minimized by appro-


70 .----------,-----------,----------,----------,,-----,

<J)

~

'" '" 0 65 f-ro c:

'" üi C Q)

" '" TI « + 60 Q)

" c:

* (5

'" c: E E

'" 55 I

Ei 0 f-

50L---------~-----------L----------L---------~------~

o 5 10 15 20 Start Address

Figure 2.5. Address Bus Power Dissipation as a Function of Start Address

priately selecting the start locations of the code segments. The same technique can also be applied for storing coefficients and data values in the memory for weighted-sum computation in which these are accessed sequentially. Figure 2.5 shows the sum of total Hamming distance and total number of adjacent signals toggling in opposite directions in the consecutive addresses as a function of start location for a 24 word memory block. The analysis shows that start address of Ox 14 results in 14% more power dissipation in the address busses compared to the start address of OxOO. The power dissipation in the addresses busses can thus be reduced by aligning the start address far the program, coefficient and data blocks with the beginning of a memory page. The capacitive loading for the address bus transitions and hence the power dissipation can also be reduced by storing the most frequently accessed program segments, coefficients and data on the on-chip memory.

2.2.2. Bus Coding 2.2.2.1 Gray Coded Addressing

The property of sequential memory access by most DSP algorithms can be further exploited by using gray coded addressing [61, 89, 54] to reduce power dissipation in the address busses. This can result in apower saving of close to


MSB Binary Address LSB

Bn Bn-I Bn-2 B2 BI BO

Gn Gn-I Gn-2 G2 GI GO

MSB Gray Coded Address LSB

Figure 2.6. Binary to Gray Code Conversion

50% compared to binary sequential access. A binary address generator can be modified to perform gray coded addressing using a binary-to-gray converter as shown figure 2.6

Since the gray coded addressing changes the order in which memory locations are accessed, the memory contents such as program, coefficients and data need to be suitably re-arranged to ensure correct functionality. Figure 2.7 shows a simple scheme that can be used to reorganize the memory contents at design (compile) time.

Since the data memory access is sequential only during certain functions such as weighted-sum computation, the gray coded addressing on the data memory address bus needs to be selectively activated at run-time. Figure 2.8 shows a scheme that supports such programmable binary-to-gray code conversion.

It can be noted that for certain functions such as finding an average or finding a range (minimum and maximum) of N data values, the sequence of accessing the data memory need not be preserved and hence the data memory need not be reorganized.

2.2.2.2 TO coding

The power dissipation in the address busses during sequential access can be further reduced by using the asymptotic zero-transition encoding referred to as TO coding in [8]. Figure 2.9 shows the memory access scheme based on TO coding.


r--1> Binary to Gray

~

Converter

Binary Gray Memory Memory

Binary {> 00 instr 1 ---{) 00 instr 1

Counter 01 instr 2 01 instr 2

10 instr 3 10 instr 4

11 instr 4 11 instr 3

I data I Figure 2.7. Memory Reorganization to Support Gray Coded Addressing

MSB Binary Address

Bn Bn-I Bn-2 B2 BI

Gn Gn-I Gn-2 G2 GI

MSB Programmable Gray Coded Address

Figure 2.8. Programmable Binary to Gray Code Converter

LSB

BO

GO

LSB

At the beginning of the series of sequential accesses, the processor sends the start location on the address bus and the same is loaded in the counter which


(~ Mem Addr Program

Program/

! Counter Coefficient Memory

Counter

'I increment

CPU

Figure 2.9. TO Coding Scheme

is part of the memory wrapper supporting TO coding. During the sequential accesses, the processor holds the address bus at the same initial value (hence o toggles) and sets the 'increment' signal high to indicate sequential access. During each cycle the counter in the memory wrapper is incremented to generate the next sequential access. While the TO coding scheme enables zero toggles in the address bus, it requires an additional signal and also results in marginal area, delay and power overhead due to the counter and the multiplexer logic in the memory wrapper.

2.2.2.3 Bus Invert Coding

The power dissipation in the data busses depends on the Hamming distance between successive values. Thus for a B bit bus, there can be upto B number of toggles. Bus invert coding [88] technique selectively inverts the data value so as to limit the maximum number of toggles to B/2, thus reducing the power dissipation. As shown in figure 2.10, in this scheme, if the Hamming distance between the new and the current data values is higher than B/2 (for a B bit bus), the new data value is inverted be fore sending to the destination. The source also asserts the 'invert' signal which is used by the destination to get the correct value by inverting the received data. This technique has the overhead of an extra signal and the encoding and the decoding circuitry at the source and the destination respectively.


invert

Source Destination

Figure 2.10. TO Coding Sehe me

2.2.3. Instruction Buffering DSP app1ications typically involve repeated execution ofDSP kemeIs. These

kemeIs are smaller in code size but take up significant portion of the run time. Instruction buffering has been proposed [7] as an architectural extension to reduce power dissipation during the iterative execution of DSP kemeIs. The scheme, as shown in figure 2.11, involves adding an instruction buffer to the processor cpu. During the first iteration, as the code is executed, the fetched instructions are also stored in the instruction buffer. For the following iterations, the instructions are fetched from the instruction buffer thus requiring no access to the program memory and thereby reducing power dissipation.

This technique can be extended further to eliminate power dissipation in the instruction decoder. As has been shown in [7], since DSPs have many complex instructions that are executed in single cycle, the decoder power dissipation forms a significant portion of the total power dissipation of the CPU. In this scheme (figure 2.12), during the first iteration, the output of the instruction decoder is stored in the decoded instruction buffer which is part of the CPU. During the following iterations the decoder output is fetched from the decoded instruction buffer thus eliminating power dissipated for program fetch and decode. Since the decoder output is typically much wider than the instruction width, the area penalty due to the decoded instruction buffer is much higher than the instruction buffer.

22

Program Memory

Program Memory

2.2.4.

Iteration 1

Instruction v

Butler

Decode Logic

epu


Program Memory

Iterations 2 -> N

- Instruction Buffer

~'od' I Logic

--

epu

Figure 2.11. Instruction Buffering

Iteration 1 Iterations 2 -> N

7 Decode Logic

--[;: Decoded Instruction

Buffer

Execute Logic

epu

Pro gram Memory

Figure 2.12. Decoded Instruction Buffering

Memory Architectures for Low Power

Decode Logic

,-- Decoded Instruction

Butler

Execute Logic

epu

The sequential nature of memory accesses by DSP algorithms can be used to organize the memory for low power without resulting in any perfonnance penalty. One approach is to partition the memory into two halves corresponding to data at odd and even addresses respectively. During sequential access no two consecutive addresses can be both odd or both even. The two memory halves thus get accessed on every alternate cycle. The two memory halves can hence

Programmable DSP based Implementation

~ Memory CPU

/\ /\

I I CLK CLK

Even

Odd

Memory

CLK'/2

o

T-FF

CLK

Figure 2.13. Memory Partitioning for Low Power

~

16

CPU Memory

32 ...

~ ~ ::l

Memory a:l .c 8 ~

d.> ... 0...

A ~ I I

CLK CLK CLKJ2 CLK

Figure 2.14. Prefetch ButTer

23

CPU

CLK

16

/ " / CPU

/\

I CLK

be clocked at half the CPU clock, resulting in power reduction. Figure 2.13 shows such a memory architecture.

The property of sequential access can also be exploited by using a wider memory and reading two words per memory access. The data can be stored in a pre-fetch buffer such that while the memory is accessed at half the CPU clock rate, the CPU gets the data on evcry cycle during sequential access. Figure 2.14 shows such a memory architecture.

It can be noted that this scheme can be generalized such that for a B bit data, the memory width can be set to N*B to read N words per memory access and consequently clock the memory at I/N times the CPU clock. The prefetch


A2 AO ---- AO

AI AO

AI ----A4

A2 ---- A2

PROGRAMI A3 __ A.L_

A3 COEFFICIENT CPU

MEMORY A4 A3

A4 ----

A5 A5 ---- A5

A7 A6 ---- A6

A6 A7 ---- A7

Figure 2.15. Bus Reordering Scheme for Power Reduction in PD bus

buffer scheme can also be used in conjunction with memory partitioning to further reduce power dissipation.

2.2.5. Bus Bit Reordering In case of a system-on-a-chip design, the designer integrates the DSP CPD

core with the memories and the application specific glue logic on a single chip. In such a design methodology the designer has control over the placement of various memories and the routing of memory-CPU busses. This flexibility can be exploited to develop a layout level technique for reducing power dissipation in the program/coefficient memory data bus. The technique aims at reducing power dissipation due to cross-coupling capacitance. One approach to achieve this is to increase the spacing between the bus lines. This however results in increased area. A better approach is to reorder the bus bits in such a way that the number of adjacent signals toggling in opposite direction are minimized. Figure 2.15 illustrates this approach and shows how the bus signals AO to A 7 can be reordered in the sequence of A2-AO-A4-AI-A3-A5-A7-A6.

For a given DSP application the program execution can be traced (using simulation) to get signal transition information on the program memory data bus. This data can then be used to arrive at optimum bus order so as to minimize the number of adjacent signal toggles.

For an N bit bus, an N node fully connected graph is constructed. The edges are assigned weights Wi,j given by the number of times bit 'i' and bit 'j' transition in the opposite direction. The problem of finding an optimum bit order


Table 2.1. Adjacent Signal Transitions in Opposite Direction as a Function of the Busreordering Span

#taps Initial ±I ±2 ±3 ±4 ±5 ±6 ±7 ±8 24 38 32 18 18 16 12 8 8 8 27 38 30 24 16 4 4 4 4 4 32 26 14 12 10 10 10 10 10 10 36 32 20 14 10 8 8 8 8 8 40 40 24 24 18 18 18 18 18 16 64 62 38 38 34 34 28 22 22 22 72 68 54 40 40 40 40 40 40 40 96 84 74 64 60 58 54 54 54 54 128 112 94 84 84 78 78 78 78 78

can then be mapped onto the problem of finding the lowest cost Hamiltonian Path in an edge-weighted graph or the traveling salesman problem.

As can be noted from figure 2.15, the bus bit reordering scheme has the downside of increasing the bus netlength and hence the interconnect capacitance. This overhead can be minimized if the reordering span for each bus bit is kept within a limit. For example, the bus reordering scheme shown in figure 2.15, uses the reordering span of ±2. The optimum bit order thus needs to satisfy the constraint in terms of the maximum reordering span. This is achieved by suitably modifying edge-weights such that all edge weights Wi,j

are made infinite if li - jl > M axSpan. The algorithm starts with the normal order as the initial order. It uses hill

c1imbing based iterative improvement approach to arrive at the optimum bit ordering. During each iteration a new feasible order is derived and is accepted as a new solution if it results in a lower cost function (i.e. lower number of adjacent signal transitions in opposite direction)

The impact of bit reordering on the power reduction was analyzed in the context of a DSP code that performs FIR filtering. Nine filters with the number of taps ranging from 24 to 128 were used. For each case, the algorithm was applied with the reordering span constraint ranging from ± 1 to ±8.

The results shown in table 2.1 show significant reduction in the number of adjacent signal transitions in opposite directions. It can also be noted that the reduction increases with the increase in the bus reordering span which is expected. However as mentioned earlier, higher reordering span implies higher interconnect length.

Figure 2.16 plots the average percentage reduction as a function of bus reordering span. As can be seen from the plot the incremental saving in the


70

65

60

55 c

50 0

U ::> 45 -0 Q)

a: 40 #-35

30

25

20 1 2 3 4 5 6 7 8

Bus Reordering Span

Figure 2.16. %Reduction in the Number of Adjacent Signal Transitions in Opposite Directions as a Function 01' the Bus Reordering Span

number of adjacent signal transitions gets smaller beyond the reordering span of ±4. For the span of ±4, the cross-coupling related power dissipation in the program memory data bus reduces on the average by 54%. This hence is the optimal reordering span to get the most power reduction.

2.2.6. Generic Techniques for Power Reduction In addition to the techniques discussed above, there are a few other low power

techniques which can be applied in the context of programmable DSPs. These are listed below in bullet forms with appropriate references which give more detai led description.

• Cold scheduling [89]

• Opcode assignment [102]

• Clock gating [80]

• Guarded evaluation [94]

• Pre-computation logic [65]

• Instruction subsetting [20]

• Code Compression [37]

2.3. Low Power Realization of Weighted-sum Computation Whi Je the techniques described in the earlier section can be applied to the low

power realization of weighted-sum computation, this section presents additional low power techniques specific to weighted-sum implementation.


2.3.1.

Q) :;)

0; > C Q)

'(3

~ 0 Ü

0.5

0.4

0.3

0.2

0.1

0

-0.1 5 10 15 20 25 30

Coefficient Number

Figure 2.17. Coefficients of a 32 Tap Linear Phase Low Pass FIR Filter

Selective Coefficient Negation

27

During a weighted-sum computation, the coefficients (weights) are stored in the coefficient memory in 2's complement form. For a given number N and the number of bits B used to represent it, the number of 1 s in the 2's complement representation of +N and -N can differ significantly. For example, both 8 and 16 bit representations of 9 (00001001 b, 00000000 00001001 b) have two 1 s, the 8 bit representation of -9 (11110111 b) has seven 1 sand the 16 bit representation of -9 (11111111 11110111b) has fifteen Is. Both 8 and 16 bit representations of 63 (00111111 b) has six 1 s, the 8 bit representation of -63 (11000001 b) has three Is and the 16 bit representation (11111111 11000001b) has eleven Is.

For each coefficient A[i], either A[i] or -A[i] can be stored in the coefficient memory, depending on the value that has lesser number of 1 s in its 2's complement binary representation. If -A[i] is stored in the memory, the corresponding product (-A[i] . X[n - i]) needs to be subtracted from the accumulator so as to get the correct weighted-sum result. This technique of selective coefficient negation reduces the number of 1 s in the coefficient input of the multiplier. It also reduces Hamming distance between consecutive coefficients, especially in cases where a small positive coefficient is followed by a small negative coefficient. This technique thus can result in significant reduction in the multiplier power and also the coefficient data bus power.

If the coefficients to be negated follow a regular pattern, a modified MAC instruction that alternates between multiply-add and multiply-subtract can support selective coefficient negation. For example, analysis of many low pass fi Iter coefficients shows that the coefficients to be negated follow a regular alternating pattern. As an example consider the coefficients of a 32 tap linear phase FIR filter shown in figure 2.17. It can be noted that coefficients follow a repetitive pattern of two positive coefficients followed by two negative coefficients.


Table 2.2. Impact of Selective Coefficient Negation on Total Number of I s in the Coefficients

#taps #ls # negated % negated #1 s after % coeffs coeffs selcctive reduction

Negation

16 120 6 37.5% 88 26.7% 24 212 12 50.0% 144 32.1% 32 230 12 37.5% 160 30.4% 36 282 16 44.4% 142 49.7% 40 312 18 45.0% 202 35.3% 48 366 20 41.7% 200 45.4% 64 498 30 46.8% 276 44.6% 72 550 34 47.2% 336 38.9% 96 782 48 50.0% 358 54.2% 128 984 66 51.6% 426 56.7%

In cases where the coefficients to be negated folIowarandom paUem, all such coefficients can be grouped together and the filtering performed using two loops - first one performing repeated multiply-add and the second one performing repeated multiply-subtract.

Table 2.2 presents the impact of selective coefficient negation in case of 10 low pass FIR filters synthesized using Parks-McClelIan algorithm [73]. The results show that selective coefficient negation selects 37% to 51 % of coefficients for negation, and results in 26% to 56% reduction in the total number of 1 s in the coefficient values. As mentioned earlier, this reduction translates into power reduction in the multiplier.

2.3.2. Coefficient Ordering Since the summation operation is both commutative and associative, the

weighted-sum output is independent of the order of computing the coefficient products. Thus for a four term weighted-sum computation, the output can be computed as

Y[n] = A[O]- X[O] + A[l] . X[l] + A[2]· X[2] + A[3]· X[3] (2.3)

or as

Y[n] = A[l]- X[l] + A[3] -X[3] + A[O]- X[O] + A[2]· X[2] (2.4)

Since the weighted sum computation also does not impose any restriction on how the coefficient and data values are stored_ The address generator needs to


comprehend the locations and generate the correct pair of addresses (to access the coefficient and corresponding data sampie value) for each product computation. The order of coefficient-data product computation directly affects the sequence of coefficients appearing on the coefficient memory data bus. Thus this order determines the power dissipation in the bus.

The following subsection formulates the problem of finding an optimum order of the coefficients such that the total Hamming distance between consecutive coefficients is minimized.

2.3.2.1 Coefficient Ordering Problem Formulation

For an N term weighted-sum computation, N! different coefficient orders are possible. The problem of finding the optimum order can be reduced to the problem of finding the lowest cost Hamiltonian Circuit in an edge-weighted graph or the traveling salesman problem. Since this problem is NP complete, heuristics need to be developed to obtain a near-optimal solution in polynomial time.

The coefficient ordering problem can thus be formulated as a traveling salesman problem. The coefficients map onto the cities and the Hamming distances between the coefficients map onto the distances between the cities. The optimal coefficient order thus becomes the optimal tour of the cities, where each city is visited only once and the total distance traveled is minimum.

Several heuristics have been proposed [33] to solve the general traveling salesman problem and also a special case of the problem where distances between the cities satisfy the triangular inequality. For any three cities Ci, Cj and Cb the triangular inequality property requires that

The coefficient ordering problem satisfies the triangular inequality as folIows: Let Ai, AjandAk be any three coefficients and let Hij, H jk and H ik be the Hamming distances between (Ai - A j ), (Aj - A k) and (Ai - A k ) pairs of the coefficients respectively. Let B ij , Bjk and B ik be sets of bit locations in which the coefficients (Ai - A j ), (A j - A k) and (Ai - A k) differ respectively. The Hamming distances H ij , H jk and Hik thus give the cardinality of the sets B ij , Bjk and Bik respective\y. These sets satisfy the following relationship :

(2.6)

The cardinality of B ik is maximum when the set (Bij n B jk ) is empty and is given by H ij + H jk . Thus the Hamming distances satisfy the following relationship :

H ik :::; (Hij + H jk )

which is the triangular inequality.

(2.7)


The algorithms proposed [33] to solve this class oftraveling salesman problems include nearest neighbor, nearest insertion, farthest insertion, cheapest insertion, nearest merger etc. Experiments with various low pass FIR filters show that in alm ost all cases, the nearest neighbor algorithm performs the best.

2.3.2.2 Coefficient Ordering Aigorithm

Here is the algorithm for finding the optimum coefficient order.

Procedure Order-Coefficients-for-Low-Power Inputs: N coefficients A[O] to A[N-l] Output: A coefficient order which results in minimum total Hamming distance between successive coefficient values

/* build hamming distance matrix */ for each coefficient A[i] (i=O to N-l) {

for each coefficient A[j] (j=O to N-l ) { Hd [i][j] = CounLno_oLOnes (A[i] EB AU]) } }

/* initialization */ Coefficient-Order-List = {A[O]} Latest-Coefficient-Index = 0 /* build the coefficient order */ for ( i = 1 to N -1) {

}

Find A[j] such that ( A[j] tf- Coefficient-Order-List ) and Hd [jHLatest-Coefficient-Index] is minimum

Coefficient-Order-List = Coefficient-Order-List + A[j] Latest-Coefficient-Index = j

The 'Coefficient-Order-List' gives the desired sequence of coefficient-data product computations. Once the sequence of accessing the coefficients is identified, the coefficients and the corresponding data values can be reordered and appropriately stored in the memory, such that the desired sequence of coefficientdata product computations is achieved when the memory is accessed sequentially.

Table 2.3 shows the impact of the coefficient ordering technique on the total Hamming distance and the total number of adjacent signals toggling in opposite direction, between successive coefficient values. This reduction directly translates into the power savings in both the coefficient data bus and the multiplier.

The results show that the total Hamming distance can be reduced by 54% to 83% using coefficient ordering. This directly translates into 54% to 83% saving in the coefficient memory data bus power. The total number of adjacent

Programmable DSP based Implementatioll 31

Table 2.3. Impact of Coefficient Ordering on Hamming Distance and Adjacent ToggIes

#taps H.d. H.d. % Adj. Adj % Initial Optimized reduction Toggle Toggle reduction

Initial Optimized 16 102 46 54.9% 8 3 62.5% 24 158 64 59.5% 20 4 80.0% 32 204 68 66.7% 22 7 68.2% 36 242 82 66.1% 28 7 75.0% 40 280 94 66.4% 32 8 75.0% 48 350 108 69.1% 50 12 76.0% 64 452 118 73.9% 54 8 85.2% 72 510 110 78.4% 52 6 88.5% 96 700 138 80.3% 64 11 82.8% 128 952 156 83.6% 84 12 85.7%

toggles in opposite direction is also reduced by 62% to 85% using coefficient ordering.

Since selective coefficient negation also hel ps in reducing the total Hamming distance between the successive coefficient values, it can be applied in conjunction with coefficient ordering to achieve further power reduction.

2.3.3. Adder Input Bit Swapping The bit-wise commutativity property ofthe ADD operation can be exploited

to develop a technique that reduces the number of toggles in the busses that feed the inputs to the adder. This not only reduces power dissipation in these busses, it also reduces the power dissipated in the adder and the accumulator that drives one of the busses.

Bitwise commutativity implies that the result of an ADD operation is not affected even when one or more bits from one input are swapped with the corresponding bits in the second input. Consider two 4 bit numbers A = (a3,a2,al ,aO) and B = (b3,b2,b 1 ,bO) , where a3 and b3 are the MSBs of A and B respectively. It can be easily shown that

a3,a2,al,aO + b3,b2,bl,bO b3,a2,al,bO + a3,b2,bl,aO a3,b2,al,aO + b3,a2,bl,bO a3,a2,bl,bO + b3,b2,al,aO

and so on

This property can be used as folIows: Consider the following 4 bit input data sequence for addition

32

Yl Y2

(inl) (in2)

0011 + 1100 0100 + 1011


Computation of Y2 as shown above results in three signal toggles in the databus (in I) (bit a2:0 ---+ 1, bit al: 1 ---+ 0 and bit aO: 1 ---+ 0) and one pair of adjacent signals (a2-a 1) toggling in opposite direction. It also results in three signal toggles in the databus (in2) (bit b2: 1 ---+ 0, bit bl:O ---+ 1 and bit bO:O ---+ 1) and one pair of adjacent signals (b2-b 1) toggling in opposite direction. Thus totally six signal toggles and two adjacent signal toggles in opposite directions contribute to the power dissipation during Y2 computation.

By using bitwise commutativity, Y2 can be calculated after swapping bits a2,a 1 ,aO with b2,b 1 ,bO respectively. With this bit swapping the computation sequence looks as folIows:

(in!) (in2)

Yl 0011 + 1100 Y2 0011 + 1100

This computation results in zero toggles in both the databusses and consequently has no pairs of adjacent signals toggling in opposite direction. As can be seen from this example, appropriate bit swapping can significantly reduce power dissipation.

Figure 2.18 shows a scheme to perform the bit swapping so as to minimize the toggle count. The scheme compares for every bit, the new value with the current value, and performs bit swapping if the two values are different.

As can be seen from figure 2.18, the reduction in the toggles in the adder inputs is achieved at the expense of additional logic i.e. the multiplexers and the exclusive-or gates. The power dissipated in this logic offsets power savings in the adder and its input busses. The final savings depend on the data values being accumulated and also on the relative capacitance of the adder input busses and the multiplexer inputs.

To evaluate the effectiveness of the input bit swapping technique for power reduction in the adder and its input busses, 1000 random number pairs were generated with bit widths of 8, 12 and 16. Table 2.4 gives the results in terms of total Hamming distance between consecutive data values and total number of adjacent signals toggling in opposite direction, in both the busses. As can be seen from the results the proposed scheme saves more than 25% power in the two input data busses of the adder and also results in power savings in the adder itself.

Programmable DSP based Implementatiofl

Program/ Coefficicnt Memory

CPU

Pro gram Counter

Data Read Address Register

Data Write Address I---+---N Register

Data Memory

Figure 2.18. Sehe me far Reducing Power in the Adder Input Busses

33

Table 2.4. Power Optimization Results Using Input Bit Swapping for 1000 Random Number Pairs

Hamming Distanee Adjaeent Signal Toggles Initial Final %reduetion Initial Final %reduetion

8 bit da ta 7953 5937 25.3% 1836 1090 40.6% 12 bit data 11979 8925 25.5% 2766 1791 35.2% 16 bit data 15945 11865 25.6% 3545 2170 38.8%

2.3.4. Swapping Multiplier Inputs Since the power dissipation in a Booth multiplier depends on the number

of I s in the Booth encoded input, the coefficient and data inputs to the multiplier can be appropriately swapped so as to reduce power dissipation in the multiplier. The results presented in [36] indicate that the amount of reduction is dependent on the data values, and in worst case can result in increase in the power dissipation. If the coefficients to be swapped follow a regular pattern, such selective input swapping can be supported as an enhancement to the repeat MAC instruction.


X[O] X[I] X[2] X [N/2-2] X[N/2-1]

X[N-2] X[N-3]

Figure 2.19. Data Flow Graph of a Weighted-sum Computation with Coefficient Symmetry

2.3.5. Exploiting Coefficient Symmetry In case of some DSP kemels such as linear phase FIR filters, the coefficients

of the weighted-sum computation are symmetrie. This property can be used to reduce by half the number of multiplications per output. For N being even, the weighted-sum equation given by:

N-l

Y = L A[i] . Xli] (2.8) i=ü

can be written as:

N/2-1

Y = L A[i]· (X[i] + X[N - i-I]) (2.9) i=O

The corresponding data f10w graph is shown in figure 2.19. While the core computation in equation 2.9 is also multiply-accumulate, the coefficient is multiplied with the sum of two input sampies. The architectures such as shown in figure 2.1, do not support single cycle execution of this computation. While it is possible to compute data sum and use it to perforrn MAC, the resultant code would require more number of cycles and more number of data memory accesses than the direct implementation of equation 2.8, that ignores coefficient symmetry.

Figure 2.20 shows a suitable abstraction of the datapath of the TMS320C54X DSP [97] that supports single-cycle execution (FIRS instruction) ofthe multiplyaccumulate computation of equation 2.9.

This architecture has an additional data read bus which enables fetehing the coefficient and the two data values in a single cycle. It's datapath has an adder


Program/ Coefficient Memory

Program Counter

CPU

Data Read Address Register I

Data Write l~====~-4 Address ~ Register

Data Read Address Register 2

Data Memory

35

Figure 2.20. Suitable Abstraction of TMS320C54x Architecture for Exploiting Coefficient Symmetry

and a MAC unit, so that the sum of the input data sam pIes and the multiplyaccumulate operation can be performed simultaneously in a single cycle. Since the computational complexity of equation 2.9 is lesser than that of equation 2.8, the corresponding implementation of equation 2.9 is significantly more power efficient.

2.4. Techniques for Low Power Realization of FIR Filters FIR filtering is achieved by convolving the input data sampIes with the desired

unit impulse response of the filter. The output Y[n] of an N tap FIR filter is given by the weighted sum of latest N input data sampIes (equation 2.10).

N-l

Y[n] = L A[i] . X[n - i] (2.10) i=ü

The corresponding signal ftow graph is shown in figure 2.21.


X(Z)

Y(Z)

Figure 2.21. Signal Flow Graph of a Direct Form FIR Filter

The weights (A[i» in the expression are the filter eoeffieients. The number of taps (N) and theeoeffieient values are derived so as to satisfy the desired filter response in terms of passband ripple and stopband attenuation. Unlike UR filters, FIR filters are all-zero filters and are inherently stable [73]. FIR filters with symmetrie eoeffieients (A[i] = A[N-I-i» have a linear phase response [73] and are henee an ideal ehoiee for applieations requiring minimal phase distortion.

While the teehniques deseribed in the earlier two seetions ean be applied in the eontext ofFIR filters, this seetion deseribes additionallow powertechniques specifie to FIR filters.

2.4.1. Circular Buffer In addition to the weighted sum computation, FIR filtering also involves

updating the input data sampIes. For an N tap filter, the latest N data sampIes are required. Hence the latest N sampIes need to be stored in the data memory. After every output computation, a new data sampIe is read and stored in the data memory and the oldest data sampIe is removed. A data sampIe X[k] for the eurrent eomputation becomes data sampIe X[k-l] for the next FIR computation. Thus in addition to accepting the new data sampIe, the existing data sampIes need to be shifted by one position, for every output. One way of achieving this is to read each data sampIe in the order X[n-N+2] to X[n] and write it in the next memory loeation. Thus for an N tap filter, this approach of data movement requires N-I memory writes.

The power related to data movement ean be minimized by eliminating these memory writes. This can be aehieved by eonfiguring the data memory as a circular buffer [96] where instead of moving the data, the pointer to the data is moved.


2.4.2. Multirate Architectures Multirate architectures involve implementing an FIR filter in terms of its dec

imated sub-filters [66]. These architectures can be derived using Winograd's algorithms for reducing computational complexity of polynomial multiplications. Consider the Z-domain representation of the FIR filtering algorithm : Y (Z) = H (Z) . X (Z), where H(Z) is the filter transfer function, X(Z) is the input and Y(Z) is the output.

Far an N tap filter, H(Z) = 2:~ol A[i] . Z-i, where A[i]'s are the filter coefficients. H(Z) can be decimated into two interleaved sequences by grouping odd and even coefficients as folIows:

7-1 7- 1

H(Z) L A[2k]· Z-2k + L A[2k + 1]· Z-(2k+1) k=O k=O 7- 1 7- 1

L A[2k]· (Z2)-k + Z-l . L A[2k + 1]· (Z2)-k k=O k=O HO(Z2) + H l (Z2) . Z-l (2.1 I)

The input X(Z) can be similarly decimated into two sequences X o (all even sampIes) and Xl (all odd sampIes) such that X(Z) = X O(Z2) + Xl (Z2). Z-l The filtering operation can now be represented as :

(Ho + H l . Z-l) . (Xo + Xl· Z-l) =} Co + Cl· Z-l + C2 · Z-2 (2.12)

Using Winograd's polynomial multiplication algorithm, Co, Cl and C2 can be computed as

Co Ho· X o, H 1 ·Xl ,

(Ho + Hd . (Xo + Xd - Co - C2 (2.13)

Cl gives the output sub-sequence Y l . Since C2 sam pIes 'overlap' with Co, they need to be added, with apprapriate shift, to Co to get the output sub-sequence Yo. It can be noted that Co, Cl, C2 computation involves filtering of the decimated input sequences using the decimated sub-filters. An N tap FIR filtering is thus achieved using three (N/2) tap FIR filters. The signal ftow graph ofthe resultant FIR architecture is shown in figure 2.22. The architecture processes two input sampIes simultaneously to produce the corresponding two outputs.

2.4.2.1 Computational Complexity of Multirate Architectures

Fram the signal ftow graph shown in figure 2.21, an N tap direct form FIR filter requires N multiplications and (N-I) additions per output.


Figure 2.22. One Level Decimated Multirate Architecture

Table 2.5. TMS320C2x Code for Direct Form Architecture

NXTPT IN XN,PAO ; bring in the new sampIe LRLK AR1,XNLAST ; point to X(n-(N- I» LARP ARI MPYK 0 ; cJear product register ZAC ; cJear accumulator RPTK NMl ; loop N times MACD HNMl,*- ; multiply, accumulate APAC SACH YN, I OUT YN, PAI ; output the filter response y(n) B NXTPT ; branch to get the next sampie

Consider the multi rate architecture shown in figure 2.22. Assuming even number of taps, each of the sub-filters is of length (N/2) and hence requires N/2 multiplications and (N/2)-1 additions. There are four more additions required to compute the two outputs YO and Yl. This architecture hence requires 3N/4 multiplications per output wh ich is less than the direct form architecture for all values of N and requires (3N+2)/4 additions per output which is less than the direct form architecture for ((N - 1) > (3N + 2)/4) i.e. (N > 6).

2.4.2.2 Multirate Architecture on a Programmable DSP

While the multi rate architecture in not as regular as the direct form structure, it has significantly reduced computational complexity and partial regularity in terms of the decimated sub-filters.

Table 2.5 shows the implementation of a direct form FIR filter on TMS320C2x [40, 95]. The coefficients are stored in the program memory and the data is


stored in the data memory. This imp\ementation requires N+ 19 cycles to compute one output of an N tap FlR filter.

The TMS320C5x [96] which is the next generation fixed point DSP has some additional instructions that help optimize the FlR implementation further. Its features such as RPTZ instruction that clears P register and Accumulator before RPTing, delayed-branch instruction and memory mapped output can be used to implement the FlR filtering algorithm that requires N+14 cycles per output.

Table 2.6 shows the implementation ofthe multi rate architecture using 'C2x. It can be noted that this code is actually a 'C5x code that uses 'C2x compatible instructions. The only exception is the accumulator buffer related instructions SACB and SBB, which are not supported in 'C2x. This implementation requires (3N+82)/4 cycles per output computation of an N tap FlR filter. The corresponding C5x implementation requires (3N+ 70)/4 cycles per output computation.

Thus in case of TMS320C2x based implementation, the multi rate architecture needs lesser cycles for FIR filters with (N > 6). In case of TMS320C5x based implementation the multirate architecture needs lesser cycles for FlR filters with (N > 14).

The power reduction due to multi rate architecture based FIR filter imp\ementation can be analyzed as folIows. Since the multirate architecture requires fewer cycles, the frequency can be lowered using the following relationship :

fmultirate/ fdirect = (3N + 82)/(4 x (N + 19)) (2.14)

With the lowered frequency, the processor gets more time to execute the instruction. This time-slack can be used to appropriately lower the supply voltage, using the relationship given in the following equation :

delay a V dd/(V dd - VT)2 (2.15)

Since most programmable DSPs [95, 96, 97] are implemented using a fully static CMOS technology, such a voltage scaling is indeed possible.

In terms of capacitance, the main computation loop in the direct form realization requires N multiplications, N additions and N memory reads. The multirate implementation has three computation loops corresponding to the three sub-filters. These loops require 3N/4 multiplications, 3N/4 additions and 3N/4 memory reads per output. Based on this observation, CtotaLmultirate / CtotaLdirect ~ 0.75 Cmultirate/Cdirect ~ (0.75 x 4 x (N + 19))/(3N + 82) Based on this analysis, für a 32 tap FIR filter, fmultirate/fdirect = (3 x 32 + 82)/4 x (32 + 19) = 0.87


Table 2.6. TMS320C2x Code for the Multirate Architecture

NXTPT IN XON,PAO ; read new even and odd sampies IN XIN, PAO LACL XON ADD XIN SACL XOIN ; compute XO+XI LRLK AR I ,XONLAST ; point to XO(n-(N/2-1))

LARP ARI MPYK 0 ; clear product register LACL YHI ; load X I *H I from previous iteration RPTK NBY2 ; loop (N/2+ I) times MACD HON, *- ; compute XO*HO SACH YON, I OUT YON, PAI ; output YO sampie 01' filter response SUB YHI ; YH I stores X I *H 1 output from earlier iteration SACB ; accumulator --+ accumulator buffer ZAC RPTK NBY2Ml ; loop N/2 times MACD HIN,*- ; multiply, accumulatc to compute X I *H I SACH YHI ; result stored in YH I for use in the next iteration NEG ; negate the accumulator RPTK NBY2M2 ; loop (N/2-1) times MACD H01,*- ; compute (XO+XI)*(HO+HI) APAC SBB ; subtract accumulator butTer from accumulator SACH YIN, I OUT YIN,PAI ; output Y I sampie 01' filter response B NXTPT

For this lowering of frequency, based on equation 2.15, the voltage can be reduced from 5 volts to 4.55 volts.

Pmultirate

Pdirect

Cmultirate X ( V mu l tirate)2 X fmultirate

C direct Vdirect f direct

= 0.75/0.87 X (0.91)2 x 0.87 = 0.62

Thus using the multirate architecture, the power dissipation of a 32 tap FIR filter implemented on the TMS320C2x processor can be reduced by 38%. Similar analysis for TMS320C5x processor based implementation shows the power reduction by 35%.

Figure 2.23 shows power dissipation as a function of number of taps for the multirate FIR filters implemented on TMS320C2x. The power dissipation is normalized with respect to the direct form FIR structure. As can be seen from


0.8 ,------,------,------r----,---.---------,

0.75

.Q 0.7 ~ ., ." 0

~ 0.65 0 0-

~ ., <U E

" 0.6 Z

0.55

0.5 '--------'--------'-------'------'---'--------' 10 20 30 40 50 60 70

Number of taps

Figure 2.23. Normalized Power Dissipation as a Function of Number of Taps for the Multirate FIR Filters Implemented on TMS320C2x

the figure, the power dissipation reduces with increasing order of the filter. The power savings can be as much as 40% for filters with >42 taps.

2.4.3. Architecture to Support Transposed FIR Structure The signal flow graph shown in figure 2.21 can be transformed using the

transposition theorem [73] into a signal flow graph shown in figure 2.24. While the computational complexity of the transposed structure is same as the direct form structure (figure 2.21), it involves multiplying all the coefficients by the same input data (X[n». This structure is hence called the Multiple Constant Multiplication (MCM) structure. Thus throughout the filter computation, one of the multiplier inputs is fixed, resulting in significant power saving in the multiplier compared to the direct form implementation. While the transposed form does not require movement of old data sampIes, it needs to store the results of the intermediate computations for future use. Since the intermediate results have higher precision (precision of the coefficient + precision of the data), wider busses are required to support single-cycle multiply-add operations.

It can be noted that the architecture shown in figure 2.1 cannot support efficient implementation ofthe MCM based FIR filters. An architecture shown in figure 2.25 has been proposed to support N cycJe implementation of an N tap transposed filter. The increased capacitance due to the wider readlwrite busses and an extra write performed every cycJe result in increased power dissipation compared to the direct form implementation. This increase typically outweighs the saving in the multiplier power, thus making such an architecture overallless power efficient.


X(Z) -,----------,---------,-----------,---------~

A[N-I]

-I Z

A[N-2]

Figure 2.24.

Pn)gramJ CoclTicicnt Memory

A[N-3] A[l] A[O]

Signal Flow Graph of the Transposed FIR Filter

I",,, -----------"

J)ata MCI11()ry

Figure 2.25. Architecture to Support Efficient Implementation ofTransposed FIR Filter

Y(Z)

All the approaches presented so far assume a given set of coefficient values that meet the desired filter response. The following two techniques optimally modify the filter coefficients such that they result in power reduction while still meeting the desired filter response.

2.4.4. Coefficient Scaling Far an N tap filter the output Y[n] is given by equation 2.10. Scaling the out

put preserves the fi Iter characteristics in terms of passband ripple and stopband attenuation, but results in an overall magnitude gain equal to the sc ale factor. For a scale factor K, equation 2.10 translates to the following:

N-l N-l

K· Y[n] = K· L (A[i]· X[n - i]) = L ((K· A[i]) . X[n - i]) (2.16) i=ü i=ü

Thus the coefficients of the scaled filter are given by (K . A[i]). Given the allowable range of scaling (e.g. ±3db), an optimal scaling factor K can be


found such that the total Hamming distance between consecutive coefficient values is minimized. This technique thus reduces the power dissipation in the coefficient memory data bus and also the multiplier.

Due to the finite precision effects, the scaled coefficients may in some cases violate the filter characteristics. This can be avoided by scaling the full precision coefficients and then quantizing them to the desired number of bits. It is verified that the scaled coefficients satisfy the desired filter characteristics before accepting them.

2.4.5. Coefficient Optimization

Coefficient optimization has been discussed in the literature primarily in the context of designing finite wordlength FIR filters [32]. The algorithms presented in [105] and [83] address the design of FIR filters with powers-oftwo coefficients. The algorithm presented in [28] minimizes the total number of 1 s in the 2's complement representation of the coefficients. These techniques aim at efficient multiplierless implementation of FIR filters.

2.4.5.1 Coefficient Optimization - Problem Definition

The coefficient optimization problem can be stated as folIows: Given an N tap FIR filter with coefficients (A[i], i=O,N-l) that satisfy the response in terms of passband ripple, stopband attenuation and phase characteristics (linear or non-linear), find a new set of coefficients (LA[i], i=O, N-l) such that the total Hamming distance between successive coefficients is minimized while still satisfying the desired filter characteristics in terms passband ripple and stopband attenuation. Also retain the linear phase characteristics if such an additional constraint is specified.

2.4.5.2 Coefficient Optimization - Problem Formulation

The coefficient optimization problem can be formulated as a local search problem, where the optimum coefficient values are searched in their neighborhood. This is done via an iterative improvement process. During each iteration one or more coefficients are suitably modified so as reduce the total Hamming distance while still satisfying the desired filter characteristics. The optimization process continues till no further reduction is possible.

The coefficient optimization can be performed either on the initial coefficient values or on the coefficient values that are uniformly scaled using the approach mentioned earlier. One coefficient is perturbed in each iteration of the optimization process. In case of an additional requirement to retain the linear phase characteristics, the coefficients are perturbed in pairs (A[i] and A[N-I-i]) so as to preserve the coefficient symmetry.


The selection of a coefficicnt for perturbation and the amount of perturbation has direct impact on the overall optimization quality. Various strategies can be adopted far coefficient perturbation [32]. These include 'steepest descent' and 'first improvement' strategics. While in the steepest descent approach, the best coefficient perturbation is selected at every stage, in the first improvement approach, the first coefficient perturbation that minimizes the cost is selected. The quality of the first improvement approach depends on the order in which the coefficients are selected.

The following subsections present the details of various components of the algorithm and follow it up with the overall algorithm.

2.4.5.3 Coefficient Optimization Aigorithm - Components

Computing filter response

During each iteration, the filter characteristics need to be computed for every coefficient perturbation. Thc overall computational efficiency of the algorithm thus depends on how fast thc filter characteristics (passband ripple and stopband attenuation) can be derived. The frequency response of a filter can be computed by taking Fourier Transform of its unit impulse response [73]. In case of FIR filters, the filter coefficient values form the unit impulse response. The frequency response of an FIR filter can thus be computed by taking Fourier transform of a sequence cünsisting of the filter coefficients padded with Os. The Fourier transform can be efficiently computed using the radix-2 FFT (Fast Fourier Transform) technique. The number of Os to be padded are decided based on the desired resolution (no. of frequency points) and also to make the total number of points apower of 2. As a thumb rule, the number of coefficients can be multiplied by 8 and the nearest higher power of 2 picked to decide on the total number of points für FFT computation.

Given the passband and stopband frequency ranges, the pass band ripple and stopband attenuation can be computed by analyzing the frequency response.

Coefficient perturbation During coefficient perturbation, the selected coefficient (A[i]) is modified

in such a way that the Hamming distance between the new (A[i]) value and its adjacent coefficients (Ali-I], A(i+l]) is less than the Hamming distance between the current A[i] value and the adjacent coefficients. The change in the coefficient value needs to be as small as possible, so as to minimally impact the filter characteristics. Thc neighborhood of the selected coefficient (A[i]) is searched so as to find the ncarest higher and the nearest lower coefficients that reduce the Hamming distance. The maximum allowable difference between the perturbed coefficient and the original coefficient can be controlled so as to focus on lower significant bits during the optimization.

Coefficient selection


In case of the steepest descent strategy, for every coefficient its nearest higher and nearest lower coefficient values are identified. A new set of coefficients can be fonned by replacing one of the coefficients with its nearest higher or nearest lower value. This approach is used to generate 2N sets of coefficients for an N tap filter, during each iteration of the optimization process. From the 2N sets of coefficients, the coefficient set that maximizes the gain function is selected and is used as the current set of coefficients for the next iteration. The gain function , is computed as follows , = Tolerance . H Dred

Tolerance = (Pdbreq - Pdb)/ Pdbreq + (Sdb - Sdbreq )/ Sdbreq

where H D red is the reduction in the total Hamming distance for the new set of coeffieients compared to the total Hamming distance for the current set of coefficients, Pdbreq is the desired passband ripple, Sdbreq is the desired stopband attenuation, Pdb is the passband ripple of the new set of coefficients, and Sdb is the stopband attenuation for the new set of coefficients.

In case of the additional requirement of retaining the linear phase response the filter coefficients are perturbed in pairs (A[i], A[N-i-l]) to maintain symmetry. Thus for an N tap filter, N differen( sets of coefficients are generated during each iteration and the set that maximizes the gain function , is selected.

In case of first improvement strategy, the optimization quality depends on the order in which the coefficients are perturbed. The coefficient order is randomized and for aselected coefficient whether to search of nearest higher or nearest lower value is also selected randomly. During each iteration, the first perturbation that reduces the Hamming distance and satisfies the filter characteristics is accepted and is used to form the current set of coefficient for the next iteration. The dependence on the coefficient order is minimized by generating 5 or 10 different coefficient orders and selecting the one that results in least total Hamming distance.

2.4.5.4 Coefficient Optimization Algorithm

Here is the overall coefficient optimization algorithm that uses the steepest descent strategy and assumes no linear phase constraint.

Procedure Optimize-CoeJficientsJor-Low-Power Inputs: Low pass filter characteristics in tenns of passband ripple Pdbreq and stopband attenuation Sdbreq . An initial set of N fi Iter coefficients A[O] to A[NI] that meet the specified filter response. Output: An updated set of filter coefficients A[O] to A[N-I] which minimize total Hamming distance between successive coefficient values and still meet the desired filter characteristics.


repeat { for each coefficient A[i] (i=O,N-l) {

}

Find a coefficient value A[i+] such that : (HD(A[i],A[i-l])+HD(A[i],A[i+ 1])) > (HD(A[i+ ],A[i-l])+ HD(A[i+], A[i+ 1])) and (A[i+] - A[i]) is minimum Generate a new set of coefficients by replacing A[i] with A[i+] Compute the passband ripple (Pdbi+) and the stopband attenuation (Sdbi+) if (Pdbi+ < Pdbreq ) and (Sdbi+ > Sdbreq ) {

Find the tolerance given by Toli+ = (Pdbreq - Pdbi+)/Pdbreq + (Sdbi+ - Sdbreq)/Sdbreq

} else { Toli+ = 0; } Find a coefficient value A[i-] such that : (HD (A[i], A[i-l]) + HD (A[i], A[i+ 1])) > (HD (A[i-], A[i-l]) + HD(A[i-], A[i+ 1])) and (A[i] - A[i-]) is minimum Generate a new set of coefficients by replacing A[i] with A[i-] Compute the passband ripple (Pdbi-) and the stopband attenuation (Sdbi-) if (Pdbi- < Pdbreq ) and (Sdbi- > Sdbreq ) {

Find the tolerance given by Toli_ = (Pdbreq - Pdbi-)/Pdbreq + (Sdbi- - Sdbreq)/Sdbreq

} else { Toli_ = 0; }

Find the coefficient value among A[i+]'s and A[i-]'s for which the gain function "( given by (Tolerance· H Dreduciion) is maximum. if (, > 0) {

Replace the original coefficient with the new value } else { Optimization_possible = FALSE }

}until (! Optimization_possible)

The above algorithm can be easily modified to handle the additional requirement of retaining the linear phase characteristics. This can be achieved by modifying both A[i] and A[N-l-i] with A[i+] (and later with A[i-]) to generate the new set of coefficients, and searching only the first (N+ 1)/2 coefficients during each iteration.

The 'first improvement' approach based version of the algorithm uses a random number generator to pick a coefficient (A[i]) for perturbation and also to decide whether A[i+] or A[i-] value needs to be considered. The new coefficient value is accepted if the new values of passband ripple and stopband attenuation are within the allowable limits. The optimization process stops when no coefficient is perturbed for the specified number of iterations.

The techniques of coefficient scaling and coefficient optimization were applied to the following six low pass FIR filters.


Table 2.7. Harnrning Oistance and Adjacent Signal Toggles After Coefficient Scaling Followed by Steepest Oescent and First Irnprovernent Optirnization with No Linear Phase Constraint

No linear phase constraint Initial Scaling + Scaling +

FIR Filter Steepest descent First irnprovernent %red

HO Togs HO Togs FrRI 372 58 298 FIR2 180 50 118 FIR3 292 44 258 FIR4 214 44 138 FIR5 258 36 168 FIR6 220 16 156

FIRI: Ip_16K_3KAK_. L62_50 FIR2: Ip_16L3KA.5K_.2A2_24 FIR3: Ip_l OK_1.8K_2.5K_.I 5_60A I FIR4: Ip_12K_2K_3K_.12A5_28 FIR5: Ip_l2K_2.2K_3.1 K_.16A9_34 FIR6: Ip_lOK_2K_3K_0.05AO_29

19 12 25 6 14 12

best of 10 randorn W.f.t. initial HO Togs HO Togs 298 19 19.9% 67.2% 114 8 36.7% 84.0% 256 24 12.3% 45.5% 138 5 35.5% 88.6% 168 16 34.9% 61.1% 156 12 29.1% 25.0%

These filters vary in terms of the desired filter characteristics and consequently in the number of coefficients. These filters have been synthesized using the Park-McCJellan's algorithm for minimum number of taps. The coefficient values quantized to 16 bit 2's complement fixed point representation form the initial sets of coefficients for optimization. Tables 2.7, 2.8 and 2.9 give results in terms of total Hamming distance (HD) and total number of adjacent signals toggling in opposite direction (Togs) for different optimization strategies. The names of the filters indicate the filter characteristics. For example, the FIR2 filter Ip_16L3KA.5K_.2AL24 is a low pass filter with the following characteristics : Sampling frequency = 16000 Passband frequency = 3000 Stopband frequency = 4500 Pass band ripple = 0.2db Stopband attenuation = 42db Number of coefficients = 24

Figure 2.26 shows the frequency domain characteristics of the 24 tap FIR fi lter for three sets of coefficients corresponding to the initial solution, optimized with no linear phase constraint and optimization with linear phase constraint.


Table 2.8. Hamming Oistance and Adjacent Signal Toggles After Coefficient Scaling Followed by Steepest Oescent and First Improvement Optimization with Linear Phase Constraint

With linear phase constraint Scaling + Scaling +

FIR Filter Steepest descent First improvement %red best of 10 random w.r.l. initial

HO Togs HO Togs HO Togs Ip_16K3KAK_. L62_50 302 20 302 20 18.8% 65.5%

Ip_16L3KA.5L.2A2_24 118 14 114 10 36.7% 80.0% Ip_IOK_I.8K_2.5K_.15_60AI 264 28 264 20 9.6% 54.5%

Ip_12K_2K_3K_.12A5_28 140 8 136 6 36.4% 86.4% Ip_12K_2.2K3.1 K_.16A9_34 178 16 170 14 34.1% 61.1%

Ip_IOK_2K_3K_0.05AO_29 154 10 154 14 30.0% 37.5%

Table 2.9. Hamming Oistance and Adjacent Signal Toggles for Steepest Oescent and First Improvement Optimization with and without Linear Phase Constraint (with No Coefficient Scaling)

No linear phase constraint No Scaling + No Scaling +

FIR Filter S tee pest descent First improvement %red best of 10 random W.f.l. initial

HO Togs HO Togs HO Togs Ip_16K_3KAL 1_62_50 308 21 312 33 17.2% 63.8% Ip_16L3KA.5L2A2_24 126 14 126 12 30.0% 76.0%

Ip_IOK_I.8K_2.5K_.15_60A1 252 28 252 30 13.7% 36.4% Ip_12K_2K_3K_.12A5_28 154 20 152 19 29.0% 56.8%

Ip_12K_2.2K3.1 K_.16A9_34 204 26 198 24 23.6% 33.3% Ip_lOL2K_3K_0.05AO_29 170 7 170 5 22.7% 68.8%

With linear phase constraint Ip_16L3KAL 1_62_50 316 30 312 26 16.1% 55.2% Ip_16L3KA.5L2A2_24 136 20 136 22 24.4% 60.0%

Ip_IOK_I.8L2.5K_.15_60AI 260 30 260 30 10.9% 31.8% Ip_12L2K3L 12A5_28 154 24 154 22 28.0% 50.0%

Ip_12K_2.2K_3.1 L.16A9_34 210 30 210 28 18.6% 22.2% Ip_10K_2K_3K_0.05AO_29 180 6 176 6 20.0% 62.5%

The results show that the algorithm using both scaling and coefficient optimization with no linear phase constraint results in upto 36% reduction in the total Hamming distance and upto 88% reduction the total number of adjacent signal toggles. Similar savings are achieved even with the linear phase constraints.


D "0

" "<ij

"

-20

-40

-60

1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz)

-40 .----,--------,----,-------,-----,----,

-45

-50

-55

-60

-65

-70 L-J._"----___ ---"-____ -'---___ ---"-____ ...LL--'---'

4600 4800 5000 Frequency (Hz)

5200 5400

49

Figure 2_26 Frequency Domain Characteristics of a 24 Tap F1R Filter Before and After Optimization

The coefficient optimization algorithm alone (i.e. with no coefficient scaling) with no linear phase constraints results in upto 29% reduction in the total Hamming distance and upto 68% reduction in the total number of adjacent signal toggles. Similar savings are achieved with the linear phase constraints.

Further analysis of the results show that the combination of scaling and coefficient optimization results in higher reduction (except in one case) in the power dissipation measures than using coefficient optimization alone. The


l+Öl~----' 1- Ö 11-------:

------------------------ -------------- ~------~

Figure 2.27. Low Pass Filter Specifications

ffi

additional constraint of retaining linear phase marginally impacts the power savings. The impact however is noticeable in case of coefficient optimization with no scaling.

In terms of optimization strategies, the first improvement approach with best of 10 random solutions performs as weIl or marginally better in most of the cases. Experiments also indicate that the number of coefficient perturbations evaluated in first improvement approach are significantly lesser than in steepest descent approach. This is reftected in faster runtimes for the first improvement optimization strategy.

2.4.5.5 Coefficient Optimization Using 0-1 Programming

Instead of the two phase approach of arriving at an initial solution and refining it for low power, the coefficient optimization problem can also be solved as a coefficient synthesis problem with the additional constraint for power reduction. Here is a 0-1 programming formulation for synthesizing coefficients that meet the desired characteristics of a linear phase FIR filter and have minimum total Hamming distance in the 2's complement fixed point representation of successive coefficients. Given Filter characteristics (figure 2.27) in terms of Passband frequency: wp


Stopband frequency: Ws

Passband ripple: 281 between (1 - ( 1) to (1 + 8d Stopband attenuation: 82

N the number of coefficients

51

B the number of bits of precision for the fixed point representation of the coefficients H n the value of the n'th coefficient

Variables The coefficient bit values form the variables, where Cn,b is the value of the b'th bit of the n'th coefficient and Cn,b E {O, I}

Objective Function Let H Dn,b be the Hamming distance between bits Cn,b and Cn+1 ,b given by H Dn,b = Cn,b EB Cn+1,b The same can be represented as - H Dn,b :S (Cn,b - Cn+1,b) :S H Dn,b The objective function then is to Minimize ~;:;;02 ~f==(/ H Dn,b Constraints The coefficient values can be computed as folIows: Hn = -Cn, ° + ~f==ll Cn ,b(2)-b Given the filter coefficients, the magnitude response at a given frequency w can be computed using the following equations: ForN odd, Fw = ~i~~1)/2 a[k]cos(wk) where a[O] = H[(N-l)/2] and

a[k] = 2H[(N-l)/2 - k], k = 1,2, ... ,(N-l)/2 ForN even,

Fw = ~:~i a[k]cos(w(k - 1/2)) where b[k] = 2H[N/2 - k], k=1,2, .. . ,N/2

The frequency response should meet the following constraints: For all frequencies w : ° :S w :S wp -+ (1 - 8d :S Fw :S (1 + 8d For all frequencies w : Ws :S w :S 7r -+ Fw :S 82

Any ofthe available 0-1 programming packages can be used to arrive at Cn,b values that satisfy the filter characteristics and minimize the total Hamming distance between successive coefficients.

2.5. Framework for Low Power Realization of FIR Filters on a Programmable DSP

Various transformations presented in this chapter complement each other and together operate at algorithmic, architectural, logic and layout levels of design abstraction. Figure 2.28 presents a framework that encapsulates these transformations in the form of a flow-chart. The figure also shows the amount of power savings that each ofthe transformations achieves. As has been presented


in this chapter, the multirate architecture results in 40% overall reduction in the power dissipation. Further reduction of 50% in the address busses, upto 80% reduction in the coefficient data bus and upto 25% reduction in the adder input busses can be achieved using these transformations. The reduction in the power dissipated in the busses, has also direct impact on the reduction in the power dissipated in the multiplier and the adder. The framework thus presents a comprehensive approach to low power realization of FIR filters on programmable DSPs.


FIR Filter Specifications ) --------------~------------~

Filter Order higher (> 14) ? Coefficient Synthesis Increase in pro gram and data

memory acceptable ? Park-McClellan's algorithm

No I Y=

t Reduced Computational

+/-3db gain acceptable ? Multirate Architecture - - - Complexity -> 40%

I Yes

overall power reduction

No

t Coefficient Scaling ,

t , Upto 35% reduction in the

~ ,

coefficient data bus power ->

Architecture support for single , power savings in the multiplie

cycle multiply-add and multipIy- ,

subtract in a repeat loop ? Coefficient Optimization

No I y,

~ Architecture support for non-

Selective Coefficient Negation sequential data addressing ?

, ,

I Ye, ,

Upto 88% reduction in the No ,

t coefficient data bus power -> , Architecture extensions for bus-

~ f,' power savings in the multiplier

coding + area/delay overhead Coefficient Ordering

acceptable ?

No Yes t Pipelined architecture + high GrayrrO coded addressing, Upto 50% reduction in address

- - - busses + power reduction in the cap, busses feeding the ALU? Bus-invert coding

I Ye,

data busses

No

t Embedded DSP with contra I ove Upto 25% reduction in the adder

Adder input bits swapping - - - input busses -> power reduction routing of memory-cpu busses?

in the addder I Ye, No

t Avg. 54% reduction in the cross

Bus bit re-ordering - - - coupling related power in the

coefficient data bus

t Low Power FIR Filter on a Programmable DSP

Figure 2.28. Framework for Low Power Realization of FIR Filters on a Programmable DSP

Chapter 3

IMPLEMENTATION USING HARDWARE MULTIPLIER(S) AND ADDER(S)

If an application's performance requirements cannot be met by a programmable DSP, one option is to use multiple processors. A more area-delay-power efficient solution is to build a processor that is customized for the application. Such a solution typical offers limited capabilities and limited programmability. However, if these are adequate for the application and are of lesser importance than the area-delay-power requirements, such a dedicated solution is the preferred option. The problem ofhardware realization from an algorithmic description of an application, has been extensively discussed in the literature. Many silicon-compilers such as HYPER [14], CATHEDRAL [46] and MARS [103] address this problem of high level synthesis. The domain of high level transformations has also been weil studied and many such techniques have been presented in the literature [14, 60]. The problem of high level synthesis for low power has been addressed in [81] and the technique of using multiple supply voltages for minimizing energy has been discussed in [16]. This chapter first presents architectural transforms from high level synthesis domain and evaluates the applicability of some of these techniques in the context of FIR filters (weighted-sum computation in general). It then presents a detailed analysis of multi rate architectures as a transform that reduces the computational complexity of FIR fi ltering, thereby providing performance/power advantage.

3.1. Architectural Transformations The architectural transforms can be classified into two categories - one that

alter the Data Flow Graph(DFG) but do not impact the computational complexity and the other that are focussed on reducing the computational complexity. The first type of transformations include pipelining, parallel processing, loop unrolling and retiming [15, 76]. These techniques improve the performance of the DFG. Altematively they enable reducing the supply voltage while main-

55



taining the same throughput, thus reducing the power which is proportional to the square of the supply voltage. This however is achieved at the expense of significant silicon area overhead. Thus for an implementation with fixed and limited number of hardware resources, these techniques do not offer significant advantage.

The architectures that reduce the computational complexity of FIR filters include block FIR implementations [41, 77] and multirate architectures [66]. The algorithm for block FIR filters presented in [77] performs transformations on the direct form state space structure. It reduces the number of multiplications at the expense of increased number of additions. Since the multiplier area and delay are significantly higher than the adder area and delay, these transformations result in low power FIR implementation. Block FIR filters are typically used for filters with lower order. Their structures are not as regular as the direct form structure. This results in controllogic overhead in their implementations. The multirate architectures [66] reduce the computational complexity of the FIR filter while partially retaining the direct form structure. These architectures can hence enable low power FIR realization on a programmable DSP and also as a dedicated ASIC implementation. The basic two level decimated multirate architecture was presented in the previous chapter, this chapter provides a more detailed analysis of the computational complexity of various multi ratearchitectures and also evaluate their effectiveness in reducing power dissipation of linear phase FIR filters.

Differential Coefficients Method [84] is another approach for reducing computational complexity and hence the power dissipation in hardwired FIR filters. The filter structure transformed using this method requires multiplication with coefficient differences having lesser precision than the coefficients themselves. Since the coefficient differences are stored for use in future iterations, this method results in significant memory overhead.

3.2. Evaluating the Effectiveness of DFG Transformations This section evaluates the effectiveness of various DFG transformations in

the context of a 4 tap FIR filter shown in figure 3.1. Figure 3.2 shows the data flow graph implementing the filter using one mul

tiplier and one adder. The delay of the multiplier is assumed to be 2T and the delay of the adder is assumed to be T. As can be seen from the data flow graph, the filter requires the delay of 9T per output computation.

The delay per output computation can be reduced by replacing the single stage multiplier (with delay of 2T) by a two stage pipelined multiplier, with each stage requiring the delay of T. The scheduled data flow graph using one pipelined multiplier and one adder is shown in figure 3.3

As can be seen from the figure, with the pipelined multiplier the delay per output reduces to 6T. If the throughput requirement is 9T per output, the clock

Implementation Using Hardware Multiplier( s) and Adder( s) 57

X[n]

A[O]

Y[n]

Figure 3.1. Direct Form Structure of a 4 Tap FIR Filter

X[n] X[n-I] X[n-2] X[n-3] __ ~!~]-! _ ~~\I_]- _ _ ~!:] _____ ~~] --- --__ CI

--- -*----- -- ---- -- --- --- ----C2

----- ------- --------C3

C4

CS ---r;t C6 _________ ------~--~------~~--C7

---------------- -------- -* -----C8

C9

Y[n]

Figure 3.2. Scheduled DFG Using One Multiplier and One Adder

frequency of this implementation can be reduced by a factor ] .5. With the increased clock delay, the supply voltage can be lowered resulting in power reduction.

The loop-unrolling transform unrolls the FIR computation loop so as to compute multiple outputs in one iteration. The effect of loop unrolling is similar to the parallel processing FIR structure proposed in [60]. Figure 3.4 shows the scheduled data-flow graph of the filter that has been unrolled once.


Y[n]

Figure 3.3. Scheduled DFG Using One Pipelined Multiplier and One Adder

As can be seen from the figure, with one level loop unrolling the delay per output computation reduces to 5T, thus enabling further lowering of supply voltage and hence further power reduction to achieve the throughput of 9T per output.

Retiming has been presented in the literature as a transform that reduces the critical path delay and hence the power dissipation. The direct form structure shown in figure 3.1 has a critical path delay of 5T (three adders and one multiplier). In general, a direct form structure of an N tap filter has a critical path delay of one multiplier and (N-l) adders. The re-timing transform has the same effect as applying transposition theorem and results in the multiple constant multiplication(MCM) structure shown in figure 3.5.

As can be seen from the figure this structure has a critical path delay of one multiplier and one adder. While this critical path is significantly smaller than the direct form structure, it can be truly exploited only if the filter is to be implemented using many multipliers and adders.

Figure 3.6 shows the scheduled data flow graph of the re-timed filter using one pipelined multiplier and one adder. As can be seen from the figure, this structure has a delay of 5T which is marginally lesser than the delay of 6T for the direct form structure shown in figure 3.2.

Implementation U sing Hardware Multiplier( s) and Adder( s)

C2

C3

C4

CS

C9

CIO

Y[n]

X[n-I]

A[I]

X[n-2]

A[2]

X[n-3] X[n-4]

A[3)

Y[n-I]

Figure 3.4. Loop Unrolled DFG Using I Pipelined Multiplier and I Adder

X[n] ----------------~------------~----------~

A[3] A[2] A[I] A[O]

59

Y[n)

Figure 3.5. Retimed 4 Tap FIR Filter

The delay per FIR filter output computation can also be reduced by using multiple functional units. This can be considered as parallel processing at a micro level. Figure 3.7 shows the scheduled data f10w graph of the direct form structure that uses two pipelined multipliers and one adder.


01' 02' 03' A[O] A[ 1] A[2] A[3]

X[n] X[n] I [n] CI--- V- -- --\- ----- ---

-*- -- --- - --

C2 __ \.8 C3 ________ +_______ ---8-C4 ________ ---------+------>ts------7----C5 +

-------- --------- --------- ------ ----

X n]

Y[n] 01 02 03

Figure 3.6. MCM DFG Using One Pipelined Multiplier and One Adder

X[n] X[n-l] X[n-2] X[n-3]

A[O] I A[l] J A[2] A[3]

Cl ---8---e-~-r -----C2 - - - - - - -1.- --- -- *- -- -e -- ---C3 ________ ~__ _ ____ / _________ _

C4 ------------------ --- -----------

CS ____________________ ~------------Y[n]

Figure 3.7. Direct Form DFG Using Two Pipelined Multipliers and One Adder

Implementation Using Hardware Multiplier( s) and Adder(s)

01' A[O]

Y[n] 01

03' A[2]

02

A[3]

03

Figure 3.8. MCM DFG Using Two Pipelined Multipliers and Two Adders

61

As can be seen from the figure, with one more multiplier the delay per output does reduce to 5T. It is also interesting to note that for the re-timed, MCM based structure, the delay continues to be 5T even if two pipelined multipliers are available. The parallelism inherent to this structure can be truly exploited by using multiple multiplier-adder pairs. Figure 3.8 shows the schedule data f10w graph for the MCM based structure using two pipeline multipliers and two adders.

As can be seen from the figure, using two multiplier-adder pairs reduces the delay to 4T. This analysis shows that the delay per output can be reduced by using multiple functional units. This can be used to lower the supply voItage and hence reduce the power dissipation, if the throughput requirement is same as that achieved using one multiplier and one adder.

3.3. Low Energy vs Low Peak Power Tradeoff Many of the transforms discussed above do not impact the computational

complexity of the filter. Thus the total capacitance switched per computation is same across all the transformations. While the impact of these transforms on energy per output and average power dissipation has been presented in the literature, the impact on peak-power has not been considered. Since peak power dissipation is also an important design consideration, the impact of these transforms on this factor needs be looked at carefully.

This section specifically looks at the parallel processing transform. With a degree of parallelism N, the number clock cycles required to perform a com-


.-11

Z "-

.2 Cf)

co "0 Q) N

eil E 0 c Cf) Q) :J eil >

1.4

1.2

0.8

0.6

0.4

0.2

Peak P wer

Energ o ~~ __ ~~L-~ __ ~~ __ ~ __ L-~

o 2 345 6 7 8 9 Degree of Parallelism (N)

Figure 3.9. Energy and Peak Power Dissipation as a Function of Degree of Parallelism

putation is reduced by a factor of N. To achieve the same delay per output as with degree one, the clock period can be increased by a factor of N. This can be used to lower the supply voItage using the relationship

delay Cl< VDD/(VDD - VT)2 (3.1 )

With the degree of parallelism N, the amount of capacitance switched per cycle goes up by a factor of N. Since the power is proportional to V 2 , the peak power dissipation can be reduced only if the supply voltage is reduced by a factor of VN. Figure 3.9 plots both the energy (or average power) and the peak power as a function of degree ofparallelism N for VDD=3V and VT=0.7Y. As can be seen from the figure, while the energy per output or the average power dissipation reduces with increasing degree of the parallelism, the peak power dissipation increases beyond N=4.

For a given degree of parallelism N, the following condition should be satisfied for the peak power dissipation to be less than with degree one.

VDD N VDD/VN ~------~ . > ----~~-------(VDD - VT)2 ((VDD/VN) - VT)2

(3.2)

This gives the following relationship between VDD, VT and N.

(3.3)

Figure 3.10 plots this relationship as the lower limit on VDD/VT for no increase in the peak power dissipation with the given degree of parallelism N.

lmplementation Using Hardware Multiplier(s) andAdder(s)

7 r---,--------,--------,--------,--------,----,

5

4

3

2 L-__ L-______ _L ______ ~ ________ L-______ _L __ ~

2 468 Degree of Parallelism (N)

10

63

Figure 3.10. Lower Limit of VDD/VT for Reduced Peak Power Dissipation as a Function of Degree of Parallelism

Figure 3.//. One Level Decimated Multirate Architecture: Topology-l

3.4. Multirate Architectures Section 2.4.2 presented the derivation of a multirate architecture using Wino

grad's algorithm for reducing computational complexity of polynomial multiplications. Figure 3.11 shows the signal flow graph ofthe multirate architecture that uses a decimation factor of two. The architecture processes two input sampies simultaneously to produce the corresponding two outputs.

Since the input sequence is decimated by two, each of the sub-filters operate at the half the frequency of the input sampies. Since different sections of the

64

XO(Z)

X(Z)~ Xl(Z)


~ Y(Z)

Yl(Z)

Figure 3.12. One Level Decimated Multirate Architecture: Topology - 11

filter operate at different frequencies, these architectures are called multirate architectures.

The multirate architecture shown in figure 3.11 decimates the input by a factor oftwo and also decimates the filterby a factoroftwo. The decimation factors for the input and the filter can be same or different and can be integers higher than two [66]. Each such combination results in a different multirate architecture. The discussion in this chapter is restricted to the multirate architectures with decimation factor oftwo for both the input and the filter. For a given decimation factor, different architecture topologies are possible. Figure 3.12 shows one such architecture that has the same decimation factor of two as the topology-I architecture shown in figure 3.11. Itcan be seen that the topology-II architecture (figure 3.12) requires two more additions than the topology-I architecture but has sub-fi Iters with different transfer functions.

3.4.1. Computational Complexity of Multirate Architectures 3.4.1.1 Non-linear Phase FIR Filters

Figure 3.13 shows the signal flow graph(SFG) of a direct form FIR filter with non-linear phase. As can be seen from the SFG an N tap filter requires N multiplications and (N-l) additions per output.

Consider the topology-I multirate architecture. In case of even number of taps, each of the sub-filters is of length (N/2) and hence requires N/2 multiplications and (N/2)-1 additions. There are four more additions required to compute the two outputs YO and Y I. This architecture hence requires 3N/4 multiplications per output which is less than the direct form architecture for aB values of N and requires (3N+2)/4 additions per output which is less than the direct form architecture for ((N - 1) > (3N + 2)/4) i.e. (N > 6).

In case of odd numberoftaps, the filtercan be converted to an equivalent even tap fi Iter by adding a coefficient of value O. This coefficient can then be dropped from the decimated sub-filters. This results in two sub-filters (Ho and Ho + H 1)

Implementation Using Hardware Multiplier(s) andAdder(s) 65

X(Z)

Y(Z)

Figure 3.13. Signal Flow Graph of a Direct Form FIR Structure with Non-linear Phase

X(Z)

Figure 3.14. Signal Flow Graph of a Direct Form FIR Structure with Linear Phase

of length (N+ 1)/2 and the third sub-filter (H1 ) of length (N-I )/2. The multirate architecture thus requires (3N+ 1 )/4 muItiplications which is less than the direct form architecture for all values of N and requires (3N+ 3)/4 additions per output which is less than the direct form architecture for ((N -1) > (3N + 3)/4) i.e. (N > 7)

3.4.1.2 Linear Phase FIR Filters

For linear phase FlR filters, the coefficient symmetry can be exploited to reduce the number of multiplications in the direct form structure (figure 3.14) by 50%. The direct form structure for linear phase FIR filter requires N/2 multiplications «N+I)/2 ifN is odd) and N-I additions. Phase Characteristics of decimated sub-filters


This subsection analyzes the phase characteristics ofthe decimated sub-filters of the topology-I multirate architecture for the linear phase FIR filter with even number of taps.

For the filter H(Z) = L~Ol A[i] . Z-i , linear phase characteristics imply

A[i] = A[N - 1 - i] (3.4)

IfN is even, the three decimated sub-filters have the following Z-domain transfer functions

~-l

Ho(Z) L A[2k] . (Z2)-k (3.5) k=O ~-l

L A[2k + 1] . (Z2)-k (3.6) k=O ~-l

(Ho + H1)(Z) = L (A[2k] + A[2k + 1]) . (Z2)-k (3.7) k=O

The coefficient symmetry of the sub-filters can be analyzed using the relationship in equation 3.4 to show that the sub-filters Ho abd H1 do not have linear phase and the sub-filter (Ho + Hd does have linear phase characteristics. Computational Complexity - linear phase FIR filters with even number of taps

Since Ho and H1 have non-linear phase, they require (N/2) multiplications and (N/2)-l additions each. Since Ho + H 1 sub-filter has a linear phase, it requires N/4 multiplications and (N/2)-l additions, if N/2 is even, and requires (N+2)/4 multiplications and (N/2)-l additions, if N/2 is odd.

Thus the topology-I multi rate architecture requires per output 5N/8 multiplications and (3N+2)/4 additions if N/2 is even, and (5N+2)/8 multiplications and (3N+2)/4 additions if N/2 is odd. In both the cases, the number of multiplications required are more than the direct form structure. The primary reason for the multirate architecture requiring higher number of multiplications is the fact that two of the three sub-filters have non-linear phase characteristics.

The topology-II multirate architecture has sub-filters with transfer functions (Ho + H1)/2, (Ho - Hd/2 and H1. Since Ho + H1 has linear phase, the subfilter (Ho + H1) /2 also has linear phase characteristics. It can be shown that the coefficients of (Ho - Hd/2 are anti-symmetrie (i.e. Ai = -AN - l - i ). This sub-filter has hence the same computational complexity as (Ho + Hd/2. This multirate architecture hence requires N/2 multiplications and (3N+6)/4 additions ifN/2 is even and needs (N+ 1/2) multiplications and (3N+6)/4 additions if NI2 is odd. While this multirate architecture requires fewer multiplications than the topology-I architecture, it is still not less than the number of multiplications required by the direct form structure.


Table 3.1. Computational Complexity of Multirate Architectures

Filter Implementation Mults per o/p Adds per o/p Non-linear phase Direct Form N N-l Multirate-I level N even 3N/4 (3N+2)/4

N odd (3N+ 1)/4 (3N+3)/4

Multirate-2 level N even N/2 even 9N/16 (9N+44)/l6 N/20dd (9N+6)/l6 (9N+50)/l6

N odd (N+ 1)/2 even (9N+5)/16 (9N+49)/l6 (N+l)/2 odd (9N+7)/l6 (9N+51 )/16

Linear phase Direct Form N even N/2 N-I

N odd (N+ 1)/2 N-l Multirate-I level N even N/2 even N/2 (3N+6)/4

N/20dd (N+ 1)/2 (3N+6)/4 N odd (N+ 1)/2 (3N+3)/4

Multirate-2 level N even N/2 even N/4 even 7N/l6 (9N+76)/l6 N/40dd (7N+8)/16 (9N+76)/l6

N/20dd (7N+I0)/16 (9N+ 70)/16 N odd (N+l)/2 even (N+l)/4 even (7N+7)/l6 (9N+57)/l6

(N+l)/40dd (7N+8)/16 (9N+57)/l6 (N+l)/20dd (N-I)/40dd (7N+13)/l6 (9N+59)/l6

(N-I )/4 even (7N+9)/l6 (9N+59)/l6

Thus in case of linear phase FIR filters, one level decimated multirate architectures can at best require the same number of multiplications as the direct form structure when N/2 is even. They require fewer number of additions for ((N - 1) > (3N + 6)/4) i.e. (N) 10). Computational Complexity - linear phase FIR with odd number of taps

In case of linear phase filter with odd number of taps, it can be shown that the sub-filters Ho and H 1 both have linear phase but the sub-filter Ho + H 1

has non-linear phase characteristics. Since Ho is of length (N+ 1 )/2 and H 1 is of length (N-I)/2, the two sub-filters together require (N+ 1)/2 multiplications. The topology-I multi rate architecture hence require (N+ 1 )/2 multiplications and (3N+3)14 additions per output. Thus for the linear phase FIR filter with odd number of taps, the one level decimated multi rate architecture can at best require the same number of multiplications as the direct form structure. It requires fewernumberofadditions for ((N -1) > (3N +3)/4) i.e. (N > 7).

The above analysis (summarized in table 3.1) demonstrates how the multirate architectures can reduce the computational complexity of FIR filters. Each of the sub-filters in the one level decimated architectures (shown in figure 3.11)


Figure 3.15. Signal Flow Graph of a Two Level Decimated Multirate Architecture

can be further decimated to further reduce the computational complexity of FIR filters. Figure 3.15 shows the signal flow graph of a two level decimated multirate architecture. Table 3.1 also presents the computational complexity for two level decimated multirate architectures. It can be noted that the doublydecimated multirate architectures further reduce the computational complexity of FIR filters.

3.5. Power Analysis of Multirate Architectures 3.5.1. Power Analysis for One Level Decimated Multirate

Architectures The reduction in the computational complexity of multirate architectures can

be exploited to achieve low power realization of FIR filters while maintaining the same throughput. In case of CMOS designs, capacitance switching forms the main source of power dissipation. The switching power is given by Pswitching = C· V 2 . f, where C is the capacitance charged/discharged per c10ck cyc1e, V is the supply voltage and f is the c10ck frequency [15].

The throughput of an FIR filter depends on the product of the c10ck period and the number of cyc1es of computation per output. Since the multirate architectures require !esser number of cycles of computation per output, they can run at higher c10ck period (lower frequency) while maintaining the same throughput. The frequency ratio between the multirate and the direct form architectures is given by the ratio of the total delay of the multiplications and the additions per output for the respective architectures. Let 15m be the delay of a multiplier and let t5a be the delay of an adder. The frequency ratio for an N tap non-linear phase FIR filter is given by

f multi rate

f direct

3N /4 x 15m + (3N + 2)/4 x t5a

N x 15m + (N - 1) x t5a (3.8)

Implementation Using Hardware Multiplier(s) andAdder(s) 69

14

12

10

1.5 2.5 3.5 4.5 VDD

Figure 3./6. Normalized Delay vs Supply Voltage Relationship

The reduced frequency for the multirate architecture directly translates into its lower power dissipation.

The lowering of the frequency has another important advantage. Since the clock period is increased, the logic delays can be correspondingly higher without affecting the overall throughput. In CMOS logic, supply voItage is one of the factors that affects the delays. The delay dependence on supply voItage is given by the following relationship

(3.9)

where VDD is the supply voItage and VT is the threshold voltage of the transistor.

Figure 3.16 shows this delay vs V dd relationship for VT = O.8V. The delay values are normalized with respect to the delay at V dd=Sy' Since the multirate architectures allow higher logic delays, the supply voltage can be appropriately lowered. This reduces the power proportional to the square of the reduction in the supply voltage.

The analysis shown below assumes that the total capacitance charged/ discharged per output is proportional to the total area of the multipliers and the adders required to compute each output. Let Am be the area of a multiplier and Aa be the area of an adder. For an N tap FIR filter with non-linear phase, the total capacitance for the direct form structure is given by :

CtotaLdirect cx: (N x Am + (N - 1) x Aa ) (3.10)


The total capacitance for the multi rate architecture is given by :

CtotaLmultirate cx: (3N /4 x Am + (3N + 2)/4 X A a ) (3.11)

The capacitance per cycle Cdirect for the direct form realization is hence given by

Cdirect cx: (N x Am + (N - 1) X Aa )/ fdirect (3.12)

The capacitance per cycle Cmultirate for the multi rate architecture is given by

Cmultirate cx: (3N /4 x Am + (3N + 2)/4 X A a )/ fmultirate (3.13)

It can be noted that if the area ratio Am / Aa is same as the delay ratio Om/Oa,

and f multirate is appropriately scaled to maintain the same throughput, the two capacitance values Cdirect and Cmultirate are same.

3.5.1.1 Power Analysis - an Example

This subsection demonstrates the above analysis for a 32 tap FIR filter with non-linear phase. Assuming Am / Aa = Om/Oa = 8 and the direct form FIR filter running at V dd=5V, the frequency ratio can be calculated as shown below:

fmultirate 32 x 3/4 x 8 + (3 x 32 + 2)/4 "----- = = 0.75 f direct 32 x 8 + 32 - 1

(3.14)

This implies that the delay can be increased by a factor of 1.33, which translates into lowering ofvoltage by a factor ofO.82 (from 5V to 4.1 V). Since the area and delay ratios between the multiplier and the adder are same, the Cmultir'ate remains the same. Thus the power reduction using one-level decimated multirate architecture is given by:

Pmultirate

Pdirect

Cmultiro.te X ( VmUltirate)2 X fmultirate

Cdirect Vdirect f direct

1 X (0.82)2 x 0.75 = 0.5 (3.15)

The above analysis shows that for a non-linear phase 32 tap FIR filter, the onelevel decimated multirate architecture (figure 3.11) results in 50% reduction in the power dissipation.

The amount of power reduction using multirate architecture is mainly dependent on the amount by which the frequency can be lowered. The lowered frequency not only reduced power directly, but also enables reducing the voltage which has a bigger impact on power reduction. The frequency ratio relationship presented above indicates that the amount of frequency reduction is dependent on the number oftaps and also on the delay ratio om/ 00.' Using this relationship, it can be shown that frequency lowering is possible if (Om /Oa) > (6/ N - 1). This relationship indicates that for N > 6 the frequency of the multi rate architecture can always be lowered independent of the (Om/oa) ratio.


0.9

0.8

c 0

~ 0. 0.7 ·iii U)

L_20 Ci (j; ;: 0.6 0

(L

"0 <l.l .~

NL 10 co 0.5 E 0 z

0.4

0.3 NL_20

0.2 0 5 10 15 20 25 30 35 40 45 50

Number 01 taps

Figure 3./7. Normalized Power Dissipation vs Number of Taps

3.5.1.2 Power Reduction Using MuItirate Architectures

This section presents results in tenns of power reduction using the multirate architectures for dedicated ASIC implementation using a hardware multiplier andan adder. Thepoweranalysis uses (Ami Aa) = (Jm/Ja) = 8. Thearearatio has been selected to be same as the delay ratio so as to have Cdirect = Cmultimte

during power analysis. The actual values of the ratios vary depending on the implementation styles used for the multiplier (e.g. Array, Booth, Wallace Tree etc.) and the adder (e.g. Ripple Carry, Carry Look Ahead, Carry Select etc.). While the amount of power saving differs with different values of the area and delay ratios, the overall trend of reduced power dissipation very much holds good.

Figure 3.17 shows the power dissipation as a function of number of taps for the following implementations : (i). non-linear phase FIR fi Iter implementation using one level decimated multirate architecture (graph labeled NL-I 0), (ii). non-linear phase FIR filter implementation using two level decimated muItirate architecture (graph labeled NL-2D and (iii). linear phase FIR filterimplemented using two level decimated multirate architecture (graph labeled L-2D). The power dissipation is nonnalized with respect to the power dissipation of direct fonn realization.

72 VLSI SYNTHESIS OF DS? KERNELS

Table 3.2. Comparison with Direct Form and Block FIR Implementations

Direct Form Block FIR impl. Multirate-2 level N *s +s ops area *s +s ops area *s +s ops area 4 4 3 7 35 2.5 6 8.5 26 2.25 5 7.25 23 5 5 4 9 44 3 8 11 32 3.25 6 9.25 32 6 6 5 11 53 3.5 10 13.5 38 3.75 6.5 10.25 36.5 7 7 6 13 62 4 12 16 44 4.25 7 11.25 41

The results show that the power dissipation reduces with increasing number of taps in all the 3 cases.

For non-linear phase FIR implementation, one level decimated multi rate architecture results in the power saving of upto 50%. The two level decimated multirate architecture results in the power saving of upto 73%. This reduction is more than the 64% power reduction achieved using the parallel processing technique. The significant point to note is that the power reduction using multirate architecture requires no datapath area overhead compared to 240% area overhead [15] in the parallel processing approach. It can be noted that the multirate architectures however do result in the coefficient and data storage overhead. For a non-linear phase N tap FIR filter, the multirate architecture shown in figure 3.11 requires (N/2) more number of coefficient memory locations and (N/2) more number of data memory locations, compared to the direct form implementation.

In case of linear phase FIR filters, since the one level decimated multi rate architectures do not reduce the number of multipliers per output, the power saving is primarily due to the reduction in the number of additions. For the filter with N (number of taps)=20, the frequency can be lowered by 1.03 which translates into the 7% power reduction. For higher values ofN power reduction of upto 9% is achieved. Depending on the number of taps, the two level multirate architectures use an appropriate combination oftopology-I and topology-II to minimize the number of multiplications. The two level decimated multi rate architectures result in upto 35% power reduction.

Power Reduction - Comparison with Other Architectures

Table 3.2 compares the doubly-decimated multirate architecture with the direct form and the block FIR implementation presented in [77], in terms of number of operations and area in equivalent number of adders. The area has been computed assuming Ami Aa = 8.

Implementatioll Using Hardware Multiplier(s) and Adder(s) 73

As can be seen from the results, the doubly-decimated multi rate architecture requires lesser number of operations and lesser area than the block FIR implementation. This shows the effectiveness of multirate architectures in reducing power dissipation with minimal area-overhead.

Chapter 4

DISTRIBUTED ARITHMETIC BASED IMPLEMENTATION

High speed digital filtering applications generally require dedicated hardwired implementation of the filters. Programmable processors can provide high sampIe rates only with excessive amount of parallelism which may not be cost effecti ve.

One approach to meet the performance requirements is to use dedicated hardwired solutions based on hardware multipliers, such as those discussed in the previous chapter. The other approach is to use Distributed Arithmetic [13, 70, 104] based structures which enable high-speed multiplier-Iess implementations of FIR filters. Distributed Arithmetic (DA) is a bit-serial computational operation that forms an inner(dot) product of a pair of vectors in a single direct step. This is achieved by storing all possible intermediate computations in a look-up-table memory and using it to select an appropriate value for a specified bit vector. The DA structure supports coefficient programmability i.e. the same filter structure can be used for a different set of coefficients by appropriately programming the look-up-table memory.

In a typical system level design process the overall area and performance goals are partitioned to arrive at area-delay constraints for the various components. If a component can be realized in a multiple ways, each representing a different point on the area-delay curve, the partitioning process can explore a much wider search space resulting in an efficient system implementation.

The first part of this chapter focuses on algorithmic transforms that generate different structures for DA based implementation of FIR filters. It evaluates the area-delay tradeoff associated with these transforms in the context of single Adder-Shifter-Accumulator (minimum area) implementations.

This is followed by a look at area-efficient implementations of DA based structures for filters whose coefficient values are known at design time. An implementation using two memory modules is considered. Since the coefficient

75



values are known at design time, the look-up-tables stored in these memory modules can be implemented as hardwired logic blocks. The chapter proposes a coefficient partitioning technique so as to minimize the area of these logic blocks.

The chapter also presents techniques for reducing power dissipation of the DA based structure. With the primary focus on the power dissipated in the input data shift registers, it proposes a data coding technique to minimize the number of toggles in these registers. For a given profile of input data distribution an optimum coding scheme can be derived so to minimize power dissipation.

4.1. DA Structures for Area-Delay Tradeoff Consider the following sum of products

N

Y = L A[n] . X[n] (4.1 ) n=l

where A[n]'s are the fixed coefficients and the X [n]'s are K-bit input data words. If each X[n] is a 2's-complement binary number scaled such that IX[n]1 < 1, then X [n] can be represented as

K-1

X[n] = -bno + L (bnk . T k) (4.2) k=l

where the bnk are the bits 0 or 1, bno is the sign bit and bn,K -1 is the LSB. Combining equations 4.1 and 4.2 gives

K-1 K-1

Y = L A[n]· (-bno + L bnk . T k) (4.3) k=l k=l

Interchanging the order of the summations gives

K-l N N

L (L A[n]· bnk ) . 2-k + L A[n]· -bno (4.4) k=l n=l n=l

Consider the bracketed term in the above expression

N

L A[n]· bnk (4.5) n=l

Since each bnk may take on values of 0 and 1 only, expression 4.5 may have 2N

possible values. Instead of computing these values on-line, they can be precompu ted and stored in a look-up-table memory. The input data can then be used

Distributed Arithmetic Based Implenzentation 77

16 Word Memory

0000 0 0001 A3 0010 A2 00 I I A2+A3

X(n-I) 0100 AI 0101 AI+A3 o I 10 AI+A2 o I I I AI+A2+A3 1000 AO

X(n-2) 1001 AO+A3 10 I 0 AO+A2 101 I AO+A2+A3 I 100 AO+AI I 101 AO+AI+A3

X(n-3) I I 10 AO+AI+A2 1111 AO+AI+A2+A3

Figure 4./. DA Based 4 Tap FIR Filter

to directly access the memory and the result can be added to the accumulator. Y can thus be obtained after K such cycles using K-1 additions.

Figure 4.1 shows DA based implementation of a 4 tap FIR filter. The input data values X[n] to X[n-3] are stored in input shift registers. During each cycle the last bits of the registers are used as an address to look-up into the coefficient memory and the read value is added to the right shifted accumulator. The shift register chain is then right shifted. Since the input values are stored in 2's complement form, the value read from the coefficient memory during the K'th iteration is subtracted from the right shifted accumulator. The output Y[n] is thus available in the accumulator after every K cycles. Figure 4.1 also shows the coefficient memory map for a 4 tap filter with coefficients A[O] to A[3].

4.1.1. DA Based Implementation of Linear Phase FIR Filters Linear phase FIR filters have symmetrie coefficients [73], i.e. for an N tap

filter A[i] = A[N - 1 - i] for i = 0,1, ... ,N-1. The coefficient symmetry can be exploited to reduce the number of multiplications by half in the direct form realization ofFIR filters. This is achieved by adding first the input data values to be multiplied by the same fi Iter coefficient. This mechanism can also be used to reduce the coefficient memory in the DA based implementation of linear phase FIR filters. Since the input data is accessed bit serially, the addition of the data values before coefficient memory look-up can be achieved using a bit-serial adder.

Figure 4.2 shows DA based implementation of a 4 tap linear phase FIR filter. Linear phase FIR filters can thus be implemented with a minimal performance overhead of a l-bit-serial-addition but with a significantly smaller coefficient memory given by 2N / 2 ifN is even and 2(N+l)/2 ifN is odd.

78

Clk


4 WonJ Memory f------,

Oll 0

01 AI

J() All

11 AII+AI

Figure 4.2. 4 Tap Linear Phase FIR Filter

The following subsections present transfonnations [104] to ac hieve the areadelay tradeoff. These transfonns can be applied to both the non-linear phase and the linear phase FIR filters. Since area-delay tradeoff is evaluated in the context of single Adder-Shifter-Accumulator based realizations, coefficient memory size is used as the measure of area. With the assumption that the memory output is latched so that memory read and addition can be perfonned in parallel, number of additions is used as the measure of delay.

4.1.2. I-Bit-At-A-Time vs 2-Bits-At-A-Time Access The DA based implementation shown in figure 4.1 uses one bit a time

(1 BAAT) from each input data. The number of additions can be reduced to K /2 - 1 if two bits at a time (2BAAT) are are used to access the coefficient memory. This however results in exponential increase in the coefficient memory requirement given by memory(2BAAT) = memory(lBAAT)2. Figure 4.3 shows DA based implementation of a 2 tap FIR filter, which uses 2BAAT coefficient memory look-up. Figure 4.3 also shows the corresponding coefficient memory map.

This implementation assurnes the two most significant bits of the input data to be sign bits. Input data in 2's-complement fonn can be easily converted to such a representation by sign-extending the data to odd number of bits and prepending a 0 as MSB.

The coefficient memory requirement ofthe DA based implementations shown in figures 4.1 and 4.2 grows exponentially (2 N for non-linear phase and 2N /2

for linear phase) with the filter order. For example, a 16 tap FIR filter needs 64K (216 ) words of coefficient memory making it highly inefficient in terms of area and almost impractical to implement. The coefficient memory requirement gets even worse with the 2BAAT scheme shown in figure 4.3. The coefficient

Distributed Arithmetic Based Implementation 79

16 Ward Memory

0000 0 000 I AI 0010 2*AI 0011 3*AI 0100 AO 0101 AO+AI o I 10 AO+2*AI o I I I AO+3*AI 1000 2*AO 1001 2*AO+AI 1010 2*AO+2*AI 101 I 2*AO+3*AI I I 00 3*AO I 101 3*AO+AI I I 10 3*AO+2*AI I I I I 3*AO+3*AI

Figure 4.3. 2 Tap FIR Filter with 2BAAT

memory requirements can be reduced by using a technique that uses multiple memory banks.

4.1.3. Multiple Coefficient Memory Banks For an N tap filter, instead of using all N-bits to address a single coefficient

memory, the N address bits can be partitioned into two or more groups each addressing its own coefficient memory.

Letthe N address bits be partitioned into M disjoint groups such that 'Lf!1 Ni =

N, where Ni is the number of address bits in group 'i'. The expression 4.5 can then be written as:

Nl+N2 N

L (A[n]· bnk) + ... + L (A[n] . bnk ) (4.6) n=l n=N-N}.f+1

The expression has M terms corresponding to the M partitions of the address bits. A term corresponding to the i th group can take 2Ni different values and can hence be implemented using a memory with 2Ni words. The resultant DA based implementation, shown in figure 4.4, has M coefficient memory banks with total memory size of 'Lt~l 2Ni . This number is less than 2N for all values of M ~ 2. This implementation however takes (M . K - 1) additions per computation, which is higher than (K - 1) required for M= 1.

For a given number of address bits N and the number of groups M into which they need to be partitioned, the optimum partition that results in minimum total memor~ is given by assigning equal number of address bits (N/M) to all the groups. The total memory required by the resultant implementation is given by M·2 N / M .


N

NI 2 ward NI

ROM

N2 2 word N2

ROM

Nm 2 ward Nm

ROM

Figure 4.4. Using Multiple Memory Banks

The above partitioning scheme can be applied when N is an integer multiple of M. If N/M is a non-integer, the optimum partitioning can be achieved by assigning to each group Ni number of bits such that IN / M - Ni I < 1

4.1.4. Multiple Memory Bank Implementation with 2BAAT Access

Since the use of multiple memory banks increases the number of additions, it is worthwhile to look at the area-delay tradeoff achievable using 2BAAT access. Consider a two memory bank implementation of an N tap FIR filter using lBAAT access. It requires (2K - 1) additions (where K is the number bits in the input data) and 2N /2+ 1 words of coefficient memory. With 2BAAT data access, the number of additions can be reduced to (K - 1) which is same as the number of additions required with a single memory bank. This however increases the total coefficient memory to 2N +1 which is in fact more than the memory required for the DA implementation with a single memory bank. Thus the two memory bank implementation with 2BAAT data access does not provide a useful area-delay tradeoff.


Figure 4.5. Multirate Architecture

However, for a three memory bank implementation with 2BAAT data access, the number of additions required (3K /2 - 1) is less than the two bank implementation with lBAAT, and the coefficient memory required (3· 22N/ 3 )

is less than the single bank implementation with I BAAT data access. The three bank implementation with 2BAAT data access thus represents a data point on the area-delay curve between the single bank lBAAT and the two bank I BAAT DA implementations.

4.1.5. DA Based Implementation of Multirate Architectures In section 2.4.2 multi rate architectures have been presented as computation

ally efficient structures for implementing FIR filters. Figure 4.5 shows the signal ftow graph of a multirate architecture that uses the decimation factor of two.

The multirate architecture (figure 4.5) for an N tap filter uses 3 sub-filters of length N/2. Each of these sub-filters can be implemented using the DA scheme shown in figure 4.1. With such an implementation, each sub-filter requires coefficient memory of 2N /2 words. The total memory requirement is hence 3·2N / 2 . This requirement is 50% more than the memory required (2 .2N / 2 )

for the DA based implementation using 2 memory banks. In terms of number of additions, each sub-filter computation requires (K-l)

additions and there are four more additions required to compute the outputs YO and Yl. Out of these four additions, the input addition (XO+Xl) can be performed bit-serially using the same approach as used in case of linear phase FIR filters. Thus the multirate architecture requires (3· (K -1) + 3) /2 additions per output. These numberof additions are less than (2· K -1) additions required for the two memory bank implementation, for all values of K > 2. Figure 4.6 shows the DA based implementation of a 4 tap multirate FIR filter.

The number of additions can be further reduced (at the expense of increased coefficient memory) by using 2BAAT access. For an N tap filter, such an


Y(n)

1-+--1> Y(n-Il

Figure 4.6. DA Based 4 Tap Multirate FlR Filter

implementation requires (3 . 2N ) words of coefficient memory and (3K/4) number of additions. It can be noted that this memory requirement is higher than the single memory DA based implementation with I-bit-at-a-time access for all values of N but the number of additions is less for (K > 4). Thus the DA implementations of the multirate architecture with both lBAAT and 2BAAT access result in a meaningful area-delay tradeoff.

Here is a look at the DA based implementation ofthe multirate FIR filter with linear phase and I-bit-at-a-time data access. Consider an 8 tap filter with coefficients AO,A I ,A2,A3,A3,A2,A 1 and AO. In the corresponding multirate architecture the coefficients of the three sub-filters are given by HO: [AO,A2,A3,A 1] , Hl:[Al,A3,A2,AO] and HO+Hl:[AO+Al, A2+A3, A2+A3, AO+Al]. It can be noted that both HO and Hl has the same set of coefficients and can hence share the same coefficient memory of size (24 ). The coefficients ofHO+Hl are symmetrie and hence need the coefficient memory of size (22 ).

In general, it can be shown that for an N tap filter (with N even), the subfilters HO and HI can share the same coefficient memory of size (2 N / 2 ) and the sub-filter HO+Hl requires the coefficient memory of size (2 N / 4 ). The total coefficient memory is thus more than the memory required for the single bank implementation. Since the number of additions is also higher, the DA based implementation of a multirate linear phase FIR filter with lBAAT data access does not result in a useful area-delay tradeoff.

4.1.6. Multirate Architecture with a Decimation Factor of Three The multirate architecture described in the earlier section decimates the in

puts and the filter coefficients by two. The decimation factors of higher than two can also be used to derive other multi rate architectures, such as a multi rate

Distributed Arithmetic Based lmplementation 83

architecture [66] that uses a decimation factor of three. In this architecture, the decimated sub-filters HO,HI and H2 are derived by grouping every third filter coefficient as shown below :

't-1 't-1 't-1

H(Z) = L A[3k]·Z-3k+ L A[3k+1]-Z-(3k+l) + L A[3k+2]·Z-(3k+2) k=O k=O k=O

(4.7) The input data is also decimated into XO, Xl and X2 in a similar way. The multirate architecture takes three inputs a time and computes three outputs at time using the following computations. .

aO X2-X1 bo HO al (XO - X2 . Z-3) - (Xl - XO) b1 H1 a2 -aO. Z-3 b2 H2 a3 (Xl - XO) b3 HO+H1 a4 (XO - X2 . Z-3) b4 H1+H2 as XO bs HO+H1 +H2 mi ai * bi, i = 0,1,2,3,4,5 YO m2 + (m4 + ms) Y1 ml + m3 + (m4 + ms) Y2 mo + m3 + ms

This multirate architecture has six sub-filters of length N/3. Each of these filters can be implemented using DA based approach, thus requiring total coefficient memory of 6 . 2N /3. These sub-filters require 6(K - 1) additions. There are 10 more additions required, four out of which are at the input and can be implemented bit-serially. Thus this architecture requires total of (6(K - 1) + 6)/3 = 2K additions per output.

The area-delay tradeoff of this architecture with 2BAAT data access can be analyzed much the same way as the earlier multirate architecture. It can be shown that with 2BAAT data access this architecture requires K additions per output and 6 . 22N /3 words of coefficient memory.

For an N tap filter, where N is an integer multiple ofthree, it can be shown that the sub-filters HO and H2 have the same set of coefficients and can hence share the same coefficient memory of size 2N / 3 . Similarly the sub-filters HO+HI and H I +H2 have the same set of coefficients and can hence share the same coefficient memory of size 2N / 3 . The sub-filters HI and HO+HI+H2 have symmetric coefficients and hence require total of 2 * 2N / 6 words of coefficient memory. Thus the total coefficient memory required for the linear phase filter is given by (2(N/3+1) + 2(N/6+1)).


4.1.7. Multirate Architectures with Two Level Decimation Each of the sub-filters in the multi rate architectures discussed above can be

further decimated to realize multirate architectures with two level decimation. For example, sub-filters of the architecture shown in figure 4.5 can be further decimated by a factor of two. The resultant architecture has nine sub-filters with N/4 number of taps. Each of these sub-filters can be implemented using DA based approach. The architecture reads in four inputs and computes four outputs. It needs 15 additions in addition to those required to implement the sub-filters. Thus this architecture requires total coefficient memory given by 9· 2N / 4 and total number of additions per output given by (9(K - 1) + 15)/4.

Some other two level decimated multi rate architectures can also be derived and analyzed far the associated area-delay tradeoff. One such architecture can be obtained by decimating by two the sub-filters of the multirate architecture obtained with first level decimation factor of three. The resultant multirate architecture has 18 sub-filters with N/6 number of taps. Each of these sub-filters can be implemented using DA based approach. This architecture reads in 6 inputs and computes 6 outputs. It needs 21 additions in addition to those required to implement the sub-filters. Thus this architecture requires total coefficient memory given by 18· 2N / 6 and total number of additions per output given by (18(K - 1) + 21/6.

Further area-delay tradeoff can be achieved by implementing the above two level decimated architectures using 2BAAT data access.

4.1.8. Coefficient Memory vs Number of Additions Tradeoff Table 4.1 gives coefficient memory size for 3 non-linear phase FIR filters

(with 8, 12 and 18 taps) implemented using 16 different DA based approaches discussed in this chapter. The table also gives the number of additions required by these 16 approaches for two values of input data precision (K= 12 and 16).

As can be seen from the results, the techniques discussed in this chapter enable achieving different points in the area-delay space for the DA based implementation of FIR filters. For a given filter, some of these points can be eliminated as their memory requirements are very high or they require higher memory for the same number of additions compared to another implementation. Even with these eliminations, as many as eight meaningful data points can be achieved on the area-delay curve. Figure 4.7 shows these memory-vs-number of addition plots for the 8 tap and 12 tap FIR filters with 16 bits of input data precision.

The following section looks at DA based implementation ofFIR filters whose coefficients are known at design time. It presents a technique to improve the area efficiency of a DA structure that uses two LUTs. It can be noted that

Distributed Arithmetic Based Jmplementation 85

Table 4.1. Coefficient Memory and Number of Additions for DA based Implementations

Impleme- Memory size Number of +s ntation N=8 N=12 N=18 K=12 K=16 IB,IMem 256 2 'L 2Hl II 15 2B,IMem 216 224 236 5 7 IB,2Mem 32 128 1024 23 31 2B,2Mem 512 213 219 II 15 IB,3Mem 20 48 192 35 47 2B,3Mem 144 768 3 x 212 17 23 IB,4Mem 16 32 96 47 63 2B,4Mem 64 256 3072 23 31 I B,/2 48 192 1536 18 24 2B,/2 768 3 x 212 3 X 218 9 12 I B,/3 44 96 384 24 32 2B,/3 336 1536 6 x 212 12 16 I B,/2/2 36 72 240 28.5 37.5 2B,/2/2 144 576 6912 15 19.5 I B,/3/2 56 72 144 36.5 48.5 2B,/3/2 192 288 1152 18.5 24.5

1600

CD 1400

N i:Jj 1200 ~ 0 1000 E CD :2 800 C CD 600 '0

~ 400 0 Ü 200

0 0 10 20 30 40 50 60

Number 01 Additions

Figure 4.7. Area-Delay Curves for F1R Filters

the proposed technique is generic and can be extended to other DA structures discussed earlier.

4.2. Improving Area Efficiency of Two LUT Based DA Structures

For a two bank implementation, equal partitions are the most efficient in terms of the total number of rows of the two look-up-tables. The totallook-uptable size for such an implementation has an upper bound given by 2 . 2N / 2 .

r m + log2(N/2)l, where N is the number of taps and m is the number of bits

86

X(n)

X(n-I)

X(n-2)

X(n-3)

4 word Memory

00 0 01 AI

10 AO

II AO+AI

4 word Memory

00 0 01 A3

10 A2

II A2+A3


Figure 4.8. Two Bank Implementation - Simple Coefficient Split

X(n)

X(n-I)

X(n-2)

X(n-3)

4 word Memory

00 0 01 A2

10 AO

II AO+A2

4 word Memory

00 0 01 A3

10 AI II AI+A3

Figure 4.9. Two Bank Implementation - Generic Coefficient Split

of precision of the coefficients. The number of columns/outputs of the memory modules is a worst-case upper limit and can be reduced depending upon the values of the coefficients.

Coefficient partitioning need not be restricted to a simple split as shown in Figure 4.8 but can be extended to a more generic case as shown in Figure 4.9. If the dynamic range of filter coefficients in any given partition is smalI, the

Distributed Arithmetic Based Implemenfation 87

overall precision required in the LUTs is less and the implementation area can be reduced.

For filters with fixed coefficient values the required area could be drastically reduced by removing the redundancy inherent in a memory structure by using a two level PLA implementation or the more efficient multi-level logic optimization. In a two LUT implementation, the functionality of the LUTs depends on the coefficient partitioning. Experiments indicate [86] that 20% to 25% swings in implementation area occur based on the type of partition. Hence this ftexibility needs to be explored.

In general, a 2N tap filter could be partitioned in (2N CN) /2 ways. Clearly, even for a modestly sized 16 tap filter this implies a search space with 6435 partitions. The unfeasibility of an optimized area mapping for the exhaustive set, and then choosing the most efficient partition is at once apparent. A set of heuristics is hence required for estimating the area of different partitions so as to speed up the search of the most area efficient partition.

4.2.1. Minimum Area Partitions for Two ROM Implementation

Once the number of taps is given, the number of words in the look-up ROM gets fixed. Area optimization must therefore seek to reduce the number of columns (i.e. the number of outputs) in it. The ROM size is direCtly proportional to the number of output columns it the truth-table. Therefore, minimum area results from that partition pair where the total number of output columns is minimum. It can be observed that the maximum bit precision required is determined by the magnitude of the largest positive or negative sum, whichever is larger. The largest positive number to be stored in the look-up ROM is given by the sum of aII positive numbers in the partition; the same applying for the largest negative number. The problem to find the most column-efficient partition, hence, can be formulated as folIows: CoefficienLset = {partitionli,partition2i } I sizeof( partitionli) = sizeof( partition2i ) Define, psumli = L AlLPositives(partitionli); psum2i = L AlLPositives(partition2d; nsumli = L AlLN egatives(partition2i ); nsv,m2i = L AlLN egatives(partition2i); precisionli = M AX (rlog2lpsum1i Il, rlog21nsurnli Il) precision2i = MAXWog2lpsum2ill, rlog2lnsum2i ll) M inimize (precisionli + precision2i ) Vi (i runs over all possible partitions)

For smaII coefficient sets it is possible to exhaustively run the algorithm and obtain the most efficient partition. However for larger sets a few heuristic rules need to be used to choose a close to most efficient partition.

88 VLSf SYNTHESfS OF DSP KERNELS

Table 4.2. A Few Functions and Their Corresponding Correlations with Actual Area

No. Function Used Corre-lation

I Number of 'l's in the truth-table 27% 2. I: I: (Number of 'l's common to

ith , lh rows in the truth-table) x ( Hamming distance between ith , lh min-terms ) -4%

3. I: MIN ( Hamming Distance between i th , lh columns IV U # i) ) 28%

4. Modified Row Hamming Distance based cost function (ROF) 81%

5. Modified Column Hamming Distance based cost function (COF) 66%

Based on the analysis of the coefficients of various low pass filters with taps ranging from 16 to 40, the following heuristic rute [86] can be used to choose an efficient partition: Stepl : Separate the coefficients into positive and negative sets. Step2 : Sort each set by magnitude. Step3 : Group the top half of each set as the first partition and the remaining as the second partition.

4.2.2. Minimum Area Partitions for Hardwired Logic The unfeasibility of an exhaustive search for the best case partition by syn

thesizing all ofthem was mentioned earlier. Most synthesis tools [12] like SIS take hours to ruggedly optimize even a ten coefficient truth-table. Estimating the area of a multi-level logic implementation from the truth-table description is a non-trivial problem. A technique for estimating the complexity of synthesized designs from FSM specifications has been proposed in [64]. The variables used by this scheme, such as the number ofinputs (which is fixed in this case), number of state transitions etc. are not applicable in this context. Experiments were conducted [86] with many cost functions using 40 random partitions each of 8 to 20 tap filters. Table 4.2 lists some of the functions and the average correlation between the expected and actual areas.

Functions 4 and 5 form the basis of the area comparison procedure and will be explained in detail later. Function 1 gives a very naive estimate, assuming that number of ones is a measure of the number of min-terms that needs to implemented. It does not consider the minimizations that occur because of


particular groupings of 1 'so However, it could be used effectively for filters with sparsely populated truth-tables. Function 2 is similar to a fan-in type algorithm used for FSMs [19, 63]. It reflects the fact that additional area results when two particular outputs have more Hamming distance between their corresponding min-tenns. However, the fact that it sums up over a11 possible combinations of rows results in favorable pairs being overshadowed by area expensive ones. Function 3 tries to group outputs with maximum overlap between them and adds the extra non-overlap cost. However, it does not account for simplifications that could arise from row overlaps. Further, it pairwise sums up a11 best case column groupings without accounting for the fact that one favorable grouping might exclude the possibility of another one.

4.2.2.1 CF2 : Estimating Area from the Actual Truth-Table

CF2 extracts area infonnation from the truth-table itself. It comprises oftwo factors: the row overlap factor and the column overlap factor. Row Overlap Factor(ROF)

For an mbit, n tap filter truth-table, any particular input combination can have a maximum of m + fZog2n 1 outputs. ROF accounts for the area optimization that results if two column entries in a truth-table can be combined. The ROF computation is as folIows:

Step1 : Arrange the truth-table with inputs in gray code fonnat (i.e. where the successive inputs differ from previous and subsequent on es in only one bit position).

Step2 : Assuming an n-coefficient partition, its N = 2n truth-table entries are sorted in gray fomlat and labeled from [0 .. N-l]. The symmetric Hamming distance is then computed as folIows:

n-12,,-i_12i -1

ROF = L L L hd(j.2i + k,j.2i - (k + 1)) (4.8) i=O j=1 k=O

where, hd(p, q) represents the Hamming distance between the pth and the qth row entries in the modified truth-table. 1t can be observed that when the Hamming distance between two output rows is more, the number of input mintenns that could be combined is less and hence the added cost. ROF gives a very high correlation with ac tu al implementation area but its correlation deteriorates as the column overlap factor begins to dominate. Consider the following simple example,

1111111111111111 0000000000000000

Case 1, ROF = 16

1010101010101010 0101010101010101

Case 2, ROF = 16


Clearly, Case 1 would require 1esser area, because of the greater column overlap which in turn implies that only one min-term needs to be implemented. To account for this, the Column Overlap Factor (COF) is computed. Column Overlap Factor(COF)

COF computation is based on the minimum-spanning-tree algorithm [18]. It begins with one output column, tries to locate another one which is c10sest to it (in terms of maximum' 1 ' overlap), then for a third one which is c10sest to either of them and so on. In each case it adds to the cost function the amount of non-overlap. Assuming m outputs in the truth-table, COF is computed using the Prim's technique [18] for minimum-spanning-tree computation as follows:

The graph G consists of columns as nodes. The edge-weight (ew) is the extra non-overlap cost between a pair of columns and COF is sum of the edgeweights of the minimum-spanning-tree. Define, G = {Ck ICk --t k th output column, k=[O,m-l] } eWij = ones (Cj) - overlap (Ci, Cj) where overlap( Ci, Cj) gives the number of positions where both Ci and Ck have 'I' entries in corresponding rows. Stepl : Initialize count=O; COF=ones( Co ); and the span set

as Spantree = { Co }. Step2 : Repeat Steps 2-4 while count::; m - 1. Step3 : Find Ck such that for all Ci E Spantree,

Ck E Spantree/\ (eWik --t MIN)V(i,k i- i) Step4 : Increment count; Add Ck to Spantree and edgeweight

(the extra non-overlap cost) to COF. COF = COF + eWik

CF2 is computed using a linearcombination ofCOF and ROF. It was observed that CF2 values had as much as 90% correlation with actual areas. Computation of CF2

A linear weighted combination of normalized COF and ROF (cost function CF2) was tested on truth-tables of fi lter coefficients generated using the Parks McClellan algorithm, with taps ranging from 16 to 36.

G F2 = k3 . RO F + k4 . GOF (4.9)

where ROF, GOF represent the normalized values of ROF, COF computed for the different partitions. The values of k3 and k4 far maximum correlation were found out as 0.92 and 0.08 respectively. 80% to 90% correlation to actual area were observed. The values of CF2 obtained for the various partitions can therefore be used to obtain a minimized search space of desired size. Figure 4.1 0 shows a typical correlation trend obtained using CF2 for 25 different 16 tap filter partitions. As can be seen, the two most area efficient partitions could be isolated. The 'kinks' in between correspond to the 'transition zone' where


TI Q)

.~ (ij

E 0 c N lL ü

0.95

0.9

0.85

0.8

0.75

o 9" o /

o 'I/i ~~

o}o

//f~

0,./'"

//0

/~ ,~/-6 0

/'{, 0.7 "----'-----''----'-----'----'-----'

200 300 400 500 600 700 800 Actual Area (equivalent NA210 NANO gates)

Figure 4.10. Area vs Normalized CF2 Plot for 25 Ditlerent Partitions of a 16 Tap Filter

neither a row nor a column overlap factor dominates; there occurs a complex interdependency in row/column simplifications.

4.2.2.2 CFl: Estimating Area from the Coefficients in Each Partition

Computation of CF2 involves an additional overhead of generating the truthtable of all possible sum combinations for a given partition. For large coefficient sets it is desirable to be able to predict area efficiencies from the coefficient values themselves. The cost function (CFI) described below uses coefficient values themselves, to get those partitions from the exhaustive set that are highly Iikely to give optimum area. Hence, a hierarchical reduction in the search space can be performed by using CF 1 on the exhausti ve set and CF2 on those partitions screened by CF 1 .

CFl is based on the Hamming distance between pairs of coefficients and the total number of ones in all the coefficients. Statistical data between actual area after SIS simulation and the corresponding truth-table has shown correlation as high as 50% to 60% between the number of ones in the truth-table and the corresponding area. CFl exploits this basis for estimation. Since, the entries in the truth-table are the sum combinations of the coefficients, the Hamming distance between any pair gives the number of ones that will result from their addition. It was observed that a similar correlation exists between the number of ones in the coefficients themselves and the area. Further, the Hamming distance estimate and the number-of-ones estimate were complementary. A linear combination of the two produced good primary estimate for reducing the search space to a manageable size. Therefore,

CFl = k1 . L L hd(Ci, Cj) + k2 . L ones(ci) i Ni

(4.10)


Table 4.3. ROM Areas as a % of Maximum Theoretical Area

No.of No.of Best Worst Max. Base taps Coetfs Area Area Saving Case

24 12 61% 74% 13% 63% 28 14 61% 74% 13% 66% 32 16 66% 76% 10% 68% 32 16 66% 74% 8% 66% 32 16 63% 74% 11% 66% 36 18 65% 70% 5% 65% 40 20 67% 72% 5% 67%

where k1 and k2 are the corresponding weights and ci represents the i th coefficient in the partition; hd is a simple Hamming distance between the two input vectors and ones is the number of '1' entries in Ci.

The function was implernented on all possible uniform twin partitions of filters with number of coefficients ranging from 8 to 20. The values of k1

and k2 that resulted in the highest correlation between the value of CFl and the actual area were found out as 0.83 and 0.17 respectively. Experiments indicated that the correlation values remained almost the same after CFl values obtained for individual coefficient sets in a give partition were added up and compared with the sum of the individual implementation areas.

4.2.3. Evaluating the Effectiveness of the Coefficient Partitioning Technique

Here are the results that highlight the effectiveness of the coefficient partitioning technique. Table 4.3 compares the best case, worst case and base case ( partitioning by simple split of coefficient set) areas for some linear phase filters.

As can be seen, 8% to 10% area saving can easily be obtained from a good partition. Table 4.4 compares the area required for a ROM implementation with that of hard wired implementation (using SIS) for different numbers of 16-bit precision coefficients. The area mapping was performed using a library of TSC4000 0.351Lm CMOS Standard gates from Texas Instruments [100]. It can be seen that more savings result for smaller number of coefficients as the decoder overhead does not decrease proportionately for the ROM even though the memory size decreases. Table 4.5 illustrates the kind of area variations that occur depending on the partitioning of coefficients for some typical filters.


Table 4.4. ROM vs Hardwired Area (Equivalent NA210 NANO Gates) Comparison

NO.of ROM Hardwired % Coeffs. Area Area Saving

4 310 64.25 79% 5 355 96.00 73% 6 447 214.00 52% 7 626 345.50 45% 8 985 650.25 34%

Table 4.5. Area (Equivalent NA21 0 NANO Gates) Statistics for All Possible Coefficient Partitions

NO.of Best Worst Mean Range Std. Coeffs Area Area Area (%) Oev.

8 91.25 115.25 100.49 23.88 26.41 14 340.00 420.75 382.58 21.11 118.64 16 759.00 934.00 844.01 20.73 200.00 16 895.50 1127.50 1000.75 23.18 291.88 16 884.50 1041.00 970.37 16.13 174.19 16 689.00 1005.00 795.50 39.72 332.02 18 1285.50 1619.25 1393.12 23.96 369.20

Clearly the correct choice of partitions results in 20% to 25% area saving and so a proper algorithm for choosing the ideal partition is altogether justified. CFI and CF2 were implemented on filters with taps ranging from 8 to 40. For filters with number of coefficients less than 20, all possible partitions were generated while for larger ones a comparable number of random partitions were generated. In each case the actual area mapping of the simplified circuit was obtained through SIS simulation.

The following results were obtained [86]:

• 82% to 90% probability of choosing the most area optimal partition using CFI and CF2.

• Over 95% probability of having the most optimal partition in the search space reduced to 2% of its original size.

• All cases yielded partitions close to minimal area in the reduced search space.


Table 4.6. Toggle and No-toggle Power Dissipation in Some D FFs

D Flip-Flop (TI standard cell [100]) Cpd,toggle (pF) Cpd,no-toggle (pF) %extra DTPIO 0.157 0.070 124% DTP20 0.180 0.071 154% DTPIA 0.155 0.069 138% DTPIO 0.167 0.070 139%

• The CF1, CF2 estimates had greater correlation as the size of the search space increased; and a larger sized domain is where CF2 and CF2 have their real application.

• For an 8 input truth-table with 256 rows and 16 output columns, SIS required a CPU time of 350.53s on a Sun SPARC 5 station while CF2 computation required only 0.15s, a speed advantage of around 2400. Further, this speed advantage increased sharply with filters of higher order.

The next section presents techniques for reducing power dissipation of DA based FIR filters. While these techniques have been discussed primarily in the context of the basic DA based filter structure shown in figure 4.1, they can be extended to other DA based filter structures as weIl.

4.3. Techniques for Low Power Implementation of DA Based FIR Filters

For the DA based structure shown in figure 4.1, the rightmost bits in the shift registers constitute the address for the LUT. Data is shifted every clock cycle and the LUT outputs are shifted and accumulated. This is done N times where N is the precision of the input data (and hence the length of the shift registers). At the end of every N clock cycles, the output is tapped at Y. For a 2's complement representation, the Sign Control is always positive except for the MSB i.e. for the N th clock cycle.

Substantial power consumption occurs as a result of toggles occurring in the shift registers every clock cycle. Table 4.6 compares the power dissipation for the toggle and the non-toggle (in the data values) cases for four D flip-flops based on a 0.35p,m CMOS technology [100]. From the table it is clear that a technique which reduces the number of toggles in the shift registers would significantly reduce the power dissipation in the design.

For applications where the distribution of data values is known, a data coding scheme can be derived which for a given distribution profile of data values, results in lesser number of toggles in the shift registers. The main constraint


is to have a scheme which results in toggle reduction with minimal hardware overhead (implying power dissipation overhead as weil). The second important constraint is that the coding scheme should be programmable so that the same hardware can be used for different distribution profiles of data values. The negabinary scheme discussed here satisfies these two constraints and can be directly incorporated into the structure of the DA based FIR shown in Figure 4.1.

4.3.1. Toggle Reduction Using Data Coding Any coding scheme that seeks to reduce togg1ing must meet the following

criteria: 1. It should add minimum additional hardware overhead. 2. It should represent the entire range of values of the source data being coded.

The generic nega-binary scheme proposed here meets the above two requirements. It has the added ftexibility of choosing one of the several possible nega-binary schemes that meet the above criteria and also resuIts in maximum toggle reduction.

4.3.1.1 Nega-binary Coding

Nega-binary numbers [106, 107] are a more generic case of a 2's complement representation. Consider an N bit 2's complement number. Only the most significant bit (MSB) has a weight of -1 while all others have a weight of + 1. An N-bit nega-binary number is a weighted sum of ±2i . As a special case consider a weighted (_2)i series where nbi denotes the i th bit in the negabinary representation of the number.

N-l

number = L nbi . (_2)i (4.11 ) i=ü

In the above case, powers of 2 alternate in signs. While the 2's complement representation has the range of [ - 2N -1 , 2N -1 - 1 ], this nega-binary scheme has the range of [ _(4 LN/2J - 1)/3, (4 fN/21 - 1)/3]. It can be noted that in general the nega-binary scheme results in a different range of numbers than the 2's complement representation. Thus there can be a number that has an N bit 2's complement representation but does not have an N bit nega-binary representation. This issue has been addressed in section 4.3.1.2. Here is a simple example that demonstrates how the nega-binary scheme can result in reduced number of toggles. Consider the 2's complement number 010101018. Using a negabinary scheme with alternating positive and negative signs (weights - (-2)i ), the corresponding representation will be ll111111 NR . Clearly while the first case has maximum possible toggles the second one has minimum toggles. If instead the number was 10101010B, this nega-binary scheme would result in a representation with same number of toggles as the 2's complement. How-


_T~_al range for 32 _(21\(4+1» different Nega-Binary representations

.2's Complement Range.

-31 -8 o 7 31

Range ror . - - - - Range ror + + + + +

Figure 4. J J. Range of Represented Values for N=4, 2's Complement and N+ 1=5, Nega-binary

ever, a different nega-binary scheme (weights ( - 2)i ) will have a representation 11111110N B withjust 1 toggle. Thus it can be noted that different nega-binary schemes have different 'regions' in their entire range which have fewer toggles and hence depending on the data distribution the flexibility exists of choosing a scheme which minimizes toggling without altering the basic DA based FIR filter structure.

In existing literature [106, 107], the term nega-binary is used specifically for binary representations with radix -2. In this chapter, the definition of the term has been extended to encompass all possible representations obtained by using ±2i as weights for the i th bit. Hence for an N-bit precision there will exist 2N

different nega-binary schemes.

4.3.1.2 2's Complement vs Nega-binary Representation

Since the range üf values für the twü representatiüns are different, the bit precision for the nega-binary scheme needs to be increased tü N+ 1. With N+ 1 bits of precision, when all sign bits are negative, the corresponding nega-binary range is [ _2N +1 + 1 ,0] and likewise when all the sign bits are positive, the range is [ 0 , 2N +1 - 1]. All intermediate sign combinations have a range lying between [ _2 N +1 + 1 , 2N +1 - 1 ] and each combination represents 2N +1 consecutive numbers. The N-bit 2's complement range being [ _2N - 1 ,

2N - 1 - 1] overlaps and completely lies within the N+l bit nega-binary range for exactly 2N + 1 different nega-binary representations out of the possible 2N +1 total cases. Figure 4.11 illustrates this point for an N=4, 2's complement and N+ 1 =5, nega-binary representation. From a total of 32 different 5-bit negabinary schemes, 17 schemes cover the 4-bit 2's complement range.

The advantage of using such a scheme is that it enables selecting a negabinary representation that minimizes the number of toggles in the data values while covering the entire range spanned by its 2's complement counterpart. For a given profile of input data distribution a nega-binary scheme can be selected, out of the ones which overlap with the 2's complement representation, such that it minimizes the total weighted toggles i.e. the product of the number of toggles in a data value and the corresponding probability of its occurrence.


f " 0

n " :> lL

~ .ö;

" '" 0

0.03 r-----..,.-------r----r------r----~---___,

0.025

0.02

0.015

0.01

0.005

o L-__ ..g,:;~_ ·600 ·400 -200 o

VALUE _ ••• > 200 400 600

Figure 4.12. Typical Audio Data Distribution for 25000 Sampies Extracted from an Audio File

Figure 4.12 iIlustrates the distribution profile of a typica1 audio data extracted from an audio file. The non-uniform nature of the distribution is at once apparent. A nega-binary scheme which has the minimum number of toggles in data values with very high probability of occurrence will substantially reduce power consumption. Further, each ofthe 2N + 1 overlap cases have different 'regions' of minimum toggle over the range, which implies that there exists a nega-binary representation which minimizes total weighted toggles corresponding to a data distribution peaking at a different 'region' in the range. While the relative data distribution of a typical audio data is similar to that shown in figure 4.12, its mean can shift depending on factors such as volume control. The flexibility of selecting a coding scheme depending on the 'mean' values is hence very critical for such applications. Section 4.3.1.4 shows that the binary to negabinary conversion can be made programmable so that the desired nega-binary representation can be selected (even at run-time) by simply programming a register.

It can be noted that the toggle reduction using the nega-binary coding comes at the cost of an extra bit of precision. The amount of saving hence reduces as the distribution becomes more and more uniform. This is to be expected, as any exhaustive N-bit code (i.e. one that comprises of all possible combinations


3 ,

2.5

2

l' 1.5 cn Q)

"öl Cl 0 f-.S: ~ ~ "-' Q)

'-' c e? Q) 0.5 :e 0

0 I- ..... , ....... -

-0.5 I

-1 I,

-40 -30 -20 -10 0 10 20 30 40 VALUE --->

Figure 4.13. Difference in Toggles far N=6, 2's Complement and Nega-binary Scheme : + - -+ - + +

of 1 sand Os) will necessarily have the same total number of toggles (summed over all its representations) as any other similar code. Therefore, as the data distribution becomes more and more uniform i.e. all possible values tend to occur with equal probability, toggle reduction decreases.

Figures 4.13 and 4.14 ilIustrate the difference in number of toggles for a 6-bit, 2's complement representation and two different 7-bit, nega-binary representations for each data value. Figures 4.15 and 4.16 show two profi les for 6-bit Gaussian distributed data. As can be seen the nega-binary scheme of figure 4.13 can be used effectively for a distribution like the one shown in Figure 4.15, resulting in 34.2% toggle reduction. Similarly nega-binary scheme of figure 4.14 can be used for a distribution like the one shown in figure 4.16, resulting in 34.6% toggle reduction. Figures 4.13 and 4.14 depict two out of a total of 65 possibilities. Each of these peaks (i.e. the corresponding nega-binary scheme has fewer toggles compared to the 2's complement case) differently, and hence, for a given distribution, a nega-binary scheme can be selected to reduce power dissipation.


4r-----.---~h-----,,-----._,----,_----,,----_.----_.

3

l' 2 ~ <>i cn Q) c;, '" 0 I-.~ Q) u c ~ Q) ::= . (5 0 - ~ )~ .. 'ti

r

W I -1 ~

-2 ,

-40 -30 -20 -10 0 10 20 30 40 VALUE ---->

Figure 4.14. Differenee in Toggles for N=6, 2's Complement and Nega-binary Sehe me : - + + - + - +

4.3.1.3 Deriving an Optimum Nega-binary Scheme for a Given Data Distribution

The first step is to find a data value which contributes most to the power dissipation, using (Number ofToggles x Probability of Occurrence) as the cost function. The analysis of various Gaussian data distributions indicates that a nega-binary scheme that minimizes the toggles in such a data value, is the optimum nega-binary scheme for the given data distribution. For example, for the profile shown in figure 4.15, data value 22 (6 bit binary representation - 0] 0 I] 0) contributes most to the power dissipation. The nega-binary representation of 22 using the 7 bit scheme (+ - - + - + +) shown in figure 4.13 is (I] ] ll ] 0) which has least number of toggles. As has been presented earlier, this nega-binary scheme is the most power efficient for the distribution shown in figure 4.15. Here is an algorithm for deriving an (N+ I) bit nega-binary representation that minimizes toggles (i.e. the number of adjacent bits with different values) for a given N bit 2's complement number.

Procedure Optimum-Nega-binary-Scheme Input: Array Bit[N] - N bit 2's complement representation of a number, with


0.07 !

0.06

0.05

1 c 0.04 0

• t5 c ::J

LL

:c 0.03 'iii c QJ

0

0.02

0.01

0 1111I

-40 -30 -20 -10 0 10 20 30 40 VALUE ---->

Figure 4./5. Gaussian Oistributed Oata with N=6, Mean=22, SO=6

Bit[O] being the LSB and Bit[N-l] being the MSB Output: Array Sign[N+ 1) - N+ 1 bit nega-binary representatiün that minimizes tüggles in the number.

für (i = 0 tü N-2) {

}

if (Bit[i+ 1) == 1) { Sign[i) = '+'

} else { Sign[i] = '-'

}

if (Bit[N-l) == 1) { Sign[N-l} = '+' Sign[N} = '-'

} else {

}

Sign[N-l] = '-' Sign[N} = '+'


0,07 ,

0,06

0,05

~ c 0,04 .Q ti c ::J

LL

C:- 0,03 'Vi c <l> 0

0,02

0,01

0 [ 1II1

-40 -30 -20 -10 0 10 20 30 40 VALUE ---->

Figure 4,16, Gaussian Oistributed Oata with N=6, Mean=-22, SO=6

4.3.1.4 Incorporating a Nega-binary Scheme into the DA Based FIR Filter Implementation

Conversion of a 2's complement number to a nega-binary representation can be done bit-serially. Here is an algorithm for such a bit-serial conversion.

Procedure Binary-to-Nega-binary Inputs: Array Bit[N] - N bit 2's complement representation of a number with Bit[O] being the LSB and Bit[N-I] being the MSB. Array Sign[N] - N bit nega-binary representation, with Sign[O] being the sign for the LSB. Output: Array NegaBit[N] - N bit nega-binary representation for the number.

C[O] = 0; /* Array C[N] is an array of intermediate variables */ for (i=O to N-l) {

}

NegaBit[i] = Bit[i] XOR. C[i] if (Sign[i] == '+') { C[i+ 1] = Bit[i] ,AND. C[i] } else { C[i+ 1] = Bit[i] .OR. C[i] }


2's Complement to Nega-Binary Converter

bi---+-,r-------------------------~\

o

c1k N+ 1 bit Sign Register

Serial

Data-in

Sign Control

DA based

FIR structure

Figure 4.17. DA Based FIR Architecture Incorporating the Nega-binary Scheme

The above algorithm can be directly implemented in hardware resulting in a small area overhead. Data values can be bit serially converted from a 2's complement representation to a nega-binary representation and loaded into the shift registers. The sign bits can be directly coupled to the Sign Contral of the adder shown in Figure 4.1. Figure 4.17 illustrates the complete nega-binary DA based FIR architecture. The sign register is a pragrammable register which holds the sign combination for the chosen nega-binary scheme. The bit serial nega-binary computation logic requires just 5 gates and can have worse case power dissipation equivalent to 3 flip-flops, which is negligible compared to the number of flip-flops in the shift register chain.

It is important to note that a simple difference of weighted toggle sums obtained for the 2's complement and the nega-binary representation does not give the actual toggle reduction occurring in the concatenated registers. Since the nega-binary registers have N+ 1 bits of precision, each data value contributes its toggles (N+ I)/N times more than the corresponding 2's complement value. Therefore, the nega-binary weighted toggle sum needs to be multiplied by a factor equal to (N+I)/N.

Hence, the power saving can be estimated as folIows:

(4.12)

where p(i) is the probability of occurrence of a data with value 'i', N is the 2's complement bit-precision used, togs(i) and negatogs(i) are the number of


toggles in the representation ofi forthe 2's complement case and the nega-binary case respectively.

The above saving computation does not account for 'inter-data' toggles that result from two data values being placed adjacent to each other in the shift register sequence. It may be observed that for a T tap filter with N-bit precision registers an architecture similar to Figure 4.1 would imply a virtual shift register (obtained through concatenating all the individual registers) of length TxN.

Actual shift simulations were performed sampie by sampie for different data profiles and different number of sampies to find out the nega-binary scheme that resuIts in maximum saving. These simulations showed that in all cases, the nega-binary scheme that resulted in the best saving was the same as the scheme that resulted in maximum estimate of power saving. This can be attributed to the observation (based on the simulations) that the contribution due to interdata toggle is almost identical across various nega-binary schemes. Hence the power saving estimate, given in equation 4.12, can be used to arrive at the optimum nega-binary scheme. There are two advantages of choosing a negabinary scheme this way. One, it does not require actual sampie by sampIe data, only an overall distribution profile is sufficient. Two, the run times for computing the best nega-binary scheme are orders of magnitude smaller.

4.3.1.5 A Few Observations

• It was observed that for a given type of distribution (e.g. Gaussian, bimodal etc.) there was a fixed trend in the best nega-binary representation for different precisions (i.e. N values). In fact, from a knowledge of the best nega-binary representation for lower values on N, the scheme for higher values could be inductively obtained. Table 4.7 shows the best nega-binary schemes for 5 to 10 bit precision data having a non-zero mean Gaussian distribution. The strong trend in the nega-binary scheme is at once apparent. A similar exercise for different distributions showed such a predictable trend in every case.

• The resuIts given in table 4.7 also indicate that the amount of power saving reduces with increasing bit precision. This trend can be explained as follows. The dynamic range of data with B bits of precision is given by 2B .

With one extra bit of precision (i.e. B+ I), the dynamic range doubles (i.e. 2B +1). Since the mean and the standard deviation ofthe data distribution are specified as a fraction of max, these also scale (double) with an additional bit of precision. Thus a data value D with B bits of precision, gets mapped onto two data values (2D) and (2D+ 1) with (B+ 1) bits of precision. As can be seen from table 4.7, the nega-binary representation for (B+ 1) bits of precision is derived by appending '+' to the nega-binary representation for B bits of precision. Thus the nega-binary representation of (2D) and (2D+ I) is

104 VLSJ SYNTHESJS OF DSP KERNELS

given by appending 0 and 1 respectively to the B bit nega-binary representation of D. Depending upon the LSB of the B bit nega-binary number, either (2D) or (2D+ I) results in an extra toggle. By the same argument, depending on the LSB of the corresponding (B-l) bit 2's complement number, either (2D) or (2D+ 1) results in an extra toggle. Thus with one additional bit of precision, the absolute value of reduction in the toggles remains more or less the same. However, since the total number of toggles in the 2's complement representation increases, the amount of % power reduction decreases with each additional bit of precision.

• In an actual simulation of the shifting sequence, each data value contributes its total toggle count equal to the number of shifts for which the entire data remains in the virtual shift register. This explains the need for the (N+ 1)/N factor in saving. This, however, does not account for the toggle contribution during the stage when a data value is partially in or when it is partially out of the shift sequence.

• If the number of taps, T, is increased, the 'inter-data' (i.e. between two consecutive data values in the virtual shift register) toggling contribution increases as a fraction of the 'intra-data' toggling. As a result, the actual saving obtained through a shift register simulation is less than that computed by the saving formula. Similarly, as the bit-precision is reduced, keeping the number of taps constant, once again the 'inter-data' toggling contribution increases as a fraction of its 'intra-data' counterpart. This is consistent with the experimental result for a 16 tap filter, which shows that the toggle reduction of 61 % for 4-bit precision compared to the reduction of 73% for 8-bit precision.

• As pointed out before, the nega-binary scheme performs weil only with peaked distributions. For a symmetrical, uniform distribution the 2's complement scheme is better. This is apparent since the nega-binary scheme is implemented with N+ 1 bits of precision to take care of the entire range. lt is only in a few regions that the toggle count is lesser compared to its 2's complement counterpart. A uniform distribution nullifies this effect. Figure 4.18 shows a plot of saving versus the Standard Deviation (SD) expressed as a percentage of the entire span (256 in this case), for an N=8, Gaussian distributed data with mean=max!2 (max being the largest positive number represented, which is 127 in this case). The Gaussian distributions for SD=8 and SD=44 are shown in figures 4.19 and 4.20 respectively.

4.3.1.6 Additional Power Saving with Nega-binary Architecture

In addition to the toggle reduction in the shift register and consequently in the address lines driving the LUT, the nega-binary architecture results in toggle

Distributed Arithmetic Based Implemenfafion 105

Table 4.7. Best Nega-binary Schemes for Gaussian Data Distribution ( mean = max/2; SD = 0.17 max)

N Best Nega-binary Scheme (N+ I bit precision) saving (precision) 21U 2" 2/\ 2 2" 2" 24 2s 2" 2' 2u

5 + - - - + + 25.41 % 6 + - - - + + + 17.87% 7 + - - - + + + + 13.73% 8 + - - - + + + + + 11.16% 9 + - - - + + + + + + 9.42% 10 + - - - + + + + + + + 8.15%

18

16

14

12

1 10

l 8 Cl c .:;

'" (J) 6

4

2

0

-2 2 4 6 8 10 12 14 16 18

SO (% 01 range) --->

Figure 4.18. Saving vs SD Plot for N=8, Gaussian Distributed Data with Mean = max/2

reduction in the LUT outputs as weIl. Such a reduction apart from saving power in the adder also results in substantial power savings in the LUT itself. Table 4.8 shows the number of Repeated Consecutive Addresses (RCAs) to the LUT for the 2's complement and the nega-binary case. It is easy to observe that the number of repeated consecutive addresses in the shift register outputs gives the number of times no toggles occur in the LUT outputs (since the same contents

106

0.05

0,045

0,04

0,035

f 0.03 c 0 n c 0025 :> u.. 2:-'0;

0,02 c:

'" 0

0.015

0.01

0.005

0 -150


-100 -50 o VALUE -.-.>

50

Figure 4.19. Narrow (SD=8) Gaussian Distribution

100 150

Table 4.8. Toggle Rcduction in LUT (far 10,000 Sampies; Gaussian Distributed Data)

TAPS N Nega-binary 2's Complement Nega-binary RCAs Toggle Reduction Scheme RCAs (% of total) (% of total) (% of 2's Comp.)

4 + - - + + 1.94% 41.41% 25.32% 8 6 +--+--- 11.60% 38.55% 18.90%

8 +--+----- 2.80% 24.04% 12.08% 4 + - - + + 7,21% 45.62% 26.75%

4 6 +--+--- 9.15% 36.09% 17.93% 8 +--+----- 8.89% 29.65% 13.14%

Average 12% to 25% additional toggle reduction in LUT with the Nega-binary architecture.

are being read). This toggle reduction is, therefore, independent of the filter coefficients.

A few comments need to be made about these numbers .

• 2's complement RCAs were obtained by counting the number of cases (out of a possible of 1 OOOOxN times the LUT is addressed) where two consecutive


0.01 ,----...,-----,-------,-------r------,------,

0.009

0.008

0,007

i 0.006 ,§ Ü § 0.005 LL

~ ' ij;

:ii 0.004 o

0.003

0.002

0.001

oL----~----~ww ·150 ·100 -50 o

VALUE •••• > 50

Figure 4.20. Broad (SD=44) Gaussian Distribution

100 150

addresses were identical. A similar computation was performed for the best nega-binary scheme obtained using the techniques presented in the previous sections (the total number of cases in this case is obviously 1 OOOOx(N+ 1) ).

• Toggle reduction was computed by finding the difference between the number of times at least one toggle occurred at the LUT output for the two schemes.

• For all the three different precisions a Gaussian distribution with mean = max/2 and an SD = 0.2 max was used (max being the largest positive 2's complement number represented).

4.3.2. Toggle Reduction in Memory Based Implementations by Gray Sequencing and Sequence Reordering

Techniques have been proposed [70, 93] to eliminate shifting in registers to reduce power by storing data as memory bits and using a column decoder to access bits at the desired location. In other words, instead of shifting the data, the pointer is moved. While such a technique reduces power dissipation due to data shifting, it results in additional power dissipation in the column decoder. The following techniques can be used for reducing power in such a shift-less

108

COUNTER

(Gray Sequence - same as Routing Sequence)


Figure 4.2/. Shiftless lmplementation of DA Based FIR with Fixed Gray Sequencing

DA implementation. 1. Using a gray sequence in the counter (column decoder) for selecting subsequent bits - this would reduce the toggling in the counter outputs which drive the multiplexers to the theoretical minimum. 2. Using the flexibility of having several gray schemes to choose a data distribution dependent scheme which minimizes toggles in the multiplexer outputs.

Gray coded addressing has been presented in the literature [61] as a technique for significantly reducing power dissipation in an address bus, especially in case of a sequential access. Figure 4.21 illustrates a DA based FIR with a fixed gray sequencing scheme. This results in theoretically minimum possible toggles occurring in the counter output. As can be seen such an implementation requires no additional hardware in the basic DA structure.

An N bit gray code can be obtained in N! ways (in a gray code any two columns can be swapped to obtain another a gray code). This freedom can be exploited to obtain a data specific gray code which minimizes the toggle count as successive bits are selected within the register. This gives us dual power saving : one, in the counter output lines themselves and two, in the multiplexer output which drives the LUT (i.e. the LUT address bus). There is an additional overhead of course. Since the register is not scanned sequentially, a simple shift

Distributed Arithmetic Based Implementation

f---+------X[n~l I b ~x[n-l]=====1 ~

t

X[n-k]

COUNTER

LUT

Shirt Count

Figure 4.22. Shiftless Implementation 01' DA Based FIR with Any Sequencing Possible

109

and accumulate cannot be used_ Instead, a barrel shifter is required, as shown in Figure 4_22, to shift and accumulate as per the counter sequence_ As with the best nega-binary scherne, an optimal gray code can be chosen which minimizes the weighted toggle sum using a saving formula very similar to the one used in section 4_3.1 A. However, no extra bit precisions are required_

The limitation of the above scheme is that a complete gray code sequencing is not possible with bit precisions which are not powers of 2_ In such cases, a partial gray code can be used, i_e. where llog2N J bits are gray coded, N being the data precision_ This is true because if a complete gray code is used there will occur values greater that N in the counter sequence which are meaningless_ Partial gray coding obviously reduces the number of different codes that are available and therefore the address bus power saving, in this case, is lesser.

Table 4_9 shows the weighted toggle data for a 3-bit gray sequencing of an N=8 Gaussian distributed data with mean = -max/2 and SD = 0_16 max. As can be seen, the best case toggle reduction is 9.51 %_ The toggle reduction varies depending on the type of distribution. One important observation is that savings are obtained for symmetrie Gaussian as weil as symmetrie bimodal data, where the nega-binary performance deteriorates substantially. A nega-binary scheme could also be used in conjunction with gray sequencing to obtain additional power savmgs_ It can be noted that inter-data toggles need not be considered in this case_


Table 4.9. Comparison of Weighted Toggle Data for Din'erent Gray Sequences

Weighted Toggles for Simple Consecutive Sequencing = 3.786 (Base Case) No. Gray Sequence Used Wt. Toggles % Saving

1. 01326754 3.439 9.17% 2. 02315764 3.652 3.54% 3. 01546732 3.556 6.08% 4. 02645731 3.619 4.14% 5. 04513762 3.426 9.5% 6. 04623751 3.702 2.22%

Best Case Gray Sequence Saving = 9.51 %

Tables 4.7, 4.8 and 4.9 highlight the effectiveness of the proposed techniques in reducing power dissipation in the DA based implementation of FIR filter. Here are some more results on power savings obtained for different number of bits of data precision and different distribution profiles of data values. Table 4.10 shows the percentage reduction in the number of toggles for two different Gaussian distributions.

Table 4.10. Toggle Reduction as a Percentage of 2's Complement Case for Two Different Göussian Distributions

Best Nega-binary Scheme N TRI (%) TR2 (%) + - - + + 4 49.75 % 41.96 %

+--+++ 5 35.63 % 28.95 % 1 +--+++- 6 28.34 % 24.04 %

+--++++-- 8 20.88 % 16.94% +--++++---- 10 16.59 % 12.80 % +--++++------ 12 13.52 % 7.17 % - + + - - 4 42.41 % 32.51 % -++--+ 5 34.07 % 26.74 %

2 -++--++ 6 27.71 % 22.06 %

-++--++++ 8 19.85 % 16.12 '',0

-++--++++++ 10 15.39 % 12.63 %

-++--++++++++ 12 12.55 % 8.68 %

TRI is the weighted toggle reduction as computed using the saving formula; TR2 is the percentage toggle reduction obtained by using 25000 actual sampies (i.e. it accounts for the 'inter-data' toggles as weil as the other factors mentioned in section 4.3.1.5) in an 8 tap filter. The predictable trend in the best case negabinary scheme for different precisions is at once apparent. Further, it can be


Table 4./1. Toggle Reduction with Gray Sequencing for N = 8 and Some Typical Distributions

Data Distribution Used ( range = [-128,127] ) Best Gray Toggle Type Mean SD Sequence Reduction (%)

Gaussian 64 20 04513762 9.46 % Gaussian -64 20 04513762 9.52 % Gaussian 0 20 02315764 3.46 % Gaussian 64 56 01326754 1.92 %

Gaussian Bimodal -64, +64 20 04513762 9.49 %

observed that as the precision increases TR 1 and TR2 values approach each other, for the reasons mentioned in section 4.3.1.5.

Table 4.11 shows the best case gray sequencing toggle reduction, in the LUT address bus, obtained for 8-bit precision data with five different distributions. The first four are Gaussian, and the last one is a Gaussian Bimodal distribution. As was pointed before, the toggle reduction decreases as the distribution becomes more and more uniform (i.e. as SD increases).

With gray sequencing, toggle reductions are obtained even with Bimodal distributions. Nega-binary representations with low toggle regions symmetrically distributed about the origin do not exist and therefore in case of such distributions the nega-binary architecture does not give good results.

Chapter 5

MULTIPLIER-LESS IMPLEMENTATION

Many DSP applications involve linear transfonns whose coefficients are fixed at design time. Examples of such transforms include DCT, IDCT and color space conversion kemels such RGB-to-YUv. Since the coefficients are fixed, the flexibility of a multiplier is not necessary and an efficient implementation of such transfonns can be obtained using adders and shifters. This chapter presents techniques for area efficient implementation of fixed coefficient l-D and 2-D linear transforms.

A 2-D linear transfonn that transfonns N inputs to generate M outputs can be performed using matrix multiplication as shown below:

Y[l] Y[2]

Y[M]

[

A[l,l] A[2,1]

A[M,l]

A[I,2] A[2,2]

A[M,2]

A[l,N] A[2,N]

.... A[M,N]

X[l] X[2]

X[N]

A l-D transfonn such as an FIR filter can be treated as a special case of a 2-D transform with M = 1.

It can be noted that a 2-D linear transfonn involves two types of computation.

Weighted-sum:An MxN 2-D transform can be computed as M l-D transforms each of length N. A l-D transform is computed as the weighted-sum ofN inputs with the weights given by the rows ofthe transformation matrix.

2 Multiple Constant Multiplication (MCM): An MxN 2-D transfonn involves each ofthe N inputs being multiplied by M constants (columns oft he matrix). It can be noted that the transposed FIR filter structure also performs MCM type computation where during each output computation, the latest data sampie is multiplied by all the filter coefficients.

113



X3 X2 Xl XO

AO Al

y

Figure 5. J. Data Flow Graph for a 4-term Weighted-sum Computation

This chapterpresents techniques forminimizing additions in both these forms of computation using a common subexpression elimination approach. Such a technique for MCM based computation has been proposed in [78]. The technique presented in this chapter is different in that it extracts only 2-bit common subexpressions during each iteration. Such an approach provides higher ftexibility enabling more area efficient implementation.

While this chapter focuses on the two types of computations separately, the two techniques can be combined in the context of 2-D transforms to achieve an area efficient implementation. Such a combined optimization strategy has been presented in [75].

The output from the common subexpression elimination based optimization is adata ftow graph with add and shift operators. The precision ofthe variables in the data ftow graph varies significantly, especially for higher number of inputs. This chapter also looks at the high level synthesis of such multi-precision data ftow graphs.

5.1. Minimizing Additions in the Weighted-sum Computation

5.1.1. Minimizing Additions - an Example Consider a 4 term weighted-sum computation (figure 5.1) with the coef

ficients AO (= 0.0111011), Al (= 0.0101011), A2 (= 1.0110011) and A3 (= I. 1001 0 I 0) represented in 2's complement 8 bit fixed point format. The output Y can be computed as

Y = AO . X3 + Al . X2 + A2 . Xl + A3 . XO (5.1)

Replacing the multiplications by additions and shifts, gives

Y = X3 + X3 « I + X3 « 3 + X3 « 4 + X3 « 5 +

Multiplier-Iess Implementation

X2 + X2 « 1 + X2 « 3 + X2 « 5 + Xl + Xl « 1 + Xl « 4 + Xl « 5 - Xl « 7

XO « 1 + XO « 3 + XO « 6 - XO « 7

115

(5.2)

The above computation requires 15 additions, 2 subtractions and 15 shifts. However, ifY is computed in terms ofX23 (= X2 + X3) and XOI (= XO + XI) as shown below :

Y = X3« 4 + Xl + Xl « 4 + Xl « 5 + XO « 3 + XO « 6 + X23 + X23 « 1 + X23 « 3 + X23 « 5 + X01 « 1 - X01 « 7 (5.3)

The above computation requires 12 additions (including 2 required to compute X23 and XO I), 1 subtraction and 10 shifts. The number of additions can be further reduced by precomputing X 123 = X 1 + X23. Y can then be computed as

Y = X3«4+X1«4+XO«3+XO«6+

X23 « 1 + X23 « 3 + X01 « 1 - XOl « 7 + X123 + X123 « 5 (5.4)

The above computation requires 11 additions (including 3 required to compute XOI, X23 and X123), 1 subtraction and 9 shifts. Y can also be computed in terms of X13 (=XI + X3) and X02 (=XO + X2) as folIows:

Y = X3« 3 + X2 + X2 «5 - Xl «7 + XO« 6 - XO« 7 + X13 + X13 « 1 + X13 « 4 + X13 « 5 + X02 « 1 + X02 « 3 (5.5)

The above computation requires 11 additions (including 2 required to compute XI3 and X02), 2 subtractions and 10 shifts. It can be noted that (X13 « 4 + X13 « 5) can be computed as ((X13 + X13 « 1) « 4). The number of additions can thus be further reduced by precomputing X13_01 = X13 + X13 « 1. Y can then be computed as

Y = X3« 3 + X2 + X2 « 5 - Xl « 7 + XO « 6 - XO « 7 + X13_01 + X13_01 « 4 + X02 « 1 + X02 « 3 (5.6)

The above computation requires 10 additions (inc\uding 3 required to compute X13, X02 and X13_01), 2 subtractions and 9 shifts (including 1 required to compute X 13_0 I).

The above example shows techniques for reducing the number of additions+ subtractions (17 to 12 - 29% reduction in this case) for implementing weightedsum computation.


5.1.2. 2 Bit Common Subexpressions

The minimization techniques presented in the above example are based on finding common subexpressions (2 bit patterns), computing the subexpressions first and using the result to compute the filter output. XOl, X23, X02, X13, X 123, X 13_01 in the equations above are examples ofthese 2 bit subexpressions. The common subexpressions used in the minimization of additions are of two types.

(i). Common Subexpressions Across Coefficients (CSACs) which are identified between 2 coefficients both having 1 s in more than one bit locations. For example, the coefficients AO and AI both have 1 s in the bit locations 0, 1, 3 and 5 (bit 10cation 0 being the LSB), resulting in the common subexpression X23. It can be noted that the number of bit locations in which the common subexpression appears, directly decides the reduction in the number of additions/subtractions and shifts. For example, the subexpression X23 which appears at four bit 10cations, results in reducing the number of additions by three and also reducing the number of shifts by three.

(ii). Common Subexpressions Within a Coefficient (CSWCs) which are identified in a coefficient bit representation having multiple instances of a 2 bit pattern. For example, the 2 bit pattern' 11 ' appears twice in the coefficient A2 at locations 0,1 and 4,5 (bit location 0 being the LSB) resulting in a common subexpression. The multiplication (A2· Xl) given by (X 1 + Xl« 1 + Xl« 4+X1« 5-X1« 7),canbeimplementedusingXL01(= X1+X1« 1) as (XL01 + XLOl « 4 - Xl « 7), resulting in the reduction in the number of additions by one and the number of shifts by one. It can be noted that the amount of reduction in the number of additions and shifts depends directly on the number of instances of the 2 bit pattern in the coefficient bit representation.

The above mentioned subexpression types are further divided into sub-types so as to handle different representation schemes, such as CSD, in which the coefficient bits can take values 0, land -1. The CSACs have two subtypes: (i). CSAC++ in which the two coefficients have non-zeros values in more than one bit locations, and the values are either both I s or both -1 s, and (ii). CSAC+- in which the two coefficients have non-zero values in more than one bit locations, and the values are either 1 and -1 or -1 and I. The CSWC subtypes CSWC++ and CSWC+- can be defined similarly.

5.1.3. Problem Formulation

Each coefficient multiplication in the weighted-sum computation is represented as a row of an NxB matrix, where N is the number of coefficients and B is the number of bits used to represent the coefficients. The total ntlmber of operations (additions + subtractions) required to compute the output is given by

Multiplier-less lrnplernentation 117

the total number of non-zero bits in the matrix less one. The coefficient matrix for the 4 term weighed-sum mentioned above is shown below.

7 6 5 4 3 2 1 0 AO 0 0 1 1 1 0 1 1 A1 0 0 1 0 1 0 1 1 A2 -1 0 1 1 o 0 1 1 A3 -1 1 o 0 1 0 1 0

The iterative 2 bit common subexpression elimination algorithm works on this matrix in two phases. In the first phase, common subexpressions across coefficients(CSACs) are searched. The matrix is updated at the end of every iteration to reflect the common subexpression elimination in that iteration. The matrix is updated by adding a new row to the matrix representing the 2 bit common subexpression. The bit values in the new rows are set depending on the locations in which the subexpression was identified and the coefficient bit values in these locations. The two coefficient rows are also updated so as to set to zero the bit values at the locations in which the common subexpression was identified.

For example, consider the common subexpression XA23 between coefficients A2 and A3 at bit locations 7 and 1 (0 being the LSB). This subexpression can be eliminated so as to reduce number of additions required to compute the multiplication of coefficients A2 and A3 with the corresponding input data values. The updated coefficient matrix after eliminating this subexpression is shown below.

7 6 5 4 3 2 1 0 AO 0 0 1 1 1 0 1 1 A1 0 0 1 0 1 0 1 1 A2 0 0 1 1 000 1 A3 0 1 0 0 1 0 0 0

XA23 -1 0 0 0 001 0

Considering that each 2 bit common subexpression requires 1 addition, the total number of additions + subtractions for a coefficient matrix at any stage in the iterative minimization process is given by total number of non-zero bits + number of subexpressions less 1. For example, the computation represented by the above coefficient matrix requires 16 additions + subtractions (16 + 1 - 1).

The first phase of the minimization process terminates when no more common subexpressions across coefficients are possible. The updated coefficient matrix is then searched for common subexpressions (CSWC) within each row. These subexpressions are then eliminated to complete the second phase of the minimization algorithm.


AG

') 4 4

2 A3 AI

2 3

A2

Figure 5.2. Coefficient Subexpression Graph for the 4-term Weighted-sum Computation

5.1.4. Common Subexpression Elimination The coefficient matrix at any stage in the iterative minimization process

typically has more than ] common subexpressions. For example, the coefficient matrix (shown in section 5.1.3) has a common subexpression across every pair of coefficients. Elimination of each of these subexpressions resuIts in different amount of reduction in the number of additions+subtractions and also affects the common subexpressions for the subsequent iterations. For example, while the number of additions+subtractions are reduced by one when XA23 common subexpression is eliminated, the number of additions+subtractions are reduced by three when XAOI or XA02 common subexpressions are eliminated. The choice of a common subexpression for elimination during each iteration thus affects the overall reduction in the number of additions+subtractions.

The steepest descent approach can be adopted to select the common subexpression for elimination. In this approach, during every iteration the common subexpression that resuIts in maximum reduction in the number of additions+subtractions is selected. Such a common subexpression is identified by first constructing a fully connected graph with its nodes representing the rows in the coefficient matrix. Each pair of nodes in the graph has two edges (E++ and E+-) representing CSAC++ and CSAC+- between the rows represented by the nodes. These edges are assigned weights to indicate the number of times the subexpression (represented by the edge) appears between the two coefficient rows represented by the two end nodes of the edge. Figure 5.2 shows such a common subexpression graph for the coefficient matrix of the 4 tap filter example. Since all E+- edges have 0 weight, only E++ type of edges are shown in the graph.

The subexpression corresponding to the edge with the highest weight is selected for elimination during each iteration. In case there are more than one edges with the same highest weight, one levellookahead is used to decide on the subexpression to be eliminated. The lookahead mechanism works as follows:

Multiplier-less Implementation 119

for each edge Eij, its end nodes i and j and all the other edges connecting to the end nodes are deleted from the graph. The modified graph is searched to find the edge with the highest weight. This weight is assigned as the one level lookahead weight for the edge Eij. The subexpression corresponding to the edge with highest one level lookahead weight is selected for elimination.

5.1.5. The Algorithm

Here is the algorithm for minimizing number of additions using the common subexpression elimination technique.

Procedure M inimize-Additions-jor-Weighted-sum Input: N filter coefficients represented using B bits of precision Output: Data flow graph representation of the weighted-sum computation, with the nodes of the flow graph restricted to add, subtract and shift operations.

/* Process the given set of coefficients */ Eliminate coefficients with 0 non-zero bits (i.e. value 0) Merge the coefficients with same value. This applies to transforms such as linear phase FIR filters with symmetric coefficients. /* Phase I - */ Construct the initial coefficient matrix of size NxB, where N is the number of coefficients after the processing and B is the number of bits used to represent the coefficients. repeat {

Construct subexpression graph Find the edge with highest weight and highest one level lookahead weight Update the coefficient matrix so as to eliminate the common subexpression

} until (highest weight < 2) /* phase 11 : */ for the first B bits of each row {

}

Extract all subexpressions of type CSWC++ and CSWC+for each subexpression {

}

Find the bit distance - given by the distance between the bit locations of the subexpression

Find pairs of subexpressions with same type and equal bit distances Eliminate the common subexpressions by representing one of the subexpressions as a shifted version of the other subexpression Update the Coefficient matrix to reflect this elimination

Output the data flow graph in terms of shifts, additions and subtractions.


Table 5.1. Number of Additions+Subtractions (Initial and After Minimization)

Filter Initial Weighted-sum # taps # +/-s # +/-s %reduction

16 51 37 27.5% 24 83 58 30.1% 32 95 65 31.6% 36 88 67 23.9% 40 120 84 30.0% 48 123 92 25.2% 64 169 116 31.4% 72 201 130 35.3% 96 225 157 30.2% 128 270 191 29.6%

Table 5.1 gives the results in terms of number of additions + subtractions for weighted-sum computation performed in the context of 10 low pass FIR filters. The number of taps of these filters range from 16 to 128. These filters have been synthesized using Park-McClellan's algorithm and the coefficients are quantized to 16 bit fixed-point format. The initial number of operations for these filters correspond to a coefficient representation scheme in which the non-zero bits of a coefficient are either all 1 s or all -1 s. Table 5.1 shows that the common subexpression elimination algorithm reduces the number of additions+subtractions by as much as 35%.

5.2. Minimizing additions in MCM Computation 5.2.1. Minimizing Additions - an Example

Consider a 4-term MCM computation shown in figure 5.3 with four coefficients AO (= 0.0111011), Al (= 0.0101011), A2 (= 1.0110011) and A3 (= 1.1001010) represented in 2's complement 8 bit fixed point format.

The computation of outputs YO, Y 1, Y2 and Y3 is given by

YO Yl

Y2

Y3

AO·X

Al·X

A2·X

A3·X

Replacing the multiplications by additions and shifts, gives

YO Yl

X+X«1+X«3+X«4+X«5

X+X«1+X«3+X«5

(5.7)

(5.8)

(5.9)

(5.10)

(5.1l)

(5.12)


x----~-----------.-----------,-----------.

AO Al A2 A3

YO YI Y2

Figure 5.3. Data Flow Graph for 4 term MCM Computation

Y2

Y3

X+X«1+X«4+X«5-X«7

X«1+X«3+X«6-X«7

Y3

(5.13)

(5.14)

The above computation requires 12 additions, 2 subtractions and 15 shifts. As can be seen from figure 5.3, these intermediate values are fed to the 3 adders to complete the MCM computation. Thus the output computation requires 15 additions, 2 subtractions and 15 shifts, which is the same computational complexity as required for the output computation shown in 5.2.

The number of additions/subtractions in the MCM computation can be minimized using the techniques given below. This can be achieved by precomputing X _01 (= X + X « 1) and using it to compute the 4 intermediate outputs as folIows:

YO X _01 + X « 3 + X « 4 + X « 5 (5.15)

Y1 X _0 1 + X « 3 + X « 5 (5.16)

Y2 X _0 1 + X « 4 + X « 5 - X « 7 (5.17)

Y3 X«1+X«3+X«6-X«7 (5.18)

The above computation requires 10 additions ( including 1 addition required to compute X_O 1 ), 2 subtractions and 13 shifts (including I shift required to compute X_O 1 ). The number of additions can be further reduced by precomputing X A5( =

X « 4 + X « 5), and computing YO to Y3 as folIows:

YO X _01 + X« 3 + X A5 (5.19)

Y1 X _0 1 + X « 3 + X « 5 (5.20)

Y2 X_01 + XA5 - X« 7 (5.21)

Y3 X«1+X«3+X«6-X«7 (5.22)


The above computation requires 9 additions ( inc\uding 2 additions required to compute X_OI and XA5), 2 subtractions and 11 shifts (inc\uding I shift required to compute X_O land 2 shifts required to compute XA5). The number of additions can be further reduced by precomputing X _013 X _01 + X « 3, and computing YO to Y3 as folIows:

YO Yl

Y2

Y3

X_013 + XA5

X_013 + X« 5

X_Ol + XA5 - X« 7

X«1+X«3+X«6-X«7

(5.23)

(5.24)

(5.25)

(5.26)

The above computation requires 8 additions (inc\uding 3 additions required to compute X_OI, XA5 and X_OI3), 2 subtractions and 10 shifts (inc\uding I shift required to compute X_O I, 2 shifts required to compute XA5 and 1 shift required to compute X_Ol3). It can be noted that XA5 can be computed using X_OI as X A5 = X _01 « 4 thus further reducing the number of additions by 1 and also reducing the number of shifts by 1.

The above example shows techniques for reducing the number of additions+ subtractions (17 to 12 - 29% reduction in this case) for implementing MCM based structures. This reduction is similar (same in this case) to that achieved for the weighted-sum computation, and uses similar techniques of finding 2 bit common subexpressions, computing the subexpressions first and using the result to compute the intennediate filter outputs. X_OI, XA5, X_Ol3 in the equations above are examples of these 2 bit subexpressions.

5.2.2. 2 Bit Common Subexpressions In addition to CSWC type of subexpression, the minimization process for

MCM based structures uses Common Subexpressions across Bit Locations (CSABs). The CSABs are identified between two bit locations both having I s for more than one coefficients. For example, the bit values at bit locations ° and 1 (0 being the LSB) are both 1 in case of coefficients AO, Aland A2, resulting in the common subexpression X_OI. It can also be noted that the number of common coefficients directly decides the reduction in the number of additions/subtractions and shifts. For example, the subexpression X_OI which is common for three coefficients, results in reducing the number of additions by two and the number of shifts also by two.

The CSABs can be easily generalized to handle different representation schemes, such as CSD, in which the coefficient bits can take values 0, 1 and -1. This is achieved by defining two subtypes : i. CSAB++ in which the bit values at two bit locations are non-zero for more than I coefficients, and the values are either both I s or both -1 s, and


ii. CSAB+- in which the bit values at the two bit locations are non-zero for more than one coefficients, and the values are either 1 and -1 or -1 and 1.

5.2.3. Problem Formulation Each coefficient multiplication in the MCM computation is represented as

a row of an N x B matrix, where N is the number of coefficients and B is the number of bits used to represent the coefficients. The total number of additions+subtractions required to compute the intermediate outputs is given by (the total number of non-zero bits in the matrix - N). The coefficient matrix for the 4 tap filter mentioned above is shown in section 5.1.3.

The iterative 2 bit common subexpression elimination algorithm works on this matrix in four phases. In the first phase, CSABs are searched. The coefficient matrix is updated at the end of every iteration to reftect the common subexpression elimination in that iteration. This is done by adding a new column to the coefficient matrix, which represents the 2 bit common subexpression. The bit values in the new column are set depending on the locations in which the subexpression was identified and the coefficient bit va1ues in these locations. The two bit columns are also updated so as to set to zero the bit values at the locations in which the common subexpression was identified.

For example, consider the common subexpression XA5 between coefficients AO and A2 at bit locations 4 and 5. The updated coefficient matrix after eliminating this subexpression is shown below.

7 6 5 4 3 2 1 0 X35 AO o 0 o 0 1 0 1 1 1 Al o 0 1 0 1 0 1 1 0 A2 -1 o 0 000 1 1 1 A3 -1 1 0 010 1 0 0

Considering that each 2 bit common subexpression requires 1 addition, the total number of additions+subtractions to compute the intermediate outputs at any stage in the iterative minimization process is given by (the total number of non-zero bits + number of subexpressions - N). For example, the computation represented by the above coefficient matrix requires 13 additions+subtractions (16 + 1 - 4). The first phase of the minimization process terminates when no more common subexpressions across coefficients are possible.

In the second phase of the minimization process, CSWCs are searched in the updated coefficient matrix and are eliminated.

In the third phase, all the identified subexpressions (CSABs and CSWCs) are searched to check for 'shift' relationships among them i.e. whether a subexpression can be realized by left shifting another subexpression by some amount. In the example shown above, the subexpressions X_OI and XA5 share such a 'shift' relationship given by (X A5 = X _01 « 4). XA5 can hence be realized


in tenns of X_O I so as to reduce the number of additions by one and reduce the number of shifts by one.

In the fourth and the final phase of the optimization process, the coefficient matrix (first B columns) is searched for two 2 bit subexpressions with 'shift' relationship among them. Such expressions can also be eliminated so as to reduce the number of additions. For example, consider two coefficients AO = 0.0101010 and AI = 0.1000101, with corresponding YO and YI computations given by :

YO Yl

X«1+X«3+X«5

X+X«2+X«6

(5.27)

(5.28)

While no CSABs can be found for these coefficients, there exist subexpressions X_13 (in AO) and X_02 (in A 1) that are related by 'shift' relationship. YO and Y I can hence be recomputed in tenns of X _02 (= X + X « 2) as follows

YO

Yl

X _02 « 1 + X « 5

X_02 + X« 6

thus reducing the number of additions by one.

5.2.4. Common Subexpression Elimination

(5.29)

(5.30)

Just as in case of I: A[i]X[n - i] based structure, the coefficient matrix at every stage in the iterative minimization process typicaJly has more than one common subexpressions. The choice of a common subexpression for elimination during each iteration thus affects the overall reduction in the number of additions+subtractions.

The steepest descent approach, similar to that used for I: A[i]X[n - i] based structure, is adopted to select a common subexpression for elimination. However since CSABs are to be searched the subexpression graph is constructed with the columns of the coefficient matrix being its nodes. Each pair of nodes in the graph has two edges (E++ and E+-) and weights are assigned to these edges to indicate the number of times CSAB++ and CSAB+- appear respectively between the columns represented by these nodes.

The subexpression corresponding to the edge with the highest weight is selected for elimination during each iteration. In case there are more than one edges with the same highest weight, one level lookahead is used to decide on the subexpression to be eliminated. The one levellookahead weight for an edge is computed in the same way as presented in section 5.1.4.

5.2.5. The Algorithm

Here is an algorithm for minimizing the number of operations (additions + subtractions) using the common subexpression precomputation technique.


Procedure M inimize-Additions-Jor-M CM Input: N filter coefficients represented using B bits of precision Output: Data ftow graph representation of the MCM computation, with the nodes of the ftow graph restricted to add, subtract and shift operations.

/* Process the given set of coefficients */ Eliminate coefficients with less than 2 non-zero bits Merge the coefficients with same value. This applies to transforms such as linear phase FIR filters with symmetrie coefficients. /* Phase I: */ Construct the initial coefficient matrix of size N x B, where N is the number of coefficients after the processing and B is the number ofbits used to represent the coefficients. repeat { ----Construct subexpression graph

Assign weights to the edges based on the number of CSABs Find the edge with the highest weight and the highest one level lookahead

weight. Update the coefficient matrix so as to eliminate the common subexpression

} until (highest weight < 2) /* phase 11 : */ .. same as in section 5.1.5 /* phase 111 : */ Find bit distances for all the common subexpressions Find pairs of subexpressions of the same type and with equal bit distances for each pair {

}

Eliminate one of the subexpressions by representing it as a shifted version of the other subexpression.

/* phase IV: */ Extract all 2 bit patterns in the first B columns Find bit distances between these 2 bit subexpressions Find pairs of subexpressions with equal bit distances Replace one of the subexpressions as a shifted version of the other subexpresslon Output the signal ftow graph in terms of shifts, additions and subtractions.

Table 5.2 gives the results in terms of number of additions+subtractions required for MCM computation as part of the transposed FIR filter structure. The same (as used in table 5.1) 10 FIR filters with number oftaps ranging from 16 to 128 have been used. The results show that the number additions+subtractions


Table 5.2. Number of Additions+Subtractions for Computing MCM Intermediate Outputs

# taps initial +/-s final +/-s initial/final 16 36 21 1.7 24 60 29 2.1 32 64 29 2.2 36 53 27 2.0 40 81 38 2.1 48 76 34 2.2 64 106 44 2.4 72 134 57 2.4 96 132 51 2.6 128 155 60 2.6

can be reduced by an average factor of 2.2. This is much higher than the factor of 1.43 (avg.) for the FIR filters presented in [78].

5.2.6. An Upper Bound on the Number of Additions for MCM Computation

Consider an MCM computation where a variable X is being multiplied by N constants using B bits of precision. Since using B bits of precision, 2B distinct constants can be represented, the N constants can have atmost 2B unique values. Since aB bit constant can have atmost BIs and multiplication by aB bit constant can be perfonned using atmost B-l additions. Thus the number of additions for MCM computation has an upper bound given by (2 B . (B - 1)) [78]. This upper bound however is pessimistic due to the following reasons:

One of the 2B constants is a '0' and multiplication by a '0' does not require any addition.

2 There are B constants whose binary representations have only one '1'. Multiplication by such a constant can be perfonned using a shift operation and hence does not require any addition.

3 Only one of the 2B constants has B number of 1 s. The average number of 1 s per constant can be shown to be B /2.

4 Consider constants NI and N2 such that N2 = NI < < K, where K is the amount of left shift. N2· X can then be computed as ( (NI· X) < < K), thus requiring no addition. For example, once the multiplication (00000111· X) is computed, multiplications by 00001110, 00011100,00111000, 01110000 and 11100000 can be computed by appropriately left shifting (00000111 . X).

Multiplier-less ImplementGtion 127

5 The above mentioned upper bound does not comprehend the reduction achieved using the common subexpression precomputation technique. This point has also been highlighted in [78].

Based on the above observations, a tighter upper bound on the number of additions can be obtained by first coming up with a subset of constants which have more than one 1 s. This subset can be further reduced by eliminating those constants that can be obtained by left-shifting other constants in the subset. In other words, in the reduced subset no two constants are related by just a shift operation.

For a given constant NI (with more than one number of Is in its binary representation), another constant N2 can always be found such that N2 has one less number of 1 sand the Hamming distance between NI and N2 is one. The multiplication NI . X can hence be computed as one addition of N2 . X with appropriately left shifted X.

Based on the above analysis, the multiplication by each member of the reduced subset can be computed using just one addition. The upper bound on the number of additions is thus given by the cardinality of the reduced subset.

It can be noted that no two constants with '1' as their LSBs can be related by just a shift operation. It can also be noted that for a constant NI whose LSB is '0', there always exists a constant N2 with '1' as its LSB, such that NI = N2 < < K, where the amount of left shift K is given by the number of continuous Os as the LSBs ofNl. For example, for NI = 00101000, there exists N2 = 00000101, such that NI = N2 < < 3. Based on these observations the reduced subset consists of those constants wh ich have more than one 1 sand have 'I' as their LSBs. It can easily be shown that for a B bit number, the cardinality of such a reduced subset is given by 28 - 1 - 1. This hence is an upper bound on the number of additions for MCM computation.

The analysis presented above assumes constants with B bit unsigned representation. A similar analysis can be performed in case of constants with B bit 2's complement representation. These constants (except for - 28 - 1) can also be represented using a B bit signed-magnitude representation. It can be noted that multiplying a variable X with a negative constant can be achieved by multiplying -X with the corresponding positive constant. Thus once -X is computed, multiplication by all negative constants which have only one '1' in their magnitude part, can be implemented using only a shift operation. The multiplication by the constant - 2 B -1 can also be handled the same way. It can also be noted that a multiplication of a negative constant (say -NI) with a variable X, can be computed using one subtraction as (-NI . X) = 0 - (NI· X).

The B bit constants can be divided into a set of (B-I) bit positive constants and a set of (B-I) bit negative constants. Since the reduced subset of (B-I) bit constants has the cardinality of (28 - 2 - 1), the MCM computation using the positive constants can be achieved using (2 8 - 2 - 1) additions. Similarly, the


MCM computation using the negative constants can be achieved using (2 B - 2 _

1) additions. Considering an extra subtraction required to compute -X, the upper bound on the number of additions for MCM computation with B bit 2's complement constants is given by (2 B - 2 -1) + (2 B - 2 -1) + 1 = (2 B -- 1 -1). It can be noted that this upper bound is same as that for unsigned constants.

5.3. Transformations for Minimizing Number of Additions While the common subexpression precomputation technique helps in reduc

ing the number of additions, the optimality ofthe final solution also depends on the initial coefficient representation. This section presents two types of coefficient transforms to generate different coefficient representations as the initial solution. It also shows how these transforms along with the common subexpression precomputation technique result in an area efficient implementation.

5.3.1. Number Theoretic Transforms

Consider an N bit binary representation in terms of bits bo to bN - 1 , bit bo being the LSB. Within this generic framework, various representation schemes are possible that differ in terms of weights associated with each bit and the value that each bit can take. For example, the weight associated with an i th

bit can be 2i or -2 i and the bit values can be either from set {0,1} or from set {O, 1, -I}. The bit pattern for a number can thus vary depending on the representation scheme used. Since the number of non-zero bits directly impact the number of additions/subtractions, the choice of the representation scheme can significantly impact the total number of additions required to compute the filter output.

This section presents three binary representation schemes that result in different number of non-zero elements and consequently impact the number of additions.

5.3.1.1 2's Complement Representation

2's complement is the most common representation scheme for signed numbers. In this scherne, the bits can take values 0 or 1. Far an N bit number, the weight associated with the MSB is - 2N -1. The weight associated with any other bit location i is 2i . The value of a number represented in 2's complement is hence given by (-bN -1 . 2N -1) + 2:[':02 bi ·2 i

Consider 8 bit 2's complement representations of3 and -3 given by 00000011 and 11111101 respectively. It can be noted that in terms of number of non-zero bits, 2's complement is not the best representation for -3. In general, 2's complement representations of small negative numbers have higher number of non-zero (I s) entries. Thus for transforms having coefficients with small negative values, 2's complement may not be the optimal coefficient representation scherne.

Multiplier-less Implementatiofl 129

5.3.1.2 Uni-sign Representation

The limitation of 2's complement representation requiring higher number of 1 s to represent small negative numbers can be overcome by employing a scheme in which the bit values can be either from set {O, I} or from set {0,-1 }. Since all non-zero bits in such a representation have the same sign, it is called the uni-sign representation. It can be noted that this representation is similar to sign-magnitude representation, except that the overhead of sign bit is eliminated by embedding the sign information in the non-zero bits.

In the uni-sign representation scheme 8 bit representation of 3 and -3 is given by 00000011 and OOOOOONN (N indicates bit value -1) respectively. This scheme thus enables representation of small negative numbers with fewer number of non-zero bits. It can be noted that the number of non-zero bits in the uni-sign representation are same as that obtained by the selective coefficient negation technique presented in 2.3.1. As per the results in 2.2, using this scheme the number of non-zero bits can be reduced by as much as 56% compared to 2's complement representation of FIR filter coefficients.

Since the range of numbers that can be represented using this scheme ( - (2 N -

1) to (2 N - 1)) is more than the 2's complement representation (_(2 N - 1 ) to (2 N - 1 - 1), any coefficient value (except the most negative value) in N bit 2's complement representation can be converted to this scheme using one less bit (i.e. N-l) of precision.

5.3.1.3 Canonical Signed Digit (CSD) Representation

The schemes discussed in 3.1 and 3.2 are not efficient in terms ofrepresenting numbers such as 15,-15,31,-31 which result in continuous streams of 1 s or Ns (-ls). Consider 8 bit representation of 31, which in both the schemes is given by 00011111. The number of non-zero bits can be reduced to 2 by representing 31 as (32-1). This can be achieved in two ways, (i) using negabinary representation [106] scheme in which the bit bi has the weight of - (- 2)i. Number 31 can then be represented as 00100001. (ii) using a scheme such as CSD [10 I] which allows the bit values to be 0, 1 or -1. Number 31 can then be represented as OOIOOOON. It can be noted that the CSD representation is more flexible than the nega-binary representation in terms of values that can be represented at each bit location. The bit-value f1exibility enables CSD to provide a representation that is guaranteed to have the least number of non-zero bits.

While CSD gives locally optimal solution resulting in minimum number of additions to implement a coefficient multiplication, it does not always result in a globally optimal solution for implementing MCM computation. Here are two examples that demonstrate this. Consider the following computation using 8 bit 2's complement or uni-sign representation scheme : Yo = 17 . X = 00010001 . X


YI = 19 . X = 00010011 . X Using the common subexpression precomputation technique the above computation can be performed using two additions as follows Yo=X+X«4 YI =Yo+X«l Using 8 bit CSD scheme no common subexpression exists. This computation thus requires two additions and one subtraction (which is one extra compared to Uni-sign representation) as shown below Yo = 00010001 . X = X + X < < 4 YI = 0001010N . X = X < < 4 + X < < 2 - X Here is another example, where CSD reduces total number of non-zero bits in coefficients but does not minimize total number of additions+subtractions across coefficient multiplications. Consider the following computation with coefficients represented in 8 bit 2's complement form Y = 00010101 . Xl + 10011101 . X 2 + 10011001 . X 3

Using the techniques presented in section 5.1, the above computation can be performed using six additions and one subtraction as folIows: Tl = Xl + X 2

T2 = X 2 + X 3

Y = X 3 + Tl + Tl < < 2 + T2 < < 3 + T2 < < 4 - T2 < < 7 Using CSD representation the computation to be performed is Y = 00010101 . Xl + N0100N01 . X 2 + N010N001 . X 3

Using the techniques presented in section 5.1, this computation can be performed using five additions and three subtractions as folIows: T2 = X 2 + X 3

Y = X I + X I < < 2 + Xl< < 4 + T2 + T2 < < 5 - T2 < < 7 - X 2 < < 2-X3 «3 While the total number of non-zero bits is reduced by 1 (12 to 11) using CSD representation, it results in an extra computation.

These examples highlight the role of different number theoretic transforms in reducing number of additions required to implement multiplier-Iess FIR filters.

5.3.2. Signal Flow Graph Transformations

The transformations discussed in section 5.3.1 do not alter the coefficient values. This section presents two transformations which are applicable specifically to FIR filters. These transforms used in conjunction with the number theoretic transforms help in further reducing the number of computations.

The FIR signal flow graph can be transformed so as to compute the output Y[n] in terms of input data values and the previously computed output Y[n-I].

Multiplier-less Implementatioll

-I Z

-I Z

-I -I

Z Z X[nl---+----,-~--,----+-----,-----+-.-----7--,

A[O]

131

-I

Z

f---+--'----7 Y[ n 1

Figure 5.4. SFG Transformation - Computing Y[n] in Terms of Y[n-l]

This can be done as folIows:

N-l

Y [n - 1] = L A [i] . X [n - 1 - i] (5.31) i=O

N-l

Y[n] = L A[i] . X[n - i] (5.32) i=O

By adding the LHS of equation 5.3 t and subtracting the RHS of equation 5.31 to the RHS of equation 5.32 gives :

N-l N-l

Y[n] = L A[i]· X[n - i]- L A[i]· X[n - 1 - i] + Y[n - 1] (5.33) i=O i=O

N-l N-2

Y[n] A[O] . X[n] + L A[i] . X[n - i]- L A[i] . X[n - 1 - i] i=l i=O

-A[N - 1] . X[n - N] + Y[n - 1] (5.34)

N-l

Y[n] A[O]· X[n] + L (A[k] - A[k - 1]) . X[n - k] k=l

- A[ N - 1] . X [n - N] + Y [n - 1] (5.35)

Figure 5.4 shows the signal flow graph of a 4 tap FIR filter transformed using the above mentioned approach.

The direct form structure of an N tap FIR filter requires N multiplications and N-I additions. With the above mentioned SFG transformation, the resultant structure (figure 5.4) requires (N+ 1) multiplications and (N+ 1) additions. While this transform results in more computation, it also modi fies the filter coefficients. If the saving in the number of additions due to the modified filter coefficients is more than the overhead of the additional computation, this transformation can result in an area-efficient multiplier-Iess FIR implementation. Such a possibility


is higher in case of linear phase FIR filters because for such filters this SFG transformation retains the number of multiplications required to compute the output. This can be proved by analyzing coefficient symmetry property of the transformed SFG.

The coefficient symmetry property (stated below) of the linear phase filters

A[i] = A[N - 1 - i] (5.36)

ean be used to reduee the number of multiplieations by half in the direet form FIR implementation.

For an N tap FIR filter, the eorresponding transformed strueture (equation 5.35) has N+ I eoeffieients C[O] to C[N). Ifthe original filter has symmetrie eoeffieients (linear phase) the eoeffieients of the transformed strueture are antisymmetrie as shown below. i.e. C[i] = -C[N - i]. From 5.35, C[O] = A[O] and C[N] = -A[N - 1] From 5.36, A[O] = A[N - 1] From the above two equations, C[O] = -C[N] ... proved for i=O From 5.35, C[j] = A[j]- A[j -1] and C[N - j] = A[N - j]- A[N - j -1] From 5.36, A[N - j] = A[j - 1] and A[N - j - 1] = A[j] From the above two equations, C[j] = A[j] - A[j - 1] and C[N - j] =

A[j - 1] - A[j] henee C[j] = -C[N - j] ... proved.

An N tap linear phase filter requires N/2 multiplieations if the number of coefficients is even and requires (N+ 1)/2 multiplications if the number of eoefficients is odd. If N is odd, the transformed filter has even number (N+ 1) of eoeffieients whieh are anti-symmetrie and henee require (N+ 1 )/2 multiplications. For N even, the transformed filter has odd number (N+ 1) of eoeffieients and henee requires (N+2)/2 number of multiplieations. However, sinee from (5.36) (A[N/2] = A[N/2-l D, the eoeffieient C[N/2] = A[N/2] - A[N/2-1] = O. Thus for N even, the transformed filter requires N/2 number of multiplieations. For example eonsider the SFG shown in figure 5.4. If the original fi Iter has linear phase, the eoeffieient values A[ 1] and A[2] are same, henee the eoeffieient (A[2]-A[ 1]) in this SFG is O. This SFG thus requires two multiplieations and four additions, as against two multiplieations and three additions required by the direet form 4 tap linear phase filter.

The above analysis shows that this signal f10w graph transformation retains the number of multiplieations required in ease of linear phase FIR filters, and provides an opportunity to reduce the number of additions by altering the eoeffieient values. As an example eonsider the ease of A[2] = 19 = 00010011 and A[3] = -13 = OOOONNON. The transformed strueture will have a eoeffieient C[3] = A[3] - A[2] = -32 = OOONOOOOO whieh has just one non-zero bit.

The eomputation of Y[n] in terms of Y[n-l] can also be aehieved by subtraeting the LHS of equation 5.31 and adding the RHS of equation 5.31 to the


-I -I -I -I -I Z Z Z Z Z

X[ n]----).---;--~-,---~--,------+--r---+---,

[0]

r----)--'------7 Y [n]

Figure 5.5. SFG Transformation - Computing Y[n] in Terms of Y[n-l]

RHS of equation 5.32. Y[n] is thus computed as :

N-l

Y[n] A[O]· X[n] + L (A[k] + A[k - 1]) . X[n - k] k=l

+ A[ N - 1] . X [n - N] - Y [n - 1] (5.37)

The resultant signal f10w graph is shown in figure 5.5. It can be shown that for a linear phase FIR filter, the coefficients ofthe modi

fied SFG are also symmetric. Thus for odd number of taps, this transformation requires the same number of multiplications (N + 1)/2 as the original structure. In case of even number of taps, unlike the above mentioned transform, no coefficient cancellation is possible. This transformation hence results in a signal f10w graph that requires (N + 2)/2 (i.e. one more) multiplications.

It can be noted from figures 5.4 and 5.5 that this SFG transformation coverts the FIR structure to an UR structure. The resultant IIR structure has a pole and a zero on the unit circle in the Z plane. The pole-zero cancellation is essential for the filter response to be stable. This can be achieved by retaining full numerical precision while performing the UR computations.

5.3.3. Evaluating Effectiveness of the Transformations This subsection presents results that evaluate the effectiveness of various

coefficient transforms in minimizing the number of additions required for the MCM computation as part of the transposed form FIR filter structure. For each filter, the initial number of additions (without common subexpression elimination) and the final (minimized) number of additions (after common subexpression elimination) for the following coefficient transforms is looked at: I. 2's: Direct form structure with coefficients in 2's complement form. 11. allp: Direct form structure with coefficients in uni-sign form. III. diff-allp: Transformed SFG (as in figure 5.4) with coefficient differences represented in uni-sign form. IV. sum-allp: Transformed SFG (as in figure 5.5) with coefficient sums represented in uni-sign form. V. csd: Direct form structure with coefficients in CSD form.


3

-

2.5

2

- - -B <J co

LL C 0 1.5 U

r- ,-,-

::l U Ql

er:

0.5

2's allp diff-allp sum-allp csd diff-csd sum-csd

Figure 5.6. Average Reduction Factor Using Common Subexpression Elimination

VI. diff-csd: Transfonned SFG (as in figure 5.4) with coefficient differences represented in CSD fonn. VII. sum-csd: Transfonned SFG (as in figure 5.5) with coefficient sums represented in CSD form.

The bar chart in figure 5.6 shows the average reduction factor for various transforms. This can be used to analyze the impact of coefficient transfonns on the amount of minimization achieved using common subexpression elimination. As can be seen from the bar chart, the common subexpression elimination results in maximum reduction when the coefficients are represented in 2's complement fonn. It can be noted that among all the coefficient transforms, 2's complement coefficient representation results in maximum number of additions in the initial solution (i.e. without common subexpression elimination). The trend in figure 5.6 shows that for a given filter, if a transform results in higher total number of non-zero bits in the coefficient representations, the higher is the reduction achieved using common-subexpression elimination. This also indicates that if a coefficient transform results in the best initial solution, it may not always give the best final solution. Hence it is important to explore across coefficient transforms to get the most optimal implementation.


5 dilf-csd

4

csd

csd

dilf-csd dilf-csd

csd csd csd 3 dilf-csd

dilf-csd

csdsum-csd csd

2 diff-csd

LP16 LP19 LP24 LP27 LP32 LP36 LP40 LP48 LP53 LP61 LP64 LP72 LP96 LP128

Figure 5.7. Best Reduction Factors Using Coefficient Transforms Without Common Subexpression Elimination

The bar chart in figure 5.7 gives the best reduction factors w.r.t. initial solutions (i.e. 2's complement representation) using coefficient transforms alone (i.e. without applying common subexpression elimination). It can be noted that reduction by a factor as high as 4.9 can be achieved. Figure 5.7 also shows for each filter the coefficient transform that results in most reduction. As expected CSD representation results in maximum reduction for all the filters. It can also be noted that CSD used in conjunction with SFG transformations that compute Y[n] in terms of Y[n-l] result in maximum reduction in 50% of the cases.

The bar chart in figure 5.8 gives the best reduction factors using the coefficient transforms in conjunction with common subexpression elimination. The reduction factors are computed with respect to an initial solution given by applying common subexpression elimination on coefficients represented in 2's complement form. It can be noted that reduction by a factor of as high as 2 can be achieved. Figure 5.8 also shows for each filter the coefficient transforms that result in most reduction. It can be noted that for a filter, multiple coefficient transforms can result in the best final solution. As an example, for LP64 three coefficient transforms (Il. allp, V. csd and VI. diff-csd) result in the best final solution. It can also be noted that a coefficient transform that results in the best

136

2

1.8

1.6

1A

~ '" 1.2 l.L. c 0 .f5 ::J "0 Q)

er:

0.8

0.6

OA

0.2

diff-csd csd

csd

diff-csd

diff-csd csd

csd

diff-allp

diff-csd allp

sum-csd

csd


allp csd

diff-csd

csd

sum-csd allp csd

csd

csd

LP16 LP19 LP24 LP27 LP32 LP36 LP40 LP48 LP53 LP61 LP64 LP72 LP96 LP128

Figure 5.8. Best Reduction Factors Using Coefficient Transforms with Common Subexpression Elimination

initial solution may not in all the cases result in the best final solution. As an example consider LP48 for which CSD (V) transform results in the best initial solution but allp (II) transform results in the best final solution.

The bar chart in figure 5.9 gives the number of times each coefficient transform results in the best final solution. As can be seen from the figure while CSD gives the best solution in most of the cases, the uni-sign representation can also in some cases perform better than CSD. The figure also highlights the role of SFG transformations shown in figures 5.4 and 5.5, which result in the best final solution in 8 out of the 14 filters.

This data can also be used to compare the two signal flow graph transformations shown in figures 5.4 and 5.5, which result in SFG structures with coefficient-differences and coefficient-sums respectively. As discussed earlier, the transformation in figure 5.5 always results in one more number of coefficient multiplications in case of even order filters. 1t hence results in a relatively higher number of additions for even order filter. Overall, the SFG transformation based on coefficient differences (figure 5.4) provides higher reduction than the transform based on coefficient sums (figure 5.5).

Multiplier-less lmplementation 137

10 .--

8

6

.--

4

~

2 ,--

n ,,~ äliP Ulll-allp ~Urll-äll~ es ul -G:iU SU - ~U o

Figure 5.9. Frequency of Various Coefficient Transforms Resulting in the Best Reduction Factor with Common Sub-expression Elimination

The resuIts presented in this section thus demonstrate the role of various transforms in minimizing number of computations in the muItiplier-less implementations of FIR filters. The various coefficient transforms discussed in this chapter enable exploration of a wider search space resulting in the best final solution.

5.3.4. Transformations for Optimal Initial Solution In addition to the SFG restructuring transformations presented in the earlier

section, the following two algorithmic transformations also enable obtaining an optimal initial solution for the multiplier-Iess implementation of FIR filters.

5.3.4.1 Coefficient Optimization

The coefficients of an FIR filter can be suitably modified so as to reduce the number of I s in the coefficients while satisfying the desired filter characteristics such as passband ripple and stopband attenuation. These techniques target both the 2's complement binary representation [28] and the CSD representation [83, 105] for the coefficients. The coefficient optimization algorithm presented in


section 2.4.5 can be adapted by appropriately modifying the cost function to perform this transformation.

5.3.4.2 Efficient Pre-Filter Structures

Instead of realizing the filter in the direct form as weighted sum computation, it can be implemented as a cascade of a 'pre-filter' with an 'equalizer'. The pre-filter structures are computationally efficient as they use coefficients with values 1 or -1. Use of Cyc\otomic Polynomial Filters as pre-filter structures has been proposed in [72]. With the correct choice of apre-filter structure, the equalizer filter can be implemented with fewer number of taps than required for the direct form realization [I]. The cascade of 'pre-filter' - 'equalizer' thus requires fewer number of multiplications and hence fewer number additions in case of a multiplier-Iess implementation.

5.4. High Level Synthesis of Multiprecision DFGs

The output from the common subexpression elimination based optimization is a data f10w graph with add and shift operators. The precision of these operations and their input/output variables varies significantly across the data f10w graph. As an example, consider a 32 term weighted-sum computation with 12 bit data and 12 bit coefficients. The output of this computation needs 29 bits of precision to guarantee no overflow. Thus the data f10w graph has variables of precision ranging from 12 bits to 29 bits.

While the conventional high level synthesis techniques can be applied to implement such data f10w graphs, the implementation is not optimal as the variable precision is not exploited during the synthesis process. The following subsections show, with the help of examples, how the variable precision can be comprehended during the three key components of high level synthesis -register allocation, functional unit binding and scheduling.

5.4.1. Precision Sensitive Register Allocation

The inputs to the register allocation problem is the variable Iifetime graph which is derived from the scheduled data f10w graph. During register allocation [22], variables with non-overlapping lifetimes are assigned to the same register, so as to minimize the total number of registers. The conventional algorithms do not aim at minimizing the total number of register bits. Consider the example shown in figure 5.10.

While both the register allocations require the same number of registers, the precision sensitive allocation requires significantly lesser number of register bits and is hence area efficient.

Multiplier-less Implementatioll 139

8 15

Vl-> Reg I V2 -> Reg 2

16 9

V3 -> Reg I V4 -> Reg

precision insensitive allocation

Reg I: 16 bits, Reg 2: 15 bits

2

CI

C2

8 15

VI-> Reg I V2 -> Reg 2

9 16

V4 -> Reg I V3 -> Reg

precision sensitive allocation

Reg I: 9 bits, Reg 2: 16 bits

Figure 5.10. Precision Sensitive Register Allocation

~ Gf0 CI ~ ~ Op 1 -> Adder I Op2 -> Adder 2 Op I -> Adder I Op2 -> Adder 2

G-~ ~ C2 ~ @ Op3 -> Adder I Op4 -> Adder 2 Op4 -> Adder I Op3 -> Adder 2

precision insensitive binding precision sensitive binding

Addcr I: 19 bits, Adder 2: 18 bi ts Addcr I: 13 bits, Adder 2: 19 bits

Figure 5.11. Precision Sensitive Register Allocation

5.4.2. Precision Sensitive Functional Unit Binding

2

Functional unit binding [22] aims at assigning functional units to the operators scheduled in each control step. This can be viewed as a special case of register allocation problem where the lifetime of the computation is always one control step (can be extended to handle multi-cycle operations). Consider the four add operations scheduled across two control steps as shown in figure 5.11.

While both the bindings require the same number of adders, the precision sensitive binding requires adders with significantly smaller precision and is hence area efficient.

140

+ 16

precision insensitive scheduling

Adder I: 16 bits, Adder 2: 17 bits

Cl

C2


precision sensitive scheduling

Adder 1: 12 bits, Adder 2: 17 bits

Figure 5.12. Precision Sensitive Scheduling

5.4.3. Precision Sensitive Scheduling Scheduling [22] aims at assigning the operations of the data ftow graph to

various control steps with the aim of minimizing area for the given number of control steps (time constrained scheduling) or minimizing the number of control steps for the fixed number of resources (resource constrained scheduling). Consider the data ftow graph (figure 5.12 with four add operations to be scheduled over two control steps.

While both the schedules require the same number of adders, the precision sensitive schedule can be implemented using adders with significantly smaller precision and is hence area efficient.

Since register allocation, functional unit binding and scheduling are interdependent, an approach that uni fies these three steps in one synthesis algorithm is necessary to get an optimal implementation. Such an integrated and precision sensitive approach to the synthesis of multi-precision data ftow graphs has been presented in [3].

Chapter 6

IMPLEMENTATION OF MULTIPLICATION-FREE LINEAR TRANSFORMS ON A PROGRAMMABLE PROCESSOR

Many signal processing applications such as image transfonns [27], error correction/detection involve matrix multiplication of the fonn Y = A * X, where X and Y are the input and the output vectors and A is the transfonnation matrix whose elements are 1,-1 and O. This chapter presents optimized code generation of these transfonns targeted to both register rich RISC architectures such as ARM7TDMI [6] and single register, accumulator based DSP architectures such as TMS320C2x [95] and TMS320C5x [96].

Code optimization techniques discussed in the literature [4, 5, 38, 39, 90] address the problems of instruction selection and scheduling, register allocation and storage assignment to minimize code size and/or number of cycles. These techniques operate on a Directed Acyclic Graph (DAG) representation of the code being optimized and can be applied to implement the multiplication-free linear transfonns. However, the amount of optimization achieved using these techniques is limited by the initial DAG representation. Much higher gains are possible by optimizing the DAG itself. The DAG optimization techniques presented in this chapter are targeted to both the register-rich and the single register architectures.

With the increasing trend towards portable computing and wireless communication, low power has become an important design consideration. These systems are typically built around programmable processors. The amount of code running on such embedded processors has been growing exponentially over the last few years. Low power is thus becoming an increasingly important consideration for code generation.

Techniques for low power memory mapping for data intensive algorithms have been presented in [67, 74, 36]. Instruction scheduling for low power has been presented in [89, 36] (calIed cold scheduling in [89]). Both these approaches [89, 36] perfonn instruction selection and register assignment first

141



and then use list scheduling based algorithm for instruction scheduling. In case of multiplication-free linear transforms since primarily ADD and SUB instructions are used, the operand part of the instructions dominates the overall power dissipation. Instead of assigning registers to the variables before instruction scheduling, the approach presented in this chapter first performs instruction scheduling using DAG variables and then does register assignment for low power code generation. The chapter also presents a technique that reorders the nodes of the DAG so as to minimize the power dissipation in case of single register architectures.

This chapter is organized as two sections, 6.1 and 6.2, which present code generation techniques targeted to register-rich and single register architectures, respectively. Each section also gives resuIts that highlight the effectiveness of these techniques in terms of reducing the number of cycles and also the power dissipation for various multiplication-free linear transforms.

6.1. Optimum Code Generation for Register-rieh Architectures

6.1.1. Generie Register-rieb Arebiteeture Model The code generation techniques presented in this section are targeted to the

architecture model shown in figure 6.1. This model is a suitable abstraction of a generic RISC architecture. The datapath of this architecture has a register file connected to an ADD/SUBTRACT module which supports the following two instructions:

1 ADD Srl Sr2,Shift Dr: (Srl) + (Sr2)< <Shift -+ (Dr)

2 SUB Srl Sr2,Shift Dr: (Srl) - (Sr2)< <Shift -+ (Dr)

where Srl and Sr2 are the source registers, Dr is the destination register and 'Shift' is the amount by which Sr2 is left shifted before being added to/subtracted from Srl. In addition to these instructions, the architecture also supports the load and store instructions for movement of data between the registers and the data memory.

It is assumed that a transform is implemented as a function and all the input data values are loaded in the registers before calling the function. Within the main body of the function, the transform is performed as aseries of ADD and SUB instructions that operate on the data stored in the registers and produce outputs which are stored in the registers.

The code generation phase takes the DAG as the input and performs instruction scheduling and register assignment aimed at having a code that is smallest in terms of program size, executes in minimum number of cycles, uses minimum number of registers and dissipates least amount of power.

Implementation of Multiplication-free Linear Transforms 143

J INSTRUCTION RE(,ISTER l

PR<X}RAM j 1 t J MEMORY

DECODE

t ~ ~ ~ EXECUTE CONTROL I

1 1 1 SHIT REG. ADDRESSES Sr2

«

DATA Sr!

MEMORY I~ ~ RHilSTLR

FIl.E +/-

Dr

Figure 6. J. Generic Register-rich Architecture

6.1.2. Sources and Measures of Power Dissipation Consider the architecture shown in figure 6.1 and three stages of the pipeline

perfonning instruction fetch, decode and execute respectively. During each cycle, a new instruction In is fetched from the program memory, the previous instruction I n - 1 is decoded and the instruction previous to that, I n - 2 is executed. Thus during each cycle, there is an activity on the program memory busses, in the decoder, in the register file and in the ADD/SUB function unit. These thus fonn the sources of power dissipation.

The power dissipated in the program memory data bus depends on the total Hamming distance between successive instructions fetched from the memory and also on the number of adjacent signals of the bus toggling in opposite direction. Since the decoder is a combinational block, the power dissipation is dependent on the switching activity of its inputs. Since the fetched instruction fonns the main input to the decoder block, the power dissipated in the decoder is dependent on the Hamming distance between the successive instructions being decoded. The power dissipated in the ADD/SUB function unit primarily depends on the data values being addedlsubtracted. Since these data values are inputs to the program, the program generator has minimal control over this component of power dissipation. The execute pipeline stage also sees activity in the register file. The power dissipated in the register file decoders is dependent on the sequence of register addresses (Srl, Sr2 and Dr) decoded every cycle.


WI W2 W3 XI X2 X3

W4 WS W6 X4 XS X6

W7 W8 W9 X7 X8 X9

Figure 6.2. 3x3 Pixel Window Transform

Since this sequence is primarily decided by the sequence in which the instructions are fetched, this component of the power dissipation is also dependent on the Hamming distance between successive instructions. Thus it can be noted that minimizing Hamming distance between consecutive instructions reduces power dissipation in all the three stages of the pipeline.

6.1.3. Optimum Code Generation for I-D Transforms A multiplication-free one dimensional transform can be represented as

N

Y = L A[i]· Xli], where A[i] E {O, 1, -I} for i = 1,2, ... , N (6.1) n=l

Examples of such transforms include the 3x3 pixel window transforms [27] used in image processing. Consider a 3x3 pixel window of an image (figure 6.2) with values X[ I] to X[9] and the corresponding transform window with weights W[l] to W[9]. The transform is then computed as:

9

Y = L W[i] . Xli] (6.2) i=l

Figure 6.3 shows the Prewitt window transform [27] which is used for edge detection. The corresponding DAG is also shown in figure 6.3.

This subsection looks at low power code generation for such transforms. As discussed earlier, the code generation should aim at reducing the Hamming distance between successive instructions. An instruction has two parts - the first part gives the operator (ADD or SUB) and the second part gives the operands (source and destination registers). The Hamming distance in the operator part of the instruction can be reduced during instruction scheduling by maximizing the sequences of consecutive ADD and consecutive SUB operations. The other technique is to modify the DAG itself so as to maximize nodes/operations of the same type. For example, the DAG in figure 6.3 can be transformed to

Implementation of M ultiplication-free Linear Transforms

1 0

1 0

1 0

-1

-1

Xl X3 X4 X6

\1 \1 8 8

-1

~Gf

~ Figure 6.3. Prewitt Window Transform

Xl X3 X6 X4

\1 \1 8 8 ~8/

~ y

X9 X7

\1

Figure 6.4. Transformed DAG with All SUB Nodes

y

145

X7 X9

\1

the DAG shown in figure 6.4. While the initial DAG in figure 6.3 has three SUB and two ADD nodes, the transfonned graph has all five nodes of SUB type. Consequently, the code generated from the transfonned DAG has zero Hamming distance in the operator part of the instructions.

The reduction in the Hamming distance in the operands part of the instructions results in power reduction in the register file also and thus has a bigger impact on the overall power reduction. Here is a technique that suitably transforms the DAG and generates code with minimum total Hamming distance between successive instructions.


Xl ~~~E)-ß-~~ Y

t t t t t X4 X7 X3 X6 X9

Figure 6.5. Chain-type DAG for Prewitt Window Transform

ADD Xl X4 Tl ADD RO Rl,O RO ADD Tl X7 T2 ADD RO R3,O RO SUB T2 X3 T3 SUB RO R2,O RO SUB T3 X6 T4 SUB RO R6,O RO SUB T4 X9 Y SUB RO R7,O RO

Figure 6.6. Optimized Code for Prewitt Window Transform

Step 1: Convert the DAG to a 'Chain' structure and reorder the nodes so as to group all ADD nodes together. The reordering minimizes the Hamming distance in the operator part of the instructions. Figure 6.5 shows such a DAG for the Prewitt Window transform.

Step 2: The chain structure fixes the scheduling of the operations and is given by the order of nodes in the DAG. Generate the instruction sequence using variables as operands. It can be noted that in such a sequence, the first source and the destination variables of all instructions can be assigned to the same register. Such an assignment results in zero Hamming distance in the Srl and Dr operands of the instructions. For the variables in Sr2, the registers are assigned in a gray code sequence so as to minimize the Hamming distance between successive Sr2 operands of the instruction. Figure 6.6 shows the code for Prewitt Window transform before and after the register assignment.

It can be noted that this two step algorithm generates code that is optimal in terms of minimum program size, minimum number of cycles, minimum number of registers and also minimum power dissipation.

6.1.4. Minimizing Number of Operations in Two Dimensional Transforms

A two dimensional transform with an MxN transformation matrix can be implemented as M one dimensional transforms whose transformation matrices (size I xN) are given by the rows of the MxN matrix. Such an approach however cannot exploit the computational redundancy across these one dimen-


sional transforms and hence results in increased number of AddJSub operations. The number of Add/Sub operations required to perform an MxN transform can be minimized by extracting common subexpressions and precomputing them. This technique is identical to the technique discussed in the earlier chapter for minimizing the number of additions in the multiplier-less implementation of the MCM computation.

Consider the 4x4 Haar transform [27] shown below.

1 -1

1 o

1 o

-1 1 j 1 [ ~~ 1 -1 X4

From the transformation matrix, it can be observed that the subexpression (Xl +X2) is used to compute both Yl and Y3. Similarly, the subexpression (X3+X4) is used to compute both Y2 and Y4. The total number of additions can be reduced by precomputing such common subexpressions. For example, the number of additions+ subtractions to compute the above equations can be reduced by two, if (Xl +X2) and (X3+X4) are precomputed. Thus the number of additions can be minimized by iteratively identifying and precomputing such common subexpressions.

Two types of common subexpressions are used

CS++ in which the elements in two columns of the matrix are both 1 or both -1 for more than one rows (e.g. X 12+, columns 1, 2 for rows 1 and 3).

2 CS+- in which the elements in two columns of the matrix are (+ 1,-1) or (-1, + 1) for more that one rows.

Every iteration involves selecting a common subexpression that results in maximum reduction in the number of operations used to perform the transform. This is the same heuristic that is used for minimizing additions in the multiplier-less implementation of weighted-sum and MCM computations.

Once the subexpression is identified, the transformation matrix is updated to reflect the precomputation. This is done by adding a new column to the transformation matrix and suitably updating the matrix elements. For example, consider X12+ common subexpression. The modified transformation matrix is shown below :

o -1 o o

1 o

-1 1 =~ ~ 1 [Jt I


Tl

XI~+

X2 ~ - }------'d----~

YI

Y2

X3~+ Y3

X4 ~81------->~ Y4

Figure 6.7. Optimized DAG für 4x4 Haar Transfürm

Figure 6.7 shows the optimized DAG for 4x4 the Haar transform. It requires six computations compared to eight computations required if the transform is computed as four I x4 transforms.

6.1.5. Low Power Code Generation The DAG optimization technique discussed in the earlier section results in a

code that is optimal in terms of number of cycles. This subsection presents an approach for instruction scheduling and register assignment to generate code with minimum total Hamming distance between successive instructions. Step 1: Instruction Scheduling

Generate an initial list of ready-to-be-scheduled nodes by selecting nodes for which both the inputs are primary inputs. Schedule anode Ni from this list as the first node.

For examp\e, for the DAG shown in figure 6.7 the ready-to-be-scheduled list is given by {Tl,T2,Y2,Y4}. Tl is scheduled as the first node from this list. Repeat {

Include the node Ni in the already-scheduled node list. Update the ready-to-be-scheduled list which has nodes whose inputs are

either primary inputs or are in the already-scheduled list For the example being considered, during the first iteration, the updated

already-scheduled list will be {Tl} and the updated ready-to-be-scheduled list will be {T2, Y2, Y4}

Select anode from the ready-to-be-scheduled list with minimum difference from the latest scheduled node. The difference is computed by comparing the operator and the variables assigned to the operand fields.

For the example being considered, during the first iteration, node Y2 will be selected as it differs with Tl in two fields (operator and destination) as against

Implementation of Multiplication-free Linear Transfarms 149

ADD Xl X2 Tl

SUB Xl X2 Y2

SUB X3 X4 Y4

ADD X3 X4 T2

ADD Tl T2 YI

SUB Tl T2 Y3

Figure 6.8. Scheduled lnstructions far 4x4 Haar Transform

T2 which differs in three fields (all the three operands) and Y 4 which differs in all the four fields.

While computing difference between the current node and the latest scheduled node, if the current node is of type ADD, use commutativity to swap the source operands and check whether the difference reduces with swapped operands.

For example, if the latest scheduled node corresponds to SUB X2 Xl Yl, then for anode with operation ADD Xl X2 Y2 the difference will be 4, however, with inputs swapped the same operation ADD X2 Xl Y2 will have a difference oftwo. } Until ready-to-be-scheduled list is empty

Figure 6.8 gives the output of instruction scheduling for the DAG shown in figure 6.7. Step 2: Register Assignment

From the schedule derived in step I, find lifetimes of all the variables. Figure 6.9 shows the data ftow graph for the scheduled DAG and the lifetime

spans for all the variables. Construct a register-conjlict graph as follows. Each node in the graph rep

resents a variable in the data ftow graph. Connect two nodes of the graph if the lifetimes of the corresponding variables overlap.

Figure 6.10 shows the register-conjlict graph for the data ftow graph shown in figure 6.9.

The register assignment approaches discussed in the literature [4] solve the problem as a graph coloring problem where no two nodes which are connected by an edge are assigned the same color and the graph is thus colored using minimum number of colors.

In this approach, the number of registers are minimized only to the extent of eliminating register-spills and the focus is more on low power considerations. The instruction schedule is analyzed to build a consecutive-variables graph in which each node represents a variable in the data ftow graph. Two nodes of the graph are connected if the corresponding variables appear in the consec-


XI X2 X3 X4 XI X2 X3 X4 TI T2 YI Y2 Y3 Y4

CI

C2

C3

C4

C5

C6

I YI Y2 Y4 Y3

Figure 6.9. Data Flow Graph and Variable Lifetimes for 4x4 Haar Transform

Y3 Y2 Y3 Y2

Y4

XI TZ XI T2

X4 X4

Figure 6.10. Register-Conflict Graph Figure 6.11. Consecutive-Variables Graph

utive eyc\es at the same operand loeation. Eaeh edge E[i, j] in the graph is assigned a weight W[i, j] given by the number of times variables i and j appear eonseeutively in the instruetion sequenee.

Figure 6.11 shows the consecutive-variables graph for the DFG shown in figure 6.9. It ean be noted that for this graph, all the edges have the same weight( = 1 ).

lmplementation of Multiplication-free Linear Transforms 151

HD YI ADD RO RI5,O R3 R5(OIOI)

2 SUB RO RI5,O R2

3 XI SUB RI R7,O R6

2 ADD RI R7,O R7

2 TI ADD R3 R7,O R5

2 SUB R3 R7,O R4

X} 11 X4-T2

RI(OOOI) R7(OIII)

Figure 6.13. Code Optimized for Low Power Figure 6.12. Register Assignment for Low Power

The low power register assignment can be formulated as a graph coloring problem, with the cost function to be minimized given by:

CF = L L HD[i,j].W[i,j] j

(6.3)

This cost function is same as that used for FSM state encoding. Many techniques have been proposed to solve this problem and include approaches such as simulated annealing [63] and stochastic evolution [47] based optimization.

Since the objective is to minimize the Hamming distance, register sharing is performed only if it helps in reducing the Hamming distance. In general, if two variables are connected in the conseclltive-variables graph but are not connected in the register-conjlict graph, they are assigned to the same register. From the graphs shown in figures 6.10 and 6.11, it can be noted that variables X4 and T2 satisfy this criterion and hence can be assigned the same register. Figure 6.12 shows the modified consecutive-variables graph and the corresponding register assignment. The code thus generated in shown in figure 6.13. The total Hamming distance between successive instructions for this code is II (assuming Hamming distance of one between the opcodes of ADD and SUB operations).

The code generated by this algorithm was compared with that generated using theoptimizingC compiler for TMS470Rl x [99] wh ich is based onARM7TDMI core of Advanced Risc Machines Ltd.

Table 6.1 gives the total Hamming distance between successive instructions for 6 different transforms (figures 6.3, 6.14, 6.7 and 6.18). The second column gives the measure for the code generated from C code that directly implements


I 2 1 -I -I -I 1 1 1

0 0 0 -I 8 -I I I I

-I -2 -I -I -I -I 1 I 1

Sobel Window Transfonn Spatial High-Pass Filter Spatial Low-Pass (Averaging) Filter

Figure 6.14. 3x3 Window Transforms

Table 6.1. Total Hamming Distance Between Successive Instructions

Transform TMS470Rlx Instruction Low Power %red. w.r.t. C Compiler Scheduling + Code TMS470Rlx

TMS470Rlx Generator C Compiler C Compiler

Prewitt Window 17 10 5 71% Sobel Window 13 13 7 46%

Spatial High Pass Filter 22 17 10 55% Spatial Low Pass Filter 14 14 7 50%

4x4 Haar 29 20 11 62% 4x4 Walsh-Hadamard 42 35 15 64%

these DAGs. The third column presents results for the C code that represents the reordered DAG and consequently re-scheduled instructions. The fourth column gives the Hamming distance measure for the code generated by the low power code generator. The results assurne that the Hamming distance between the ADD and the SUB opcodes is one.

As can be seen from the results, significant power reduction can be achieved by using a low power driven code generation approach. To compare this approach with the approach that first does register assignment and then performs cold scheduling, the register assignment done by the TMS470Rlx C Compiler was used to cold schedule the Prewitt Window Transform. The total Hamming distance for the resultant code was eight compared to the measure of five for the low power code generator. This justifies the approach of first scheduling the instructions and then performing low power register assignment.

Implementation oj Multiplication-free Linear Transjorms

PDB I lNSTRUCTlON

I I REGISTER

I DRAß

PAß

I I

ADDRESS

GI'NERATOR DWAB

DRDß PROGRAM DATA

MEMORY MEMORY «

+/-

r ~ I Ace

I DWDB «

Figure 6.15. Single Register, Accumulator Based Architecture

6.2. Optimum Code Generation for Single Register, Accumulator Based Architectures

153

6.2.1. Single Register, Accumulator Based Architecture Model The eode generation teehniques presented in this seetion are targeted to the

arehiteeture model shown in figure 6.15. This model is a suitable abstraetion of TMS320C2x [95] and TMS320C5x [96] proeessors and shows the datapath of interest to the multiplieation-free linear transforms.

The arehiteeture has six busses. The program memory address bus (PAB) gives the loeation of the instruetion to be fetehed. The instruetion is fetehed on the pro gram memory data bus (PDB). The loeation of the data to be read is speeified on the data memory read address bus (DRAB) and the data is read from the memory on data memory read data bus (DRDB). The loeation for weiting a data is speeified on the data memory weite address bus (DWAB) and the data to be weitten in put on data memory write data bus (DWDB).

The arehiteeture supports the foJIowing instruetions:

ADD Mem,Shift: (Aee) + (Mern)< <Shift -t (Aee) Forexample, the instruetion 'ADD X, l' gets the value from the data memory loeation a~dressed by X, left shifts it by 1 and adds to the Aeeumulator in one cloek eycle (assuming 0 wait-state memory aeeess).

2 SUB Mem,Shift: (Ace) - (Mern)< <Shift -t (Ace)


Y2

Xl X2 X3 X4

Figure 6./6. Example DAG

3 LAC Mem,Shift: (Mem)< <Shift -+ (Acc) For example, the instruction 'LAG X, l' gets the value from the data memory location addressed by X, left shifts it by I and loads into the Accumulator in one c10ck cycle (assuming 0 wait-state memory access).

4 SAC Mem,Shift: (Acc)< <Shift -+ (Mem)

5 NEG -(Ace) -+ (Ace)

The code generator uses these instructions to implement a multiplication-free linear transform.

6.2.2. Code Generation Rules One of the inputs to the code generator is a DAG representation of the desired

computation. Figure 6.16 shows a DAG representation of a five input, two output eomputation. A DAG has three types of nodes - input, output and the intermediate nodes. For example, nodes Xl, X2, X3, X4, X5 in figure 6.16 are the input nodes, Tl, T2, T3 are the intermediate nodes and Yl, Y2 are the output nodes. The output and the intermediate nodes represent either an ADD or a SUBTRACT operation and have fanin of two.

The other input to the code generator is a sequence in which the nodes of the DAG need to be evaluated. Given the sequenee and the DAG, following rules are used to generate the code:

Let 'current' node be the latest evaluated node and 'new' node be the new node for which the code is being generated.


Table 6.2. Code Dependance on the Scheduling of DAG Nodes

TI,T2,T3,YI,Y2 T2,T3,TI,YI,Y2 T2,TI,YI,T3,Y2

LAC XI rule I LAC X3 rule I LAC X3 rule I ADD X2 rule I ADD X4 rule I ADD X4 rule I SAC TI rule I SAC T2 rule 4 SAC T2 rule 4 LAC X3 rule I ADD X5 rule 2 LAC XI rule I ADD X4 rule I SAC T3 rule I ADD X2 rule I SAC T2 rule 4 LAC XI rule I SUB T2 rule 2 ADD X5 rule 2 ADD X2 rule I SAC Yl rule 4 SAC T3 rule I SUB T2 rule 2 LAC T2 rule I LAC TI rule I SAC YI rule 4 ADD X5 rule I SUB T2 rule I ADD T3 rule 2 ADD YI rule 2 SAC YI rule 4 SAC Y2 rule 4 SAC Y2 rule 4 ADD T3 rule 2 SAC Y2 rule 4

13 cycles 11 cycles II cycles

If the 'current' node is not one of the fanin nodes of the 'new' node, save the 'current' node (SAC instruction), load the left fanin node of the 'new' node (LAC instruction) and ADD/SUBTRACT the right fanin node of the 'new' node.

2 Ifthe 'current' nodeisaleftfaninnodeofthe 'new' node, ADD/SUBTRACT the right fanin node of the 'new' node.

3 If the 'current' node is a right fanin node of the 'new' node and the 'new' node function is SUBTRACT, negate the 'current' node (NEG) instruction and ADD the left fanin node of the 'new' node.

4 If the 'new' node is an output node or an interrnediate node with fanout of two or more, store the new node (SAC instruction) before proceeding with the next node.

Considersequences {TI, T2, T3, YI, Y2}, {T2, T3, Tl, YI, Y2} and {T2, TI, Y I, T3, Y2} for the DAG shown in figure 6.16. The corresponding code is shown in table 6.2.

As can be seen from this example, for a given DAG, the code size and consequently the number of cycles depend on the sequence in which the nodes are evaluated. The code optimization problem thus maps onto the problem of finding an optimum sequence of DAG node evaluations.


6.2.3. Computation Scheduling Algorithm This subsection presents an algorithm for scheduling the DAG computations

for minimum number of cycles. The algorithm uses the following knowledgebase derived from the code generation mIes presented earl i er.

Anode can be scheduled for computation only if both its fanin nodes are al ready computed or are input nodes.

2 The computation of output nodes and the intermediate nodes with fanout of two or more, always needs to be stored irrespective of the next computation node.

3 If the 'current' node is one of the fanin nodes of the 'new' node, it avoids accumulator spill and hence reduces the 'store' and 'load' overhead.

The algorithm perforrns the optimization in two phases. The first phase uses the above mentioned knowledgebase to generate an initial schedule. The second phase performs iterative refinement to further optimize the schedule.

Procedure DAG-Schedule Input: DAG representation of the computation to be implemented on a single register, accumulator based machine Output: An order in which the DAG nodes need to be scheduled so as to generate a code with minimum accumulator spills

scheduled-node-list = {} current-node = 0 while (no.of-scheduled-nodes < total-no.of-intermediate+output-nodes) {

/* build candidate-node-list */ candidate-node-list = {} for all (nodei tf- scheduled-node-list) {

if «nodei.left-fanin E (input-node-list + scheduled-node-list)) .and. (nodei.right-fanin E (input-node-Iist + scheduled-node-list)))

candidate-node-list += nodei } /* assign weights to the candidate-nodes */ for all (nodei E candidate-node-list) {

nodei.weight = 1 if «nodei E output-node-list) .or. (nodei.fanout 2:: 2))

nodei. weight++ if «nodei.left-fanin = current-node) .or.

«(nodei.right-fanin = current-node) .and. (nodei.op = ADD))) nodei. weight += 2

if (nodei.fanout-node.right-fanin E scheduled-node-list) nodei.weight += 2


} /* select the node with the highest weight for scheduling */ Find (nodem E candidate-node-list) such that nodem.weight is maximum scheduled-node-list += nodem current-node = nodem

}

In the above algorithm, at each stage of node selection, there can be more than one nodes with the same weight. During the first phase of the algorithm a node is selected randomly. In the iterative refinement phase anode selected at each stage is replaced by other node (if available) with the same weight. The resultant schedule is compared with the initial schedule and accepted if it results in fewer number of cycles.

The scheduling algorithm when applied to the DAG shown in figure 6.16, generates the schedule T2, T3, Tl, Yl, Y2 with no further improvement possible in the iterative refinement phase. During the first iteration of the algorithm, the candidate-node-list consists of nodes Tl and T2, with weights 1 and 4 respectively. Node T2 is hence scheduled first. In the second iteration, the candidate list consists of nodes Tl and T3 both having weight 3. Selecting T3 results in the schedule T2, T3, Tl, Yl, Y2 which requires 11 cycles. During the iterative refinement phase of the algorithm, Tl is selected instead of T3 resulting in the schedule T2, Tl, Y 1, T3, Y2 which also requires 11 cycles.

The code generated by the algorithm presented in this chapter was compared with that generated using optimizing C compiler for TMS320C5x. The DAGs for 4x4 Walsh-Hadamard transform shown in figures 6.17, 6.18 and 6.23 were converted to an equivalent C program and compiled with highest optimization level. The generated code which used indirect addressing, was converted to use direct addressing thus reducing number of cycles. Table 6.3 shows the comparison in terms of number of cycles assuming that the program and data are available in on-chip memories.

Table 6.3. Comparison ofCode Generator with 'C5x C Compiler

DAG 'C5x C compiler Code Generator no. of cycles no. of cycles

Fig.6.17 20 20 Fig.6.18 22 22 Fig.6.23 19 14

The results show that the code generator generates as compact code as the 'C5x C compiler for the first 2 DAGs. It does better in case of the DAG in fig-


Xl YI Xl Y3

X2 X3 X4 X2 X3 X4

Xl Y2 Xl Y4

X2 X3 X4 X2 X3 X4

Figure 6.17. DAG far 4x4 Walsh-Hadamard Transfarm

ure 6.23. The main reason for this is that the C compiler during its optimization phase modi fies the DAG and in the process generates code with more number of cycles.

6.2.4. Impact of DAG Structure on the Optimality of Generated Code

This section shows how the structure of a DAG impacts the optimality of the code. For a one dimensional transform, the DAG structure can be either a tree-type or a chain-type. It can be shown that a chain-type structure results in 0 accumulator spills and is hence results in a more optimal code than the tree-type structure.

Consider the 4x4 Walsh-Hadamard transform as an example to analyze the relationship between the structure of a DAG and the optimality 01' the generated code.

The 4x4 Walsh-Hadamard transform [27] is shown below.

1 1 -1 1

1 -1 -1 -1

-~ 1 r ;~ 1 -1 X3 1 X4

One approach to realize this transform is to implement it as four I x4 one dimensional transforms. The DAG for such an implementation, shown in figure 6.17, has 12 nodes (i.e. 12 additions + subtractions).

The numberof nodes can be minimized using the technique, discussed earlier, of precomputing the common subexpressions. Figure 6.18 shows the DAG thus optimized for minimum number of nodes. It has 8 nodes (8 additions + subtractions) compared to 12 nodes of the DAG shown in figure 6.17.

The scheduling algorithm discussed earlier was used to schedule the DAGs shown in figures 6.17 and 6.18. While the DAG shown in figure 6.17 requircs


Xl Y1

X2 Y2

X3 Y3

X4 Y4

Figure 6.18. Optimized DAG für 4x4 Walsh-Hadamard Transform

20 cycles, the DAG in figure 6.18 requires 22 cycles to compute the transform, eventhough it has four less nodes. Clearly, fewer number of nodes does not always translate into fewer number of cycles. The main reason for the DAG in figure 6.18 requiring more cycles, is that all its intermediate nodes have fanout of two. For single register, accumulator based architectures, such intermediate nodes result in accumulator spills, and consequently in 'store' and 'load' overhead.

6.2.5. DAG Optimizing Transformations This section presents four DAG transformations that minimize the accumu

lator spill and hence the number of execution cycles.

6.2.5.1 Transformation I - Tree to Chain Conversion

This transforms converts a 'tree' structure in a DAG to a 'chain' structure. This eliminates the need to store the intermediate computations and hence reduce the number of cycles. Figure 6.19 shows an example of this transform. While the DAG with a 'tree' structure requires seven cycles to compute the output, and the transformed 'chain' structure performs the computation in five cycles.

6.2.5.2 Transformation 11 - Serializing a Butterfly

Many image transform DAGs have 'butterfly' structures that perform the computations of the type (YI = XI + X2, Y2 = Xl - X2). Such butterfly structures can be serialized by computing one of the butterfly outputs in terms of the other output, and using a SHIFT operation which when performed along with ADD or SUBTRACT does not require in an additional cycle. Figure 6.20


LAC XI

ADD X2 XI~OTI LAC XI

SAC TI X2....:7 ~ ADD X2

LAC X3 ~Yl ADD X3

X3~ X3 ADD X4 ADD X4

ADD TI X4....:7 T2 c) SAC Yl X4 YI

SAC YI

TREE STRUCTURE CHAIN STRUCTURE

Figure 6.19. Transformation I - Tree to Chain Conversion

LAC Xl LAC Xl LAC Xl

ADD X2 Xl2S:Yl Xlzt Yl ADD X2 Xl~Yl SUB X2

c) X2 -2 Y2

+2 SAC Yl SAC Yl OR SAC Y2

LAC Xl X2 -1 Y2 SUB X2.1 X2 -1 Y2 ADD X2.1

SUB X2 SAC Y2 SAC Yl

SAC Y2

Figure 6.20. Transformation Il - Serializing a Butterfly

shows the serialized DAGs which require five cycles compared to six cycles required for the butterfly computation.

As can be seen from the figure, there are two ways of serializing a butterfly depending on whether Yl is computed in tenns of Y2 (Yl = Y2 + 2*X2) or Y2 is computed in terms of Yl (Y2 = Yl - 2*X2). The choice of the transfonn depends on the context in which the butterfly appears in the overall DAG.

6.2.5.3 Transformation III - Fanout Reduction

Since the intennediate nodes with fanout of two or more, result in accumutator spilling, this transfonnation reduces fanout of an intennediate node in a DAG. Unlike the first two transfonns, this transfonn increases the number of nodes in the DAG by one. Figure 6.21 shows an example of this transfonnation applied to a four input, two output DAG. It can be noted the fanout of the intennediate node TI in the transfonned DAG is one (i.e. one less than in the original DAG). While the original DAG has three nodes and requires eight cycles, the transfonned DAG has four nodes but requires seven cycles.

For the DAG shown in figure 6.21, the fanout reduction transfonnation can also be applied to eliminate TI 's fanout to Y2, instead of eliminating TI's fanout to YI. In general, the choice of which fanout edge to eliminate depends on the context in which the node appears in the overall DAG.


LAC XI

"1" X3;fYl LAC XI

ADD X2 LAC XI ADD X2

SAC TI ADD X2 XI~

XI ~ I c:::) ADD X4 ADD X3 ADD X3 X4 Y2 X2--'" SAC YI

X2--'" SAC Y2 OR

:::t SAC YI LAC XI LAC TI SUB X4

--'" Y2 ADD X2 ADD X4 ADD X3 X4 ADD X4 SAC Y2 X3 YI

SAC YI X4 Y2 SAC Y2

FANOUT REDUCTION MERGING

Figure 6.21. Transformations 111 and IV

XI XI XI YI

X2 X2 Y2

X4 Y3

X3 X3 XI Y2

X4 X4 Y4 Xl

X3

Y4

SERIALIZING A BUlTERFL Y MERGING TREE TO CHAIN CONVERSION

Figure 6.22. Optimizing DAG Using Transformations

6.2.5.4 Transformation IV - Merging

Merging is another transfonn that reduces the fanout of intennediate nodes. Unlike the earlier transfonns, this transfonn does not reduce the number of cycIes. However it transfonns the DAG so that other transfonnations can be applied to the modified DAG. Figure 6.21 also shows an example ofthe 'merging' transfonnation applied to the four input, two output DAG.

Figure 6.22 shows how these transfonnations can be applied to the DAG in figure 6.18. The resultant DAG requires 16 cycles (six cycIes \ess) to compute the 4x4 Walsh-Hadamard transfonn.

The amount of optimization possible using these transfonns depends on the sequence in which the nodes are selected and the choice of transfonnations applied. One approach is to search the DAG for potential nodes for transfonnation and for a selected node, apply the transformation that results in most saving. This greedy approach does not often give the most optimum solution. Instead of applying transfonnations to eliminate accumulator spills, a spill-free DAG can be directly synthesized from the transformation matrix. The following sub-


section presents a technique for synthesis of spill-free DAGs that are optimal in terms of number of cycles.

6.2.6. Synthesis of Spill-free DAGs

A DAG that can be scheduled without any accumulator spi lls provides certain advantages. Firstly, it simplifies code generation. Secondly, since there are no accumulator spills, no intermediate storage is required, thus reducing the memory requirements to implement the transform. The DAG in figure 6.17 is an example of such a DAG. It however requires 20 cycles to compute the transform. It can be noted that the final optimized DAG in figure 6.22, is also a spill-free DAG. This DAG however requires just 16 cycles. The main reason for the reduced cycles is the fact that this DAG uses precomputed outputs along with the inputs. For example, Y3 is computed in terms of Y 1 and the primary outputs, and Y 4 is computed in terms of Y2 and the primary outputs. Instead of generating a DAG with minimum number of additions and then applying transformations, this DAG can be directly generated from the transformation matrix, if the sequence of output computation is known.

Here is an algorithm that arrives at an optimum sequence of output computations such that the resultant spill-free DAG requires fewest number of cycles. The algorithm operates on a graph whose nodes represent the outputs. The nodes are of three types - corresponding to 1. The 'most-recently-computed' output 2. Other outputs that are 'already-computed' 3. Outputs that are 'yet-to-be-computed' Each node in the graph has an edge (self loop) that starts and ends in itself. These self-Ioops are assigned costs which are given by the number of cycles required to compute the output independently (i.e. without using any of the precomputed outputs). There are also edges between every 'already-computed' output node to all the 'yet-to-be-computed' nodes. Each edge is assigned a cost given by the number of cycles required to compute the 'yet-to-be-computed' output in terms of the 'already-computed' output. The algorithm uses the steepest descent approach, which at every stage selects an output that results in minimum incremental cost. In case of more than one outputs having the same lowest incremental cost, one output is selected randomly. Once an output is selected, it is marked as the 'most-recently-computed' output. All the edges between this node and the 'already-computed' nodes are deleted, and new edges are added between this node and the other 'yet-to-be-computed' nodes. The newly added edges are then assigned appropriate costs. This process is repeated to cover all the outputs.


Procedure Synthesize-Spill-Jree-DAG Input: Two dimensional matrix representing the multiplication-free linear transform Output: Spill-free DAG representation of the computation with the DAG having minimum number nodes

already-computed-output-list = { } most-recently-computed-output = (j) /* Construct initial graph and compute edge costs */ for (i=ü,i <no-of-outputs;i++) {

edge[i,i].cost = number of non-zero entries in row 'i' + 1 } repeat { ------pfnd the edge E(M,N) with the lowest cost.

if (M == N) { /* self loop */ Generate the DAG to compute output(N) in terms of only the inputs

} else { Generate DAG to compute output(N) in terms of inputs and output(M)

} /* Update the graph */ Delete edge E(N,N) for each node (i E already-computed-output-list) {

Delete edge E(i,N) } already-computed-output-list += N for each node (i E yet-to-be-computed-output-list) {

E(most-recently-computed-output,i).cost++ } most-recently-computed-output = N for each node i E yet-to-be-computed-output-list {

Add edge E(N,i)

}

E(N,i).cost = number of mismatches between row N and row 'i' of the transformation matrix

} until (yet-to-be-computed-output-list == {})

Figure 6.23 shows each iteration of the algorithm applied to the 4x4 WalshHadamard transform matrix, and the resultant DAG. It can be noted that the resultant DAG is spill-free and requires just 14 cycles to compute the transform.

Here are results of applying these transformations on 8x8 Walsh-Hadamard transform, 8x8 Haar transform and 4x4 Slant transform.


4

4 4

XI YI Y3 Y2 Y4

X2 X3 X4 X3 X2 X3 X4

Figure 6.23. SpilI-free DAG Synthesis

The 8x8 Walsh-Hadamard transfonn [27] is given by:

Y1 1 1 1 1 1 1 1 1 Xl Y2 1 -1 1 -1 1 -1 1 -1 X2 Y3 1 1 -1 -1 1 1 -1 -1 X3 Y4 1 -1 -1 1 1 -1 -1 1 X4 Y5 1 1 1 1 -1 -1 -1 -1 X5 Y6 1 -1 1 -1 -1 1 -1 1 X6 Y7 1 1 -1 -1 -1 -1 1 1 X7 Y8 1 -1 -1 1 -1 1 1 -1 X8

The direct computation of this transfonn requires 56 additions+ subtractions and the corresponding code executes in 72 cycles. The number of additions+ subtractions can be minimized to 24 using the common subexpression precomputation algorithm. The resultant DAG is shown in figure 6.24. The code corresponding to this DAG requires 64 cycles.The DAG can be optimized by applying transfonnations to serialize all the butterflies. The resultant DAG is also shown in figure 6.24. This DAG also has 24 nodes but the corresponding code requires 52 cycles.

Figure 6.25 shows a spill-free DAG synthesized for the 8x8 Walsh-Hadamard transfonn. The DAG has 35 nodes and the corresponding code requires 44 cycles. The results so far indicate that for both 4x4 and 8x8 Walsh-Hadamard transforms, the spill-free DAGs result in most efficient code. To take this analysis further the DAG for 8x8 Walsh-Hadamard transfonn was modified to


XI YI XI YI

X2 Y2 X2 Y2

X3 Y3 X3 Y3

X4 Y4 X4 Y4

X5 YS XS YS

X6 Y6 X6 Y6

X7 Y7 X7 Y7

X8 Y8 X8 Y8

Figure 6.24. DAGs far 8x8 Walsh-Hadamard Transform

extract acommon sub-computation (X5 + X6 + X7 + X8). The resultant DAG is also shown in figure 6.25. This DAG has 32 nodes and it does result in one accumulator spill. The code corresponding to this DAG requires 42 cycles (2 less than the spill-free DAG).

The 8x8 Haar transform [27] is given by :

Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8

1 1 1 1 111 1 1 1 -1 -1 o 0 0 0 1 -1 0 0 o 0 1-1 000 0 000 0

1 1 -1 -1

o 0 1 1 o 0 o 0 1 -1 o 0

1 1 -1 -1 o 0

-1 -1 o 0 o 0 o 0 1 -1

X1/v'S X2/v'S X3/2 X4/2

X5/v'2 X6/v'2 X7/v'2 X8/v'2

The direct computation of this transform requires 24 additions + subtractions and the corresponding code executes in 40 cycles. The number of additions + subtractions can be minimized to 14 using the common subexpression precomputation algorithm. The resultant DAG is shown in figure 6.26. The code corresponding to this DAG requires 39 cycles. The DAG was optimized by applying transformations to serialize all the butterflies. The resultant DAG is also shown in figure 6.26. This DAG also has 14 nodes but the corresponding code requires 30 cycles.

The spilI-free DAG for the 8x8 Haar transform has 20 nodes and the corresponding code requires 32 cycles.


Yl Y2 Y3

X8 -2

X7 X6 X5 X4 X3 X2 Xl X2 X4 X6 X8 X2 X3 X6 X7

X7 X6 X3 X2 X8 X6 X8 X5 X3 X2 X8 X6 X4

Y7

X2 X4 X6 X8 Yl Y5 D Y2 Y3

X8

Y8

Y7 Y6 Y4

Figure 6.25. Spill-free DAGs für 8x8 Walsh-Hadamard Transform

XI YI XI YI

X2 Y5 X2 Y5

X3 Y3 X3 Y3

X4 Y6 X4 Y6

Y2 X5 Y2

Y7 X6 Y7

: X7 Y4 Y4

Y8 X8 Y8

Figure 6.26. DAGs for 8x8 Haar Transform


The 4x4 Slant transfonn [27] can be transfonned into a 4x8 multiplicationfree transfonn as shown below :

-~ -; 1 -1 1

3 -1

[ Y1 1 [ J

1 0 0 Y2 1 -1 -1 0 Y3 -1 -1 1 0 0 Y4 -1 -1 0 -1

0 0 0

[ X1/2 1 X2/2v1s X3/2

X4/2v1s

X1/2 X2/2v1s

-n X3/2

X4/2V5 Xl

X2/vIs X3

X4/vIs

It can be noted that the left half of the 4x8 matrix is same as the WalshHadamard transform. The direct computation of the 4x8 transfonn requires 16 additions+ subtractions and the corresponding code executes in 24 cyc\es. The number of additions + subtractions can be minimized to 12 using the common subexpression precomputation algorithm. The code corresponding to the resultant DAG requires 26 cyc\es.

Interestingly the spill-free DAG can be synthesized directly from the 4x4 matrix with elements 1,-1,3 and -3. The 4 outputs can be computed as YI = Xl + X2 + X3 + X4, Y2 = YI + Xl «1 - X3«1 - X4«2, Y3 = Y2 - Xl «1 - X2«1 + X4«2, Y4 = Y3 - X2«I + X3«2 - X4«1 The DAG for the above computation has 12 nodes and requires 17 cyc\es.

The results presented so far are summarized in table 6.4.

Table 6.4. Number of Nodes (Ns) and Cycles(Cs) for Various DAG Transforms

Original Minimum Serialized spiII-free Transform adds+subs butterft ies DAGs

Ns Cs Ns Cs Ns Cs Ns Cs 4x4 Walsh-Hadamard 12 20 8 22 8 19 9 14 8x8 Walsh-Hadamard 56 72 24 64 24 52 35 44

8x8 Haar 24 40 14 39 14 30 20 32 4x4 Slant 16 24 12 26 12 23 12 17

The following subsections present a technique that reorders the nodes of the DAG to achieve low power realization of multiplication-free linear transfonns on a single register, accumulator based architecture.


6.2.7. Sources and Measures of Power Dissipation

The six busses of the architecture shown in figure 6.15 are networks with a large capacitive loading. Hence signal switching in these networks has a significant impact on power consumption. Since during each cycle, an instruction is fetched from the program memory, the power dissipated in the PDB bus is directly dependent on the total Hamming distance between successive instructions. The power dissipation in the PAB bus is dependent on the addresses of the instructions being fetched. Techniques such as gray coded addressing have been proposed to reduce power dissipation in the PAB bus. The DRAB and DRDB busses see the activity during the execution of LAC, ADD and SUB instructions. While the amount of switching and hence the power dissipation in the DRDB bus is dependent on the data values, the power dissipation in the DRAB bus is dependent on the total Hamming distance between successive data read addresses. Since the code for the multiplication-free linear transforms is executed sequentially, static analysis of the code is adequate to compute these measures of power dissipation. The following measure is used for power dissipated due to the execution of i th instruction:

Pli] = HO (Opeode part of Instruction[i-l], Opeode part of instruetion [iD +

2 * HO (Memory address part of Instruetion [i-I], Memory address part of lnstruetion[i])

6.2.8. Low Power Code Generation

For one dimensional transforms, the chain structure results in a code that causes zero accumulator spills and is hence optimal in terms of number of cycles. The nodes of the Chain structured DAG can be reordered so as to reduce the power dissipation in the PDB and DRAB busses. To arrive at an optimal order, a fully connected graph is constructed with the nodes of the graph representing the nodes of the DAG (and thus the corresponding instruction). The edges of the graph are assigned weights in the following way:

W[i,j] = HO (Opeode part of Instruetion[i], Opeode part of instruetion [j]) +

2 * HO (Memory address part of Instruetion [i], Memory address part of Instruetion[j])

The problem of finding an optimum node order can be reduced to the problem of finding the lowest cost Hamiltonian path in an edge-weighted graph or the traveling salesman problem. Since the last instruction has to be SAC (i.e. store the accumulatorcontents into the specified data memory location), the algorithm uses the corresponding node as the starting point and works backwards to get the lowest cost Hamiltonian path. It also comprehends the constraint that the first instruction has to be LAC (Load the accumulator with the contents of


Table 6.5. Hamming Distance Measure for Accumulator based Architectures

Transform Initial DAG Reordered DAG %reduction Prewitt Window 36 28 22% Sobel Window 40 24 40%

Spatial High Pass Filter 42 28 33% Spatial Low Pass Filter 34 25 26%

4x4 Haar 68 62 9% 4x4 Walsh-Hadamard 64 58 9%

the specified data memory loeation) and only those variables that are to be multiplied by the weight of + 1 ean be used for the LAC instruetion.

For a two dimensional transform, the spill-free DAG strueture is used as the starting point and is partitioned into sub-DAGs bounded by primary output eomputations (i.e. the SAC instruetions). For example, for the spill-free DAG ofthe Walsh-Hadamard transform shown in figure 6.23, it is partitioned into four sub-DAGs bounded by SAC Yl, SAC Y3, SAC Y2 and SAC Y4 instruetions. The nodes within eaeh of the sub-DAGs ean then be reordered without affeeting the overall funetionality, thus resulting in a code with redueed power dissipation.

Table 6.5 gives the Hamming distanee based measure (deseribed in seetion 6.2.7) for six multiplieation-free linear transforms. For the 3x3 window transforms, it is assumed that the variables Xl to X9 are stored at loeations OxOO to Ox08 and the output Y is stored at loeation OxOF. For the 4x4 Haar and Wall transforms, it is assumed that the inputs Xl to X4 are stored at loeations OxOO to Ox03 and the outputs YI to Y4 are stored at loeations Ox08 to OxOB. 1t is also assumed that the opeodes for LAC, ADD, SUB and SAC are I-hot eneoded and henee have a Hamming distanee of two between any two of them. As can be seen by the results the proposed node reordering technique results in significant power reduetion.

Chapter 7

RESIDUE NUMBER SYSTEM BASED IMPLEMENTATION

Residue Number System (RNS) based implementation of DSP algorithms have been presented in the literature [29, 30, 92] as a technique for high speed realization. In a Residue Number System (RNS), an integer is represented as a set of residues with respect to a set of integers called the Moduli. Let (mI, m2, m3, ... , m n ) be a set of relatively prime integers called the Moduli set. An integer X can be represented as X = (Xl, X 2, X 3, ... , X n) where

Xi = (X) modulo mi for i = 1,2, .. , n (7.1 )

we use notation Xi to represent IXlmi the residue of X w.r.t mi. Given the moduli set, the dynamic range(M) is given by the LCM of all the moduli. If the elements are pair-wise relatively prime, the dynamic range is equal to the product of all the moduli [92]. The bit-precision of a given moduli set is

(7.2)

where M is the dynamic range of the given moduli set. So, the moduli set is determined based on the bit-precision needed for the computation. Für example, für 19-bit precisiün the modul i set 5,7,9,11,13,16 can be used [87].

Let X,Y and Z have the residue representations X = (Xl, X 2, X 3, ... , X n), Y = (YI , Y2, Y3, ... , Yn ) and Z = (Zl, Z2, Z3, ... , Zn) respectively and Z = (X Op Y) where Op is any operation in addition, multiplication or subtractiün. Thus we have in RNS,

Zi = lXi op Yilmi for i = 1,2, .. , n (7.3)

Since, Xi's and Yi 's require lesser precisiün than X and Y, the computation üf Zi 's can be performed faster than the computation of Z. Müreüver, since the

171



computations are independent of each other, they can be executed in parallel resulting in significant performance improvement. For example, Consider the moduli set (5,7,11) Let X = 47 = (14715,14717,147111) = (2,5,3)

Y = 31 = (13115,13117',131111) = (1,3,9) and Z = X + Y = 47 + 31 = 78 = (17815,17817, 178hd = (3,1,1) Zi'S can be computed, using equation 7.3, as Z = (12 + 115, 15 + 317, 13 + 9111) = (3,1,1)

While in RNS, basic operations like multiplication, addition, and subtraction can be performed with high speed, operations like division and magnitude comparison require several basic operations and hence are slower and more complicated to implement. Since most DSP kemels (such as FIR and UR filtering, FFT, correlation, DCT etc) do not need these operations to be performed, this limitation does not apply.

Figure 7.1 shows the RNS based implementation of an N-term weighted-sum computation. The implementation is generic and assumes K moduli (MI to MK) selected so as to meet the desired precision requirements. The coefficient memory stores the pre-computed residues of the coefficients A[O] to A[N-]] for each of the modul i M 1 to MK. The data memory stores the residues of the data values for each of the moduli Ml to MK. The weighted-sum computation is performed as aseries of modulo MAC operations. During each cycle the Coefficient Address Generator and the Data Address Generator provide the appropriate coefficient-data pairs to the K modulo MAC units which perform the modulo computations in parallel.

Figure 7.2 shows an implementation of a modulo MAC unit. It consists of a modulo multiplier and a modulo adder connected via an accumulator so as to perform repeated modulo MAC and make the result available to the RNS-toBinary converter every N cycles.

The modulo multiplication and modulo addition is typically implemented as a look-up-table (LUT) for small moduli. Figure 7.2 also shows the look-uptable based implementation of a modulo 3 multiplier. The modulo MAC unit as shown in figure 7.2 requires two look-ups. The performance can be further improved by merging the two LUTs into a modulo Multiply-Accumulate (MAC) LUT as shown in figure 7.3.

7.1. Optimizing RNS based Implementation of the Weighted-sum Computation

This section presents techniques to optimize the RNS based implementation of weighted-sum computations in the area-delay-power space.

Residue Number Systembased Implementation 173

A[O] modMI A[O] mod M2 A[O] modMK Coefficient ---------Address ~ A[I] modMI A[I] mod M2 A[I] modMK

Generator ---------

---------A[N-I] mod MI A[N-I] mod M2 A[N-I] mod MK

MAC RNS modMI MAC to r--- y

modM2 Binary ~

MAC

modMK

X[O] modMI X[O] mod M2 X[O] modMK Oata ---------

Address ~ X[I]modMI X[I] modM2 X[I] modMK

Generator ---------

---------

X[N-I] mod MI X[N-I] mod M2 X[N-I] mod MK

f t Binary 10 RNS

t Xli]

Figure 7.1. RNS Based ImpJementation of FIR Filters

A[i] * moduJo 3

A X Y

A 00 00 00 00 01 00 00 JO 00 L ,---

01 00 00 Z Y OJ 01 OJ

OJ 10 JO X JO 00 00

* A

+ ~ C - f-t> moduJo M f----t'

moduJo M C

JO OJ 10 1010 01 ~ ~

I I Xli] CJk

Figure 7.2. ModuJo MAC using Jook-up-tabJes


A[i]

L -

A MAC

"- modulo M f---!> C

f-C> C I-~ y

T I

Xli] Clk

Figure 7.3. Modulo MAC using a single LUT

7.1.1. Parallel Processing The RNS based implementation shown in figure 7.1 can be modified so as to

read, for each of the moduli, two data-coefficient pairs during every cycle. The RNS structure can be modified, as shown in figure 7.4, so as to have two modulo MAC units per modulus . With such a parallel processing of degree two, an N term weighted-sum can be computed in NI2 cycles. If the same throughput is to be maintained, the clock can be slowed down appropriately and the supply voltage lowered resulting in power savings.

7.1.2. Residue Encoding for Low Power The modulo multiplier of the modulo MAC unit is typically implemented as

a look-up-table. The look-up-table for modulo 3 multiplier assumes that the coefficient residues are binary coded (i.e. residue 0 as 00, residue 1 as 01 and residue 2 as 10). However, since the coefficient residues are fed to the modulo MAC units only, they need not necessarily be binary coded. Secondly, since there is a separate MAC unit for each modulus, the coefficient residue coding can be different across moduli. For example, residue 0 for modulus 5 may be coded as 010 and for modulus 7 may be coded as 100. This ftexibility in the coefficient coding can be exploited to minimize the switching activity in the coefficient memory output busses that feed the modulo MAC units. Such a coding can thus reduce the power dissipation in the coefficient memory and the modulo MAC units.

Here is the formulation of the coefficient residue encoding problem. For a given modul i Mi, a fully connected graph is constructed with the nodes of the graph representing the residues. The edges (E[i,j]) of the graph are then assigned weights to indicate the number of times the corresponding residues Ci' and 'j') appear sequentially in the coefficient memory. The residues can

Residue Number System based lmplementation 175

A[O] modMI A[I]modMI Coefficient ----------Address ~ A[2] modMI A[3] mod MI

Generator ---------

---------A[N-2] mod MI A[N-I] mod MI

1 1 MAC

~

+ ~ RNS modMI MAC modMI ----;0 to -.... y

mod MI ----;0 Binary

----;0

I X[O] mod MI X[I] modMI

Data ----------

Address ~ X[2] mod MI X[3] mod MI

Generator ---------

---------X[N-2] mod MI X[N-I]modMI

t t t Binary to RNS

1 Xli]

Figure 7.4. RNS Based Implementation of FIR Filters with Parallel Processing Transformation

then be coded so as to minimize the following cost function:

CF = LLHD[i,j]. W[i,j] j

(7.4)

where HD[i,j] is the Hamming distance between the coding of residues 'i' and 'j', and W[i, j] is the edge weight. It can be noted that the cost function CF is similar to that used for FSM state encoding. The stochastic evolution based optimization strategy described in [47] can thus be used to perform the coefficient residue encoding.

7.1.3. Coefficient Ordering Since modulo addition is both commutative and associative, the order in

which the coefficient-data pairs are fed to the modulo MAC units can be changed without impacting the functionality. The residues in the coefficient memory can be reordered so as to minimize the total Hamming distance between successive


values of the coefficient residues. Such a minimization reduces the switching activity in the coefficient memory output busses and hence reduces power dissipation in the coefficient memory and the modulo MAC units.

The problem of finding an optimum coefficient order can be mapped onto the problem of finding the lowest cost Hamiltonian Circuit in an edge-weighted graph or the traveling salesman problem. The coefficients map onto the cities and the Hamming distances between the residues of the coefficients map onto the distances between cities. The optimal coefficient order thus becomes the optimal tour of the cities, where each city is visited only once and the total distance traveled is minimized.

It can be noted that this technique is similar to the technique discussed in chapter 2 (section 2.3.2) except that the coefficient residues are used for calculating the Hamming distance. The coefficient order can either be kept consistent across moduli or optimized for each modulus. The later results in higher savings but increases the complexity of address generation.

7.1.4. Exploiting Redundancy The modulo multiplication and the module addition for small moduli is typ

ically performed using look-up-tables. One approach to implement a look-uptable is to use a memory such as a ROM or a RAM. However, since the number of entries in a look-up-table are typically less than the number of rows of the memory block, such as implementation is not area efficient. For example, a look -up-table for modulo 5 multiplication wh ich has 25 entries, needs a memory that has 64 rows (6 address lines). Thus a more area-efficient way to implement the look-up-tables is to realize them as PLAs.

The look-up-table area can be further reduced by exploiting the commutativity of modulo multiplication and modulo addition. For example, a look-up-table for modulo 5 multiplication has an entry corresponding to inputs 2 and 3 and also has an entry corresponding to inputs 3 and 2. This redundancy can be exploited to reduce the number of entries in the look-up-table and correspondingly reduce the PLA area. For example, for a modulo 5 multiplication the number of entries in the look-up-table can be reduced from 25 to 15. In general, for a modulo Mi multiplication (or addition), the number of entries in the look-uptable can be reduced from Ml to Mi (Mi + 1)/2. While the look-up-table area is reduced, there is an associated overhead of the logic that needs to optionally swap the inputs to the look-up-table. Figure 7.5 shows a scheme (SCHEME I) that swaps the inputs if the first input is greater than the second input. It can be noted that the reduction in the look-up-table area is thus achieved at the expense of multiplexers and a comparator.

The comparator overhead can be eliminated by using one of the bits (say LSB) of one of the inputs as the control signal to the multiplexers for swapping the inputs. Figure 7.5 also shows an implementation ofthis scheme (SCHEME

Residue Number System based lmplementation

AII)···(M-lil

XII) (M-lil

SCHEME I

* moduln M

All) .. (M-lil

XII) ... (M-1i1

* modulo M

SCHEME 11

Figure 7.5. Minimizing Look Up Table Area by Exploiting Redundancy

177

11). For example, consider the look-up-table for modulo 3 multiplication and an input swapping scheme in which the inputs to the look-up-table are swapped if the LSB of the first input is 1. It can be noted that with such ascheme, the input pairs 01 00 and 01 10 will never be fed to the LUT. Thus the number of entries of the look-up-table can be reduced from 9 to 7. While this reduction is Iess than the reduction from 9 to 6 entries using the scheme mentioned earlier, overall it may be more area efficient and faster (no comparator delay).

7.1.5. Residue Encoding for minimizing LUT area Coefficient residue encoding was presented in section 7.1.2 as a technique to

minimize switching in the coefficient memory output busses. The coefficient residue coding alters the modulo multiplication truth-table and hence impacts the area of the PLA. The area of the PLA implementing the modulo addition can be impact by coding the residue values of the input data. Such a coding also impacts the area of the PLA implementing the modulo multiplication. The binary-to-RNS and the RNS-to-Binary converters can be suitably modified to comprehend the residue coding of the data. Techniques based on symbolic input-output coding can be used so as to appropriately encode the coefficient residues and the data residues so as to minimize the PLAs implementing the modulo multiplication and the modulo addition across all moduli. If this technique is to be applied in conjunction with the technique showed in figure 7.5 (SCHEME I), the impact of coding on the compare logic area also needs to be comprehended. Such a coding algorithm has been presented in [43].

The area improvements obtained by exploiting the redundancy in the computation and encoding the residues are presented in tables 7.1, 7.2, and 7.3 for modulo adder, modulo multiplier, and modulo MAC respectively. The results are presented for RNS based implementations based on the moduli set {5, 7, 9, 11, 13}. The metric chosen for PLA area is the product of the number of rows and the number of columns in the PLA. The number of columns in the


Table 7.1. Area estimates für PLA based müdulü adder implementatiün

Xfürm M=5 M=7 M=9 M=11 M=13 Cünv 270 540 1060 1580 2200

Xform1 217(19%) 363(32%) 736(30%) 1050(33%) 1364(38%) Xform2 225( 16%) 420(22%) 900(15%) 1320(17%) 1820(17%) Xfürm3 240(11%) 420(22%) 1220 1340(15%) 1740(21%)

Xfürm1+3 172(36%) 368(31 %) 673(36%) 1172(25%) 1443(34%) Xfürm3+1 217(19%) 355(34%) 804(24%) 1066(32%) 1198(45%) Xform3+2 195(27%) 315(42%) 980(7.5%) 1280(19%) 1480(33%)

PLA is equal to 2*(no. of input bits) + no. of output bits. The number of rows in PLA are obtained from the residue encoded truth table minimized using Espresso [62].

It can be noted that the redundancy elimination technique can be used in conjuction with residue encoding. Tables 7.1, 7.2 and 7.3 show area reduction for such combination of transformations as weIl. Here are the cases for which the results are presented:

Conv: Area with conventional implementation. Xform 1: Area improvement with redundancy elimination - Scheme I Xform2: Area improvement with redundancy elimination - Scheme 11 Xform3: Area improvement with coefficient residue encoding Xform 1 +3: Area improvement with Xform 1 followed by Xform3. Xform3+ 1: Area improvement with Xform3 followed by Xform 1. Xform3+2: Area improvement with Xform3 followed by Xform2.

As can be seen from the results, the PLA area can be reduced by as much as 45% (corresponding to modulo-13 addition using Xform3+ 1). The results also show that no one combination of techniques results in the most area reduction across all moduli. While (Xform3+ 1) combination gives maximum area reduction in most cases, it has an associated delay overhead. For modulo-7 addition, the (Xform3+2) combination gives minimum area and it comes with minimal delay overhead.

In modulo multiplication implementation, the PLA area can be reduced by as much as 52% (corresponding to modulo-13 multiplication using Xform3+ 1). The results also show that for modulo-5 multiplication, just the Xform 11 gives the most area efficient PLA.


Table 7.2. Area estimates for PLA based modulo multiplier implementation

X form M=5 M=7 M=9 M=II M=13 Conv 225 285 960 1440 1860

X form 1 202(10%) 258(9%) 656(31%) 990(31%) 1324(28%) Xform2 165(27%) 225(21%) 760(21%) 1200(16%) 1540(17%) Xform3 135(40%) 330 540(43%) 1000(31%) 920(51%)

Xform1+3 170(24%) 325 644(32%) 1046(27%) 1448(22%) Xform3+1 157(30%) 314 547(43%) 884(38%) 886(52%) Xform3+2 165(27%) 330 720(25%) 980(32%) 1160(38%)

Table 7.3. Area estimates for PLA based modu10 MAC implementation

Xform M=5 M=7 M=9 M=11 M=13 Conv 1428 4284 10696 20636 31892

Xform1 892(38%) 2346(45%) 6240(42%) 11454(45%) 17670(45%) Xform2 1197(16%) 3381(21%) 8876(17%) 16744(18%) 26320(17%) X form 3 1113(22%) 2604(39%) 5124(52%) 17388(16%) 32088

Xform1+3 884(38%) 2133(50%) 3838(64%) 10748(48%) 18415(42%) Xform3+1 766(47%) 1781(58%) 3648(66%) 10150(51 %) 17264(46%) Xfonn3+2 1029(27%) 2667(37%) 7224(32%) 14728(28%) 25228(20%)

In case of MAC computation, the results show that the PLA area can be reduced by as much as 66%. In this case Xform3+ I gives the minimum area for all the moduli (i.e. 5, 7, 9, 11 and 13).

7.2. Optimizing RNS based Implementation of FIR Filters While the techniques described in the earlier section can be applied to FIR

filters, this section presents additional transformations specific to FIR filters.

7.2.1. Coefficient Scaling For ascale factor K, the FIR filtering equation 2.10 translates to:

N-l N-l

Y[n] = (K· L A[i]·X[n-i]) ·1/ K = (L (K·A[i]) ·X[n-i]) ·1/ K (7.5) i=ü i=ü

Thus the FIR filter output can be calculated using coefficients scaled by a factor K and the weighted-sum result scaled by 1/K.

Since the residue values of the scaled coefficients are different than the residue values of the original coefficients, scaling can be used as a transfor-

]80 VLSI SYNTHESIS OF DSP KERNELS

mation to minimize switching in the coefficient memory output busses. For a given set offilter coefficients, an optimum scale factor within the specified range (e.g. ±3db) can be found such that the total Hamming distance between residue values of successive coefficients, across all moduli, is minimized. Such a transformation can thus reduce the power dissipation in the coefficient memory and the modulo MAC units.

The RNS based implementation shown in figure 7.] can be modified to include a 'modulo multiply by l/K' LUTs in front ofthe RNS-to-Binary converter.

It can be noted that instead of using a common scale factor across all moduli, a different scale factor which is optimal for each of the moduli can be chosen so as to further reduce the power dissipation. In such a case, the 'modulo multiply by 1/K' LUTs can be replaced by LUTs that use different scale factors for different moduli.

7.2.2. Coefficient Optimization for Low Power

Coefficient optimization to minimize the total Hamming distance between successive coefficient values has been presented in section 2.4.5. The algorithm can be suitably adapted to minimize the total Hamming distance between residues of successive coefficient values across all moduli. Coefficients can thus be optimized so as to minimize the switching activity in the coefficient memory output busses and hence reduce power dissipation in the coefficient memory and the modulo MAC units.

7.2.3. RNS based Implementation of Transposed FIR Filter Structure

The implementation shown in figure 7.1 corresponds to the direct form FIR filter structure. The RNS based implementation based on the transposed form FIR filter has been presented in [87] and shown to be efficient for very high order filters and sm aller moduli. Figure 7.6 shows the implementation for a modulus.

7.2.4. Coefficient Optimization for Area Reduction

As shown in figures 7.2 and 7.3, the filter coefficients and data form the inputs to the modulo multiplier and the modulo MAC units. The area of the PLA based modulo multiplier and the modulo MAC depends on the number of combinations ofthe coefficients and data for which the PLA needs to be realized. For area efficient implementation, the coefficients can be optimized such that the number of unique residues across the moduli set are reduced thereby reducing the entries in PLA based modulo multiplier and the modulo MAC unit. Such a coefficient optimization can be performed by suitably adapting the algorithm presented in chapter 2 (section 2.4.5).


X[i] ------,,--------,-------,-----------.

*2 modM

* 3 modM

CONNECTION ARRA Y

-I + + '----_Z_-I/I mod ,---_,/1 mod

M M

-I Z

* (M-I) modM

+ mod f-------I> Y[i]

M

Figure 7.6. Modulo MAC structure for Transposed Form FIR Filter

It can be noted that the reduction in the number of unique residues across the moduli set results in an area efficient implementation of the transposed FIR filter structure shown in Figure 7.6 because of the reduction in the number of modulo multipliers needed in the structure.

The impact of this transformation can be appreciated from the results for 6 low pass filters shown below. These filters vary in terms of desired filter characteristics and consequently in the number of coefficients. These filters have been synthesized using the Park-McClellan's algorithm [73] for minimum number of taps. The optimization has been performed using first improvement and the steepest descent strategies. The strategies differ in the optimization approaches. While in the steepest descent approach, in each iteration of the optimization, the move which gives the maximum gain is selected, in first improvement approach, the first move which gives gain is selected in each iteration.

The coefficient values quantized to 16-bit 2's complement fixed point representation form the initial set of coefficients for optimization. coefficient optimization algorithm has been applied across the moduli set {5,7,9, 11,13, 17} for PLA based implementation of modulo multiplier and modulo MAC. The area improvements and total number of unique residues for different optimization strategies for modulo multiplier and modulo MAC modules relative to the conventional implementation are shown in Table 7.5. The metric chosen for PLA area is the product of the number of rows and columns in the PLA. The number


of columns in a PLA is equal to { 2*(number of input bits) + the number of output bits}. The number of rows are obtained by minimizing the resulting truth table using Espresso [62]. The values under Mult and MAC columns in table 7.5 give the sum of areas of the modules across the whole moduli set.

Table 7.4. Distribution of Residues across the Moduli Set

Filter 5 7 9 11 13 17 Total LPI -lp_16L3KA.5L2A2_24 Conventional 5 6 6 9 7 7 40 Steepest 4 4 4 6 5 5 28 Ist Impr. 4 6 3 6 5 3 27 LP2 -lp_12L2L3K.12A5_28 Conventional 4 7 9 8 10 9 47 Steepest 4 5 5 5 6 3 28 I st Impf. 5 5 4 5 3 5 27 LP3 -lp_IOL2K_3K_0.05AO_29 Conventional 5 6 8 8 10 10 47 Steepest 5 5 5 4 4 4 27 Ist Impr. 4 6 5 3 7 4 29 LP4 -lp_12K_2.2K_3.1 K_.16A9_34 Conventional 5 7 5 9 8 11 45 Steepest 3 6 6 6 6 5 32 I st Impf. 4 5 7 4 7 4 31 LP5 -lp_IOK_1.8K_2.5L.15_60AI Conventional 5 7 9 9 9 14 53 Steepest 5 7 8 8 8 9 45 I st Impf. 5 7 7 8 6 9 42 LP6 -lp_IOK_I.8L2.5L.03_70_55 Conventional 5 7 9 11 10 14 56 Steepest 5 7 9 10 9 1I 51 I st Impf. 5 7 8 9 8 13 50

Results show that with coefficient optimization, area improvement of upto 52% in modulo multiplier block and 54% in the modulo MAC block. Table 7.4 shows the distribution of residues across the moduli set for different optimization strategies compared to the distribution in the conventional case. For example, for filter LPI and modulo 17 the conventional implementation has value 7 and the first improvement strategy has a value 3. This means that out the possible 17 residues for modulo 17, the coefficients map to 7 residues in the case of conventional implementation, while the coefficients map to only 3 residues after optimization by first improvement strategy.

The names of the filters mentioned in the tables indicate the filter characteristics. For example, the FIR filter Ip_16L3KA.5K_.2A2_24 is a low-pass filter

Residue Number System based Implementation

with the following characteristics: Sampling freq. = 16KHz, Passband freq. = 3KHz, Stopband freq. = 4.5KHz, Passband ripple = O.2db, Stopband atten. = 42db and Number of filter coeffs = 24

183

Table 7.5. Impact of Coefficient Optimization on the Area of Modulo Multiplier and Modulo MAC

Conventional Steepest Ist Impr. Best % Impr. Filter Mult MAC Mult MAC Mult MAC Mult MAC LPI 5565 62060 4100 44780 3500 35600 37% 43% LP2 6775 75195 3375 36215 3215 34880 52% 54% LP3 6485 71260 3365 33515 3610 38060 48% 53% LP4 6230 69910 3805 40295 3645 36005 42% 49% LP5 7465 85840 5880 64965 5330 59890 29% 30% LP6 7940 90470 7150 81065 7070 80115 11% 11%

In terms of the optimization strategies, the first improvement approach performs marginally better in most of the cases. It is to be noted that the runtimes for first improvement strategy is far less than that of steepest descent strategy because of the exhaustive search involved in selecting the best move in each iteration with steepest descent approach.

It can be noted that this transformation can be applied in conjunction with residue encoding and redundancy elimination schemes to achieve further area reduction.

7.3. RNS as an Optimizing Transformation for High Precision Signal Processing

In applications, such as high quality audio, that need more than 16 bits of precision, the processing of signals on a 16-bit fixed point DSP requires double precision computation and is hence time consuming. Since in Residue Number System (RNS), the high precision data is decomposed to lower precision for its processing, RNS can be used as an optimizing transformation [44] for improving the performance of such applications implemented on a single processor.

The selection of moduli is an important consideration in the RNS based implementation as it determines the computation time. The issues involved are:

The moduli set should be pairwise relatively prime to enable high dynamic range.


2 The modul i set has to be selected such that residue computations are easy (e.g. 2n ,2n - 1) [92].

3 Since the implementation is on a single processor and computations w.r.t. to each moduli has to be done sequentially, it is desirable to have as few moduli as possible.

4 The moduli should have simple multiplicative inverses. This ensures conversion from residue to binary domain with less computations.

Based on the above mentioned considerations, for a programmable DSP based implementation a moduli set ofthe form (2 n , 2n -1, 2n - 1 -1, 2n - 2 -1) can be selected. It has been shown by Szabo and Tanaka [92) that the four moduli are pairwise relatively prime if and only if n is odd. So, for n = 15, this moduli set provides a high dynamic range of more than 56 bits and also offers an advantage of simplicity in determining the additive and multiplicative inverse. The additive inverse of a number w.r.t. modulo of form 2k - 1 is just the 1 's complement representation of the number.

The results of RNS based implementation of an N-tap FIR filter with 24 bit data and coefficient precision on TMS320C5x DSP are shown in tables 7.6 and 7.7.

Table 7.6. RNS based FIR filter with 24-bit precision on C5x

Pgm. Mem. Data Mem. Exec. time size( words) size(words) (cycles)

BIN_FIR 86 4N+8 32N+20 RNS_FIR N+269 7N+19 16N+261

Table 7.7. Number of Operations for RNS based FIR filter with 24-bit precision on C5x

Loads & Stores Adds Mults BIN_FIR IIN+4 ION 4N RNS_FIR 4N+48 3N+60 4N+I

As can be seen from table 7.6, the program and data memory requirements of the RNS based implementation (RNS--FIR) are much higher than the implementation in the binary domain (RNS~IN). In terms of execution time however, the RNS based implementation requires fewer cycles for filters with more than

Residue Number System based Jmplementation 185

15 taps. As can be seen from table 7.7, the RNS based implementation requires fewer loads and stores and performs fewer number of additions than the binary domain implementation for filters with more than 8 taps. Thus the RNS based implementation is also power efficient for higher filter orders.

It can be noted that the transformation based on the multi rate architectures presented in chapter 2 (section 2.4.2) can be applied in conjunction with the RNS based implementation [44] to further improve performance and reduce power dissipation.

Chapter 8

A FRAMEWORK FOR ALGORITHMIC AND ARCHITECTURAL TRANSFORMATIONS

Chapters 2 to 7 have presented many algorithmic and architectural transformations targeted to the weighted-sum and the MCM computations realized using different implementation styles. This chapter proposes a framework that encapsulates these transformations and enables a systematic exploration of the area-delay-power solution space. The framework is based on a classification of the transformations into seven categories which exploit unique properties of the DSP algorithms and the implementation styles.

8.1. Classification of Algorithmic and Architectural Transformations

Implementing data movement by moving the pointer to the data Most DSP algorithms are data intensive and perform sequential shifting of data between two output computation. The power dissipation for such a data shift can be minimized by moving the pointer to the data instead of moving the data itself. The power reduction is thus achieved at the expense of the decoder overhead and increased control complexity. The transformations of this type incIude:

• Circular Buffers used in programmable DSPs to achieve data movement after every FIR filter computation (section 2.4.1).

• Shiftless Implementation of DA based FIR Filters described in section 4.3.2.

2 Data Flow Graph Restructuring Transformations Most DSP algorithms are repetitive in nature. This property can be exploited to re-structure the DFGs so as to achieve the desired area-delay or area-power tradeoffs. The transformations of this type incIude:

187



• Parallel Processing [15, 76]

• Pipelining [15]

• Re-timing [60]: An application ofthis transform is the MCM based FIR filter structure (figure 3.5) derived from the direct form structure.

• Loop Unrolling [22,60]

3 Data Coding The area, delay, power parameters of most implementation styles are impacted by the bit pattern of the processed data. Data coding techniques use different number representation schemes so as to appropriately alter the bit pattern and hence impact area, delay and power parameters. Data coding has the associated overhead of encoding and decoding. The transformations of this type include:

• Gray Coding: used to minimize power dissipation in the busses with sequential data values (sec 2.2.2.1). It is also used to minimize the decoder power in the shiftless DA based implementation of FIR filters (section 4.3.2).

• TO coding: also used to minimize power dissipation in the busses with sequential data values (section 2.2.2.2).

• Bus Invert Coding: used to minimize power dissipation in the busses with data values having no specific pattern (section 2.2.2.3).

• Nega-binary Coding: used to minimize power dissipation in the input data shift register of the DA based implementation of FIR filters (secti on 4.3.1).

• Uni-sign representation: used to improve area efficiency of multiplierless implementation ofweighted-sumlMCM computations (section 5.3.1.2).

• CSD representation: also used to improve area efficiency of multiplierless implementation ofweight-sumlMCM computations (section 5.3.1.3).

• Coefficient Residue Encoding: for low power realization of RNS based weighted-sum computation (section 7.1.2). Also used to minimize lookup-table area in the RNS based implementation of weighted-sum computation (section 7.1.5)

4 Transformations that Exploit Redundancy in the Computation The transformations of this type reduce the computational complexity of an algorithm by exploiting redundancy in the computation. This is achieved by either reducing the number of operations or replacing complex operations (such as multiplication) by simpler operations (such as addition and shift). These transformations typically destroy the regularity of the computation

A Frameworkfor Algorithmic and Architectural Transformations 189

and also increase the control complexity. The transformations of this type include:

• Multirate Architectures for FIR filter implementation [66].

• Block FIR Filters [77]

• Common Subexpression Precomputation: used to improve the area efficiency of a multiplier-less implementation of a weighted-sum computation (sections 5.1 and 5.2) and also to reduce the number of operations required to perform two dimensional multiplication-free linear transforms (section 6.1.4).

• Fast Fourier Transform (FFT) [73]

• Fast Discrete Cosine Transform (DCT) [82]

• Redundancy elimination schemes (section 7.1.4) to reduce the areas of PLAs implementing modulo-add, modulo-multiply and modulo-MAC look-up-tables in the RNS based implementation ofweighted-sum computation.

5 DFG Transformations based on Mathematical Properties These transformations exploit the properties such as commutativity, associativity and distributivity of mathematical operators so as to suitably restructure the data flow graph. Here are some examples of the transformations of this type:

• Linear Phase FIR Filters which exploit the coefficient symmetry and use the distributivity property to reduce by half the number of multiplications (section 3.4.1.2).

• Coefficient Scaling discussed in sections 2.4.4 and 7.2.1.

• Selective Coefficient Negation discussed in section 2.3.1.

• Coefficient Ordering discussed in sections 2.3.2 and 7.1.3.

• Se\ective Bit Swapping of ALU Inputs in section 2.3.3.

• Computing output Y[n] in terms of Y[n-l] of an FIR Filter discussed in section 5.3.2.

• DAG Node Reordering for Low Power discussed in section 6.1.3.

• DAG Transformation - Tree to Chain Conversion discussed in sec-tion 6.2.5.1.

• DAG Transformation - Serializing a Butterfly discussed in section 6.2.5.2.

• DAG Transformation - Fanout Reduction discussed in section 6.2.5.3.

• DAG Transformation - Merging discussed in section 6.2.5.4.


6 Exploiting Relationship between the Real Value Domain and the Binary Domain While the frequency response of an FIR filter depends on the real values of the coefficients, the area-delay-power parameters are affected by the properties of the binary representation of the coefficients. A small change in the value of a coefficient has minimal impact on the filter characteristics but can impact the binary representation significantly. For example, numbers 31 and 32 have a small difference in the real value domain, but in the binary domain 31 has five 1 s while 32 has just one 1. This relationship can be exploited to suitably alter the filter coefficients while still meeting the desired filter characteristics in terms of passband ripple and stopband attenuation. The transformations of this type include:

• Coefficient Optimization to reduce the power dissipation in the coefficient memory data bus for the FIR filters implemented on a programmable DSP (section 2.4.5) and also for RNS based implementation (section 7.2.2).

• Coefficient Optimization to minimize the number of non-zero bits in the CSD representation of filter coefficients. This helps to reduce the number of additions in the multiplier-less implementation of FIR filters [83, 105].

• Coefficient Optimization to reduce the area of the look-up-tables in RNS based implementation of FIR filters (section 7.2.4).

7 Transformations that Exploit Available Degrees of Freedom For a given implementation style, multiple alternatives may exist to realize some of its functionality. As an example, consider the loading of coefficient and data values in the program and data memories respectively, for a programmable DSP based implementation of an FIR filter. Typically multiple options exist for the start address of these data blocks. If the parameters such as memory size and number of cyc\es are not affected by the start address, any of the available options can be arbitrarily chosen for loading the data. However, if a new design constraint such as power dissipation is to be comprehended and is affected by the start address, this degree of freedom presents an opportunity to optimize the design constraint. The transformations that exploit such degrees of freedom include:

• Allocation of Coefficient and Data Memory to reduce power dissipation in the address busses for the programmable DSP based implementation of FIR filters (section 2.2.1).

• Bus Bit Reordering to reduce cross-coupling related power dissipation in the busses (section 2.2.5).


• Coefficient Partitioning to improve area efficiency of a DA based implementation of FIR filters, that uses two LUTs (section 4.2).

• Register Assignment for low power code generation of multiplicationfree linear transforms (section 6.1.5).

8.2. A Snapshot of the Framework This section proposes a framework for systematically applying the transfor

mations of each type. The framework captures for each of the transforms the parameters it optimizes and the desired characteristics of the algorithm and of the implementation style for the transform to be effective. For example, consider the gray coding transform. This transform can be used to minimize power dissipation in a bus. The transform is most effective if the algorithm results in sequential data values on the bus and if the bus represents a large capacitive load in the target implementation style. The framework also captures the overheads associated with the transformation. For example, the gray coding transform results in the area and delay overhead due to the encoder and the decoder logic at the source and the destination of the bus respectively.

Figures 8.1 and 8.2 show a snapshot of the framework that can be used to achieve the desired area-power tradeoff for a DSP algorithm.

It can be noted that figures 8.1 and 8.2 depict a representative sampie of the complete framework. The complete framework not only comprehends all existing transformations, but also suggests a methodology for identifying new transformations that could be specific to a new algorithm or a new implementation style. This is achieved by systematically exploring the different classes of transformations.

The framework can be further enhanced by incorporating estimators that can quantify the gains and overheads associated with each of the transformations.


Get the 'baseline' mapping of the algorithm on the target implementation style

Identify main sources of power dissipation

Does the algorithm Is the decoder overhead Implement data move involve sequential ~ and increased control ~ as movement of the data movement ? complexity acceptable ? pointer to the da ta

N NI I

~ + Is the algorithm Can the supply voltage

repetitive in ~ be scaled ? and

nature? Is increase in the Y

area acceptable ? f----- Use parallel processing

N Is increase in the area y

and latency acceptable ? f----- Use pipelining

..... _-_._-_ .............. __ ......... . ..................... __ ...........

Is increase in the ~ control complexity Use loop unrolling

acceptable ?

NI I ~ +

Is there any Is the potential loss in Use common sub-redundancy in regularity and the expression pre-the computation y corresponding increase

~ computation or other

that can be -- in control complexity transforms to reduce exploited? acceptable ? computational

complexity

N NI I t t

Continued on * * the next page

Figure 8.1. A Framework for Area-Power Tradeoff


* Continued from * the previous page

! Are there busses Is the encoder-(e.g. address bus) ~

y Use Gray coding decoder overhead ~ or TO coding

that see sequential acceptable ? data access ?

N 1 ~ t NI I

Are there busses Is the encoder- Use bus-invert ~

y that see values that decoder overhead ~ coding are da ta dependent ? acceptable ?

N 1 ~ ~ NI I

Are representative Is there a layout-level Use bus bit-reordeing traces of data values ~ t1exibility in routing ~ during routing availab1e for the a bus? busses?

Nl J t NI I

Does the impl-Is the area overhead of Use selective bit-

mentation havc ALU ~

Y with large capacitive

a mux and an xor gate ~ swapping for the

input busses? per input bit acceptable ? ALU inputs

N 1 ~ t NI I

Implementation with the desired area-power tradeoff

Figure 8.2. A Framework for Area-Power Tradeoff - continued

Chapter 9

SUMMARY

As the seemingly simple and straightforward problem ofweighted-sum computation was approached, its multiple facets started unfolding. Several optimization opportunities were identified in the area-delay-power space targeted to technologies ranging from programmable processors to hardwired implementations with or without a hardware multiplier. The algorithmic and architectural transformations presented in this book cover this entire solution space.

Programmable DSP based Implementation For programmable DSP based implementations, the book has focussed on

the issue of low power realization. Generic extensions to DSP architectures for low power were presented, followed by techniques specific to weighted-sum computation and finally transformations which are specific to FIR filters. These were captured in a comprehensive framework for low power realization of FIR filters on a programmable DSP.

Implementation using Hardware Multiplier(s) and Adder(s) The effectiveness of many high level synthesis transformations such as par

allel processing, pipelining, loop unrolling and re-timing were evaluated in the context of weighted-sum computation. This led to an important observation that the actual gains from these transformations are heavily dependent on the hardware resources available for implementation. The effect of these transformations - specifically parallel processing - on the peak power dissipation was also analyzed. While the average power dissipation continues to reduce with increasing degree of parallelism, the peak power dissipation can, in fact, start increasing beyond a point. The relationship between the supply voltage Voo, the threshold voltage VT and the degree ofparallelism N can be analyzed such that for a given value of the VOO/VT ratio, a limit on N can be derived

195



beyond which the peak power dissipation starts increasing. The DFG transformations mentioned above do not impact the computational complexity; they achieve power reduction at the expense of increased area. In the context of FIR filters, multi rate architectures were presented as structures which reduce computational complexity and thus achieve power reduction with minimal datapath area overhead.

Distributed Arithmetic Based Implementation For Distributed Arithmetic based implementations, the transformations pre

sen ted in this book focused on three important aspects. Firstly, various DA based architectures were analyzed as techniques to arrive at multiple solutions in the area-delay space for weighted-sum computation. Secondly, the flexibility in coefficient partitioning was exploited to improve the area efficiency of DA structures which use two look-up-tables. Finally, a data encoding scheme based on the nega-binary number representation was presented to reduce power dissipation in DA based FIR filters.

Multiplier-Iess Implementation The problem of minimizing the number of additions in the multiplier-less

implementations of 1-D and 2-D linear transforms was presented. A common subexpression precomputation technique that can be applied to both weightedsum computation and the multiple constant multiplication (MCM) computation was presented. In the context of FIR filters, coefficient transformations were identified which, in conjunction with the common subexpression precomputation technique, realize area-efficient multiplier-less FIR filters. Since the resultant data flow graphs have operators and variables with varying bit precision, an approach to precision-sensitive high level synthesis was discussed.

Code Generation of Multiplication-free Linear Transforms Many image processing algorithms perform weighted-sum computation where

the weight values are restricted to {O, 1, -I}. The common subexpression precomputation technique can be applied to minimize the number of additions required to implement such two dimensional multiplication-free linear transforms. Depending on the target architecture, (register-rich or single register, accumulator based), appropriate code generation strategy and DAG optimization transformations can be used to improve performance and reduce power dissipation of the implementation of these transforms.

Residue Number System Based Implementation In RNS based implementation, the weighted-sum computation problem can

be solved as multiple weighted-sum computations of smaller precision executed in parallel. For smaller moduli, the weighted-sum is performed using

Summary 197

look-up tables (LUTs) thus enabling a multiplier-Iess implementation. Techniques such as residue encoding and LUT redundancy elimination enable better area efficiency and reduce power dissipation of such implementations. These parameters can further be optimized in the context ofFIR filters using additional transformations such as coefficient optimization.

A Framework for Aigorithmic and Architectural Transformations The various transformations discussed in the book were classified into seven

categories. The classification is based on the properties that the transformations exploit to achieve optimization in the area-delay-power space.

A framework was proposed for a systematic application ofthese transformations. For each of the transformations, the framework captures the parameters it optimizes, the desired characteristics of the algorithm and of the implementation style for the transformation to be effective.

There are two more types of transformations that can be incorporated in the framework so as to make it more comprehensive. The first type includes techniques that perform area-delay-power tradeoffs dynamically. Techniques for dynamic power management and system-level adaptive voltage scaling [23] are examples of such transformations. The second type of transformations use 'quality' as a variable to trade area, delay and power with one another. One example of such a transformation is the setting of word length for storing the data and coefficient values so as to achieve the desired area vs quality trade-off for FIR filtering.

References

[I] J. W. Adams and A. N. Willson, "Some Efficient Digital Prefilter Structures", IEEE Transactions on Circuits and Systems, May 1984, pp. 260-265

[2] M. Agarwala and P. T. Balsara, "An Architecture for a DSP Field Programmable Gate Array", IEEE Transactions on VLSI Systems, March 1995, pp. 136-141

[3] Vikas Agrawal, Anand Pande, Mahesh Mehendale, "High Level Synthesis of Multi-precision Data Flow Graphs", 14th International Conference on VLSI Design, January 2001, pp. 411-416

[4] A. Aho, R. Sethi and J. Ullman, Compilers Principles, Techniques and Tools, Addison-Wesley, 1986

[5] G. Araujo, S. Malik, M. T-C. Lee, "Using Register-Transfer Paths in Code Generation for Heterogeneous Memory-Register Architectures", ACMIIEEE 33rd Design Automation Conference, 1996, pp. 591-596

[6] ARM7TDMI Data Sheet, Advanced RISC Machines Ltd.(ARM), 1995

[7] R. S. Bajwa, M. Hiraki, H. Kojima, D. J. Gorny, K. Nitta, A. Sridhar, K. Seki and K. Sasaki, "Instruction Buffering to Reduce Power in Processors for Signal Processing", IEEE Transactions on VLSI Systems, December 1997, pp. 417-424

[8] Luca Benini, G. D. Micheli, E. Macii, D. Sciuto, C. Silvano, "Asymptotic ZeroTransition Activity Encoding for Address Busses in Low-Power MicroprocessorBased Systems", GLS- VLSI'97, 7th Great Lakes Symposium on VLSI, 1997, pp. 77-82

[9] N. Binh, M. Imai, A. Shiomi and N. Hikichi, "A Hardware/ Software Partitioning Algorithm for Designing Pipelined ASIPs with Least Gate Counts", 33rd ACMIIEEE Design Automation Conference, DAC-1996, pp. 527-532

[10] Ivo Bolsens, Hugo J. Oe Man, Bill Lin, Kar! Van rompaey, Steven Vercautcren and Dicderik Verkest, "Hardware/Software Co-Design of Digital Telecommunication Systems", Proceedings of the IEEE, March 1997, pp. 391-418

[I I] Jon Bradley, Ca/cu/ation «f TMS320C5x Power Dissipation Application Report, Texas Instruments, J 993

199

200 VLSI SYNTHESIS OF DS? KERNELS

[12] R. K. Brayton, C. T. McMullen, G. D. Hachtel and A. Sanglovanni-Vincentelli, Logic Minimization Algorithmsfor VLSI Synthesis, New York: Kluwer Academic, 1984

[13] C. S. Burrus, "Digital Filter Structures Described by Distributed Arithmetic", IEEE Transactions on Circuits and Systems, December 1977, pp. 674-680

[14] A. Chandrakasan, M. Potkonjak, J. Rabaey and R. Brodersen, "HYPER-LP: A System for Power Minimization using Architectural Transformations", ICCAD-1992, pp. 300-303

[15] A. Chandrakasan and R. Brodersen, "Minimizing Power Consumption in Digital CMOS Circuits", Proceedings of the IEEE, April 1995, pp. 498-523

[16] Jui-Ming Chang and Massoud Pedram, "Energy Minimization Using Multiple Supply Voltages", IEEE Transactions on VLSI Systems, December 1997, pp. 436-443

[17] Pali ab Chatterjee and Graydon Larrabee, "Gigabit Age Microelectronics and Their Manufacture", IEEE Transactions on VLSI Systems, March 1993, pp. 7-21

[18] N. Deo, Graph Theory with Applications to Engineering and Computer Science, Prentice Hall India, 1989

[19] S. Devadas, H. T. Ma, A. R. Newton and Sanglovanni-Vincentelli, "MUSTANG: State Assignments of Finite State Machines Targeting Multi-Level Logic Implementations", IEEE Transactions on Computer-Aided Design, vol. CAD-7, Dec 1988, pp. 1290-1300,

[20] W.E. Dougherty, D.J. Pursley, D.E. Thomas, "Instruction subsetting: Trading power for programmability", IEEE Computer Society Workshop on VLSJ'98, 1998 , pp. 42-47

[21] R. Ernst, J. Henkel and T. Benner, "Hardware-Software Cosynthesis for Microcontrollers", IEEE Design and Test of Computers, Vol. 12, 1993, pp. 64-75

[22] Daniei Gajski, Nikil Dutt, A C-H Wu, S Y-L Lin, High,level Synthesis -Introduction to Chip and System Design, Kluwer Academic Publishers, 1992

[23] V. Gutnik and A. P. Chandrakasan, "Embedded Power Supply for Low-Power DSP", IEEE Transactions on VLSI Systems, December 1997, pp. 425-435

[24] N. Halbwachs, P. Capsi, P. Raymond, and D. Pilaud, 'The Synchronous dataftow programming language Lustre", Proceedings of the IEEE, September 1991, pp. 1305-1320

[25] P. N. Hilfinger, J. Rabaey, D. Genin, C. Scheers and H. De Man, "DSP Specification using the SILAGE Language", International Conference on Acoustics, Speech and Signal Processing, ICASSP-1990, pp. 1057-1060

[26] K. Illgner, H-G. Gruber, P. Gelabert, J. Liang, Y. Yoo, W. Rabadi and Raj Talluri, "Programmable DSP Platform for Digital Still Cameras", International Conference on Acoustics, Speech and Signal Processing, ICASSP-1999, pp. 2235-2238

REFERENCES 201

[27) Anil K. Jain, Fundamentals 0/ Digital Image Processing, Prentice Hall Inc. 1989

[28) R. Jain, et.al, "Efficient CAD Tools for Coefficient Optimization of Arbitrary Integrated Digital Filters", IEEE International Conference on Accoustics, Speech and Signal Processing, 1984

[29) W. K. Jenkins and B. Leon, "The Use of Residue Number System in the Design of Finite Impulse Response Filters", IEEE Transactions on Circuits and Systems, April 1977, pp. 191-201

[30) W. K. Jenkins, "A Highly Efficient Residue-combinatorial Architecture far Digital Filters", Proceedings of the IEEE, June 1978, pp. 700-702

[31) I. Karkowski and R.H.J .M. Otten, "An Automatie Hardware-Software Partitioner Based on the Possibilistic Programming", European Design and Test Conference, 1996, pp. 467-472

[32) D. Kodek and K. Steiglitz, "Comparison of Optimal and Local Search Methods for Designing Finite Wordlength FIR Digital Filters", IEEE Transactions on Circuits and Systems, January 1981, pp. 28-32

[33) E.L. Lawler, J.K. Lenstra, A.H.G. Rinnooy Kan, and D.B. Shmoys (eds.), The Travelling Salesman Problem, lohn Wiley & Sons Ltd, 1985

[34) Edward A. Lee, "Programmable DSP Architectures: Part I", IEEE ASSP Magazine, October 1988, pp. 4-19

[35) Edward A. Lee, "Programmable DSP Architectures: Part 11", IEEE ASSP Magazine, lanuary 1989, pp. 4-14

[36) M. T.-c. Lee, V. Tiwari, S. Malik and M. Fujita, "Power Analysis and Minimization Techniques for Embedded DSP Software", IEEE Transactions on VLSI Systems, March 1997, pp. 123-135

[37) H. Lekatsas, l. HenkaL W. Wolf, "Code Compression for Low Power Embedded System Design", ACM/IEEE Design Automation Conference, DAC-2000, pp. 294-299

[38) Stan Liao, Srinivas Devadas, Kurt Keutzer, Steve Tjiang, "Instruction Selection Using Binate Covering for Code Size Optimization", IEEE International Conference on CAD, ICCAD-1995, pp. 393-399

[39) Stan Liao, Srinivas Devadas, Kurt Keutzer, Steve Tjiang, Albert wang, "Storage Assignment to Decrease Code Size", ACM conference on Programming Language Design and Implementation, 1995

[40) Kun-Shan Lin (ed.), Digital Signal Processing Applications with the TMS320 Family - Theory, Algorithms alld Implementations - Vol I, Texas Instruments, 1989

[41) E. Lueder, "Generation of Equivalent Block Parallel Digital Filters and Algorithms by a Linear Transformation", IEEE International Symposium on Circuits and Systems, 1993, pp. 495-498

[42) Gin-Kou Ma and Fred J. Taylor, "Multiplier Policies For Digital Signal Processing", IEEE ASSP Magazine. January 1990, pp. 6-19


[43] M. N. Mahesh, Satrajit Gupta and Mahesh Mehendale, "Improving Area Efficiency of Residue Number System based Implementation of DSP Algorithms", 12th International conference on VLSI design, January 1999, pp. 340-345

[44] M. N. Mahesh, Mahesh Mehendale, "Improving Performance of High Precision Signal Processing Algorithms on Programmable DSPs", ISCAS-99, 1999 pp. 488-491

[45] M. N. Mahesh, Mahesh Mehendale, "Low Power Realization of Residue Number System based Implementation of FIR Filters", 13th International Conference on VLSI Design, January 2000, pp. 30-33

[46] H. De Man, J. Rabaey, P. Six and L. Claesen, "Cathedral-II: A Silicon Compiler for Digital Signal Processing", IEEE Design and Test of Computers Magazine, December 1986, pp. 13-25

[47] Mahesh Mehendale, B. Mitra, "An Integrated Approach to State Assignment and Sequential Element Selection for FSM Synthesis", 7th International Conference on VLSI Design, 1994, pp. 369-372

[48] Mahesh Mehendale, S. D. Sherlekar, G. Venkatesh, "Techniques for Low Power Realization of FIR Filters", Asia and South Pacific Design Automation Conference, ASP-DAC'95, pp. 447-450

[49] Mahesh Mehendale, S. D. Sherlekar, G. Venkatesh, "Coefficient Optimization for Low Power Realization of FIR Filters", IEEE workshop on VLSI Signal Processing, 1995, pp. 352-361

[50] Mahesh Mehendale, S. D. Sherlekar, G. Venkatesh, "Synthesis of Multiplier-Iess FIR Filters with Minimum Number of Additions", IEEE International Conference on Computer Aided Design, ICCAD'95, pp. 668-671

[51] Mahesh Mehendale, S. D. Sherlekar, G. Venkatesh, "Low Power ReaIization ofFIR Filters using Multirate Architectures", International Conference on VLSI Design, VLSI Design'96, pp. 370-375

[52] Mahesh Mehendale, S. D. Sherlekar, G. Venkatesh, "Optimized Code Generation of Multiplication-free Linear Transforms", ACM/IEEE Design Automation Conference, DAC'96, pp. 41-46

[53] Mahesh Mehendale, S. D. Sherlekar, G. Venkatesh, "Area-Delay Tradeoff in Distributed Arithmetic based Implementation of FIR Filters", International Conference on VLSI Design, VLSI Design'97, pp. 124-129

[54] Mahesh Mehendale, S. D. Sherlekar, G. Venkatesh, "Extensions to Programmable DSP Architectures for Reduced Power Dissipation", International Conference on VLSI Design, VLSI Design'98, pp. 37-42

[55] Mahesh Mehendale, Somdipta Basu Roy, S. D. Sherlekar, G. Venkatesh, "Coefficient Transformations for Area-Efficient Implementation of Multiplier-Iess FIR Filters", International Conference on VLSI Design, VLSI Design'98, pp. 110-115

REFERENCES 203

[56] Mahesh Mehendale, S. D. Sherlekar, G. Venkatesh, "Algorithmic and Architectural Transformations for Low Power Realization of FIR Filters, International Conference on VLSI Design, VLSI Design'98, pp. 12-17

[57] Mahesh Mehendale, Amit Sinha, S. D. Sherlekar, "Low Power Realization of FIR Filters Implemented Using Distributed Arithmetic", Asia and South Pacific Design Automation Conference, ASP-DAC'98, pp. 151-156

[58] Mahesh Mehendale, S. D. Sherlekar, G. Venkatesh, "Low Power Realization of FIR Filters on Programmable DSPs", IEEE Transactions on VLSI Systems, Special Issue on Low Power Design, December 1998, pp. 546-553

[59] Mahesh Mehendale, S. D. Sherlekar, "Low Power Code Generation of Multiplication-free Linear Transforms", 12th International Conference on VLSI Design, January 1999, pp. 42-47

[60] R. Mehra, D. B. Lidsky, A Abnous, P. E. Landman and J. M. Rabaey, "Algorithms and Architectural Level Methodologies for Low Power", Chapter 11 in Low Power Design Methodologies, Jan Rabaey and Massoud Pedram, Eds., Kluwer Academic Publishers, September 1995

[61] Huzefa Mehta, R. M. Owens, M. J. Irwin, "Some Issues in Gray Code Addressing", GLS-VLSI'96, 6th Great Lakes Symposium on VLSI, 1996, pp. 178-181

[62] G.De Micheli, Synthesis and optimization of digital circuits, McGraw-HiII, 1994.

[63] Biswadip Mitra, Shantanu Jha and P. Pal Chaudhuri, "A Simulated Annealing Based State Assignment Approach for Control Synthesis", 4th CSIIIEEE International Symposium on VLSI Design, 1991, pp. 45-50

[64] Biswadip Mitra, P. R. Panda and P. Pal Chaudhuri, "Estimating the Complexity of Synthesized Designs from FSM Specifications", 5th International Conference on VLSI Design, 1992, pp. 175-180

[65] J. Monteiro, S. Devadas, A. Ghosh, "Sequential logic optimization for low power using input-disabling precomputation architectures", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Volume: 17 3 , March 1998, pp. 279-284

[66] Z.J. Mou and P. Duhamel, "Short-Length FIR Filters and Their Use in Fast Nonrecursive Filtering". IEEE Transactions on Signal Processing, June 1991, pp. 1322-1332

[67] L. Nachtergaele, F. Catthoor, B. Kapoor, S. Janssens, "Low Power Storage Exploration for H.263 Video Decoder", IEEE International Workshop on VLSI Signal Processing, 1996, pp. 115-124

[68] Farid Najm, 'Transition Density: A New Measure of Activity in Digital Circuits", IEEE Transactions on CAD, Feb 1993, pp. 310-323

[69] The National Technology Roadmap for Semiconductors, SIA - Semiconductor Industry Association, 1994


[70] B. New, "A Distributed Arithmetic Approach to Designing Scalable DSP Chips", Electronic Design News, August 17, 1995

[71] Ralf Niemann, Peter Marwedel, "Hardware/Software Partitioning using Integer Programming", Eurpean Design and Test Conference, 1996, pp. 473-479

[72] W. J. Oh and Y. H. Lee, "Cascade/Parallel Form FIR Filters with Powers-of- Two Coefficients", IEEE International Symposium on Circuits and Systems, ISCAS-1994, Vol. H, pp. 545-548

[73] A.v. Oppenheim and R.W. Schaffer, Discrete Time Signal Processing, Prentice Hall, 1989

[74J Preeti Panda and Nikil Dutt, "Reducing Address Bus Transitions for Low Power Memory Mapping", European Design and Test Conference, 1996, pp. 63-68

[75] Anand Pande, Sunil Kashide, Hardware Software Codesign of DSP Algorithms, ME Thesis, Centre for Electronics Design and Technology, Indian Institute of Science, Bangalore, India, January 2000

[76] K.K. Parhi, "Algorithms and Architectures for High-Speed and Low-Power Digital Signal Processing", 4th International Conference on Advances in Communications and Control, 1993, pp. 259-270

[77] D.N. Pearson and K.K. Parhi, "Low-PowerFiR Digital Filter Architectures", IEEE International Symposium on Circuits and Systems, Vol I, 1995, pp. 231-234

[78] M. Potkonjak, Mani Srivastava and Anantha Chandrakasan, "Multiple Constant Multiplications: Efficient and Versatile Framework and Algorithms for Exploring Common Subexpression Elimination", IEEE Transactions on Computer Aided Design, February 1996, pp. 151-165

[79] Ptolemy Project Homepage, Departmcnt of Electrical En-gineering and Computer Sciences, University of California at Berkeley, URL -http://ptolemy.eecs.berkeley.edu

[80] Wu Qing, M. Pedram, Wu Xunwci, "Clock-gating and its application to low power design of sequential circuits", IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, Volume: 47 3, March 2000, pp. 415-420

[81] Anand Raghunathan and Niraj Jha, "Behavioral Synthesis for Low Power", Proceedings of International Conference on Computer Design, ICCD-1994, pp. 318-322

[82] K. R. Rao and P. Yip, Discrete Cosine Trans:form: Algorithms, Advantages and Applications, Academic Press, 1990

[83] H. Samueli, "An Improved Search Algorithm for the Design of Multiplierless FIR Filters with Powers-of-Two Coefficients", IEEE Transactions on Circuits and Systems, July 1989, pp. 1044-1047

[84] N. Sankarayya and K. Roy, "Algorithms for Low Power FIR Filter Realization Using Differential Coefficients", International Conference on VLSI Design, 1997, pp. 174-178

REFERENCES 205

[85] H. Schroder, "High Word-Rate Digital Filters with Programmablc Table LookUp", IEEE Transactions on Circuits and Systems, May 1977, pp. 277-279

[86] Amit Sinha, Mahesh Mehendale, "Improving Area Efficicncy ofFiR Filters Implemented Using Distributed Arithmetic", International Conference on VLSI Design, VLSI Design'98, pp. 104-109

[87] M. A. Soderstrand and K. AI-Marayati, "VLSI Implementation of Very-HighOrder FIR Filters", IEEE International Symposium on Circuits and Systems, 1995, pp. 1436-1439

[88] M. R. Stan and W. P. Burleson, "Bus Invert Coding for Low Power 1/0", IEEE Transactions on VLSI Systems, March 1995, pp. 49-58

[89] Ching-Long Su, Chi-Ying Tsui and Alvin M. Despain, "Saving Power in the Control Path of Embedded Proeessors", IEEE Design and Test of Computers, Winter 1994, pp. 24-30

[90] Ashok Sudarsanam, Sharad Malik, "Memory Bank and Register AlIoeation in Software Synthesis for ASIPs", IEEE International Conferenee on CAD, ICCAD-1995, pp. 388-392

[91] Earl Swartzlander Jr., VLSI Signal Processing Systems, Kluwer Academie Publishers, 1985

[92] N. S. Szabo and R. I. Tanaka, Residue Arithmetic and its Applications to Computer Technology, Me-Graw Hili, 1967

[93] N. Tan, S. Eriksson and L. Wanhammar, "A Power-Saving Teehnique for Bit-Serial DSP ASICs", ISCAS 94, Vol. IV, pp. 51-54

[94] Y. Tiwari, S. Malik, P. Ashar, "Guarded evaluation: pushing power management to logie synthesis/design", IEEE Transaetions 0 Computer-Aided Design of Integrated Cireuits and Systems, Volume: 17 10, Oel. 1998 , pp. 1051- 1060

[95] TMS320C2x User's Guide, Texas Instruments, 1993




[99] TMS470R I x Code Generation Tools Guide, Texas Instruments, 1997

[100] TSC4000ULV 0.35pm CMOS Standard Cell, Maero Library Summary, Applieation Specifie Integrated Cireuits, Texas Instruments, 1996

[101] Y. Tsividis and P. Antognetti, editors, Design of MOS VLSI Circuits.f(Jr Te/ecommunications, Prentiee-Hall, 1985

[102] c.-Y. Wang and K. Roy, "Control Unit Synthesis Targeting Low-Power Processors", IEEE International Conferenee on Computer Design, Oetoher 1995, pp. 454-459


[103] Ching-Yi Wang and K. K. Parhi, "High-Level DSP Synthesis Using Concurrent Transformations, Scheduling and Allocation", IEEE Transactions on ComputerAided Design 01' Integrated Circuits and Systems, March 1995, pp. 274-295

[104] Stanley White, "Applications of Distributed Arithmetic to Digital Signal Processing: A Tutorial Review", IEEE ASSP Magazine, July 1989, pp. 4-19

[105] Q. Zhao and Y. Tadokoro, "A Simple Design of F1R Filters with Powers-01'-Two Coefficients", IEEE Transactions of Circuits and Systems, May 1988, pp. 566-570

[106] S. Zohar, "Negative Radix Conversion", IEEE Transactions on Computer, Vol C-19, March 1970, pp. 222-226

[107] S. Zohar, "A VLSI Implementation of a Correlator/Digital Filter Based on Distributed Arithmetic", IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. 37, No. I, January 1989, pp. 156-160

Topic Index

Accumulator based Architecture, 153 Adder Input Bit Swapping, 31 Array Multiplier, 14 Binary to Nega-Binary Conversion, 10 I Bitwise Commutativity of ADD Operation, 31 Block FlR Filters, 56 Bus Bit Reordering, 24 Bus Coding, 17 Bus Invert Coding, 20 CSD Representation, 129 Characteristics of DSP Algorithms, 1I Circular Buffer, 36 Classification of Transformations, 187 Code Generation of I-D Transform, 144 Coefficient Optimization, 43, 137, 180 Coefficient Ordering, 28, 175 Coefficient Partitioning, 86 Coefficient Scaling, 42, 179 Color Space Conversion, 2, 113 Common Subexperssion Elimination, 116, 122, 147 Consecutive-Variables Graph, 150 DAG Optimizing Transformations, 159 DAG Transform - Fanout Reduction, 160 DAG Transform - Merging, 161 DAG Transform - Serializing a Butterfly, 159 DAG Transform - Tree to Chain Conversion, 159 DCT, 2, 113, 172, 189 DFG Transformations, 56 DSC Image Pipeline, I DSP Architecture, II Decoded Instruction Buffer, 21 Digital Still Camera, I Distributed Arithmetic, 76 Dynamic/Switching Power in CMOS, 12 Energy vs Peak Power Tradeoff, 61 FFT,44, 172, 189 FIR Filter, 35 Gaussian Data Distribution, 104 Generic Techniques for Low Power, 26 Graph Coloring Problem, 151 Gray Coded Addressing, 17, 108 Haar Transforrn, 147, 165 Hardware-Software Partitioning, 6 High Level Synthesis of Multiprecision DFGs, 138 High Precision Signal Processing, 183 Instruction ButTering, 21 Instruction Scheduling, 148 LUT Redundancy Elimination, 176 Linear Phase FIR Filters, 65 Loop Unrolling, 59

Low Power Code Generation, 148, 168 MAC (Multiply-Accumulate) Instruction, 12 Memory Architectures for Low Power, 22 Memory Partitioning for Low Power, 23 Memory Prefetch ButTer, 23 Modulo MAC, 173 Multiple Constant Multiplication (MCM), 113, 120 Multiplication-free Linear Transforrn, 141 Multirate Architectures, 37,63,81 Nega-binary Coding, 95 Optimization using 0-1 Programming, 50 Parallel Processing, 59, 174 Pixel Window Transform, 144 Power Analysis of Multirate Architectures, 68 Power Dissipation due to Cross Coupling, 13 Power Dissipation in CMOS, 12 Power Dissipation in a Bus, 13 Power Dissipation in a Multiplier, 13 Pre-Filter Structures, 138 Precision Sensitive Binding, 139 Precision Sensitive Register Allocation, 138 Precision Sensitive Scheduling, 140 Prewitt Window Transform, 144 RNS Moduli Selection, 183 Register Assignment, 149 Register-Conflict Graph, 150 Register-rich Architecture, 143 Residue Encoding, 174, 177 Residue Number System, 171 Retiming,59 Selective Coefficient Negation, 27 Shiftless DA Implementation, 107 Signal Flow Graph Transformations, 130 Slant Transform, 167 SoC Design Methodology, 4 Sobel Window Transform, 152 Solution Space for DSP Implementation, 6 Spatial High Pass Filter, 152 Spatial Low Pass Filter, 152 Spill-free DAG, 162 Stochastic Evolution, 151 TO coding, 18 TMS320C2x/C5x, 39, 153 TMS320C54x - FIRS Instruction, 34 Transformation Framework, 51, 191 Transition Density, 14 Transposed FIR Filter, 41, 180 Traveling Salesman Problem, 29 Truth Table Area Estimation, 88 Two Dimensional Linear Transform, 113 Uni-sign Representation, 129 Walsh-Hadamard Transform, 158, 164

207

About the Authors

Mahesh Mehendale is a Distinguished Member of Technical Staff and the Director, DSP Design at Texas Instruments (India) Ltd. He leads the DSP design group which recently designed TI's next generation TMS320C27x processor. This design received EDN Asia magazine's component of the year award for 1999. Mahesh received his B.Tech (EE), the M. Tech (CS&E) and the Ph.D. degrees from IIT Bombay in 1984, 1986 and 1999 respectively. He has published more than 35 papers in refereed international conferences and has received two best paper awards at EDIF World 1991 and the VLSI Design 1996 conferences. He was a co-author of the paper that received the best student paper award at the VLSI Design 1999 conference. Mahesh has co-presented fuH day and half day tutorials in the areas of FPGAs, low power DSP, VLSI signal processing and programmable DSP verification at many national and international conferences. He has served as a member of the technical program committee for many conferences and also as the Tutorial Chair for the VLSI Design 1996 conference and the Design Contest Chair for the VLSI Design 2000 conference. He is the program co-chair for the joint VLSI Design-ASPDAC 2002 conference. Mahesh holds five US patents with one more application pending. He is a senior member of IEEE. He can be reached at [email protected] .

Sunil D. Sherlekar is currently the Vice-President of R&D at Sasken Communication Technologies Ltd. (formerly known as Silicon Automation Systems Ltd.), a company engaged in developing leading edge solutions for the telecom access market. Prior to joining this company, he was on the facuIty of Computer Science and Engg. at IIT Bombay for about 12 years. He has published several papers in refereed conferences and journals and has supervised several doctoral and masters' students. He has served as a member and chair for the technical program committee of several conferences and also as the General Chair for the Asian Test Symposium 1995 and Asia-Pacific Conference on HDLs (APCHDL) 1996. He is the General Chair for the joint VLSI Design-ASPDAC 2002 conference. He is a member of the editorial board of JETTA and a member of the Steering Committee of ASPDAC. For APCHDL 1994, he gave the keynote speech at Toyohashi in Japan. In 1996, his paper with Mahesh Mehendale received the best paper award at the VLSI Design Conference. In 1997, he was awarded by IEEE for his contribution to the Asian Test Subcommittee activities. He can be reached at [email protected] .

209

Date post:	11-Sep-2021
Category:	Documents
Upload:	others
View:	7 times
Download:	0 times

VLSI Synthesis of DSP Kernels: Algorithmic and Architectural Transformations

Documents