THE PERCEPTION PROCESSOR
by
Binu K. Mathew
A dissertation submitted to the faculty ofThe University of Utah
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy
in
Computer Science
School of Computing
The University of Utah
August 2004
Copyright c© Binu K. Mathew 2004
All Rights Reserved
THE UNIVERSITY OF UTAH GRADUATE SCHOOL
SUPERVISORY COMMITTEE APPROVAL
of a dissertation submitted by
Binu K. Mathew
This dissertation has been read by each member of the following supervisory committee and bymajority vote has been found to be satisfactory.
Chair: Al Davis
John B. Carter
Ganesh Gopalakrishnan
Erik Brunvand
William C. Athas
THE UNIVERSITY OF UTAH GRADUATE SCHOOL
FINAL READING APPROVAL
To the Graduate Council of the University of Utah:
I have read the dissertation of Binu K. Mathew in its final form andhave found that (1) its format, citations, and bibliographic style are consistent and acceptable;(2) its illustrative materials including figures, tables, and charts are in place; and (3) the finalmanuscript is satisfactory to the Supervisory Committee and is ready for submission to TheGraduate School.
Date Al DavisChair: Supervisory Committee
Approved for the Major Department
Christopher R. JohnsonChair/Director
Approved for the Graduate Council
David S. ChapmanDean of The Graduate School
ABSTRACT
Recognizing speech, gestures, and visual features are important interface capabil-
ities for future embedded mobile systems. Unfortunately the real-time performance
requirements of complex perception applications cannot be met by current embedded
processors and often even exceed the capability of high performance microprocessors.
The energy budget of current high performance processors is infeasible in the embedded
space. The normal approach is to resort to a custom ASIC to meet performance and
energy constraints. However ASICs incur expensive and lengthy design cycles. They are
so specialized that they are unable to support multiple applications or even evolutionary
improvements in a single application. This dissertation introduces a VLIW perception
processor that uses a combination of clustered function units, compiler controlled data-
flow and compiler controlled clock-gating in conjunction with hardware support for mod-
ulo scheduling, address generation units and a scratch-pad memory system to achieve
very high performance for perceptual algorithms at low energy consumption. The archi-
tecture is evaluated using benchmark algorithms taken from complex speech and visual
feature recognition, security, and signal processing domains. Since energy and delay are
common design trade-offs, the energy-delay product of a CMOS implementation of the
perception processor is compared against ASICs and general purpose processors. Using
a combination of Spice simulations, real processor power measurements and architecture
simulation it is shown that the perception processor running at 1 GHz clock frequency
outperforms a 2.4 GHz Pentium 4 by a factor of 1.75. While delivering this performance
it simultaneously achieves 159 times better energy delay product than a low power Intel
XScale embedded processor.
The perception processor makes sophisticated real-time perception applications pos-
sible within an energy budget that is commensurate with the embedded space, a task that
is impossible with current embedded processors.
This dissertation is dedicated to
A, T, C, G, 1 and 0, the building blocks of intelligence.
And to the pioneers uncovering the foundations of intelligence.
CONTENTS
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
LIST OF ACRONYMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
CHAPTERS
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 The Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Road Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2. RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1 Optimization and Characterization ofPerception Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Image and Neural Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.3 High ILP Processors for Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4 Custom Hardware for Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142.5 Balancing Performance and Power Consumption . . . . . . . . . . . . . . . . . . 152.6 Distinguishing Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3. PRINCIPLES BEHIND DYNAMIC POWER REDUCTION . . . . . . . . . . 21
3.1 Dynamic Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.2 Power Reduction Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.3 Process Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.1 Constant Field Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3.2 Voltage Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.3.3 Frequency Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 The ET nMetric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.5 Energy Delay Squared Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4. SPEECH RECOGNITION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1 Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2 Acoustic Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Language Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.4 Overall Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.5 Architectural Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5. CHARACTERIZATION AND OPTIMIZATION OF SPHINX 3 . . . . . . . 41
5.1 Memory System Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.2 ILP in Sphinx . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.3 Results of Software Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.1 Cache Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.3.2 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.4 The HMM Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6. A CUSTOM GAUSSIAN ACCELERATOR . . . . . . . . . . . . . . . . . . . . . . . 52
6.1 Top Level Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546.2 Coprocessor Datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 556.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.5 Accelerator Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.5.1 Energy Savings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596.5.2 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 616.5.3 Bandwidth Savings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7. VISUAL FEATURE RECOGNITION ALGORITHMS . . . . . . . . . . . . . . 64
7.1 Flesh Toning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677.2 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 687.3 Rowley Face Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 697.4 Viola and Jones’ Detector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707.5 Eigen Faces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 727.6 Architectural Implications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8. CHARACTERIZATION OF VISUAL FEATURE RECOGNITION . . . . . 77
8.1 Application Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798.2 Optimization Opportunities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
9. PERCEPTION PROCESSOR ARCHITECTURE . . . . . . . . . . . . . . . . . . 87
9.1 Pipeline Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 899.2 Instruction Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 919.3 Function Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 929.4 Compiler Controlled Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 949.5 Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 959.6 Memory System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
9.6.1 Loop Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 989.6.2 Stream Address Generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1019.6.3 Array Variable Renaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1059.6.4 Addressing Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
9.7 Compiler Controlled Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107vii
9.8 Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1089.9 Programming Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
10. EVALUATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
10.1 Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11610.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12010.3 Experimental Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12010.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
10.4.1 Instruction Level Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12310.4.2 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12410.4.3 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12610.4.4 Energy Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12910.4.5 Energy Delay Product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12910.4.6 Energy Delay Squared Product . . . . . . . . . . . . . . . . . . . . . . . . . . . 13210.4.7 Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13310.4.8 The Cost of Generality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
10.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
11. CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
12. FUTURE RESEARCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
viii
LIST OF FIGURES
1.1 Perception Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 High Level Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.1 Signal Processing Front End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Triphone HMM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.1 L1 Dcache Miss Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 L2 Cache Miss Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3 L2 to Memory Bandwidth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.4 GAU and GAU OPT IPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.5 HMM IPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.6 Measured Speedup on R12K . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.7 Cache Optimized Gaussian Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
6.1 Top Level Organization of Gaussian Estimator . . . . . . . . . . . . . . . . . . . . . 54
6.2 Gaussian Coprocessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.3 Channel Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.1 Algorithmic Stages of a Face Recognizer . . . . . . . . . . . . . . . . . . . . . . . . . 66
8.1 Execution Time Break Down of Viola/Jones Detector Based FaceRecognizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.2 Execution Time Break Down of Rowley Detector Based Face Recognizer . 80
8.3 L1 Dcache Miss Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
8.4 L2 Cache Hit Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
8.5 IPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
8.6 Speedup or Slow Down Over Real Time . . . . . . . . . . . . . . . . . . . . . . . . . . 84
9.1 Perception Processor Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
9.2 Pipeline Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9.3 Microinstruction Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
9.4 Function Unit Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
9.5 Interconnect Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
9.6 Loop Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
9.7 Stream Address Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
9.8 Loop Acceleration Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
9.9 Matrix Multiply Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
9.10 Inner Product Accelerator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
9.11 Assembly Code for Interleaved Inner Product . . . . . . . . . . . . . . . . . . . . . . 113
10.1 IPC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
10.2 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
10.3 Throughput Normalized to Pentium 4 Throughput . . . . . . . . . . . . . . . . . . 127
10.4 Process Normalized Energy Consumption . . . . . . . . . . . . . . . . . . . . . . . . 130
10.5 Process Normalized Energy Delay Product . . . . . . . . . . . . . . . . . . . . . . . . 131
10.6 Process Normalized Energy Delay Squared Product (ET 2) . . . . . . . . . . . . 132
10.7 Impact of Clock Gating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
10.8 Energy Consumption of PP+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
10.9 Energy Delay Product of PP+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
12.1 Generic Stream Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
12.2 Stream Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
x
LIST OF TABLES
5.1 Experiment Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
8.1 Experiment Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
LIST OF ACRONYMS
ALU arithmetic logic unit
ANN artificial neural network
ASIC application-specific integrated circuit
CAD computer-aided design
CMOS complementary metal-oxide semiconductor
CMU Carnegie Mellon University
CPU central processing unit
DMA direct memory access
DRAM dynamic random-access memory
DSP digital signal processor
DTLB data TLB
Dcache data cache
EDP energy delay product
FFT fast Fourier transform
FIR finite-length impulse response
FPGA field-programmable gate array
FPU floating point unit
FU function unit
GCC GNU compiler collection
GNU GNU’s not Unix
HDL hardware description language
HMM hidden Markov model
HSV hue saturation value
Hub, Hub-4 a speech recognition benchmark created by the National Institute of Stan-dards and Technology; also a speech model developed for this benchmark by CMU
IEEE Institute of Electrical and Electronics Engineers
ILP instruction level parallelism
IPC instructions per cycle
ISA instruction set architecture
ITLB instruction TLB
Icache instruction cache
L1 level 1
L2 level 2
MIPS million instructions per second; also, the company MIPS Inc.
MLP multilayer perceptron
MTCMOS multithreshold CMOS
NCC normalized color coordinates
NOP null/no operation
RAM random-access memory
RGB red green blue
RISC reduced instruction set computer
ROM read-only memory
SGI a company formerly known as Silicon Graphics Inc.
SIMD single instruction multiple data
SRAM static random-access memory
TLB translation look-aside buffer
TSMC Taiwan Semiconductor Manufacturing Company
VLIW very large instruction word
xiii
ACKNOWLEDGMENTS
Over the years I have discussed numerous half baked ideas with my advisor Prof.
Al Davis who has always been encouraging of new concepts. This dissertation is a
testament to Al’s willingness to let me explore territory that was totally outside the
realm of his previous research. In spite of my naive protests he gently, but unrelentingly,
insisted for several years until I finally grasped it that reducing power consumption was
a more important problem than high performance. I am very grateful for Al’s flexible but
insistent advising style. Mike Parker has contributed greatly to my research through nu-
merous hours of consultation and help with power measurements. Ali Ibrahim deserves
thanks for his help with a prototype that eventually led to the perception processor. Profs.
John Carter and Sally McKee extensively schooled me in the technique of writing papers
during my first two years of graduate school. They deserve all the credit for improving
my writing skill. Any deficiencies are due to my stubborn adherence to my own style.
Finally, my family and my friends, particularly BJP, Shilu, Asha Amal and Yea teasingly
convinced me over the last several years that a Ph.D. meant a lot more than anyone really
believed. Thank you!
CHAPTER 1
INTRODUCTION
The term Perception Processing encompasses processor support for technologies that
can enable computers to perceive the world the way we humans do with our sensory
faculties. It targets areas like object detection, recognition and tracking, speech and ges-
ture recognition and multimodal abilities like lip reading to support speech recognition.
The applications for perception processing are both immense and diverse. More and
more computing devices are being invisibly embedded into our living environment, and
we notice their existence only when they cease to serve us. For this fledgling comput-
ing fabric to develop into tomorrow’s ubiquitous computing environment, the primary
means of interacting with it should be human friendly ones like speech and gesture.
Future mobile embedded environments need to support sophisticated applications such as
speech recognition, visual feature recognition, secure wireless networking, and general
media processing. Work environments from the board room to the garage will eventually
feature human friendly and hands free interfaces to the computers embedded into those
environments. Perception prosthetics are an important application too. Devices that
listen to speech and then project text on a heads up display worn by a deaf person or
an intelligent camera that gives audio cues like “Vehicle approaching,” “Stairs 10 feet
ahead” to a blind wearer are of particular interest. Another important application area
is robotics – the opportunity to outfit both manned and autonomous vehicles, industrial
and house hold robots and even machine tools with tireless vision presents boundless
opportunity. Other areas that could benefit from perception processing include automated
surveillance, translation of speech and a variety of assistive technologies.
2
1.1 The Problem
By their very nature, perception applications are likely to be most useful in mobile
embedded systems. A fundamental problem that plagues these applications is that they
require significantly more performance than current embedded processors can deliver.
Most embedded and low-power processors, such as the Intel XScale, do not have the
hardware resources and performance necessary to support a full featured speech rec-
ognizer. Even modern high performance microprocessors are barely able to keep up
with the real-time requirements of sophisticated perception applications. The energy
consumption that accompanies the required performance level is often orders of mag-
nitude beyond typical embedded power budgets. This dissertation attempts to develop
a specialized processor architecture that can provide high performance for perception
applications in an energy-efficient manner.
Figure 1.1 shows actual measured performance of two perception applications: CMU
Sphinx 3, a speech recognition system, and FaceRec, a face recognition application. The
applications were run on Intel Pentium III and later processors with clock speeds varying
from 900 MHz to 3 GHz. Details of these applications are presented in Chapters 4 and
7. The horizontal lines show the performance level required to achieve real-time targets.
For the speech recognizer, this involves recognizing a 29.2 second long speech recording
in the same interval of time. The workload for the face recognizer consists of processing
25 image frames in 5 seconds corresponding to the real-time target of handling 5 frames
of 320 × 200 pixel images every second.
Each of the smooth curves in the figure correspond to the hyperbola obtained by
assuming ideal scaling of performance with frequency. They are derived by starting with
the data point corresponding to 900 MHz and assuming that run time varies inversely
with frequency. It is evident that for speech recognition, the performance of the processor
does not scale ideally. In theory a 2.4 GHz processor should achieve real-time perfor-
mance. In practice a processor frequency of approximately 2.9 GHz is required to satisfy
real-time requirements. This performance gap suggests that when moving to more com-
plex future speech recognition workloads, higher frequencies alone are not the solution.
Fundamental architectural improvements are called for. The face recognizer demands
3
600800100012001400160018002000220024002600280030003200340036003800400042004400460048005000
CPU Freq (MHz)
600800100012001400160018002000220024002600280030003200340036003800400042004400460048005000
CPU Freq (MHz)
0
10
20
30
40
50
60
70
80
90
100
110
Run
tim
e (S
econ
ds)
Speech RecSpeech Rec - real timeSpeech Rec - theoreticalFace RecFace Rec - real timeFace Rec - theoretical
Figure 1.1. Perception Performance
a higher level of performance than is currently available. Its real-time requirements
demand a 4.8 GHz or faster processor. The complexity of both workloads is likely to
increase significantly in the future. The results clearly show that perception applications
stress the performance limits of high end processors and low power embedded processors
may never have the compute power required for perception applications.
Given Moore’s law performance scaling, the performance issue is not by itself a
critical problem. However two significant problems remain. First, the energy expended
in high performance processors is intractable in the embedded space. Furthermore, the
power requirements of new processors is increasing. The conclusion is that technol-
ogy scaling alone cannot solve the problems faced by perception applications. Second,
perception and security interfaces are by nature always operational. This limits the pro-
cessor’s availability for other compute tasks such as understanding what was perceived.
4
The usual solution to reducing power consumption while increasing performance
is to use an Application Specific Integrated Circuit (ASIC). Given the complexity and
the always on nature of perception tasks, a more relevant approach would be to use
the ASIC as a coprocessor in conjunction with a low power host processor. As a part
of this research, an ASIC coprocessor for one of the dominant phases of the CMU
Sphinx speech recognition system was investigated. Details may be found in Chapter
6. This effort led to the usual realization that ASICs are costly and inflexible. Their
high fabrication cost coupled with the costs associated with a lengthy design cycle are
difficult to amortize. The inherent level of specialization in an ASIC makes it extremely
difficult to support multiple applications, new methods, or even evolutionary algorithmic
improvements. Given that embedded applications evolve rapidly and that embedded
systems are extremely cost sensitive, these problems provide significant motivation to
explore a more general purpose approach. The use of reconfigurable logic and FPGA
devices is another common approach [31]. The inherent reconfigurability of FPGAs
provides a level of specialization while retaining significant generality. However the
reconfiguration time is relatively long, and FPGAs have a significant disadvantage both
in performance and power when compared to either ASIC or CPU logic functions.
1.2 The Solution
This dissertation addresses the design of programmable processors that can handle
sophisticated perception workloads in real time at power budgets suitable for embedded
devices. Programmable processors optimized for the perception domain are intended
to be used as a coprocessors for general purpose host processors. A high level view
of the architecture is shown in Figure 1.2. A number of function units are organized
as a cluster and embedded in a rich interconnection network that provides connection
between function units in the cluster and four memories. The host processor moves
data into or out of the coprocessor via double buffered input and output SRAMs. Local
storage for the cluster is provided by the scratch SRAM, and the microcode program
that controls the operation of the cluster is held in the u-Code SRAM. The execution
cluster can be customized for a particular application by the selection of function units.
5
Figure 1.2. High Level Architecture
In fact the type and number of function units, SRAMs, address generators, bit widths
and interconnect topology are specified using a configuration file. The hardware design
(Verilog netlist) and a customized simulator are automatically generated by a cluster
generator. Henceforth the term perception processor refers to the generic architecture
behind any domain-specific processor created using the cluster generator tool.
Perception algorithms tend to be stream oriented, i.e., they process a sequence of
similar data records where the data records may be packets or blocks of speech signals,
video frames or the output of other stream processing routines. Each input packet is
processed by a relatively simple and regular algorithm that often refers to some limited
local state tables or history to generate an output packet. The packets have fixed or
variable but bounded sizes. The algorithms are typically loop oriented with dominant
components being nested for loops with flow-dependent bodies. Flow dependence im-
plies that loop-carried dependences have constant distances in the iteration space of the
nested loop structure. Processors that are optimized for this style of computation are
6
called stream processors. While there are subtle differences in the nuances, the notion of
streams and algorithm kernels described here is essentially the same as that developed by
Dally et al. for the Imagine Stream Processor [82]. The perception processor developed
in this research is a specialized stream processor optimized for speech recognition and
vision. However, attempts will be made to show its generality to other stream oriented
algorithms in Chapters 10 and 12.
Fine-grained control of physical resources is provided by a horizontal microcode
program. The architecture and the fine-grained control mechanism support data flows
that resemble the custom computational pipelines found in ASICs. Software based
control provides a significant level of generality. Any algorithm can be mapped onto
the cluster, albeit with varying levels of efficiency. The result is a cluster that can be
tailored to a particular domain and can support multiple applications or applications
phases. The approach includes a specialized microcode compiler that maps applications
onto the perception processor. Currently, the input to the compiler is a tiny specialized
language implemented on top of the Python scripting language. It supports constructs for
various types of for loops, array access patterns, opcode mnemonics, loop unrolling and
processor reconfiguration requests. Compilers for more general languages like C or C++
are definitely possible, but have not been implemented. The compiler uses hardware sup-
port for modulo-scheduled loops in conjunction with array address generators to deliver
high throughput for flow dependent loops [81]. The microcode provides fine-grained
control over data steering, clock gating and function unit utilization and it permits single
cycle reconfiguration of address generators.
Energy efficiency is primarily the result of minimized communication and activity.
The compiler uses fine-grained clock gating to ensure that each function unit is active
only when required. Compiler-controlled dataflow permits software to explicitly address
output and input stage pipeline registers of function units and orchestrate data transfer
between them over software-controlled bypass paths. Data values are transported only
if necessary, and the compiler takes care to ensure that value changes are visible on
heavily loaded wires and forwarding paths only if a unit connected to that path needs
the data value. By explicitly enabling pipeline registers the compiler is able to control
7
the lifetime of function unit outputs and directly route data to other function units,
avoiding unnecessary access to a register file. The resulting dataflows or active datapaths
resemble custom computational pipelines found in ASICs, but have the advantage of
flexibility offered by software control. This may be thought of as a means of exploiting
the natural register renaming that occurs when a multistage pipeline shifts and each
individual pipeline register gets a new value. However the active datapath in the cluster
will utilize multiplexer circuits that provide generality at the cost of power, area and
performance. These muxes and the associated penalties will not be present in a custom
ASIC design.
The resulting architecture is powerful enough to support complex perception algo-
rithms at energy consumption levels commensurate with mobile device requirements.
The approach represents a middle ground between general purpose embedded proces-
sors and ASICs. It possesses a level of generality that cannot be achieved by a highly
specialized ASIC, while delivering performance and energy efficiency that cannot be
matched by general purpose processor architectures.
1.3 Road Map
Chapter 2 will provide a brief introduction to previous research pertaining to the
optimization and acceleration of perception applications. Chapter 3 describes the basic
principles behind power reduction in CMOS circuits and introduces metrics that will be
used later for evaluating the perception processor. This is followed by Chapters 4 and
Chapter 5, which provide an introduction to the foundations of speech recognition and
a performance analysis of the CMU Sphinx 3.2 speech recognition system respectively.
Chapter 6 presents the design and evaluation of an ASIC coprocessor for a dominant
phase of Sphinx. Computer vision algorithms used in the FaceRec application mentioned
previously are introduced in Chapter 7 and the application itself is characterized in
Chapter 8. The architecture of the perception processor is elaborated in Chapter 9, and
its performance and energy efficiency are analyzed in Chapter 10. Chapter 11 draws
conclusions and highlights important results. Finally, Chapter 12 points out avenues for
future research.
CHAPTER 2
RELATED WORK
While there has been a considerable body of research targeted at accelerating artificial
neural networks (ANN) in general, very little work has been directed towards the archi-
tectural needs of perception processing and low power implementations of perception
functions. Related areas of research can be classified broadly into optimization and
characterization of perception workloads and various special purpose, parallel and low
power coprocessor architectures. The following sections highlight representative work
in each of these areas. In addition, general research in processor and reconfigurable logic
not specifically targeted at perception yet capable of sustained high performance at low
power budgets will also be discussed.
2.1 Optimization and Characterization ofPerception Applications
Perception processing, which encompasses a wide range of topics like computer
vision, speech recognition and gesture recognition, is currently the focus of vigorous
research. While it is common in the literature to see the relative merits and performance
of algorithms compared, architecture level analysis of whole perception applications is
extremely rare. Traditional research in perception has been geared towards improving ac-
curacy. Performance is a secondary goal, and power efficiency has been largely ignored.
For instance, the yearly Hub speech recognition evaluation reports typically emphasize
improvements in recognition accuracy and mention improvements in performance as a
multiple of “slow down over real time” [30, 92].
Ravishanker’s research improved the performance of the Sphinx speech recognition
system by trading off accuracy in a computationally intensive phase for faster run time
and then recovered the lost accuracy by doing additional processing in a computationally
9
cheaper phase of the application [74]. This research also reduced the memory footprint
of speech recognition by using a disk based language model cached in memory by the
software.
Agram, Burger and Keckler characterized the Sphinx II speech recognition system in
a manner useful for computer architects [6]. They focused on ILP, as well as memory
system characteristics such as cache hit rates and block sizes, and concluded that avail-
able ILP was low. They compared the characteristics of the Sphinx II system with those
of Spec benchmarks and also hinted at the possibilities and problems associated with
exploiting thread level parallelism.
Researchers at the Intel ICRC labs published a performance analysis of a speech
recognition system for Mandarin Chinese [59]. This study focused on the run time and
the size of the working set while executing the Intel speech recognition system on several
different versions of the x86 processor. They reported a decrease in ILP with increased
clock rate. IPC decreased from between 1 and 1.2 at 500 MHz to approximately 0.4 at
1.5 GHz – a clear indication that increasing clock rate is not the solution to improving
speech recognition performance. The decrease in ILP was attributed to memory system
behavior, but a detailed explanation was not provided. The ICRC speech system is not
publicly available, but the underlying semicontinuous HMM technique is the same as
that used by Sphinx. An experiment reported by the Intel researchers claimed to achieve
faster than real time recognition – 1.14 times faster than real time on a 1 GHz processor
and 1.33 times faster than real time on a 1.5 GHz Pentium 4 processor. The results
from Figure 1.1 show that Sphinx1 is 2.5 times and 1.5 times slower than real time
on 1 GHz and 1.8 GHz Intel Pentium processors respectively. It is possible that the
workload and vocabulary used by the Intel researchers was considerably simpler than
the one used with Sphinx. Ravishanker reported that for Sphinx II, the language model
search consumed about 40% of the recognition time [74]. For the Intel researchers, the
language model search is a very small fraction of the execution time. Details of the ICRC
speech model are not available. The huge gap in performance between Sphinx and the
1This is not the CMU version of Sphinx, but a heavily optimized version described in Chapter 5.
10
numbers published by ICRC is possibly because the ICRC speech model is simpler than
the Hub-4 speech model used to evaluate Sphinx.
Rabiner and Huang provide data on historical trends in the compute requirement of
continuous speech recognition. They predict that in the post year 2000 time it frame will
require the compute power of 20 to 60 DSP processors each delivering 1000 MIPS [79].
No published work on the power consumption characteristics of speech recognition is
known to exist at this time.
Compared to speech recognition, the algorithms used for perceptual computer vision
are far more diverse, and workload characterization results are almost nonexistent. The
problem is exacerbated by the fact that research is split into image understanding applica-
tions like automatic navigation and nonunderstanding applications like face recognition
and detection. A large volume of existing research emphasizes the parallelization and
hardware acceleration of early vision primitives like convolution, thresholding, segmen-
tation and connected component labeling [9, 105, 107]. Toolkits like Xvision and the In-
tel computer vision library provide optimized versions of such vision primitives [43, 52].
While there seems to be a consensus on early vision primitives for image understanding,
there seems to be very little agreement and commonality in the higher level aspects of
computer vision. Specialized systems for inspection of manufacturing defects, robot and
vehicle navigation exist, but seem to be highly domain specific. Representative examples
are commercial offerings by companies such as Cognex and Coreco, which provide
application specific software for industrial applications such as visual inspection, security
monitoring, motion detection, etc. [1, 2]. In contrast, nonunderstanding computer vision
applications seem to have more in common with each other, and complete applications
are more readily available. These also possess a synergistic nature – face detection and
lip tracking can augment speech recognition and improve recognition accuracy [102].
Rowley described an optimization for his neural network based face detector that
can process a single 320 × 200 image in 7.2 seconds on a 200 MHz R4400 [83]. He
reported that combined with flesh tone detection, it might be possible to reduce this time
to two to four seconds. Viola and Jones published a method of detecting faces that can
perform at a rate of 15 frames per second on a 700 MHz Pentium [103]. Their rapid
11
rate of detection depends on three fundamental advances. They propose a new image
representation called integral image that allows features used by their detector to be
computed rapidly. This image representation can be coupled with a learning algorithm,
which can select a small number of critical features from a large set and thus reduce
computation. They also describe a method to cascade increasingly complex classifiers
that prunes away uninteresting background regions so that the algorithm can spend more
time on the promising part of an image. Together, these optimizations claim a factor
of 15 speedup over the Rowley detector. Connell of the ECVG research group at IBM
reported being able to perform face detection at 90 frames per second on a 400 MHz
Pentium II by correlating the output of a variety of inaccurate and computationally cheap
face detectors [25]. Details of this system are currently not available.
There is a serious dearth of research characterizing the performance of face detectors.
This lack of published analysis can be mainly attributed to two different factors. First,
there is a wide variety of nonneural net based face detection techniques. The promi-
nent examples are support-vector based methods, naive Bayesian classifiers, template
matching and Eigen vector based techniques [110, 75]. Though each of these techniques
has its ardent proponents, the field as a whole is too fractured. Anyone undertaking an
architecture study is perplexed about which method is important. Second, most neural
net face detectors are based on multilayer perceptrons (MLP). Because of their regular
structure, it is simple to estimate the number of operations, bandwidth requirements,
etc. of an MLP network. While performance is easy to estimate, the degree of numerical
precision required, power consumption, die area, etc. are much more difficult to quantify.
Face recognition shares the same problem as face detection in that no performance and
power analysis studies are known to exist.
2.2 Image and Neural Processors
Neuromorphic system design pioneered by Carver Mead is a method of building
electronic circuits inspired by biological systems. For example, Boahen and colleagues at
the University of Pennsylvania designed the Visio 1, a chip that models photo-receptors
and the four major ganglion cell types found in a retina [15]. This low power chip
12
uses networks of ganglion cells to detect edges and distinguish directions of motion.
Harrison and Koch at the California Institute of Technology built a chip that integrated
photo-detectors and analog motion detectors to model the first three layers of the visual
system of a house fly [45]. They successfully used this chip to steer the direction of an
autonomous mobile robot in real time. While there are distinct power and performance
advantages to such neuromorphic chips, their analog nature, limited reconfigurability and
tight integration with photo-detectors make them unlikely candidates for integration into
low power digital computers for perception.
The Xetal processor developed by researchers at Philips Research labs takes the
approach of providing a low power programmable linear array of processors designed
to accept digital video data [57]. Xetal consists of an array of 320 programmable pro-
cessing elements that are laid out with communication channels and optimized to process
640×480 images at 30 frames per second. This processor is optimized for low power high
performance computations like convolution, color conversion, noise reduction, template
matching and image compression. No information is currently available on applying
Xetal to perception processing.
Fang, a researcher from NASA JPL, describes a low power system on chip design that
combines an on-chip camera with a neural net processor and a control microprocessor
[34]. This system developed for real-time vision applications in space exploration was
reported to be capable of functions like edge detection, connected component detection,
motion estimation, etc. Actual power and performance results are not available.
The Simpil processor designed at Georgia Tech is a focal plane SIMD architecture for
early vision applications like edge detection, image convolution and compression [24].
In Simpil, up to 16 pixels may be sampled by a SIMD node using A/D conversion and
processed locally. Arrays of nodes perform localized computations over the entire focal
plane. Estimated total power consumption for a 64× 64 array of SIMD nodes fabricated
in a 0.35µ process as four separate chips was 5.1 W while operating at 20 MHz.
13
2.3 High ILP Processors for Perception
The high performance microprocessor industry has devoted a lot of attention to de-
veloping short vector (SIMD) extensions like MMX, SSE, MDMX and VIS that cater to
the needs of multimedia applications [26, 37]. An Intel publication described the use of
SSE II instructions for Viterbi decoding of hidden Markov models [50]. Significant per-
formance improvement is claimed, but not quantified. The Intel computer vision library
provides SIMD optimized versions of commonly used vision algorithms [52]. Though
vector machines have long been the workhorse of scientific computing, the relevance of
short vector or SIMD optimizations to perception codes had not been appreciated fully
until recently. These techniques have been shown to improve performance by up to an
order of magnitude on DSP style algorithms and even on small speech processing codes
[55]. The trend has in general been to use short vectors to utilize SIMD parallelism and
to use the super-scalar scheduling infrastructure already available in modern out of order
processors to keep the SIMD units occupied rather than using real vector issue and long
vectors [11]. Shifting the task of identifying dependences and scheduling instructions
from a vectorizing compiler to dynamic issue logic has the distinct disadvantage of
increasing processor complexity as well as power consumption. Vector chaining has
been traditionally used as a performance enhancement mechanism [85]. The compiler
controlled dataflow approach developed in this dissertation can mimic vector chaining in
a more general manner and with low hardware overhead.
There have been numerous attempts to implement digital neural network processors
as vector or SIMD machines. CNAPS from Adaptive Systems and the NeuroMatrix
DSP from Module Research Center are representative examples [44, 72]. While neural
network algorithms have been a mainstay of perception research, the evaluation of such
architectures for well defined perception tasks or whole perception applications is rarely
found in the literature. A well known example is SPERT, a neural network and signal
processing accelerator board for workstations, based on the Torrent 0 vector micropro-
cessor jointly designed by the International Computer Science Institute and UC Berkeley
[106]. Evaluation of SPERT focused on training of forward and back-propagation neural
networks for tasks like probability estimation for a hidden Markov model based speech
14
recognizer. Both processor speed and the complexity of the recognition task have in-
creased greatly since the time of SPERT.
The performance of Multi-SPERT, a later design consisting of multiple SPERT boards
was measured to be over 530 million connection updates per second for a five node
configuration performing neural network training for speech recognition [36]. Moreto
analyzed SPERT’s performance on a partial implementation of RASTA-PLP, a speech
front-end signal processing program [73]. An implementation of RASTA for SPERT had
a significant impact in its day. A recent study reported that RASTA-PLP computation
took only 6.7% of the run time of a recognition task [6]. Clearly, the performance
bottlenecks have shifted with advances in speech recognition technology.
2.4 Custom Hardware for Perception
Pihl at the Norwegian University of Science and Technology designed the PDF co-
processor, a custom coprocessor in a 0.8µ CMOS process to accelerate the computation
of Gaussian observation probabilities in a hidden Markov model based speech recognizer
[77]. This research concluded that memory bandwidth was a limiting factor for Gaussian
computation. Pihl approached the memory bandwidth problem by using a new fixed
point representation called the dynamical circular fixed-point format, which reduced the
memory bandwidth requirement by half. The PDF coprocessor could evaluate 40,000
39-element Gaussian components in real time using this format at 154 MHz consuming
853 mW of power. The work was based on an early version of Sphinx. In the current
Sphinx 3.2 version, the workload has worsened by a factor of 15.3. This number, as well
as the bandwidth requirement, is expected to increase further in the future.
An earlier attempt to accelerate speech recognition may be found in the work of
Anatharaman and Bisiani [10]. They present a custom architecture as well as a multipro-
cessor architecture for improving the performance of the beam search algorithm used by
the CMU distributed speech recognition system.
Benedetti and Perona describe an FPGA based system that exploits memory locality
for real-time low level vision [13]. Their system targeted the fast prototyping of low level
vision techniques using observations about locality in pixel neighborhoods to achieve
15
2.8 GBytes/second bandwidth between SRAM components and FPGA based compute
elements.
2.5 Balancing Performance and Power Consumption
Given the rising interest in mobile devices and the widespread use of embedded
processors in control and monitoring applications, a large body of existing work has
been devoted to achieving high computational performance while also improving power
efficiency. The approach taken in this dissertation is to control a clock gated VLIW pro-
cessor consisting of a cluster execution units and a special purpose scratch-pad memory
system at a very fine granularity using horizontal microcode. All communication within
the cluster is scheduled under software control – a technique that will be referred to as
compiler controlled dataflow. In addition, the clock signal to each function unit is con-
trolled by the software on a cycle by cycle basis. This is called compiler controlled clock
gating. The details appear later in Chapter 9, but this synopsis is useful in considering
the relevance of preexisting approaches.
There are many vendors of high performance power efficient embedded processors
such as the Philips Trimedia, TI C62xx, and Lucent DSP16000 that can be effectively
scheduled to achieve reasonably low power operation [47, 100, 3]. Increasing perfor-
mance via VLIW instruction scheduling and instruction width reduction techniques is a
common theme in modern embedded systems [63, 108, 16, 8]. Efforts have demonstrated
the benefit of VLIW architectures for customization and power management [89]. Opti-
mization techniques for clustered VLIW architectures can also be found in the literature
[56]. However, these efforts do not address low-level communication issues. Caliber
uses an interesting software pipelining strategy that is targeted at reducing memory
pressure in VLIW systems. The primary mechanism is to distribute the register file
[8, 7]. In contrast, in this dissertation, the output stage pipeline registers of function
units and the associated forwarding paths will be managed as if they constituted a small
distributed register file. Tiwari et al. have explored scheduling algorithms for less
flexible architectures, which split an application between a general purpose processor
and an ASIC [95]. Lee investigated the power benefits of instruction scheduling for
16
DSP processors [61]. Eckstein and Krall focus on minimizing the cost of local vari-
able access to reduce power consumption in DSP processors [33]. Application-specific
VLIW clusters have been investigated by many researchers [60, 35]. Customizing a
VLIW processor to minimize power and maximize performance by only including the
necessary function units and specializing function units via operator fusion has been
studied and utilized by the Tensilica Corporation in their Xtensa architecture [40]. The
fine grain horizontal microcode approach taken in this dissertation can be viewed as
a fine-grained extension of the VLIW concept. However the addition of sophisticated
address generators, multiple address contexts per address generator, the removal of the
register file, and the fine-grained steering of data are aspects presented in Chapter 9 that
are not evident in these other efforts.
The MOVE family of architectures explored the concept of transport triggering,
where computation is done by transferring values to the operand registers of a function
unit and starting an operation implicitly via a move targeting a trigger register associated
with the function unit [48]. Like in the MOVE architecture, the concept of compiler
directed data transfer between function units is used in this dissertation too, but the
resultant architecture is a traditional operation triggered one and transport triggering is
not used.
The RAW machine has demonstrated the advantages of low level scheduling of data
movement and processing in function units spread over a two-dimensional space [104,
62]. The RAW work is similar to the research presented in this dissertation in many ways.
Low-level architectural resources are fully exposed to the compiler. Custom data flows
are scheduled by the compiler on resources that are inherently somewhat general purpose.
The primary differences arise from the basic design target. The RAW effort is directed
at demonstrating that high levels of performance can be achieved on an architecture
consisting of many fine-grained tiles. This dissertation is directed at demonstrating that
somewhat general purpose structures can be scheduled to achieve power efficiency that
competes with special purpose ASIC designs.
The Imagine architecture is organized to exploit high levels of internal bandwidth in
order to achieve high performance levels on stream based data [82]. Scheduling issues
17
are similar, but the target is performance rather than low power. Given the poor wire
scaling properties of deep submicron CMOS processes, it is somewhat inevitable that
function unit clusters will need to be considered in order to manage communication
delays in high performance wide issue super-scalar processors. Current DSP processors
like the TMS320C6000 already have clustered datapaths and register files [94]. These
approaches however are all focused on providing increased performance. The approach
taken in this dissertation is to improve both power and performance while retaining a
large degree of programmability.
One popular approach to specialization is the use of reconfigurable logic to provide
customization. Techniques vary from power aware mapping of designs onto commer-
cially available FPGA devices to hybrid methods where specialized function blocks are
embedded into a reconfigurable logic array [20, 18, 69, 31]. Of particular relevance
are compiler directed approaches that are similar to the compiler-controlled dataflow
approach used in this research [70]. However, this dissertation targets custom silicon
implementations rather than the higher level FPGA domain. FPGA based approaches
have a significant advantage when the phases of an application persist long enough to
amortize the relatively long reconfiguration times. The generality of the FPGA approach
also leads to excessive energy loss. The approach taken here is commensurate with more
rapid reconfiguration and exhibits significantly better energy efficiency.
A number of researchers have tried to predict the energy consumption of an ap-
plication running on a particular processor [84]. Wattch is a well known example of
high level simulation based power estimation [17]. Such high level approaches have
a number of benefits. They are useful early in the design flow, and the simulations
are several orders of magnitude faster than low level estimation using tools like Spice.
The disadvantage is that Wattch-like systems need to be calibrated to use high level
power models that take into account all the implementation specific details. When the
actual implementation differs from the power model provided to the tool, the power
estimate will be meaningless. Since the perception processor architecture described later
in this dissertation is significantly different from general purpose architectures modeled
by Wattch, the power estimates reported in this work will be based on low-level Spice
18
simulation of actual circuits.
Clock power is often the largest energy culprit in a complex design such as a modern
microprocessor [41, 96]. This is primarily because the clock signal potentially goes
everywhere on the chip. Clock gating is a popular technique that selectively turns off the
clock to portions of the chip that are not used at a particular time. Krashinsky studied
the benefits of clock gating applied at various levels of aggression on a microprocessor
design [58]. Tseng and Asanovic describe a technique that conserves register file power
when the value will be supplied from a bypass path [98]. This is similar in spirit to
compiler-controlled dataflow used in this dissertation except that the architecture de-
scribed in Chapter 9 eliminates the register file altogether and uses the bypass paths
to forward all values. There are two disadvantages to clock gating: the enable signal
must arrive sufficiently ahead of the clock signal, and the use of additional gates in the
signal path will increase clock skew. Both effects reduce the maximum achievable clock
frequency. For low-power design objectives, this is seldom a serious issue.
Modulo scheduling is a well known software pipelining approach for VLIW proces-
sors [81]. It permits multiple loop bodies to be simultaneously in flight within a clustered
VLIW processor. The perception processor discussed in this dissertation relies heavily
on modulo scheduling to achieve high performance. The regular nature of modulo sched-
uled loops makes them amenable to algorithmic level power analysis and optimization.
While the compiler controlled clock-gating explored in this dissertation has been free
of problems, such fine grain management of power could lead to excessive power line
noise otherwise known as the di/dt effect. In such cases it is possible for a compiler to
introduce additional dummy operations into a modulo scheduled loop to reduce power
line disturbance. Yun and Kim present a power aware modulo scheduling algorithm that
could limit power fluctuations [112].
While using custom coprocessors to accelerate applications is a well established idea,
recently researchers have started emphasizing it as a means of reducing power consump-
tion. PipeRench is one such programmable datapath developed at CMU. PipeRench
uses self-controlled runtime reconfiguration and virtualization of hardware to execute
a 40 tap 16 bit FIR filter processing 41.8 million samples per second and the IDEA
19
encryption algorithm at 450 Mbps while operating at 120 MHz [87]. Power consumption
on 15-20 filter taps while operating at 33.3 MHz is in the 600-700 mW range. Pleiades
is a reconfigurable DSP architecture developed at UC Berkeley. It is a domain specific
processor that trades off the flexibility of a general purpose processor for higher energy
efficiency. The Pleiades designers report that their architecture consumes only 14 nano
Joules per stage of an FFT computation while the Intel Strong ARM and the Texas
Instruments TMS320C2xx consume 36 nJ and 15 nJ respectively after normalizing for
CMOS process parameters [4]. The opportunities for special purpose architectures to
improve on the power consumption and the performance of general purpose devices are
numerous. Direct comparison against such systems is often impossible because of their
unavailability and due to the difference in the domains they are targeted at. For this
reason, the approach described in this dissertation will be compared against commer-
cially available general purpose processors and ASIC implementation of algorithms, not
against domain specific accelerators in the literature.
2.6 Distinguishing Features
The perception processor is unique in its use of semiautonomous distributed address
generators and scratch-pad memories to efficiently deliver data to a cluster of execution
units. Like the perception processor, the Imagine Stream Processor and its successor,
the Streaming Super Computer are targeted at stream computing. However, they target
high performance optimizations for multimedia and scientific calculations and are less
concerned with power efficiency. The perception processor, on the other hand, targets
power efficient acceleration of speech recognition and vision applications. Compiler
controlled dataflow is used as a means to mimic custom ASICs, but unlike the transport
triggered MOVE architecture, the perception processor is operation triggered. Unlike
prior research, the fine-grain compiler controlled clock gating described in this disserta-
tion is used not only as a power saving method but also as a means to let software control
the lifetime of values held in pipeline registers. This leads to the ability to schedule
variables in both time and space and harvest the natural register renaming that happens
when a pipeline shifts. Traditional VLIW processors like the Intel Itanium use a rotating
20
register file to accelerate loops. In contrast to traditional architectures, the perception
processor uses a mechanism called array variable rotation to create the equivalent of
multiple virtual rotating registers, one per array variable accessed in a loop body. Most
importantly, an architecture level analysis and optimization of perception applications
and a power efficient, yet programmable architecture designed for a variety of stream
oriented perception and DSP algorithms is the distinguishing mark of this dissertation.
CHAPTER 3
PRINCIPLES BEHIND DYNAMIC POWER
REDUCTION
Power consumption in CMOS circuits consists primarily of static power dissipated
by leakage currents and dynamic power, which is in turn comprised of short circuit
dissipation and the switching power consumed while charging and discharging load
capacitances. Though subthreshold leakage current was a small component of power
consumption in past processes with 0.25µ and larger feature sizes, it is fast becoming a
large component in processes with smaller feature sizes. Architectures that expose power
management to the operating system and application software can play an important role
in reducing leakage power. A combination of software and hardware mechanisms can
intelligently power down parts of a system that are not in active use. The most effective
solutions to the leakage current problem are at the circuit and process level. Circuit
design styles that use gated Vdd and stacked transistors have been shown to greatly
reduce the magnitude of the problem, but they also decrease performance [78, 53].
CMOS processes with multiple threshold voltages (MTCMOS) provide another solution
to the leakage power problem. They also contribute to design flexibility since fast leaky
transistors can be used in critical paths to enhance performance and slow energy efficient
transistors can be used in noncritical parts of the circuit. How to take advantage of this
flexibility in large circuits synthesized from a hardware description language (HDL) is
an area of active research [93].
While a CMOS gate is switching state, there is a short period of time during which
the N and P transistors are simultaneously on, which leads to short-circuit current flowing
between the power and ground terminals. The magnitude of this current increases with
reductions in Vt. It also increases when the rise and fall times of the input waveform
22
are slow [109]. As in the case of leakage current there is very little that can be done at
the architecture level to solve the problem. The process level solution of using high Vt
devices and circuit design styles that ensure rapid rise and fall times alleviate the severity
of the problem.
The architectural options developed in this research are evaluated using transistor
level circuit simulations. These Spice simulations consider both the short circuit and the
leakage components of power consumption. However, since there is not much that can
be done at the architecture level, this research is focused entirely on CMOS dynamic
power dissipated by repeated charging and discharging of load capacitances – a problem
for which architecture level solutions are possible.
3.1 Dynamic Power ConsumptionTo understand how architectural strategies can provide high performance for percep-
tion applications at low power levels, it is necessary to look at the CMOS circuit dynamic
power consumption equation:
P = ACV 2F (3.1)
P is the power consumed, A is the activity factor, i.e., the fraction of the circuit that is
switching, C is the switched capacitance, V is the supply voltage, and F is the clock
frequency [109]. If a capacitance of C is charged and discharged by a clock signal
of frequency F and peak voltage V, then the charge moved per cycle is CV and the
charge moved per second is CV F . Since the charge packet is delivered at voltage V,
the energy dissipated per cycle, or the power, is CV 2F . The data power for a clocked
flip-flop, which can toggle at most once per cycle, will be 12CV 2F . When capacitances
are clock gated or when flip-flops do not toggle every cycle, their power consumption
will be lower. Hence, a constant called the activity factor (0 ≤ A ≤ 1) is used to model
the average switching activity in the circuit. Equation 3.1 is derived by incorporating
this term into the power consumption. Custom ASICs can drastically reduce the power
consumption by using specialized circuit structures and concurrency to lower C and F
respectively. The drawback is that custom ASICs are inflexible and once fabricated, they
23
cannot be reprogrammed. Also, their high production costs and long design times often
make them an unattractive choice. While programmable perception processors are more
desirable than ASICs, ASICs still represent the “gold standard” against which perception
processors should be compared. This is because the specialized nature of an ASIC gives
it significant power, performance and die area advantages when compared to a general
purpose processor. So they represent the best possible implementation of a particular
algorithm for a given CMOS technology.
Assume that an application is required to perform N operations every t seconds to
keep up with real time. Then it should be the case that:
N
IPCavg × F≤ t
IPCavg refers to the average number of instructions issued per second across the
whole application. Further, when NIPCavg×F
< t, the processor has too much perfor-
mance, i.e., its frequency is too high and it wastes power. When handling constant rate
real-time workloads, it is not useful to finish the work early and power down the circuit
till the next real-time deadline. The overhead of reloading state holding data memories
and the instruction memory may be in the range of several thousand cycles. It is better
to slow down the processor to have just enough performance to meet real-time deadlines
rather than paying the reload penalty tens or hundreds of times per second depending on
the nature of the constant rate workload. Thus the ideal frequency of operation is:
Fideal =N
IPCavg × t(3.2)
Substituting this back in the power equation we get:
P = ACV 2 N
IPCavg × t(3.3)
3.2 Power Reduction StrategiesEquations 3.1 and 3.3 point to several power reduction strategies. For instance,
power consumption can be reduced by increasing IPC. However, modern dynamically
24
scheduled processors also increase the value of C when they increase IPC due to the
introduction of large reorder buffers, complex cache structures, register renaming and
support for speculative execution. Architectures that can provide high IPC without
an inordinate rise in the value of C will lead to low power consumption. This can
be achieved at the cost of generality by using simple application domain specific ILP
enhancing mechanisms as well as by taking advantage of compiler driven static ILP
improvements. Increasing the issue width causes some increase in power consumption
because of the wider structures used to support multiple issue. Since most of the ILP
extraction is done at compile time, and because the additional logic can be tailored to
take advantage of domain specific optimizations, the strategy leads to a net power savings
in the end.
Another architectural means of reducing power consumption is to decrease the ac-
tivity factor A. Clock gating provides one method of reducing the activity factor [96].
Designing structures that isolate activity happening in one part from being visible in
other parts is another useful technique. A typical example is the forwarding paths of a
super-scalar microprocessor. A forwarding mux connected to the output of a function
unit makes the value changes occurring in the final stage of that unit visible at the inputs
of other function units even when the receiving units do not need the forwarded value.
This leads to unnecessary switching activity and power dissipation at the receiving side.
When the forwarding path is not needed, the mux select signals can be manipulated
so that unnecessary value changes are not visible at the receiving side. This strategy
called operand isolation was utilized in the IBM PowerPC 4xx embedded controllers
[27]. Operand isolation under compiler control is used as a power saving strategy for the
perception processor described in Chapter 9.
Lowering the ideal operating frequency also permits the use of a lower supply volt-
age, which results in power savings. If frequency is directly proportional to supply
voltage, Equation 3.1 predicts cubic power reduction. However, in reality, f ∝ (V −Vt)Kds
V
where Kds is a device saturation constant whose value ranges from zero to two when ve-
locity saturation is not explicitly modeled [12]. Considering this relationship, quadratic
or linear power savings may be obtained by lowering the supply voltage and operating
25
frequency. This strategy capitalizes on the results produced by researchers exploring
ideal voltage selection and voltage scaling [76]. Equation 3.1 applies only within a
narrow, process specific, supply voltage range.
Ultimately, the average IPC available in an application is limited by the dependences
between instructions. Further improvements may be obtained by multithreading the
application, in which case IPCavg in Equation 3.3 corresponds to the aggregate IPCs
of the individual threads. Traditional high performance multiprocessors exact a high
energy price because of the complexities of memory system coherence and interthread
communication. By tailoring a multiprocessor system to the information flow and syn-
chronization patterns found in perception applications, it is possible to design simple
architectures that provide sufficient generality for the perception domain.
Perception applications are usually stream oriented. They consist of a pipeline of
algorithms, most of which are compute and memory intensive. Each phase typically
touches and discards a large data set in a block oriented manner, i.e., several input blocks
and a few blocks of local state are consulted to compute a block of output. There is little
or no reuse of the high bandwidth input data, which is comprised of both input signals
and massive knowledge bases that are too large to cache on-chip. One or more phases
may be executed on a processor, and multiple processors may be connected in a pipeline
fashion for efficient interphase communication while harvesting thread level parallelism.
3.3 Process Normalization
Comparing the power and performance advantages of any perception-optimized ar-
chitecture to its competition presents some problems. Typically, the competition is a
commercial general purpose processor that is implemented in a different CMOS process
than the one used to implement the perception processor. To make a fair comparison
possible, it is necessary to normalize power and delay of circuits for the minimum feature
size of the CMOS process. Three different scaling regimes will be used to evaluate
the different architectures in Chapter 10 : constant field scaling, voltage scaling and
frequency scaling.
26
3.3.1 Constant Field Scaling
In constant field scaling, when the minimum feature size is scaled from λ to sλwhere
s is a scale factor, the length and width of the channel, the oxide thickness, substrate
concentration density and the operating voltage are all scaled by the same factor s so that
the electric field in the transistor remains constant. The net result is that the dynamic
power consumption P is scaled to s2P , circuit delay T is scaled to sT and operating
frequency F is changed to F/s [109]. Correspondingly, energy consumption scales as s3
and the energy delay product scales as s4. Both the horizontal and vertical electric fields
within a transistor must scale by the same factor for this analysis to hold.
3.3.2 Voltage Scaling
The operating speed of a circuit depends on the supply voltage. It is common practice
to optimize the power performance ratio by adjusting the supply voltage. Noise margins,
transistor threshold voltage and punch through limit the allowable range of supply volt-
age. Within the allowable range, Equation 3.1 predicts that when the supply voltage is
scaled by a factor of sv the dynamic power consumption scales by s2v.
3.3.3 Frequency Scaling
Equation 3.1 predicts that when the supply voltage is held constant and the frequency
is scaled by a factor of sf , the dynamic power consumption scales by sf too.
In reality, most commercial systems do not undergo pure constant field scaling. To
obtain higher performance than promised by constant field scaling, slightly higher supply
voltages and correspondingly higher frequencies are used. This situation can be modeled
as voltage/frequency scaling layered on top of constant field scaling. When the supply
voltage is scaled by a factor of sv and the operating frequency is scaled by sf the dynamic
power consumption scales by s2vsf .
In the interest of obtaining satisfactory noise margin, circuits are typically designed
to operate at voltages that are several times higher than the threshold voltage of tran-
sistors. Since the threshold voltage does not scale as rapidly as transistor feature size,
supply voltage cannot be reduced considerably in the future if noise margins are to be
27
maintained. Combined with the issue of increasing leakage current, these factors indicate
that technology scaling alone may not be adequate to alleviate the power consumption
problems of future systems.
3.4 The ET nMetric
Power consumption, delay, throughput and energy consumption are metrics com-
monly used to compare systems. Considering each of these metrics in isolation does not
permit a fair comparison of systems because of the ability of CMOS circuits to trade
performance for energy. When multiple criteria need to be optimized simultaneously,
it is common to optimize their weighted product. In the case of energy and time, this
product may be represented as the metric M for a circuit configuration C such that:
M(C) = ET n
Here n is a weight that represents the relative importance of the two criteria. The ET n
metric was first proposed by Martin, Nystroem and Penzes [64]. Since energy and time
can be traded off for each other, consider the infinitesimally small quantity of energy
∆E that needs to be expended to reduce the time for a computation by an infinitesimally
small amount ∆T . Using Newton’s binomial expansion and ignoring products and higher
powers of ∆E and ∆T we get:
M(C ′) = (E + ∆E)(T − ∆T )n = ET n − nE∆T + T∆E
If this new operating point is equivalent to the old operating point under the metric M :
ET n − nE∆T + T∆E = ET n
Rearranging this equation yields:
∆E
E=n∆T
T(3.4)
28
Intuitively, this means that a small reduction in time is considered n times more
valuable than a corresponding reduction in energy. For example, if n = 1, a 1% reduction
in time is considered worth paying a 1% increase in energy. If n = 2, then it is acceptable
to pay for a 1% increase in performance with a 2% increase in energy consumption. In
general, when n = 1, energy and delay are equally important, when n > 1 performance
is valued more than energy and when 0 < n < 1 energy savings are considered more
important than performance. The case of n = 0 optimizes just for energy and n = −1
optimizes for power. Other negative values of n are not useful for optimization since
E/T n changes in opposite directions for improvements in energy and delay.
3.5 Energy Delay Squared Product
Martin, Nystroem and Penzes proposed ET 2 as a special case of the ET n metric that
is voltage independent [64]. They proved mathematically that an ET n optimal design
is optimal irrespective of the value of n. There are a few caveats to this result. It
applies only when the circuit is operating within its normal range, i.e., supply voltage
is not close to the threshold voltage or to the velocity saturated region of a transistor.
The intuition behind their formulation is that two circuits with different supply voltages,
power consumptions and performance may be compared by voltage/frequency scaling
the systems until either their supply voltage or their frequency matches. Then the system
with the better power consumption or performance may be picked. Unfortunately, if the
initial difference in performance is too large as in the case of a 2.4 GHz Pentium 4 and a
400 MHz XScale described in Chapter 10, the scaled voltage will be outside the operating
range. For example, if the Pentium operating at 1.6 volts has 10 times the performance of
the XScale, to equalize their performance the Pentium’s voltage needs to be to be scaled
down to approximately 0.16 volts. This assumes that operating frequency scales linearly
with supply voltage, an approximation that applies only in an extremely narrow voltage
range. The new supply voltage of 0.16 volts is bound to be smaller than the threshold
voltage of the 0.13µ CMOS process in which the Pentium is fabricated. So the Pentium
will not operate correctly at that voltage. Since the scaled supply voltage is not within
the normal voltage range, the metric equivalent optimality promised by Martin et al. will
29
not apply.
Results presented in Chapter 10 use E, ET and ET 2 as metrics. The choice of E
gives an advantage to systems like the XScale processor that stress energy efficiency
over performance. The choice of ET favors systems like the perception processor that
value both performance and energy efficiency. ET 2 favors high performance processors
like the Pentium whose design allocates a large expenditure of energy in return for
small improvements in performance. Since the range of supply voltage required to
equalize the performance of the XScale and Pentium systems is outside the operating
range for transistors in the 0.13µ technology in which the Pentium 4 is implemented, this
dissertation uses ET 2 merely as a metric that stresses performance over energy savings.
No claims are made about metric equivalent optimality of the circuits for values of n
other than two.
CHAPTER 4
SPEECH RECOGNITION
Modern approaches to large vocabulary continuous speech recognition are surpris-
ingly similar in terms of their high-level structure [111]. The work described herein is
based on the CMU Sphinx 3.2 system, but the general approach is applicable to other
speech recognizers [49, 74]. The explanation of large vocabulary continuous speech
recognition (LVCSR) in this chapter is based on a simple probabilistic model presented
in [80, 111]. The human vocal apparatus has mechanical limitations that prevent rapid
changes to sound generated by the vocal tract. As a result, speech signals may be
considered stationary, i.e., their spectral characteristics remain relatively unchanged for
several milliseconds at a time. DSP techniques may be used to summarize the spec-
tral characteristics of a speech signal into a sequence of acoustic observation vectors.
Typically, 100 such vectors will be used to represent one second of speech. Speech
recognition then becomes a statistical problem of deriving the word sequence that has
the highest likelihood of corresponding to the observed sequence of acoustic vectors.
This notion is captured by the equation:
W = argWmaxP (W |Y ) (4.1)
Here, W = w1, w2, ..., wn is a sequence of n words and Y = y1, y2, ..., yT is a sequence
of T acoustic observation vectors. Equation 4.1 may be read as W is the particular word
sequenceW which has maximum a posteriori probability given the observation sequence
Y . Using Bayes’ rule, this equation may be rewritten as:
W = argWmaxP (Y |W )P (W )
P (Y )(4.2)
31
P (Y |W ) denotes the probability of the acoustic vector sequence Y given the word
sequence W . P (W ) denotes the probability with which the word sequence W occurs
in the language. P (Y ) denotes the probability with which the acoustic vector sequence
Y occurs in the spoken language. P (Y ) is independent of the word sequence, therefore
W can be computed without knowing P (Y ). Thus Equation 4.2 may be rewritten as:
W = argWmaxP (Y |W )P (W ) (4.3)
The set of DSP algorithms that convert the speech signal into the acoustic vector se-
quence Y is commonly referred to as the front end. The quantity P (Y |W ) is generated
by evaluating an acoustic model. The term P (W ) is generated from a language model.
4.1 Front End
The signal processing front end summarizes the spectral characteristics of the speech
waveform into a sequence of acoustic vectors that are suitable for processing by the
acoustic model. Figure 4.1 shows the stages of this transformation.
Frame Blocking: The digitized speech signal is blocked into overlapping frames. It
is common to have 100 frames per second, so a new frame is started every 10 ms. A new
frame contains the last 7.5 ms of the previous frame’s data and the first 7.5 ms of the
next frame’s data. Thus, even though a new frame is made every 10 ms, each frame is 25
ms in duration. The overlap decreases problems that might otherwise occur due to signal
data discontinuity.
Preemphasis: This stage spectrally flattens the frame using a first order filter. The
transformation may be described as:
Y0[n] = x[n] − αx[n− 1], 0.9 ≤ α ≤ 1, 0 < n < Samples per frame
Here, x[n] refers to the nth speech sample in the frame. Sphinx uses α = 0.97 and the
sampling rate is typically 8K or 16K 16-bit samples per second.
32
Figure 4.1. Signal Processing Front End
Hamming Window: In this stage a Hamming window is applied to the frame to
minimize the effect of discontinuities at the edges of the frame during FFT. The transfor-
mation is:
Y1[n] = x[n] ×H[n], 0 < n < Frame size
The vector H[n] is computed using the following equation.
H[n] = 0.54 − 0.46 × cos(2πn
Frame size− 1)
33
The constants used in the H[n] transform were obtained from the Sphinx source code.
FFT: The frame is padded with enough zeroes to make the frame size a power of two
(call this N ) and a Fourier transform is used to convert the frame from the time domain
to the frequency domain.
Y2 = DFT (Y1)
The square of the magnitude is then computed for each frequency component. Thus the
results are real numbers rather than the complex output produced by a discrete Fourier
transform.
Y3[n] = real(Y2[n])2 + imag(Y2[n])2, 0 < n ≤ N/2
Mel Filter Bank: A set of triangular filter banks is used to approximate the frequency
resolution of the human ear. The Mel frequency scale is linear up to 1000 Hz and
logarithmic thereafter. A set of overlapping Mel filters are made such that their center
frequencies are equidistant on the Mel scale. The transformation is:
Y4[n] =N/2∑
i=0
Y3[i] ×MelWeight[n][i], 0 < n < Number of filters
For 16 KHz sampling rate, Sphinx uses a set of 40 Mel filters.
Log Compression: The range of the values generated by the Mel filter bank is
reduced by replacing each value by its natural logarithm. This is done to make the
statistical distribution of the spectrum approximately Gaussian – a requirement for the
subsequent acoustic model. The transformation is:
Y5[n] = ln(Y4[n]), 0 < n < Number of filters
DCT: The discrete cosine transform is used to compress the spectral information into
a set of low order coefficients. This representation is called the Mel-cepstrum. Currently
34
Sphinx compresses the 40 element vector Y5 into a 13 element cepstral vector. The
transformation is:
Y6 = DCT (Y5)
Numerical differentiation: Acoustic modeling assumes that each acoustic vector is
uncorrelated with its predecessors and successors. Since speech signals are continuous,
this assumption is problematic. The traditional solution is to augment the cepstral vector
with its first and second differentials. Since the Mel cepstral vector is 13 elements long
in Sphinx, after appending the differentials the final acoustic vector that is 39 elements
in length.
Summary: The Sphinx front end transforms a 25 ms speech sample into a 39 element
vector of real numbers that represents the spectral characteristics of the waveform in a
compact form. The speech signal is blocked into overlapping frames spaced 10 ms apart.
Thus the front end transforms one second of speech into a series of 100 acoustic vectors.
Even though the front end only occupies less than 1% of the compute cycles of Sphinx
3.2, it is very important for two reasons.
1. Understanding acoustic vectors is a crucial prerequisite to illustrate the operation
of the acoustic model.
2. The front end is dominated by floating point computations that make it very prob-
lematic to run on embedded processors without floating point hardware. Fixed
point versions are difficult to create and analyze, but have been studied in the
literature. Delaney described a fixed point speech front end for Sphinx which
performed 34 times better on an embedded processor than a floating point front
end that uses software emulated floating point operations [32].
4.2 Acoustic Model
Equation 4.3 needs the quantity P (Y |W ), the probability of an acoustic vector se-
quence Y given a word sequence W to find the most probable word sequence. A
35
simplistic approach to achieve this would be to obtain several samples of each possible
word sequence, convert each sample to the corresponding acoustic vector sequence and
compute a statistical similarity metric for the given acoustic vector sequence Y to the set
of known samples. For large vocabulary speech recognition this is not feasible because
the set of possible word sequences is very large. Instead words may be represented as
sequences of basic sounds. Knowing the statistical correspondence between the basic
sounds and acoustic vectors, the required probability can be computed.
The basic sounds from which word pronunciations can be composed are known as
phones or phonemes. Approximately 50 phones may be used to pronounce any word
in the English language. For example the CMU dictionary enlists the pronunciation for
dissertation as:
DISSERTATION D IH S ER T EY SH AH N
While phones are an excellent means of encoding word pronunciation, they are less than
ideal for recognizing speech. The mechanical limits of the human vocal apparatus leads
to co-articulation effects where the beginning and end of a phone are modified by the
preceding and succeeding phones. Recognizing multiple phone units in context tends
to be more accurate than recognizing individual phones. Current speech recognition
systems deal with three-tuples of phones called triphones. It is customary to denote
triphones as left context−current phone+right context. For example SH-AH+N is
a triphone that represents the context of the AH phone in the word dissertation. The final
N phone in “dissertation” can be modeled with a cross-word triphone whose right context
is the first phone in the next word or by the triphone AH-N+SIL where SIL is a special
phone that denotes silence. Although there are approximately 50 × 50 × 50 = 125, 000
possible triphones, only about 60,000 actually occur in English.
The probability that an acoustic vector sequence corresponds to a particular triphone
may be estimated using a Hidden Markov Model (HMM). Current speech recognizers use
an HMM model with three internal states and an entry and an exit state. The topology
of the HMM is shown in Figure 4.2. An HMM is a probabilistic finite state machine
that generates observation sequences. If the model is in state Si at time step t, then it
36
Figure 4.2. Triphone HMM
has a probability Bi(Yt) of producing the acoustic vector Yt and it switches to state Sj
with probability Aij . The problem of computing P (Y |W ) now becomes what is known
as the evaluation problem for HMMs – the problem of estimating the probability with
which a given HMM could have generated the observation sequence Y . The evaluation
problem can be solved using the Forward/Backward algorithm for HMMs, but since the
optimal state sequence is needed at a later stage, it is common to do a more expensive
Viterbi search which can compute the probability and uncover the optimal state sequence
simultaneously [80].
4.3 Language Model
The accuracy of recognition hypothesis produced by the acoustic model can be fur-
ther enhanced using a language model. The acoustic model might produce several
alternate similar words that the language model helps to disambiguate. Language models
are also useful in limiting search time for beam search based acoustic models. N-gram
models which predict the probability of a word based on the previous N − 1 words are
a common and effective approach. Current systems like Sphinx and HTK favor models
with N=3, which are called trigrams. While there are alternatives to N-gram models that
rely on grammar, syntax, subject verb agreement and trigger words, N-gram models have
the distinct advantage of being easy to train since N-gram probabilities can be easily
estimated from a large corpus of text automatically. A trigram model may be trained
37
simply by using the equation:
P (w3|w1, w2) =F (w1, w2, w3)
F (w1, w2)
Here, F (w1, w2, w3) refers to the frequency of occurrence of the trigram (w1, w2, w3)
in the training text and F (w1, w2) refers to the frequency of occurrence of the bigram
(w1, w2). In practice, for a large vocabulary all possible trigrams will not be present in
the training corpus. In that case bigram or unigram probabilities are used in the place of
trigram probabilities after reducing the probability by a back-off weight, which accounts
for the fact that the next higher n-gram has not been seen and therefore has a lower chance
of occurring.
4.4 Overall OperationHMMs are constructed for all known triphones. A pronunciation dictionary is used
to convert words into triphone sequences with overlapping contexts. For example the
isolated word dissertation whose pronunciation is the phone sequence D IH S ER T
EY SH AH N is expanded to SIL-D+IH, IH-S+ER, S-ER+T, ER-T+EY, T-EY+SH, EY-
SH+AH, SH-AH+N, AH-N+SIL. There are many more expansions corresponding to all
words that could possibly precede or succeed this word in a sentence. These are words
that could end in D+IH or start with AH-N. A data-structure known as a lexical tree
(Sphinx terminology) is constructed, and all words in the dictionary are entered in the
lexical tree. The roots of the tree correspond to the set of all triphones that start any
word in the dictionary. Each node in the tree points to the next triphone in the expanded
pronunciation of a word. Common triphone sequences may be shared within the tree.
The overall effect is that of combining all the triphone HMMs by adding null transitions
between the final states of one triphone HMM to the initial state of its successor. To
model continuous speech, null transitions are added from the final state of each word
to the initial state of all words. Triphones that occur at the end of a word are specially
marked so that a language model may be consulted at those points. Thus the lexical
tree is a multirooted tree where each node points to an HMM and a successor node. In
the case of word exit triphones there are multiple successors. Given an acoustic vector
38
sequence Y , each vector in the sequence is applied successively to the HMMs and the
probability that the HMM generated that vector is noted. Transitions are made in each
step to successor nodes. On reaching a word exit triphone, the state sequence history is
consulted to find the word that has been recognized. The last n words (usually n=3) are
checked against a language model for further analysis. The search is done by means of
a well known dynamic programming algorithm known as Viterbi beam search [74]. The
acoustic and language models are strongly coupled, though language model evaluation
may be deferred until the acoustic model has been evaluated. Together, they consume
almost 99% of the run time of Sphinx.
4.5 Architectural Implications
A basic understanding of the acoustic and language models is necessary to understand
the architectural implications and scaling characteristics of speech recognition. The
lexical tree is a complex data structure that results in considerable pointer chasing at
run time. The nodes that will be accessed depend very much on the sentences being
spoken. The size of the tree depends on the vocabulary size. However there is scope for
architectural optimization. The opportunity stems from the fact that acoustic vectors
are evaluated successively and on evaluating an HMM for the current vector, if the
HMM generates a probability above a certain threshold, the successors of the HMM
will be evaluated in the next time step. Thus there is always a list of currently active
HMMs/lextree nodes and a list of nodes that will be active next. Evaluating each HMM
takes a deterministic number of operations and thus a fixed number of clock cycles. This
information can be used to prefetch nodes ahead of when they are evaluated.
Given the fact that the number of triphones and words in a language are relatively
stable, it might appear that the workload will never expand. In reality this is not the
case due to the probability density function Bi(Yt). In the past, speech recognizers
used subvector quantized models, which are easy to compute. These methods use a
code book to store reference acoustic vectors. Acoustic vectors obtained from the front
end are compared against the code book to find the index c of the closest match. The
probability density function then reduces to a table lookup of the form B[i][c]. While
39
this is computationally efficient, the discretization of observation probability leads to
excessive quantization error and thereby poor recognition accuracy.
To obtain better accuracy, modern systems use a continuous probability density func-
tion and the common choice is a multivariate mixture Gaussian in which case the com-
putation may be represented as:
Bi(Yt) =M∑
m=1
cimN
∑
n=1
(Yt[n] − µim[n])2 × Vim[n] (4.4)
Here, µim is the mean and Vim the variance of the Gaussian mixture and cim is
the weight of the mixture. For The Hub-4 speech database used for this research was
obtained from CMU and they chose M and N to be 8 and 39 respectively. Note that
the outer∑M
m=1 denotes an addition in the logarithmic domain. Normally the inner
term involves exponentiation to compute a weighted Mahalanobis-like distance, but it is
reduced to simple arithmetic operators by keeping all the parameters in the logarithmic
domain [91, 111]. Therefore the outer summation needs to be done in the logarithmic
domain. This may be implemented using table lookup based extrapolation. This strategy
is troublesome if the processor’s L1 D-cache is not large enough to contain the lookup
table.
If each HMM state uses a separate probability density function, then the system is
said to be fully continuous. Thus the peak workload for an English speech recognizer
would correspond to the evaluation of about 60,000 probability density functions and
HMMs, as well as an associated lextree traversal that is proportional to the number of
words in the vocabulary. Fully continuous models are not popular for two reasons:
1. Their computational complexity makes them orders of magnitude slower than real
time on current processors.
2. Their parameter estimation problem and sparse training sets lead to low recognition
accuracy.
The parameter estimation problem is particularly difficult. For M = 8 and N = 39
Equation 4.4 needs 39× 2× 8 + 8 = 632 parameters for the values of µim, Vim and cim.
40
For a total of 60,000 triphones this adds up to 113.7 million parameters. The training data
is often insufficient to estimate that many parameters, so the use of continuous models
leads to increased word error rate. The usual solution is to cluster together HMM states
and share a probability density function among several states. Such clustering methods
are an area of active research. A speech recognition system that uses clustered probability
density functions is called a semicontinuous or tied-mixture system. Almost all advanced
large vocabulary speech recognizers currently fall in this category. The Hub-4 speech
model used to evaluate Sphinx 3.2 contains approximately 6000 probability density
functions representing an average of 30 HMM states sharing a single function. This
ratio could change when a model is trained on a larger data set leading to proportionately
increased compute complexity. Another possibility is an increase in M , the number of
mixtures per function, which will again proportionately increase the compute cycles. A
third possibility is increasing the size of the context from triphones to quinphones (five
phones, one current phone and two left and two right neighbors). The use of quinphones
will lead to an increase in the number of probability density functions that need to be
evaluated. This will be further multiplied by the number of quinphones in the language
vs the number of triphones.
Though traditional speech recognizers couple the evaluation of HMMs and Gaussians
tightly, in the interest of extracting greater levels of thread parallelism, it is possible to
decouple HMM and Gaussian evaluation, an approach that will be further investigated in
Chapter 5.
CHAPTER 5
CHARACTERIZATION AND OPTIMIZATION OF
SPHINX 3
Chapter 4 described the Front end (FE), Gaussian (GAU) and Search (HMM) phases
of the Sphinx 3.2 speech recognition system. To fully characterize the complex behavior
of Sphinx, it is necessary to study the individual phases separately. In addition to the
FE, GAU and HMM phases, Sphinx has a lengthy startup phase and extremely large
data structures which could cause high TLB miss rates on embedded platforms with
limited TLB reach. To avoid performance characteristics being aliased by startup cost
and the TLB miss rate, Sphinx was modified to support check-pointing and fast restart.
For embedded platforms, the check-pointed data structures may be moved to ROM in
a physically mapped segment similar to kseg0 in MIPS processors [71]. Results in this
chapter are based on this low startup cost version of Sphinx, referred to as original.
Previous studies have not characterized the three phases separately [6, 59]. To capture
the phase characteristics and to separate optimizations for embedded architectures, a
phased version of Sphinx was developed so that each of the FE, GAU and HMM phases
can be run independently with input and output data redirected to intermediate files.
In the rest of this chapter FE, GAU and HMM refers to the corresponding phase run
in isolation while phased refers to all three chained sequentially with no feedback. In
phased, FE and HMM are identical to original, while the work load of GAU is increased
by the lack of dynamic feedback from HMM. Breaking this feedback path exposes
parallelism in each phase and allows the phases to be pipelined. GAU OPT refers to
a cache optimized version of the GAU phase alone. PAR runs each of the FE, GAU OPT
and HMM phases on separate processors. It also uses the same cache optimizations as
GAU OPT.
42
Both simulation and native profiling tools were used to analyze Sphinx 3. Simulations
provide flexibility and a high degree of observability, while profiled execution on a real
platform provides realistic performance measures and serves as a way to validate the
accuracy of the simulator. The configurations used to analyze Sphinx 3 are shown in
Table 5.1.
A multi-GHz processor is required to operate Sphinx in real time. Parameters like L1
cache hit time, memory access time and floating point latency were measured on a 1.7
GHz AMD Athlon processor using the lmbench hardware performance analysis bench-
mark [68]. Numbers that could not be directly measured were obtained from vendor
microarchitecture references [51, 5]. The Simplescalar simulator was then configured to
reflect these parameters [19]. Unless mentioned otherwise, the remainder of this chapter
uses the default configuration from Table 5.1.
Native profiling indicates that the original Sphinx spends approximately 0.89%, 49.8%
and 49.3% of its compute cycles in the FE, GAU and HMM phases respectively. Another
recent study found that as high as 70% of another speech recognizer’s execution time was
Table 5.1. Experiment Parameters
Native Execution:SGI Onyx3, 32 R12K processors at 400 MHz32 KB 2-way IL1, 32 KB 2-way DL1, 8 MB L2Software: IRIX 64, MIPS Pro compiler, Perfex, SpeedshopSimulator: (default configuration)SimpleScalar 3.0, out of order CPU model, PISA ISA8 KB 2-way IL1, 2 cycle latency, 32 KB 2-way DL1, 4 cycle latency2 MB 2-way L2, 20 cycle latency, 228 cycle DRAM latencyL1 line size 64 bytes, L2 line size 128 bytesSoftware: gcc 2.6.3ILP Experiment ConfigurationsReasonable configuration:32 KB DL1, 4 cycle latency, 2 MB L2, 20 cycle latency2 memory portsAggressive configuration:32 KB DL1, 2 cycle latency, 8 MB L2, 20 cycle latency4 memory ports
43
spent in Gaussian probability computation [59]. In the phased version approximately
0.74%, 55.5% and 41.3% of time was spent in FE, GAU and HMM respectively. Since
FE is such a small component of the execution time, the rest of this work excludes it and
concentrates on the analysis of the GAU and HMM phases.
5.1 Memory System Behavior
Figures 5.1 and 5.2 show the L1 Dcache and L2 cache miss rates for original, phased,
FE, HMM and GAU for a variety of configurations. Since earlier studies showed that
larger line sizes benefit Sphinx II, 64 byte L1 and 128 byte L2 cache line sizes were
chosen [6]. In addition, the L2 cache experiments assume a 32 KB L1 Dcache. Both
figures assume an 8 KB Icache. Since Sphinx has an extremely low instruction cache
miss rate of 0.08% for an 8 KB Icache, no other Icache experiments were done. The
8 KB16 KB
32 KB64 KB
32 KB SGI
L1 Data Cache Size
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Mis
s R
ate
(Per
cent
) 9.56
%
7.72
%
6.56
%
5.96
%
10.4
9%
9.42
%
7.52
%
6.33
%
5.70
%
10.3
2%
8.80
%
7.22
%
6.22
%
4.64
%
8.16
%9.45
%
2.79
%
0.69
%
0.36
%
4.12
%
9.47
%
7.27
%
5.91
%
5.25
%
9.43
%
8.85
%
6.73
%
5.75
%
3.65
%
6.38
%
9.29
%
8.53
%
7.74
%
7.20
%
12.9
2%OriginalPhasedPhased OPTFEGAUGAU OPTHMM
Figure 5.1. L1 Dcache Miss Rate
44
256 KB512 KB
1 MB2 MB
4 MB8 MB
SGI 8 MB
L2 Cache Size
0
5
10
15
20
25
30
35
40
45
50
Mis
s R
ate
(Per
cent
)
44.1
3%
40.8
7%
38.3
1%
35.4
5%
30.8
8%
21.8
4%
17.7
8%
42.8
9%
40.1
5%
37.4
7%
34.3
9%
29.4
2%
17.9
0%
13.2
9%
25.3
6%
22.4
6%
19.8
3%
16.8
5%
13.1
5%
9.63
%
9.19
%
5.46
%
5.00
%
1.80
%
1.54
%
1.39
%
0.92
%
0.28
%
41.8
6%
41.1
7%
40.4
8%
39.9
8%
36.8
7%
20.2
7%
11.8
0%
8.65
%
7.67
%
7.05
%
6.70
%
6.21
%
5.21
%
4.06
%
44.7
8%
39.6
1%
34.4
8%
28.0
7%
19.9
5%
11.4
1%
11.6
3%
OriginalPhasedPhased OptFEGAUGAU OPTHMM
Figure 5.2. L2 Cache Miss Rate
SGI data provide a reality check since they represent results obtained using hardware
performance counters. The SGI L2 results are very similar in character to the 8 MB
simulation results in spite of the effects of out of order execution, memory system latency
and differences in cache replacement policy. The L1 results are not directly comparable
since the R12000 uses a 32 byte L1 line size and suffers from cache pollution induced
by abundant DTLB misses.
Figure 5.3 shows the average bandwidth required to process the workload in real
time. This is obtained by dividing the total L2 to memory traffic while Sphinx operates
on a speech file by the duration in seconds of the speech signal. The evidence suggests
that bandwidth starvation leading to stalls on L2 misses is the reason this application
is not able to meet real-time requirements. The memory bandwidth required for this
application is several times higher than what is available in practice. Note that available
bandwidth is always significantly less than the theoretical peak on most architectures. A
16-fold improvement in L2 size from 256 KB (the L2 size of a 1.7 GHz Athlon) to 8 MB
(SGI Onyx) produces only a very small decrease in the bandwidth requirement of GAU.
45
256 KB512 KB
1 MB2 MB
4 MB8 MB
SGI 8 MB
L2 Cache Size
0
250
500
750
1000
1250
1500
1750
2000
L2 to
Mem
ory
Ban
dwid
th (
MB
/s) 15
84
1473
1383
1277
1111
790
791
1895
1776
1654
1502
1261
773
766
1243
1118
1001
854
661
468
468
824
810
795
785
724
399
363
174
154
141
134
123
103
86
1068
962
853
705
505
289
305
OriginalPhasedPhased OptGAUGAU OptHMM
Figure 5.3. L2 to Memory Bandwidth
This phase essentially works in stream mode making 100 sequential passes per second
over a 14 MB Gaussian table. The speech signal itself contributes only 16 KB/s to the
total bandwidth requirements. Some computation saving heuristics in Sphinx also have
the beneficial side effect of helping to save bandwidth by not touching blocks that are
deemed improbable. Until the L2 size reaches 8 MB, long term reuse of Gaussian table
entries in the L2 is infrequent. It should be noted that the bandwidth requirement of GAU
in isolation is more severe than if it were operating inside original, since feedback driven
heuristics cannot be applied.
5.2 ILP in Sphinx
Before exploring special-purpose architecture extensions for speech, it is worthwhile
to investigate the limits of modern architectures. GAU is a floating point dominant code
while HMM is dominated by integer computations. GAU also appears to be easily vec-
46
torizable. Two simulation studies were undertaken to explore possibilities for extracting
ILP. For GAU, a surplus of integer ALUs was provided and the number of floating point
units was varied. Since this algorithm uses an equal number of multiplies and adds, the
number of floating point adders and multipliers were increased in equal numbers from
one to four, which corresponds to the X axis varying from two to eight FPUs in Figure
5.4. Two different memory system hierarchies were considered: a reasonable one for
a multi-GHZ processor and an aggressive memory system with lower latencies. Both
configurations are summarized in Table 5.1.
The SGI-2+2f entry describes the measured total IPC on the R12000, which has two
integer and two floating point units. The SGI-2 entry is the measured floating point IPC
alone. In the case of GAU, IPC remains low because of insufficient memory bandwidth
to keep the FPUs active. In the case of the R12000, which can issue two floating
point operations per cycle, the IPC for this loop is an underwhelming 0.37. GAU OPT,
uncovers opportunities for ILP by virtue of its cache optimizations thereby improving
IPC greatly. However, the IPC saturates at 1.2 in spite of available function units. A
2 4 6 8
SGI 2+2fSGI 2f
Number of FPUs
0
0.5
1.0
1.5
2.0
GA
U IP
C
0.56
0.58
0.58
0.59
1.20
0.37
0.77
0.81
0.82
0.82
ReasonableAggressive
2 4 6 8
SGI 2+2fSGI 2f
Number of FPUs
0
0.5
1.0
1.5
2.0
GA
U O
PT
IPC
1.02 1.10
1.10
1.10
1.74
0.55
1.09 1.
20
1.23
1.23
Figure 5.4. GAU and GAU OPT IPC
47
recently published study also indicated IPC in the range of 0.4 to 1.2 for another speech
recognizer [59]. Clearly, the architecture and compiler are unable to automatically extract
the available ILP, which again argues for custom acceleration strategies.
Figure 5.5 shows the corresponding experiment for the HMM phase. In this experi-
ment, the number of integer adders and multipliers are varied equally from one to four.
In spite of available execution resources, IPC remains low. It should be noted that in both
experiments, the SGI results are indicative of cases where the CPU to memory clock ratio
is low. This ratio will undoubtedly increase in the future.
The observations from sections 5.1 and 5.2 have several implications:
1. If speech is an “always on” background application, it could cause significant L2
cache pollution and memory bandwidth degradation to the foreground application.
To guarantee real-time processing, it might be better to stream data around the L2
rather than pollute it.
2 4 6 8SGI -
2
Number of ALUs
0
0.2
0.4
0.6
0.8
1.0
HM
M IP
C
0.30 0.33
0.34
0.34
0.58
0.46 0.
54
0.56
0.57
ReasonableAggressive
Figure 5.5. HMM IPC
48
2. Since the L2 cache is one of the largest sources of capacitance on the chip, ac-
cessing it for stream data incurs a large power overhead. Low power embedded
platforms may not need any L2 cache at all since dramatic increases in L2 size are
not accompanied by corresponding improvements in DRAM bandwidth require-
ments or performance.
3. Bandwidth reduction is important for its own sake as well as to reduce power
consumption. Bandwidth partitioning so that each phase has independent access
to its data set is important.
5.3 Results of Software Optimizations
Since Sphinx was shown to have bad cache behavior, cache optimizations were
investigated. To extract greater levels of parallelism the application was multithreaded.
5.3.1 Cache Optimizations
In Section 5.1, GAU was shown to be bandwidth starved. The GAU code in phased
was instrumented and found to require approximately twice the amount of computation
as in original. However, Figure 5.6 shows that phased suffers only 0.85 times slow down
over original on an R12000. Clearly, a large fraction of the excess computation is hidden
by memory latency. With processor to memory speed ratios increasing in the future, an
out of order processor can hide an even larger amount of compute overhead. The key is
to improve the memory system behavior without an unreasonable increase in compute
requirements.
To achieve this goal, two transformations were performed on phased. First, a block-
ing optimization similar in spirit to loop tiling was performed, which delays the initial
speech signal by 100 ms or 10 frames. The Gaussian probabilities for all 10 frames are
computed by making a single pass over the Gaussian tables. This effectively reduces the
number of passes to 10 per second where original would have done 100. The blocking
factor is limited to 10 to avoid a perceptible real-time lag at the decoder output.
It should be noted that this is not a blocking or tiling transformation that a compiler
could perform. The software had to be restructured to accumulate 10 frames of the
49
Original
Phased OptPar
Amdhal
Real time
0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Spe
edup
1.00
0.85 1.
05
1.67 1.
97
2.79
Figure 5.6. Measured Speedup on R12K
speech signal and to process 10 frames in one pass. Further, this became possible only
because the feedback between HMM and GAU was eliminated. Speech researchers ad-
vancing the state of their art are unlikely to be interested in or aware of architectural level
implications. Thus, it is imperative that architecture researchers analyze the performance
implications of important perception applications like speech recognition.
Sphinx allocates the mean and variance vectors used for Gaussian computation de-
scribed in Section 4.5 separately. Every component evaluation consumes one mean and
one variance vector. Since Sphinx originally allocated each table of vectors separately
and each is more than 7 MB, they potentially conflict with each other in the cache. To
avoid this, corresponding mean and variance vectors were interleaved and padded with
an additional 64 bytes to be exactly three L2 cache lines long. This padding strategy
consumes bandwidth but simplifies DMA transfers for the coprocessor architecture de-
scribed later. The optimized version appears in Figure 5.7. Note the interleaving of
vectors and a blocking loop that is not present in Equation 4.4. The optimized version
appears in Figures 5.1, 5.2, 5.3 and 5.6 as the data point GAU OPT.
50
for(senone = 0; senone < N; senone++) // Loop 0for(block=0; block < 10; block++) // Loop 1
for(c=0; c < 8; c++) // Loop 2{for(i=0, sum=0.0; i < 39; i++) // Loop 3
{t = X[block][i] -
Gautable[senone][c].vector[i].Mean;sum += t * t *
Gautable[senone][c].vector[i].Var;}
sum = max(sum, MINIMUM_VALUE);sum = sum * Gautable[senone][c].FinalScale +
Gautable[senone][c].FinalWeight;score[senone][block] =
log_add(score[senone][block], sum);}
Figure 5.7. Cache Optimized Gaussian Algorithm
GAU OPT demonstrates the true streaming nature of GAU. Figure 5.3 shows that
GAU OPT uses a factor of 4.7 to 3.9 less bandwidth than GAU in simulation with a
factor of 4.2 improvement obtained on a real machine. This supports the claim that GAU
processing can be done without an L2 cache. With a 256 KB L2 cache, the GAU OPT
bandwidth is 174 MB/s. Calculations show that without a heuristic, and without an L2
cache, GAU OPT can meet its real-time requirements with 180 MB/s of main memory
bandwidth. This has important implications for the scalability of servers that process
speech.
Figures 5.1 and 5.2 show dramatic reduction in the cache miss rates in both simulation
and native execution. The L2 native execution results are better than simulation results.
The large variation in the L1 results is due to the 32 byte L1 line size on the R12000 and
also possibly because of an extremely large number of TLB misses. The software TLB
miss handler could easily pollute the L1 cache. The important point is that Figure 5.6
shows that OPT, a version of phased with the GAU OPT blocking optimization, achieves
a slight speedup over original despite performing a larger number of computations.
51
In summary, to be able to extract parallelism, the feedback loop was broken, which
approximately doubled the GAU workload. With cache optimizations (which are not
possible with feedback), the loss due to the extra GAU workload is recovered and the
exposed parallelism is now open for further optimization.
5.3.2 Parallelization
Based on the percentage of execution time, Amdahl’s law predicts a factor of 1.97
speedup if GAU and HMM processing could be entirely overlapped. It is clear that a
special-purpose architecture for GAU can have significant speedup, as well as power and
scaling benefits. Sphinx was multithreaded to see if there were any practical impediments
to achieving good speedup. The parallel version of Sphinx, called PAR, runs each of
the FE, GAU OPT and HMM phases on separate processors. In effect, this models an
SMP version of Sphinx 3 as well as the case where each processor could be replaced by
a special-purpose accelerator. As shown in Figure 5.6, the parallel version achieves a
speedup of 1.67 over the original sequential version. A custom accelerator will likely be
even better. The HMM phase was further multithreaded to use four processors instead of
one, but the resulting five processor version was slower than the two processor version
due to high synchronization overhead.
5.4 The HMM Phase
The HMM related data structure used in Sphinx consists of two components, the
actual Markov model data and lexical tree information attached to each node. While the
data layout itself seems to be well suited for a Dcache, separating out the lexical and
Markov model information could possibly lead to better cache behavior. Since such a
change would entail major restructuring of the application, it was not studied. HMM
evaluation can also benefit from from special-purpose acceleration. To avoid having to
rewrite Sphinx entirely, the HMM related data was transcribed to a new database and the
HMM routine was accelerated in isolation. The results may be seen in Chapter 10.
CHAPTER 6
A CUSTOM GAUSSIAN ACCELERATOR
Chapter 4 introduced the use of multivariate mixture Gaussians in the acoustic model
evaluation of Sphinx 3.2 and indicated that this computation is common to other speech
recognition systems like HTK and the ICRC recognizer [59, 111]. Chapter 5 showed
that 55.5% of the execution time of Sphinx 3.2 was spent in Gaussian computation
when using the Hub-4 speech model. The high percentage of execution time spent
in this computation together with its applicability to a variety of speech recognizers
argues for special acceleration hardware for mixture Gaussians. Accelerators may be
implemented as custom nonprogrammable circuits or as domain specific programmable
processors. The custom circuit option will represent a practical upper bound on achiev-
able performance and energy efficiency. The programmable option which sacrifices some
performance and energy to gain generality will be explored in Chapter 9. This chapter
describes how a high throughput custom datapath is able to achieve area, power and
bandwidth efficiency as well as scalability by means of:
1. Reducing floating point precision.
2. Restructuring the computation.
3. Sharing memory bandwidth.
The Sphinx source code uses floating point computation sparingly, favoring scaled in-
teger arithmetic wherever possible. GAU and FE are the only floating point dominant
computations in Sphinx. An attempt was made to convert GAU to use fixed point integer
arithmetic. This failed because GAU requires a high dynamic range, which cannot
be provided by 32-bit scaled integer arithmetic. Fortunately, the scores of the highly
53
probable states are typically several orders of magnitude higher than those of the less
likely ones, indicating that a wide range is more important than precision.
Earlier work by Pihl explored the use of special-purpose floating point formats in
Gaussian estimation to save memory bandwidth [77]. Special floating point formats
should be almost invisible to the application so that speech models may be developed
without access to any special hardware. A custom software floating point emulation
library was developed to conduct an empirical search for the precision requirements of
the GAU phase. The library supported multiplication, addition, MAC, and (a− b)2 oper-
ations on IEEE 754 format floating point numbers. The approach was to experimentally
reduce mantissa and exponent sizes without changing the output results of the Sphinx 3
recognizer. The result was a reduced precision floating point format similar to the IEEE
754 format which has a sign-bit, an 8-bit excess 127 exponent and a hidden one-bit in its
normalized mantissa. Unlike IEEE 754, which has 23 explicit-bits in the mantissa, the
new format used only 12 bits. Conversion between the reduced precision representation
and IEEE 754 was done by truncating the extra mantissa bits when converting from
IEEE 754 to the new format and concatenating additional 0 bits when converting from
the new format to IEEE 754. Such a transformation can be done within a floating point
unit without any changes being visible to the application. Though this work was done
independently, it is worthwhile to note that a previous study arrived at similar conclusions
based on an earlier version of Sphinx [97]. However that research used digit serial
multipliers, which cannot provide the kind of throughput required for GAU computation.
Hence the accelerator discussed here uses fully pipelined reduced precision multipliers
instead.
Another key insight is that current high performance microprocessors provide a fused
multiply add operation that would benefit GAU. However, GAU also needs an add mul-
tiply (subtract-square) operation. There is scope for floating point circuit improvements
relying on the nature of (a − b)2 always returning a positive number. Further gains can
be obtained in area, latency, power and the magnitude of the numerical error by fusing
the operations (a− b)2 ∗ c. This is the approach used in this research.
54
6.1 Top Level Organization
Figure 6.1 illustrates the system context for the GAU accelerator. Figure 6.2 shows
the details of the accelerator itself. Loops 1, 2 and 3 from the optimized GAU algorithm
in Figure 5.7 are implemented in hardware. The outer loop and the log add step, which
consists of integer subtract, table lookup and integer add, are implemented in software.
The max operation can be folded into the de-normal floating point number handling
section of the floating point adder without additional latency, but empirically it can be
discarded without sacrificing recognition accuracy. The organization in Figure 6.1 is
essentially a decoupled access/execute architecture [88]. The outer loop runs on a host
processor and instructs a DMA engine to transfer X, Mean and Var vectors into the
accelerator’s input memory. A set of 10 input blocks are transferred into the accelerator
memory and retained for the duration of a pass over the entire interleaved Mean/Var
table. The Mean/Var memory is double buffered for simultaneous access by the DMA
Figure 6.1. Top Level Organization of Gaussian Estimator
55
Figure 6.2. Gaussian Coprocessor
engine and the accelerator. The accelerator sends results to an output queue where they
are read by the host processor using its coprocessor access interface.
6.2 Coprocessor Datapath
Figure 6.2 shows the architecture of the accelerator. The datapath consists of an
(a−b)2×c floating point unit, followed by an adder that accumulates the sum as well as a
fused multiply add (a×b+c) unit that performs the final scaling. Given that X, Mean, and
Var are 39-element vectors, a vector style architecture is suggested. The problem comes
in the accumulation step, since this operation depends on the sum from the previous
cycle, and floating point adders have multicycle latencies. For a vector length of N and
an addition latency of M, a straightforward implementation takes (N − 1) ×M cycles.
Binary tree reduction (similar to an optimal merge algorithm) is possible, but even then
the whole loop cannot be pipelined with unit initiation interval.
This problem is solved using by reordering Loops 1,2,3 to a 2,3,1 order. This cal-
culates an (X − M)2 × V term for each input block while reading out the mean and
variance values just once from the SRAM. Effectively this is an interleaved execution of
10 separate vectors on a single function unit, which leaves enough time to do a floating
56
point addition of a partial sum term before the next term arrives for that vector. The cost is
10 internal registers to maintain partial sums. Loops 2,3,1 can now be pipelined with unit
initiation interval. In the original algorithm, the Mean/Var SRAM is accessed every cycle
whereas with the loop interchanged version this 64-bit wide SRAM is accessed only once
every 10 cycles. Since SRAM read current is comparable to function unit current in the
CMOS technology used for this design, the loop interchange also contributes significant
savings in power consumption.
The Final Sigma unit in Figure 6.2 works in a similar manner, except that instead of a
floating point adder, it uses a fused multiply add unit. It scales the sum and adds the final
weight. This unit has a fairly low utilization since it receives only 8 × 10 inputs every
39× 10× 8 cycles. To save power this unit is disabled when it is idle. In a multichannel
configuration it is possible to share this unit between multiple channels. To reduce the
number of reads the processor needs to perform to fetch results from the accelerator, this
unit may be made to accumulate the final score. This also serves to reduce the outgoing
bandwidth from the processor by a factor of eight. In that case, due to the interleaved
execution this unit also requires 10 intermediate sum registers. Log domain addition can
be implemented using an integer subtract, table lookup and an integer add operation.
The state machine needs to be adapted to recirculate the results through the the integer
add/subtract unit within the floating point adder. The lookup table used for extrapolation
is constant and can therefore be implemented as optimized logic within the state machine.
In this design, log domain addition is implemented in software.
6.3 Implementation
The datapath shown in Figure 6.2 was implemented using a datapath description
language (Synopsys Module Compiler Language) and is subsequently synthesized for
a 0.25µ CMOS process. The control sections were written in Verilog and synthesized
using the Synopsys Design Compiler. The gate level netlist is then annotated with worst
case wire loads calculated using the same wire load model used for synthesis. The netlist
is then simulated at the Spice level using Synopsys Nanosim and transistor parameters
extracted for the same 0.25µ MOSIS process. Energy consumption is estimated from
57
the RMS supply current computed by Spice. The unoptimized fully pipelined design can
operate above 300 MHz at the nominal voltage of 2.5 volts with unit initiation interval. At
this frequency the performance exceeds the real-time requirements for GAU, indicating
an opportunity to further reduce power. A lower frequency and voltage can be used to
further reduce power.
A low power processor similar to a MIPS R4600 was designed for use as a control
processor. The MIPS was chosen because it is commonly used in embedded systems
and also because high performance implementations of the MIPS ISA, like the R12K,
were readily available for experiments. The design of this processor was done in such a
way that it could be easily modified for tight integration with ASIC coprocessors. The
Gaussian accelerator was designed and attached to the control processor as a custom
coprocessor, and the combination was then simulated. The control processor is a sim-
ple in-order design that uses a blocking L1 Dcache and has no L2 cache. To support
the equivalent of multiple outstanding loads, it uses the MIPS coprocessor interface to
directly submit DMA requests to a low priority queue in the on-chip memory controller.
The queue supports 16 outstanding low priority block read requests with block sizes that
are multiples of 128 bytes. A load request specifies a ROM address and a destination –
one of the Feat, Mean or Var SRAMs. The memory controller initiates a queued memory
read and transfers the data directly to the requested SRAM index. A more capable out
of order processor could initiate the loads directly. Software running on the processor
core does the equivalent of the GAU OPT phase. It accumulates 100 ms or 10 frames
of speech feature vectors (1560 bytes) into the Feat SRAM whenever the accelerator
has finished processing the previous block of input. Currently, the accelerator functions
faster than its real-time requirement. It is possible to slow down the accelerator so that
it completes the processing of each block just by the time the next block of input is
ready, but this has not been attempted. The data transfer uses the memory controller
queue interface. Next, it loads two interleaved Mean/Var vectors from ROM into the
corresponding SRAM using the queue interface. A single transfer in this case is 640
bytes. The Mean/Var SRAM is double buffered to hide the memory latency. Initially,
the software fills both the buffers. It then queues up a series of vector execute commands
58
to the control logic of the Gaussian accelerator. A single command corresponds to
executing the interchanged loops 2,3,1. The processor then proceeds to read results from
the output queue of the Gaussian accelerator. When 10 results have been read, it is time
to switch to the next Mean/Var vector and refill the used up half of the Mean/Var SRAM.
This process continues until the end of the Gaussian ROM is reached. When one cache
line of results has been accumulated, they are written to the output queue where another
phase or an I/O interface can read them.
Calculations based on the throughput of the accelerator showed that it needed to
operate at 202 MHz to achieve real-time speech processing. To simplify the electrical
interface between the processor and the coprocessor, both circuits need to operate at the
same clock frequency. Since the processor runs a general purpose operating system,
events like clock ticks and background tasks sometimes interrupt the main program that
transfers data between main memory and the input and output queues. Additional head-
room is required so that these interruptions do not prevent real-time processing of the
speech data. The extra performance required from the processor depends on the mix
of control tasks running on the processor. When the accelerator is scaled to process
multiple channels the processor needs to have commensurate processing ability too. So
the operating frequency of the system was chosen to be as high as possible subject to
the limitations of the 0.25µ process. The maximum frequency at which the circuits were
stable was 300 MHz. A cycle accurate simulator was developed and validated by running
it in lock step with the processor’s HDL model. The simulator was detailed enough to
boot the SGI Linux 2.5 operating system and run user applications in multitasking mode.
The resulting system accurately models the architecture depicted in Figures 6.2 and 6.1.
The GAU OPT application for this system is a simple 250 line C program with fewer than
10 lines of assembly language for the coprocessor interface. Loop unrolling and double
buffering were done by hand in C. The application was compiled using MIPS GCC 3.1
and run as a user application under Linux inside the simulator. It was able to process 100
ms samples of a single channel in 67.3 ms and scale up to 10 channels in real time. The
actual data may be seen in Section 6.5.2.
59
6.4 Applications
Though the Gaussian estimator was designed for Sphinx 3 and the MIPS-like embed-
ded processor, the results are widely applicable to other architectures and recognizers.
There are several levels at which this system may be integrated into a speech recognition
task pipeline similar to Phased. For example, an intelligent microphone may be created
by using a simple low power DSP to handle the A/D conversion and FE phase, and then
a GAU coprocessor attached to the DSP may be used for probability estimation. The
probability estimates can then be sent to a high-end processor or custom accelerator that
does language model computation. The GAU coprocessor can then hide more than 50%
of the compute effort required for speech recognition. On desktop systems, the Gaussian
accelerator may be part of a sound card or the Gaussian accelerator may be directly
attached to the main processor. On commercial voice servers, the Gaussian estimator
may be directly built into the line cards that interface to the telephone network thereby
freeing up server resources for language model and application processing. This also has
important implications for server scalability, discussed in the Section 6.5.2.
6.5 Accelerator Evaluation
The main contributions of the coprocessor architecture are energy savings, server
scalability and bandwidth savings. Each of these advantages is elaborated in the follow-
ing sections.
6.5.1 Energy Savings
The Spice simulation results from the fully synthesized coprocessor architecture
were compared against an actual 2.4 GHz Pentium 4 system that was modified to allow
accurate measurement of processor power. Without considering the power consumed by
main memory, the GAU accelerator consumed 1.8 watts while the Pentium 4 consumed
52.3 watts during Gaussian computation, representing a 29-fold improvement. The
performance of the Pentium 4 system exceeded real-time demands by a factor of 1.6
while the coprocessor approach exceeded real time by 1.55. However the Pentium
4 is implemented in a highly tuned 0.13µ process whereas the GAU accelerator was
60
automatically synthesized for a generally available TSMC 0.25µ process. When normal-
izing for process differences, the advantage of the GAU coprocessor approach increases
significantly. After normalizing for the process, the coprocessor’s throughput is 187%
higher than the Pentium 4, while consuming 271 times less energy. It is important to
note that energy consumption vs. performance is a common design trade-off. A more
valid comparison is the energy-delay product. The GAU coprocessor improves upon the
energy-delay product of the Pentium 4 processor by a factor of 507.
However the processor is only part of any system. Main memory is an impor-
tant consideration as well. This includes the power dissipated by a memory controller,
DRAM chips and the memory bus. It is difficult to estimate this accurately. Since
the XScale processor has an on-chip memory controller, the power consumption on an
XScale system accessing DRAM at peak bandwidth was measured. The main memory
component of power consumed by Gaussian computation was calculated based on that
measurement at the rate of 0.47 W per 64 MB/s of DRAM bandwidth. When the memory
is included the GAU coprocessor approach improves upon the Pentium’s energy delay
product by a factor of 196 and has an energy advantage of a factor of 104, and the
throughput performance stays the same as the processor-only results.
A Pentium 4 was used as the comparison because embedded processors like the
XScale do not have either the floating point instructions or the performance required
for the benchmarks. Software emulated floating point could possibly bloat the energy
delay product of the XScale and make a meaningful comparison impossible. Another
reason for the choice was simply the technical feasibility of measuring processor power.
For example, the Intel XScale development platform used in this research had a pro-
cessor module board with FPGA, Flash memory, etc., integrated on it, and isolating the
processor power was difficult. The particular Pentium 4 system was chosen because
the layout of the printed circuit board permitted modifications to permit measuring the
energy consumption of the processor core alone.
61
6.5.2 Scalability
As natural human interfaces become more common, scalability of servers that pro-
cess speech will become an important issue. This will be particularly important for
systems like call centers and collaborative work environments. In addition to having
energy advantages, the design is also scalable. Figure 6.3 shows that the system can be
scaled to process five independent speech channels in real time. The main limitation is
the in-order processor with its simple blocking cache mode. This is evident from the
difference in performance between the first and second bars in each data set. At six
channels, the system is seen to be slightly slower than real time. However, an ideal L1
D-cache which always reports a cache-hit and never writes data back to memory is seen
to scale up to 10 channels or more. A Final Sigma stage that implements log domain
addition enables the design to scale even with blocking caches due to the removal of
destructive interference between the cache and the DMA engine. The Final Sigma stage
reduces the number of results that need to be stored in the cache by a factor of eight. With
this optimization the system is able to process 10 or more channels of speech signals.
1 2 3 4 5 6 7 8 9 10
Number of channels
0
20
40
60
80
100
120
140
Pro
cess
ing
time
per
10 fr
ames
(m
s)
67.3
67.3
67.3 77
.7
93.4
109.
1
67.3
67.3
67.3
67.3
67.3
67.3
67.3
67.3
67.3
70.9
64.8
65.3
65.1
65.4
65.1
65.1
65.5
65.5
65.1
65.2
No Sigma, Real DL1No Sigma, Ideal DL1Sigma, Real DL1
Figure 6.3. Channel Scaling
62
For embedded designs, the power required to support multiple speech channels may be
excessive, but such an organization is likely in a server. One channel of speech feature
vectors contributes about 16 KB/s to the memory bandwidth. The outgoing probabilities
consume 2.3 MB/s.
By setting a threshold on acceptable Gaussian scores and selectively sending out the
scores, this can be significantly reduced. The dominant bandwidth component is still
the Gaussian table. Additional Feat SRAMs and Gaussian accelerator datapaths may be
included. Since the Gaussian tables are common for all channels, all datapaths can share
the same Var and Mean SRAMs and thereby reuse the same 180 MB/s vector stream.
With a higher frequency implementation of the Gaussian datapath, multiple channels can
also be multiplexed on the same datapath. In a server, the Gaussian estimation of several
channels can be delegated to a line card, which operates out of its own 18 MB Gaussian
ROM. The partitioning of bandwidth, a 50% reduction in server workload per channel as
well as reduced cache pollution leads to improved server scalability.
6.5.3 Bandwidth Savings
The Hub-4 speech model used in this study has 49,152 interleaved and padded Mean/
Var vectors each occupying three L2 cache lines of 128 bytes or a total of 384 bytes per
pair of vectors. Thus the total size of the Gaussian table is 18 MB. Sphinx processes this
table 100 times every second, but uses a subvector quantization heuristic to cut down the
processing requirement, which in turn leads to lower DRAM bandwidth utilization. To
guarantee real-time processing, the Gaussian accelerator may be used at a low power for
brute force evaluation. Because of the blocking optimization GAU OPT, the data needs
to be processed only 10 times per second with a peak bandwidth of 180 MB/s, which can
be further reduced by applying the subvector quantization (nonfeedback) heuristics in
Sphinx. Not only does this design bring the bandwidth requirements to limits possible on
embedded systems, it also drastically improves the power consumption. On a 400 MHz
Intel XScale development system where the processor itself consumes less than 1 W,
peak memory bandwidth of 64 MB/s was obtained. Achieving this bandwidth consumed
an additional 0.47 W. The factor of four or more bandwidth savings is significant for
63
the embedded space since it indicates that a 52-watt server can be replaced by a 1-watt
embedded processor.
The Gaussian coprocessor takes advantage of the simple loop structure and the lim-
ited precision requirements of the GAU algorithm to make real-time processing of speech
signals possible at greatly reduced power budget. However, its design is quite inflexible
and difficult to adapt to other algorithms like neural net evaluation which involve similar
loops and summation operations. The experience underscores the potential benefits
of programmable accelerators which can use domain specific optimizations to provide
power and performance advantages similar to ASICs.
CHAPTER 7
VISUAL FEATURE RECOGNITION
ALGORITHMS
Visual feature recognition systems vary significantly based on the type of feature that
is being recognized. Relatively simple recognizers are regularly employed in industrial
visual inspection systems. On the other hand, human face recognition is an extremely
complex task given the huge possibility space of facial features and skin tones. Facial
recognition systems clearly have utility in security and surveillance domains, and other
visual recognizers play key roles in gesture interfaces, lip reading to support speech
recognition, and robotics. Interest in face recognition is motivated by the difficulty of
the problem, which cannot be currently supported by embedded systems. This is evident
from Figure 1.1, which showed that a high performance 4.8 GHz processor was required
to satisfy the real-time requirements of the FaceRec application. Furthermore the face
detection algorithms like the neural network based Rowley detector and the rectangle
feature based Viola/Jones detector used in this study are generic approaches for object
detection [83, 103]. They appear to be easily adapted to address other visual feature
recognition tasks. The main differences for these other tasks is a different training
regimen and different frame rate requirements. For example, the Rowley method of
face detection described in Section 7.3 has been applied to license plate detection [83].
Thus, research in accelerating face detection and recognition also helps the detection and
recognition of other objects.
The FaceRec application studied here can be viewed as a pipeline of three major
functional components. A flesh tone detector is used to isolate areas of a frame where
a face is likely to be present. The next stage is a face detector that determines whether
a face is present or not in each area of interest. The final phase is a face recognizer.
65
Each of these components is based on well known algorithms that have been adapted
or reimplemented to fit into a unified framework. Some algorithmic optimization and
restructuring has been done to suit benchmarking purposes, but the basic approach has
been developed by other researchers.
Interestingly the face recognition system, when viewed from a structural perspective
comprises a series of increasingly discriminating filters. Early stages of the sequence
must inherently filter the entire image. As the process proceeds downstream, each stage
needs to examine less image data since previous stages have eliminated certain areas from
the probable candidate list. The result is an interesting balance of simple algorithms that
analyze lots of data early in the sequence and more sophisticated algorithms that only
need to analyze limited amounts of data late in the process. The result is a structure that
is amenable for implementation as an embedded system.
Figure 7.1 shows the major steps in face recognition. The input is a low-resolution
video stream such as 320 × 200 pixel images at 10 frames per second. The stream
is processed one frame at a time, and sufficient state is maintained to perform history
sensitive tasks like motion tracking. The process is essentially a pipeline of filters that
reduce the data and attach attributes to frames for the use of down stream components.
Typically each filter is invoked at the frame rate. This underlines the soft real-time nature
of this application. Additional data is required since filters may access large databases
or internal tables. These additional data streams add to the aggregate bandwidth require-
ment of the system. The periodic nature of the application domain often makes it possible
to easily estimate the worst case requirements.
Object recognition typically proceeds in two steps: object detection and the actual
object identification. Most approaches to object identification require a clearly marked
area, normalized to a particular size, and the location of key features. Object detectors
find the area where the desired feature is likely to reside, scale the area to meet the
normalization requirement, and then create a location and boundary description for that
area. False positives and negatives occur, but the algorithms try and minimize their
occurrence.
66
Figure 7.1. Algorithmic Stages of a Face Recognizer
Object detectors also often work at a fixed scale. The detector is swept across the
image recording all positions at which a detection was reported. The image is then
subsampled or scaled down by a small factor (typically 0.8), and the process is repeated
until the frame is below the size of the detector. A decision procedure is then applied
to all the predicted hits to decide which ones are the most likely. Detectors often have
much lower compute cost per subwindow than their corresponding identifying routines.
Since they are swept across the entire image, a significant portion of the application’s
execution time might be spent in the detector. In contrast, even though identifying filters
are more compute intensive, they are applied only to the high probability regions of the
frame, so their contribution to the overall execution time might be low. Though object
detectors are less compute intensive, they are much more difficult to design due to their
generality. For example a face identifier chooses from one of N known faces, but a face
detector has to distinguish between the infinite sets of faces and nonfaces.
Since detection is time consuming, it is common to structure an object detector as a
cascade of filters with cheaper heuristics upstream identifying potential regions for more
expensive heuristics downstream. An extreme case of this is the Viola/Jones method,
which trains a sequence of about 200 increasingly discriminate filters [103]. A more
common approach when dealing with faces and gestures is to identify the flesh colored
regions of an image and apply a more sophisticated detector to those regions.
67
The identifier receives candidate regions from the detector along with other infor-
mation like probability, scale and feature locations. It typically employs some type of
distance metric from known references to provide a positive identification. In the face
recognizer, the first level of detection is provided by flesh toning which is followed by
an image segmenting algorithm. These are followed in turn by a more complex detector,
voting for high probability regions, an eye locater and finally a face identifier.
7.1 Flesh Toning
Flesh toning identifies flesh colored pixels in an image. The commonly used RGB
color space is not well suited for flesh toning because skin color occupies a wide range
in primary color space. Variations due to lighting and ethnicity are hard to deal with and
skin-like colors on walls and clothing are harder to discriminate. However, skin colors
are tightly clustered in color spaces like HSV. Flesh toning can be done by converting
pixels from sample images into the chosen color space and making a scatter plot with
two colors, one for flesh pixels and one for nonflesh pixels. A boundary is then drawn
around flesh tone clusters. This boundary is then approximated by curves, which can
be described by simple geometric equations. In the image under test, any pixel that lies
inside this new approximated but easily described boundary is considered to be a flesh
pixel.
The base algorithm involves transforming the RGB color space into the NCC (Nor-
malized Color Coordinates) space using the simple equation r = R/(R + G + B),
g = G/(R + G + B). In this space flesh pixels occupy a space bounded by two
parabolas and maximum and minimum x-axis values. Applying two inequalities of the
form ax2 + bx+ c to the color coordinates will predict if the pixel is flesh colored or not
[90]. While this algorithm is simple and achieves good discrimination, it was observed
that it tends to classify certain shades of blue found in clothing as a skin color. A second
algorithm was used to transform the RGB value of a pixel to an HSV (Hue, Saturation,
Value/Luminance) value. In the HSV space, flesh color is tightly clustered allowing the
use of four simple inequalities for flesh tone [14]. In practice the HSV based algorithm
generates too many false positives. However, the consensus of the HSV and NCC space
68
algorithms produces good results. The output of this phase is a bit mask of the same size
as the image where a bit is set if the corresponding pixel is flesh colored.
7.2 Segmentation
Segmentation is the process of clumping together individual pixels into regions where
an object might be found. A common approach is to do a connected component analysis,
which typically forms irregular regions. Since the Viola and Rowley algorithms used
for face detection need rectangular regions, instead of connected component analysis, a
simple algorithm to cut apart the flesh tone bit mask into rectangles was used instead
[103, 83].
Two operators from mathematical morphology are applied to the bit mask: a 3 × 3
erosion operator followed by a 5×5 dilation operator. This has the effect of cutting away
small connections and regions that are likely to be false positives and then smoothing the
bit mask by filling in any small holes in the middle of an otherwise acceptable sized
region. A logical OR of all the rows in the image is then performed to make a single row.
This step is called vertical separation. Runs of “1” values in the single row represent
vertical stripes of the image that contain objects of interest. Runs of “0” values represent
vertical stripes that may be discarded. For each vertical stripe, the columns are logically
OR-ed to create a single column. This is called horizontal separation. Runs of “1”
represent the region of interest. This algorithm can be recursively applied to isolate the
rectangular regions of interest. In the actual implementation, the horizontal separation
steps for all the vertical stripes are done together in an interleaved manner. This has the
effect of converting the column walk across the bitmap into a row walk giving better
cache performance. Recursion is stopped after two levels since this has empirically
provided adequate results. The flesh tone bitmap is discarded at this stage. The output
of this stage is a list of coordinates of the top left and bottom right corners of rectangular
regions of interest and a gray scale version of the image.
69
7.3 Rowley Face Detector
Henry Rowley’s neural net based face detector is well known as a pioneering con-
tribution [83]. Its implementation was provided by the Robotics Institute at CMU. This
detector is designed to determine if a 30 × 30 pixel image contains a face or not. Face
detection is done by sweeping the detector over the image and computing the decision
at each pixel location. Then the image is scaled and reduced in size by a factor of 0.8
and the procedure is repeated. The resulting series of images and detection locations is
called an image pyramid. In the case of real faces, a detection will be reported at several
nearby pixel locations at one scale and at corresponding locations in nearby scales. False
positives do not usually happen with this regularity. Hence a voting algorithm can be
applied to the image pyramid to decide the site of any true detections.
In each window the detector first applies a correction for varying lighting conditions
followed by histogram equalization to expand the range of intensity values. The prepro-
cessed window is then applied to a multilayer neural network where the input layer has
retinal connections to the image window. Neural net evaluation can be represented as:
Y = tanh(N
∑
i=1
W [i] × Image[Connection[i]])
W [] is a set of weights associated with each neural connection and Connection[]
represents the image locations to which the neuron is connected. In practice, Image
contains additional storage following the actual stored image, and the outputs of neurons
are stored to the additional locations. Thus a multilayer network can be evaluated as
if it is a flat retinally connected array of neurons if it is ensured that neurons in deeper
layers follow neurons closer to the retinal layer. The tanh function acts as a sigmoid
shaped nonlinearity and computing it is expensive. Rowley’s original implementation
uses the tanh() implementation provided by the C-library. In the version developed for
this dissertation, it was replaced with an 800 entry lookup table which has produced
identical output to the original for the test images. This simple optimization improved
the performance of the algorithm by a factor of 2.5 on a 2.4 GHz Pentium processor.
70
The retinal layer is followed by a hidden layer comprised of three classes of units.
Four units look at 10 × 10 subwindows, 16 units look at 5 × 5 subwindows and 6 units
look at overlapping 30 × 5 horizontal stripes. The final output of the network indicates
if the 30 × 30 window contains a face or not.
The voting algorithm notes the location and scale of each detection in an image
pyramid. The next step called spreading replaces each location in the pyramid with the
count of the number of detections in a neighborhood. The neighborhood of a location
extends an equal number of pixels along the position and scale axes. The values are
then thresholded and the centroids of all remaining locations are found. Centroids are
examined in descending order of the number of detections per centroid and other cen-
troids that represent a face overlapping the current face are eliminated. The remaining
centroids represent the location of faces found in the image. To further reduce false
positives, multiple neural nets each trained separately may be applied to the image and
their consensus can represent a more accurate detection.
7.4 Viola and Jones’ Detector
Viola and Jones present a new and radically faster approach to face detection based
on the AdaBoost algorithm from machine learning [103]. They claim a factor of 15
speedup over the Rowley detector for their implementation, but they run their detector
directly on entire images without using flesh toning to cut down the search area. Since
their source code is proprietary, their algorithm was reimplemented by the author and
Robert Evans based on example code obtained from Peter Carbonetto at the University
of British Columbia. The visual feature recognizer used for this dissertation research uses
flesh-toning for both the Rowley and the Viola/Jones detectors. Flesh-toning cuts down
the region of the image that the detector needs to process. Under these circumstances a
visual feature recognition system using Rowley’s method performs as well as a system
that uses Viola and Jones’ method. To understand the Viola/Jones detector, the concept
of boosting needs to be explained first.
A random guess to a yes or no question stands the chance of being correct 50% of the
time. If a heuristic can improve the odds by a very small amount then it is called a weak
71
learner. It is possible to generate weak learners for several tasks in a semiautomated
manner by enumerating a huge set of heuristics generated on the basis of combinations
of simple rules and evaluating their performance on a set of samples. A heuristic that can
improve the odds of a guess by a significant amount is called a strong learner. Boosting
is a method of combining several weak learners to generate a strong learner. AdaBoost is
a well known algorithm to generate strong learners from weak learners, while providing
statistical bounds on the training and generalization error of the algorithm [86].
The weak learners in the Viola/Jones algorithm are based on features of three kinds.
A two-rectangle feature is the difference between the sum of the values of two adjacent
rectangular windows. A three-rectangle feature considers three adjacent rectangles and
computes the difference between sum of the pixels in the extreme rectangles and the
sum of the pixels in the middle rectangle. A four-rectangle feature considers a 2 × 2 set
of rectangles and computes the difference between sum of pixels in the rectangles that
constitute the main and off diagonals. For a 24 × 14 subwindow there could be more
than 180,000 such features. The task of the AdaBoost algorithm is to pick a few hundred
features and assign weights to each using a set of training images. Face detection is
reduced to computing the weighted sum of the chosen rectangle-features and applying a
threshold. As in the case of the Rowley algorithm a 30× 30 detector is swept over every
pixel location in the image, and the image is rescaled. Rowley’s voting algorithm is used
to decide the final detection locations.
Computing rectangle features is a simple but slow operation based on the sum or
difference of pixels in adjacent rectangular regions. Recomputing these sums for each
pixel location is very expensive. A major contribution of the Viola/Jones approach is
an intermediate image representation called the integral image. The sum of the pixels
in a rectangular window can be computed easily using the intermediate representation.
The integral image value at pixel location (x,y) in an image is defined as the sum of
all pixels to the left and above the pixel (x,y). This is computationally prohibitive. By
expressing the same relationship as a pair of recurrences, it is possible to compute the
integral image with just one pass over the image. Given the integral image, computing a
feature F reduces to:
72
S =9
∑
i=1
W [i] × IntegraImage[F.Index[i]]
F.score = abs(S − F.mean face) < abs(S − F.mean nonface)
W [] is a set of weights that depends only on the type of the feature being computed.
These are known constants unlike the trained weights of a neural network. Similar to
neural networks F.Index[] is a set of indices denoting connections to specific locations
within the integral image. These are trained for each selected feature. F.mean face
represents the average distance from feature F to a set of rectangles known to contain
faces. Similarly F.mean nonface represents the average distance from feature F to a
set of rectangles known to be devoid of faces. Thus the score for the feature depends
on whether the feature is closer to the population of faces or nonfaces. The decision of
whether an image window contains a face or not is based on the computation:
IsFace =N
∑
i=1
F [i].score > threshold
N is the number of features used for recognition and threshold is determined by the
AdaBoost algorithm.
The original slow approach described in the Viola/Jones paper uses 200 features.
They then go on to describe a faster approach where they cascade many such detectors
with more complex detectors following simpler ones. A window is passed to a detector
only if it was not rejected by the preceding detector. Since training this cascade is a
laborious process, the workload characteristics of this algorithm are modeled with a 100
feature detector.
7.5 Eigen FacesEigenfaces is a well known Principle Component Analysis (PCA) based face recog-
nition algorithm developed by researchers at MIT [99]. A reimplementation of the
Eigenfaces algorithm from researchers at Colorado State University was used in this
73
research [28]. Though the mathematical underpinnings of Eigenfaces are complex, the
entire algorithm is simple and has a structure quite amenable to streaming and high
statically schedulable ILP. Training images are represented as a set of flattened vectors
and assembled together into a single matrix. The Eigen vectors of the matrix are then
extracted and stored in a database. The training face images are projected onto a feature
space, called face space, defined by the Eigen vectors. This captures the variation
between the set of faces without emphasis on any one facial region like the eyes or
nose. The projected face space representation of each training image is also saved to a
database. To identify a face, the test image is projected to face space using the saved
Eigen vectors. The projected test image is then compared against each saved projected
training image for similarity. The identity of the person in the test image is assumed to be
the same as the person depicted in the most similar training image. The actual algorithm
that defines the face space is:
Make Eigen Vectors(ImageList, N, M): ImageList is a set of N training images,
where each image is W × H pixels. M is the number of Eigen vectors that needs to be
generated.
1. Flatten each image into a WH element vector by concatenating all the rows. Let
ImageMatrix be the N ×WH matrix containing all the flattened images.
2. Sum up all the rows of ImageMatrix and divide by N to get an average flattened
image. Call this WH element vector as ψ.
3. Subtract the average image ψ from the flattened images in ImageMatrix. Let the
new N ×WH matrix be φ.
4. Compute dot products of all possible image pairs. Let L be the new N ×N matrix
where L[i][j] = dot product of φ[i] and φ[j].
5. Compute the N Eigen values and corresponding Eigen vectors of L. Pick the M
Eigen vectors corresponding to the highest Eigen values. Each Eigen Vector is N
elements long.
74
6. Do a matrix multiplication of each of the selected M Eigen vectors against φ and
save the resulting set of 1×WH sized matrices as a combined M ×WH element
EigenMatrix in a database. Save the average image ψ also to the database.
The projection algorithm follows:
Project to Face Space(Image): Image is W ×H pixels in size.
1. Let img be the flattened WH element vector form of Image.
2. Load the average image ψ and the EigenMatrix from the database.
3. Subtract the average image ψ from img to create a new image img ′.
4. Take the dot product of img′
against each row of EigenMatrix to obtain an M
element vector img′′.
5. Let norm =√
∑Mi=1 img
′′ [i] × img′′[i]. Divide each element of img′′ by norm.
This is the face space representation of Image.
Learning is a matter of projecting all the known faces to the face space and saving the
projected representations and the identity of each person.
Learn Faces(ImageList, N, M): ImageList is a set of N training images, where each
image is W ×H pixels. A person’s name is attached to each image. M is the number of
Eigen vectors needed.
1. Call Make Eigen Vectors(ImageList, N, M)
2. For each image in ImageList, call Project to Face Space(image) and save the
resulting projected faces to a database.
Identification is then a simple matter of projecting the test image to face space and
computing a similarity score.
Identify(Image): Image is W ×H pixels in size.
1. Load the saved known projected faces from the database.
75
2. proj = Project to Face Space(Image)
3. Take the dot product of proj against each known projected face. Call this the score.
4. The known projected face that gets the highest score is considered the identity of
the test image.
7.6 Architectural Implications
To report the algorithmic complexity of the various phases of the visual feature
recognizer, an n×n pixel square image is assumed in this section. The flesh tone detector
applies algebraic transformations and inequalities on each pixel. Thus its complexity
is O(n2). The face detectors sweep a basic detector across all pixel locations in the
image and then rescale the image, so their complexity grows as O(n2log(n)). This will
be compounded by the complexity of the base detector itself. For the Rowley method
using N neurons of length L, the base detector complexity is O(LN) giving an overall
complexity of O(n2log(n)LN). The complexity of Viola/Jones style base detector with
N rectangle features1isO(N), which yields an overall complexity ofO(n2log(n)N). For
each region where a face is likely to be present, EigenFaces performs O(Mn2 + KM)
operations where M is the number of Eigen vectors and K is the number of known
faces. This complexity is due to the M dot products on vectors of length n2 done while
projecting the test image and the K dot products on M element vectors done while
finding the most similar known face.
In all cases, the workload scales faster than n2 as the image size is increased, so high
performance architectures are necessary for larger images. For increased accuracy both
the Rowley and Viola detectors need a larger number of neurons and features respectively
leading to a linear increase in compute requirements. For EigenFaces, increasing the
discrimination by using a larger number of Eigen vectors leads to a linear increase in the
compute requirements as does increasing the number of known faces to check against.
Each of the phases is a natural fit for a streaming architecture. Since the flesh toner
works on one pixel at a time, the image may be streamed pixel or raster line at a time
1Features not cascaded. Ignoring the integral image computation.
76
through the processor. The face detectors work on rectangular regions of an image, thus
the ability to hold a 30 × 30 image window on chip and stream the neurons or features
through the processor is important. For a modest increase in on chip storage to about
16 KB both the neuron and feature descriptions can be held within the processor and
image windows may be streamed through the processor. Since both detectors sweep
their image windows row by row and column by column, the ability to hold 30 raster
lines on chip will greatly reduce the number of image window fetch operations. Since
they both work on gray scale images, the additional SRAM required is merely 9.3 KB
for 320 × 200 sized images. While they have very different conceptual backgrounds,
the base detectors of the Viola and Rowley algorithms are remarkably similar and both
involve indirect vector access and dot product operations. The Viola algorithm uses
a short vector length of nine and uses integer multiply accumulate operations while
the Rowley method uses longer vectors with lengths ranging from 11 to 151 with the
sizes 101, 151 and 26 covering 87.8% of all evaluated neurons. Currently, neural net
evaluation involves floating point multiply accumulate operations. Given the limited
range of weights and histogram equalized image pixels, this could possibly be converted
to scaled integer arithmetic.
Similarly, EigenFaces is dominated by floating point dot products which in turn
depend on floating point multiply accumulate operations. Each test image needs to be
projected to face space based on the stored Eigen vectors. This is a series of dot product
operations, and each stored Eigen vector may be simply streamed through the processor
while holding the flattened image within the processor. The vector length is equivalent to
the number of Eigen vectors. Values of 50 or more are required in practice. Identification
can be done by holding the projected test image constant in the processor and streaming
the known projected images for computing dot products. Thus it can be seen that on
the whole the nature of visual feature algorithms lends themselves to efficient stream
processor implementations.
CHAPTER 8
CHARACTERIZATION OF VISUAL FEATURE
RECOGNITION
This chapter provides a detailed characterization of the visual feature recognition
system described in Chapter 7. Native execution, profiling using processor performance
counters, and simulation were used to characterize the application. The native execution
results were obtained using SGI SpeedShop on a 666 MHz R14K processor. Unlike
the results presented in Chapter 5, which used the SimpleScalar 3.0 simulator, results in
this chapter are based on ML-RSIM, an out of order processor simulator derived from
the Rice University RSIM simulator. This change was motivated by two reasons. First,
the visual feature recognition application is implemented in C++, but the compiler used
by SimpleScalar does not support C++. Since ML-RSIM accepts binaries compiled for
SunOS, it was possible to generate the application binary on a Sun workstation. Second,
a stable version of ML-RSIM was not available at the time the experiments in Chapter 5
were conducted.
A derivative of the Net BSD operating system was run within the simulator. An
application binary compiled for SunOS was used without any modification since the OS
emulates the SunOS system call interface. Two different configurations were simulated:
a multi-GHz processor whose parameters like L1 cache hit time, memory access time,
floating point latencies, etc., were measured on a 1.7 GHz AMD Athlon processor using
the lmbench hardware performance analysis benchmark and an embedded configuration
which is modeled after an Intel XScale 400 MHz processor except for the fact that it
uses a Sparc ISA and has a floating point unit [68]. Since ML-RSIM could not be
configured without an L2 cache, an inclusive L2 cache equivalent in size to the combined
L1 instruction and data caches was added. Since the cache is inclusive and the same size
78
as the sum of the L1 caches, this configuration behaves similar to a machine with no
L2 cache. Numbers that could not be directly measured were obtained from vendor
microarchitecture references. ML-RSIM was configured to reflect the parameters shown
in Table 8.1. Unless mentioned otherwise, the remainder of this chapter uses the default
configuration.
The application is studied in five configurations: a) full pipeline using the Rowley
face detector, b) full pipeline using the Viola/Jones face detector, c) only the Rowley face
detector with flesh toning and image segmentation, d) only the Viola/Jones face detector
with flesh toning and image segmentation, e) only the Eigenfaces recognizer. The last
three configurations are important from an energy savings perspective since running the
individual algorithms on separate low frequency processors or hardware accelerators can
lead to significant energy savings.
Table 8.1. Experiment Parameters
Native Execution:SGI Onyx3, R14K processors at 666 MHz32 KB 2-way IL1, 32 KB 2-way DL1, 8 MB L2Software: IRIX 64, MIPS Pro compiler, Perfex, SpeedshopSimulator: (default configuration)
Sparc V8 ISA, out of order CPU model, 2 GHz16 KB 2-way IL1, 2 cycle latency, 16 KB 2-way DL1, 2 cycle latency2 MB 2-way L2, 20 cycle latency, 228 cycle DRAM latencyL1 line size 64 bytes, L2 line size 128 bytesIssue width: 4 integer + 4 floating point, Max 4 graduations/cycleDRAM interface: 600 MHz, 64 bits wideSoftware: gcc 2.6.3Embedded ConfigurationSparc V8 ISA, 400 MHz32 KB 32-way IL1, 1 cycle latency, 32 KB 32-way DL1, 1 cycle latency64 KB inclusive L2 cacheL1 line size 64 bytes, L2 line size 128 bytesIssue width: 1 integer or 1 floating point, Max 1 graduation/cycleDRAM interface: 100 MHz, 32 bits wideSoftware: gcc 2.6.3
79
8.1 Application Characteristics
Figures 8.1 and 8.2 show the relative execution times of each algorithm when the
application is run using the Rowley detector and the Viola/Jones detector. In both cases,
the face detector is the dominant component. Since the detectors are heuristic in nature,
the face regions identified by them may differ. This in turn leads to differences in
the runtime of other algorithms that depend on the detector’s output. Figures 8.3 and
8.4 show the L1 Dcache miss rate and the L2 cache hit rates for all five application
configurations. Since the caches are inclusive, the L2 hit rate is defined as the L1 misses
that hit in the L2 cache divided by the total number of accesses made by the application.
Since this application achieves 99.8% Icache hit rate with a 16 KB Icache, no other
Icache configurations were studied. Figure 8.5 shows IPC for a variety of execution unit
configurations, and Figure 8.6 shows the run times normalized to real time.
For the entire application there is consistently greater than 92% L1 cache hit rate
for Dcaches of 16 KB and above. This indicates that the streaming pipelined model for
composing the algorithms is a good fit for the problem. Each 320 × 200 pixel color
image is 187.5 KB long and the corresponding gray scale versions are about 64 KB.
The images clearly will not fit in the L1 cache. The explanation is that the color image
is accessed in streaming mode, i.e., each pixel is touched exactly once for flesh toning.
Image segmentation works on the flesh tone bitmap (approximately 64 KB) making at
Flesh tone3.9%
Viola59.7%
Eye locator17.1%
Eigenfaces19.4%
Figure 8.1. Execution Time Break Down of Viola/Jones Detector Based FaceRecognizer
80
Flesh tone6.2%
Rowley64.6%
Eye locator10.4%
Eigenfaces18.8%
Figure 8.2. Execution Time Break Down of Rowley Detector Based Face Recognizer
most two passes over it. Since these accesses touch at most two image rows at a time,
good cache utilization is insured. Subsequently, only small windows into the image are
used. Since objects in these images are typically smaller than 50× 50 pixels, each object
is only about 2.5 KB in size. The downstream algorithms make several passes over each
object, but only a small part of each object needs to be cache resident at each time.
For example, the integral image computation in the Viola/Jones algorithm is based on a
recurrence that involves two adjacent image rows and an additional row for intermediate
storage and has an L1 cache footprint of about 4.4 KB. The Rowley algorithm touches at
most 30 rows of the object at the same time. However, as it sweeps across the image left
to right and top to bottom only a 30 × 30 pixel window needs to be cache resident at a
time. Since it shifts its position one pixel at a time, a 29 × 29 region of this window will
be reused by the next iteration contributing to high L1 cache hit rate. A similar pattern
occurs in the later phase of the Viola/Jones algorithm on a 30×30 region. The Eigenfaces
algorithm uses a projected image of the object to be recognized as well as basis, mean
and projected image matrices corresponding to each reference object. The target object is
reused while it is compared against each candidate. Each candidate, however, is accessed
only once per target object.
The objects and their attributes from each stage are typically touched again by the
next stage. The auxiliary information used by the algorithms is somewhat small. Both
81
8 KB16 KB
32 KB64 KB
L1 Data Cache Size
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Mis
s R
ate
(Per
cent
)
9.22
%
6.40
%
4.16
%
2.82
%
9.97
%
7.11
%
5.11
%
3.12
%
9.52
%
6.19
%
3.83
%
2.24
%
10.7
7%
7.68
%
5.54
%
3.16
%
6.62
%
3.90
%
2.50
%
1.93
%
Viola AppRowley AppViolaRowleyEigenfaces
Figure 8.3. L1 Dcache Miss Rate
detector algorithms use fixed size data structures. The worst case is the Viola/Jones
algorithm, which needs a weight and a type for each feature corresponding to 100× 2 ×
4 = 800 bytes of L1 cache. The data set for the Eigenfaces algorithm on the other hand
is linear in the number of the reference faces. Since these could potentially be streamed
into the L1 Dcache once per target object (or once per frame), the footprint is small. Only
the projected target object and a small part of the basis/mean/projected reference images
need to be resident in the L1 Dcache. From Figure 8.4 it is seen that the L2 cache is
largely ineffective since it is accessed infrequently due to the low L1 miss rate.
From a cache footprint perspective, both the detector algorithms and the entire appli-
cation appear to be a good match for embedded processors with limited cache resources.
Since images are accessed left to right, multiple rows at a time, sequential prefetch (or
strided prefetch) would hide memory access latencies even when the L1 Dcache is small.
82
256 KB512 KB
1024 KB
2048 KB
L2 Cache Size
0
0.5
1.0
1.5
2.0
2.5
Hit
Rat
e (P
erce
nt)
0.82
%
1.08
% 1.30
%
1.41
%
0.82
%
1.10
%
1.38
% 1.52
%
0.84
%
1.13
%
1.37
%
1.46
%
0.86
%
1.23
%
1.46
%
1.56
%
0.66
% 0.82
% 0.96
%
0.82
%
Viola AppRowley AppViolaRowleyEigenfaces
Figure 8.4. L2 Cache Hit Rate
Quite a different view unfolds on examination of the IPC and speedup graphs. Figure 8.5
shows IPC for a variety of execution unit configurations. IPC is seen to saturate early on
for two main reasons. The first is caused by dependences in the loop bodies. For example,
neural net evaluation involves computing Σi=0..nWeight[i] ∗ Image[Connection[i]]. In
addition to the loop carried dependence on the sum, each of the inputs is accessed
indirectly via a pointer since an input to one neuron could be the output of another
neuron. Second, the high ratio of array variable accesses to arithmetic operations causes
saturation of the Dcache ports.
Figure 8.6 shows the run times normalized to real time. Here, 1.0 represents min-
imum real-time performance corresponding to 5 frames per second. For example, in
Figure 8.6 in the 1 ALU + 1 FPU configuration, the Rowley algorithm is 1.13 times
83
Embedded 1+1 2+2 3+3 4+4
ALUs + FPUs
0
0.25
0.5
0.75
1.0
1.25
IPC
0.48
0.65 0.
69 0.72
0.72
0.49
0.65 0.67 0.
70
0.70
0.56
0.71 0.
76 0.78
0.78
0.52
0.67 0.69 0.71
0.71
0.52
0.74 0.
80
0.89
0.89
Viola AppRowley AppViolaRowleyEigenfaces
Figure 8.5. IPC
slower than real time while the Eigenfaces algorithm processes 5 frames in 0.69 seconds.
The graph clearly shows that embedded processors are inadequate to handle the work
load in real time. In this case instruction throughput is the culprit. Even when function
units are available, dependences and contention for the Dcache ports causes low IPC.
The power budgets required for real-time performance are beyond what is available
on normal low power embedded platforms. Thermal dissipation is a problem even on
high performance processors and energy saving solutions are important for real-time
workloads like visual feature recognition. Hardware accelerators that use specialized
data paths and stream array operands out of multiple SRAM buffers stand a good chance
of accelerating these algorithms at embedded power budgets.
84
Embedded
(400 MHz)Embedded
(400 MHz)
0
2
4
6
8
10
12
14
16
18
20
22
Run
tim
e no
rmal
ized
to r
eal-t
ime
19.6
99.
93 10.8
97.
294.
96
Embedded
(2 GHz) 1+1 2+2 3+3 4+4
ALUs + FPUs
Embedded
(2 GHz) 1+1 2+2 3+3 4+4
ALUs + FPUs
0
0.25
0.5
0.75
1.0
1.25
1.5
1.75
2.0
2.25
2.5
2.75
3.0
3.25
3.5
3.75
4.0
4.25
4.5
3.94
2.88
2.72
2.59
2.59
1.99
1.52
1.47
1.40
1.40
2.18
1.71
1.59
1.54
1.54
1.46
1.13
1.09
1.06
1.06
0.99
0.69
0.64
0.58
0.58
Viola AppRowley AppViolaRowleyEigenfaces
Figure 8.6. Speedup or Slow Down Over Real Time
8.2 Optimization Opportunities
One recurring theme in image processing is computing a kernel that operates on an
M×N subwindow of a largerW×H image. The kernel is recomputed for every possible
pixel location within the larger image. This resembles sliding the M × N subwindow
over the W × H image. There is significant scope for compiler based reordering of
computations in such kernels. Here are two concrete examples.
As described in Section 7.4, the heuristics used by the Viola/Jones algorithm are
based on the sum/difference of pixels in adjacent rectangular regions. Recomputing these
sums for each pixel location is very expensive. A major contribution of their approach
is an intermediate image representation called the integral image. The sum of the pixels
in a rectangular window can be computed easily using the intermediate representation.
85
The integral image value at pixel location (x,y) in an image is defined as the sum of
all pixels to the left and above the pixel (x,y). This is computationally prohibitive. By
expressing the same relationship as a pair of recurrences, it is possible to compute the
integral image with just one pass over the image. This transformation required careful
study and insight from the originators of the algorithm. Given the fact that the sums
of rectangular subwindows of the larger image are recomputed at each pixel location, a
compiler based tool aware of the access pattern and rules of arithmetic may be designed
to deduce the recurrences.
The standard deviation of pixel values within a 30× 30 pixel window starting at each
pixel location within the probable face rectangle is required during the computation of
Viola/Jones heuristics. An initial implementation that simply recomputed the standard
deviation function at each pixel location was seen to occupy between 10-15% of the
compute time of the whole application. When going from one pixel to the next, the
windows overlap by 29 × 30 pixels and the mean and sum of squares for one pixel can
be easily calculated from its predecessors values by adjusting for the nonoverlapping
pixels alone. By defining a set of recurrences for the mean and mean square for 30 × 30
subwindows over a wider region, it is possible to compute the standard deviations in one
pass over the image thereby reducing the execution time of this component to less than
1%. Currently, such transformations require a lot of attention from the programmer and
insight into the algorithm and are error prone because of corner cases. This bolsters the
argument in favor of compiler based loop restructuring that can apply axioms of algebra
to deduce the right set of recurrences.
Another possible optimization is to reorder the computation so that data may be
streamed through a set of execution units and results computed in the minimum number
of passes while observing limits on the amount of intermediate storage used. Compiler
based tools that consider parameters like the size of the image and the subwindow, and
the size of the intermediate storage and automatically transform algorithmic kernels for
optimum stream performance would be desirable.
As seen in Figures 8.5 and 8.6, wide issue clearly helps performance. In traditional
architectures, wide issue usually comes at the cost of increased hardware complexity and
86
could potentially limit the clock frequency as well as exceed a limited energy budget.
This application is embarrassingly parallel in most sections due to the intrinsic data
parallelism in the pixel processing. One way of achieving good performance at low
power is to use a cluster of function units operating in parallel with a very small quantity
of SRAM for local storage and no cache. This approach is investigated in the next
chapter.
CHAPTER 9
PERCEPTION PROCESSOR ARCHITECTURE
Chapter 3 explained that achieving high IPC was critical to realizing high-performance,
low-power perception processors. Chapters 4 and 7 described the structure of typical
perception algorithms, which are characterized by simple multilevel nested loops where
the majority of arithmetic and floating point operators have array and vector operands.
Operand availability is therefore critical to achieving high IPC. It was also seen that per-
ception applications may be expressed as a pipeline of algorithms. These facts motivate
the choice of architectures that embody function unit clusters for high ILP and simple
communication mechanisms that permit chaining multiple processors to implement a
pipeline of algorithms. Perception processors that are general enough to be able to
execute multiple algorithms yet are small enough to conserve energy and die area would
be ideal. An empirical search for a processor architecture that satisfies the generality,
high IPC, and low resource utilization criteria led to an initial architecture [67] that was
successively refined. The end result of this evolutionary process is depicted in Figure
9.1.
The perception processor architecture consists of a set of clock gated function units,
a loop unit, three dual ported SRAMs, six address generators (one for each SRAM
port), local bypass paths between neighboring function units as well as a cluster wide
interconnect. A register file is conspicuously absent because the combination of compiler
controlled dataflow and a technique called array variable renaming makes a register file
unnecessary. Though none of the clusters described here need a register file, it is possible
to incorporate one into a function unit slot. Clusters can be configured to maximize the
performance of any particular application or set of applications. Typically there will be a
minimum number of integer ALUs as well as additional units that are more specialized.
88
Figure 9.1. Perception Processor Organization
Hardware descriptions for the cluster and the interconnect are automatically generated
by a cluster generator tool from a configuration description. Details may be found in
Section 9.8.
To understand the rationale behind this organization it is important to know that
typical stream oriented loop kernels found in perception algorithms may be split into
three components. They consist of control patterns, access patterns and compute patterns.
The control pattern is typically a set of nested for loops. Access patterns seen in these
algorithms are row and column walks of 2D arrays, vector accesses and more complex
patterns produced when simple array accesses are interleaved or software pipelined.
Compute patterns correspond to the dataflow between operators within the loop body.
For example, the compute pattern of a vector dot product is a multiply-accumulate flow
where a multiplier and an adder are cascaded and the adders output is fed back as one of
89
its inputs.
The perception processor has programmable hardware resources that accelerate each
of the three patterns found in loops. The loop unit accelerates control patterns while
the address generators cover access patterns. The interconnect and the function units
together implement compute patterns. The execution cluster operates in a VLIW manner
under the control of horizontal microcode stored in the microcode SRAM. The mi-
crocode provides the opportunity to clock gate each resource individually on a cycle by
cycle basis leading to low energy consumption. Together, these features provide the mix
of high performance and hardware minimality that is crucial to perception applications.
9.1 Pipeline Structure
The perception processor architecture was designed to be able to emulate dataflows
that typically occur within custom ASIC accelerators. To this end, it has a simple
and rather different pipeline structure from a traditional processor. In sharp contrast to
the typical five-stage Instruction Fetch/Instruction Decode/Execute/Memory/Write Back
(IF/ID/EX/MEM/WB) pipeline of a MIPS like RISC processor, the perception processor
pipeline consists of just three stages: Fetch/Decode/Execute [46]. The number of actual
stages in the final execute phase depends on the function unit. The pipeline structure is
shown in Figure 9.2. Conspicuous departures from the RISC model include the absence
of register lookups in the decode stage and the lack of memory and write back stages.
In the perception processor, the microinstructions are fetched from a very wide in-
struction memory which is more than 200 bits wide. The decode stage is minimal – it is
limited to performing sign or zero extensions to constants, generating NOPs for function
units while the memory system is being reconfigured, and generating clock enable signals
for active function units. The wide instruction is then dispatched to a set of function units,
a loop unit, and a set of address generators. All resources, including the actual function
units and SRAM ports, appear as peers in the EX stage. The final output of all these peer
units can be transferred back to the input of the units by an interconnect network. The
latency of transfers depends on proximity. Nearest neighbors can be reached in the same
cycle while reaching a nonneighboring unit incurs an additional cycle of latency.
90
Figure 9.2. Pipeline Structure
In the MIPS RISC execution model, every single instruction implicitly encodes a
path through the pipeline. An integer instruction takes the IF/ID/EX/MEM/WB while a
floating point instruction takes a detour through the FPU in the EX stage. There is also an
implicit hardware controlled timing regime that dictates the relative cycle time at which
an instruction reaches each stage subject to dependences checked by interlocks.
In the perception processor, instructions do not encode any such implicit paths. The
instructions are called microcode because they serve the traditional horizontal microcode
function where individual bits directly control hardware functions like mux selects and
register write enables. To get the functionality implied by a MIPS instruction, the stage
91
by stage functionality of the MIPS instruction must be identified and the equivalent
microinstruction bits set in several successive microinstruction words. The advantage
of this lower level approach is that the hardware can be controlled in a fine grained
fashion, which is impossible in the MIPS case. For example, interconnect muxes may
be set to route data between selected function units and memory in a manner which
directly represents the dataflow graph of an algorithm, and data may be streamed through
the dynamically configured structure. The ability to reconfigure the structure through
microcode on a cycle by cycle basis means that the function units may be virtualized to
map flow-graphs that are too large to fit the processor. This manifests itself as higher
initiation intervals and larger number of temporary results that need to be saved or
rerouted when compared to a processor that has enough physical resources to allocate
to the entire flow-graph. Performance degrades gracefully under virtualization. The
perception processor supplants the instruction centric RISC execution model with a data
centric execution model, which lends it the flexibility to efficiently mimic the styles of
computation found in VLIW and vector processors as well as custom ASIC datapaths.
9.2 Instruction Format
To understand the following discussion on the internals of the perception processor
a quick introduction to the microinstruction format and the instruction fetch mechanism
is necessary. Figure 9.3 shows the constitution of a typical instruction word. While
Figure 9.3. Microinstruction Format
92
the instruction word width and format are fixed for a given configuration, they will
vary between configurations depending on the type and number of function units and
interconnect paths. The type field specifies whether the instruction is a normal VLIW
style instruction bundle or a reconfiguration command. Reconfiguration commands are
used to dynamically modify the working of the address generators and the loop unit.
The type field is followed by instruction packets for each function unit. If the type
field specifies a reconfiguration command, the instruction packet fields have alternate
interpretations. In that case, the decoder makes NOP packets for all the function units.
Each instruction packet consists of an opcode, mux selects for the A and B operands
selection muxes of a function unit and enable signals for the A and B input registers. The
registers latch new values only when their enable signals are asserted. These FU opcode
packets are followed by address generator operations each of which specify a load, store
or NOP and the address context register to be used for the load or store operation. These
are in turn followed by mux select signals for the interconnect muxes. Finally, there
are a set of constant fields to support constants used in the code. The constant fields
have different interpretations (e.g., one 16-bit constant, two 8-bit constants, four 4-bit
constants, etc.) depending on the context. The decoder can perform modifications like
sign or zero extension before the constants are presented to the function units. The
instruction memory has 1 cycle latency. The decoder adds another cycle of latency. This
2 cycle fetch delay is accounted for in branch instructions and the loop unit logic. Since
the actual bit positions of various fields depends on the configuration, the instruction
fetch logic and the decoder are automatically generated by a netlist generator tool based
on the processor configuration and bundling constraints.
9.3 Function Units
Function units follow the generic organization shown in Figure 9.4. Their operands
may be the output of their own final stage or the output of their left or right neighbor.
Forwarding the output of the unit to its input allows efficient execution of reduction
operators like∑
and∏
and polynomial terms like Axn. Nearest neighbor connections
capitalize on the short delay of local wires to implement chained operations in a manner
93
Figure 9.4. Function Unit Architecture
similar to vector chaining. In addition an operand may also arrive over the interconnect,
in which case the transferred value is first latched in a register. The interconnect register
can also hold semistatic operands like constants used for scaling an operand stream.
Several types of function units are used in this study.
Integer ALUs perform common operations like add, subtract, xor, etc. ALUs also
have compare instructions, which not only return a value, but also set condition codes
local to the particular ALU. Conditional move operations may be predicated on the
condition codes set by previous compare instructions to route one of the two ALU inputs
to the output. This makes if-conversion and conditional data flows possible. All ALU
operations have single cycle latency.
94
FPUs support floating point add, subtract, multiply, compare and integer to floating
point convert operations. While the FPU is IEEE 754 compatible at its interfaces, for
multiply operations it internally uses a reduced precision of 13 bits of mantissa since the
target applications work well with this precision [66]. Reduced precision in the multiplier
contributes significant area and energy savings. All FPU operations have 7 cycle latency.
Multiply units support 32-bit integer multiply operations with 3 cycle latency.
In order to illustrate the advantages of fine grain pipeline control and modulo support
and to demonstrate the generality claims, no application specific instructions have been
added to the function units with two exceptions: the reduced precision of floating point
multiplies and byte select/merge instructions, which select an individual byte from a
word. The latter is similar to the pack/unpack instruction in Intel’s IA-64 architecture or
the AL/AH register fields in the IA-32 architecture. These instructions significantly ease
dealing with RGB images.
9.4 Compiler Controlled Dataflow
As CMOS technology scales, wire delays get slower when compared to logic. The
cluster interconnect reflects the belief that future architectures will need to explicitly
address communication at the ISA level. Traditional architectures are based on implicit
communication. For example the MIPS instruction addi r1, r2, 10 depends on the
hardware to keep track of the last location where the operand r2 was present and transfer
it to where it is consumed. The location could be a renamed register or a pipeline stage.
In a wide issue clustered processor, it is advantageous to have operands to a function unit
be sourced from nearby function units to hide the effects of long wire delays. This is
possible if communication is explicitly orchestrated by the compiler. In the perception
processor all communication is explicitly orchestrated by the compiler. In the example
above, the compiler would pick a function unit to execute the addi instruction, transfer
the output of the function unit that last produced the value corresponding to the r2
operand to the A input of the chosen function unit, transfer the constant 10 to the B input
and schedule the actual addition to happen the cycle when both inputs are available.
In the perception processor, pipeline registers at the interfaces of every unit including
95
function units and SRAM ports are named and accessible to software. Data is explicitly
transferred from the output pipeline register of a producer to the input registers of its
consumers. Unlike traditional architectures where pipelines shift under hardware control,
a compiler for the perception processor can use clock gating to control pipeline shifting
and thereby control the lifetime of values held in pipeline registers. This ensures that a
result will be alive till all its consumers have received a copy. This explicit management
of result lifetime and communication is called compiler controlled data flow.
Explicit communication leads to the ability to overlap communication with com-
putation with almost no hardware overhead. A significant number of bits in the wide
microinstruction word are devoted to controlling the interconnect. While the interconnect
can be controlled on a cycle by cycle basis, the compiler may elect to dedicate certain
interconnect muxes to flows on a longer term basis. For example, while adding two
vectors it is possible to dedicate separate interconnect muxes for the two operands for
the duration of the vector addition. The compiler also attempts operand isolation, i.e., it
tries to set unused muxes to states that reduce the amount of activity visible to the rest of
the circuitry leading to lower power consumption.
9.5 Interconnect
The local bypass muxes in each function unit are intended for fast, frequent com-
munication with the immediate function unit neighbors. The interconnect supports com-
munication with nonneighbor function units and SRAMs. Such communications have
a latency of one cycle. In a multicluster configuration, intercluster communication will
incur even larger delays. Values transferred via the interconnect to the input registers of
a function unit may be held indefinitely which is useful for caching common constants.
In modulo scheduled loops, each resource may be used only during one modulo
period. Reusing a resource later will render the loop body unschedulable. It is common
to find a lot of data reads early in the loop body and a few stores toward the end that
correspond to computed values graduating. Conflicts in the interconnect often make
modulo scheduling difficult. Partitioning the interconnect muxes by direction has the
potential to reduce scheduling conflicts. Incoming muxes transfer data between function
96
units and from SRAM ports to function units while outgoing muxes are dedicated to
transferring function unit outputs to SRAM write ports.
The high level architecture of the interconnect is remarkably simple. Assume an
organization with N incoming muxes and M outgoing muxes as shown in Figure 9.5.
Each incoming mux is logically a 16-to-1 mux which selects the output of one of the eight
function units, six SRAM ports or constant fields within the microinstruction. There is
some hierarchy in the actual circuit to optimize size and delay. There is currently an
unused port in the 16-to-1 mux which is reserved for inter cluster communication in
future multicluster configurations. As seen in Figure 9.4, there are two interconnect
pipeline registers at the input of each function unit. Half of the N muxes feed the A input
registers of function units. The muxes are connected to the input registers in round robin
manner. The other N/2 muxes serve the B input registers. The muxes are partitioned by
input register so that both operands of a function unit may be delivered from elsewhere
in the cluster without conflict. The M outgoing muxes are 8-to-1 muxes that connect
the function unit outputs to the SRAM write ports. Again, the muxes are connected in a
round robin manner to the SRAM data inputs. Upon specifying values for N and M, a
netlist generator tool developed as a part of this research generates Verilog HDL for the
processor and the interconnect. While the simple round robin connections have worked
Figure 9.5. Interconnect Architecture
97
well for the benchmarks used in this research, it is possible to manually specify any
custom topology for the interconnect. The choice of interconnect parameters depends on
the dataflow within the algorithm kernels and the number of computed results that need
to be retired per cycle. It is possible to implement compiler based instruction scheduling
algorithms that are topology neutral by describing communication paths as a weighted
graph structure, an approach which was used in an earlier version of this architecture
[67]. The actual processor configurations that are evaluated in Chapter 10 uses four
incoming muxes and one outgoing mux.
It is possible that two operands need to be made available at a function unit as part of
a dataflow but interconnect conflicts make such a transfer impossible. In such cases it is
possible to transfer one operand in an earlier cycle and freeze its destination interconnect
register using clock gate control till both operands arrive and can be consumed. The
conflict can thus be resolved and a feasible schedule attained, but latency and loop
initiation interval increase somewhat as congestion increases. This method of staging
during separate cycles, transfers that are logically simultaneous, is called interconnect
borrowing.
9.6 Memory System Architecture
Perception applications are stream oriented with a large number of 2D array and
vector accesses per elementary operation. These accesses typically occur within tight
loops with known bounds. Traditional processors have a limited number or load/store
ports, and this limits overall performance because of the high number of array accesses,
which is the reason DSPs traditionally partition their memory resources. A large number
of SRAM ports are required to efficiently feed data to function units. Increasing the
number of ports on a single SRAM or cache increases access time and power consump-
tion. This motivates the choice of multiple small software managed scratch SRAMs. It
is also possible to power down SRAMs that are not required. For low leakage processes
a large fraction of the energy consumption is in the sense amplifiers of the SRAM ports.
They consume approximately 50% of the processor energy in the 0.25µ implementation.
98
Mechanisms to efficiently use these expensive resources are important for both perfor-
mance and energy conservation.
Hardware performance counter based measurements on a MIPS R14K processor
showed that 32.5% (Geometric mean) of the executed instructions were loads/stores
for a set of perception benchmarks described later in Section 10.1. The high rate of
load/store operations combined with the regular array access patterns makes it possible
to overlap computation and SRAM access possible using hardware accelerators. A
large fraction of the remaining 67.5% execution component is array address calculations
that support load/store operations. Significant optimizations are possible by associating
each SRAM port with an address generator that deals with common access patterns of
streaming applications. The access patterns include 2D array and vector accesses in
modulo scheduled or software pipelines loops. Details may be found in Section 9.6.4.
Four new instructions are required to take advantage of the optimizations:
write context context index, src:
Reconfigure an address generator by transferring a description of an access pattern
into a context register within the memory system. This instruction when applied to the
loop unit similarly transfers the parameters of a loop into a loop context register.
load.context dest, context index and
store.context context index, src:
These are loads/stores that use the address generation mechanism. The context index
encoded into the immediate constant field of the instruction specifies the address gener-
ator to be used and the index of a context register within it.
push loop context index:
Let the memory system know that a new loop is starting.
9.6.1 Loop Unit
The index expressions of array accesses in a multilevel nested loop will depend
on some subset of the loop variables. The purpose of the loop unit is to compute and
maintain the loop variables required for address generation in the memory system while
the loop body itself is executed in the function units. Figure 9.6 shows a simplified orga-
99
Figure 9.6. Loop Unit
nization of the loop unit. The loop unit offers hardware support for modulo scheduling,
a software pipelining technique that offers high levels of loop performance in VLIW
architectures [81].
A brief introduction to some modulo scheduling terminology is necessary to under-
stand the functioning of the loop unit. Assume a loop body which takes N cycles to
execute. Modulo scheduling allows starting the execution of a new instance of this loop
body every II (Initiation Interval) cycles where II is less than N . A normal loop that
is not modulo scheduled may be considered a modulo scheduled loop II = N . How II
is determined and the conditions that must be satisfied by the loop body are described
in [81]. The original loop body may be converted to a modulo scheduled loop body by
replicating instructions such that every instruction that was originally scheduled in cycle
n is replicated so that it also appears in all possible cycles (n + i × II)modN where i
is an integer. This has the effect of pasting a new copy of the loop body at intervals of
II cycles over the original loop body and wrapping around all instructions that appear
100
after cycle N . If a particular instruction is scheduled for cycle n, then n/II is called its
modulo period.
The compiler configures static parameters including II and loop count limits into
loop context registers. The corresponding dynamic values of the loop variables are
held in the loop counter register file. The only other piece of information required is
which loop body is currently pointed to by the program counter. A four-entry loop stack
captures this information. In this implementation, the loop unit can keep track of four
levels of loop nest at a time, which is sufficient for the benchmarks used in this research.
For larger loop nests the address expressions that depend on additional outer loops may
be done in software as in a traditional processor. A four-entry loop context register
file holds the encoded start and end counts and the increment of up to four innermost
for loops. Loops are a resource that can be allocated and managed just like one would
allocate memory on a traditional architecture. The loop unit maintains a counter for
each loop nest and updates it periodically. It also modifies the program counter and
admits new loop bodies into the pipeline in the case of modulo loops. In that case it also
does additional manipulation of the loop counter to drain the pipeline correctly on loop
termination. On entering a new loop any previous loop is pushed on a stack, though its
counter value is still available for use by address generators. Loop parameters may be
loaded from memory. This permits modulo scheduling of loops whose loop counts are
not known at compile time. Appropriate loop parameters may be loaded from SRAM at
run time depending on the size of input data.
Just before starting a loop intensive section of code, loop parameters (perhaps dynam-
ically computed) are written into the context registers using write context instructions.
On entry into each loop body, a push loop instruction pushes the index of the context
register for that loop onto the stack. At any given moment, the top of the stack represents
the innermost loop that is being executed at that time. An II counter repeatedly counts
up to the initiation interval and then resets itself. Every II cycles, the loop increment
is added to the loop variable that is held in the loop counter register file. This is done
automatically. No loop increment instructions are required. When the end count of the
loop is reached, the innermost loop will have completed. The top entry is automatically
101
popped off the stack, and the process is repeated for the enclosing loop. Note from
Figure 9.6 that the registers and datapaths have small widths of 4 and 9 bits that cover
most common loops. These widths are parameters specified in the perception processor
configuration. The netlist generator tool can generate perception processors which use
any user specified widths. The choices in Figure 9.6 were sufficient to cover benchmarks
used in this research. Loops that are incompatible with a particular perception processor
configuration can always be done in software, so the reduced bit-widths save energy in
the common case.
9.6.2 Stream Address Generators
Most perception algorithms have a high ratio of array variable accesses to operators.
Multiple SRAM ports are essential for high throughput. The three dual ported SRAMs
in Figure 9.1 together have a read/write power consumption approximately equal to the
total function unit power consumption. Since each additional SRAM port introduces area
and energy overhead, utilizing them effectively is essential for performance. A previous
version of the architecture which used generic integer ALUs for address generation was
unable to maximize SRAM port utilization [67]. This is because generating an address
for a 2D array access involves multiply/shift and add operations which incurs multiple
cycles of latency in a traditional processor. When a tight loop body involves several
array accesses a significant fraction of the function units and registers will need to be
allocated for address calculation rather than computing results. Since address calculation
for arrays is a stylized operation, it is possible to design distributed semiautonomous
address generation hardware that frees up function unit resources for the actual result
computation and improves data delivery and throughput. In the perception processor,
dedicated address generators are attached to each SRAM port. They handle commonly
occurring address sequences like vector and strided access as well as 2D array accesses
including row and column walks. They can handle address generation under regular,
modulo and unrolled loops and can deal with special situations that occur when multiple
loop bodies are in flight simultaneously.
102
Before entering into a loop intensive section of code, the compiler uses write context
instructions to write descriptions of array access patterns into the address context register
files of address generators. For increased throughput the same access pattern may be
written into multiple address generators. Each address context includes the row and
element sizes, the base address as well as the loop counter indices that correspond to the
array’s loop variables. The loop counter indices may be used to retrieve the value of loop
count variables generated by the loop unit in Figure 9.6. In the current implementation
there are four context entries in each address generator corresponding to a total of 24
access patterns simultaneously. Since write context is a single cycle operation, dynamic
reconfiguration has very low overhead. The parameters for an array access pattern are
packed into a single 32-bit word with the base address at the least significant bit. So
arithmetic can be done on the packed word to update the base address dynamically.
Address computation for array and vector references issued to an SRAM port are
handled by its attached stream address generator. The operation of the address generator
depends on loop counters from the loop unit and array parameters like base address
and row size that are stored in its address context register file. Figure 9.7 shows the
internal structure of an address generator. To understand how this simple structure can
accomplish a variety of address calculations, it is essential to understand how a compiler
generates addresses for array references. Consider the 2D arrays declared and used in C
as shown in Figure 9.8.
To simplify the discussion, assume word oriented addressing. Let the size of the
Complex struct be denoted as elem size . Then, the size of one row of A is row size =
elem size × N . If the offset of imag within the struct is 1 and the base address of A is
BaseA, then the base addresses of the imag field will be Baseimag = BaseA + 1. So
the address expressions corresponding to the load into t1 is Baseimag + i× row size +
j× elem size since C stores arrays in row major order. A vector is a single-dimensional
array, so its address expression is just a special case where row size = 0. For more
complex index expressions of the form P × i+Q, the factors P and Q may be absorbed
into the row size and base address respectively. A column-walk of the form A[j][i] can
be evaluated similarly. By constraining the row and element sizes to be powers of two,
103
Figure 9.7. Stream Address Generator
struct Complex A[N][M];struct Complex B[N][K];...for(i=0; i<N; i++) { ...
for(j=0; j<M; j++) { ...t1 = A[i][j].imag; ...for(k=0; k<K; k++) { ...
t2 = B[i][k].real;...
Figure 9.8. Loop Acceleration Example
104
the address expression reduces to the form address = Base + ((i << x)|(j << y)).
For cases where row size cannot be a power of two, to help pack more data into the
scratch memory, row size may be picked as the sum of two powers of two and separate
expressions may be used to access the bulk of the array and the residue. For arrays
with n > 2 dimensions, the base address is repeatedly recalculated to account for n − 2
dimensions and the last two levels of loop nest are left to the hardware to deal with. Not
all array accesses need to use the same loop variables. In the example, the access of
B depends on i, k unlike A which depends on i, j. The address generator is capable of
picking the correct loop variables and plugging them into the address expression.
Each address generator has a designated partner ALU in the cluster with several
address generators possibly sharing the same partner. In cases where the address gener-
ator is not equipped to compute the array index function, it is possible to directly issue
an address computed by its partner ALU. The partner ALU can also compute address
contexts on the fly and reconfigure an address generator. The combination of an address
generator and its partner ALU can effectively deal with indirect access streams of the
type A[B[i]]. Address generation adds 1 cycle latency to load/store operations.
When the compiler emits code for an array access, the index of an address generator
and the index of an address context register within that generator are encoded into the
context index field of the load/store instruction. The selected address generator then uses
the context index field to retrieve the array parameters from the context register file as
shown in Figure 9.7. The retrieved context entry specifies the loop variables to be used
for calculating the address. The muxes at the top right of the figure use this information
to select the appropriate loop counters. The mux inputs are connected to the loop count
registers of the loop unit shown in Figure 9.6. The shifters then shift the selected loop
variables, and the result is OR-ed and added to the base address to generate an address.
To improve the processor’s cycle time, pipeline registers have been inserted just before
the final add operation.
Several special cases are handled in the address generator. It is common to unroll
loops by a small factor and software pipeline them for performance. In that case, instead
of using two loop variables, it is possible to use one loop variable and one unroll factor
105
to compute the address. The unroll factor is packed into the immediate field of the
instruction and selected in lieu of the loop variable using the upper 2-to-1 mux in Figure
9.7. When the access pattern is too complex to be handled by the address generator, the
lower 2-to-1 mux selects an address that is computed by an ALU. To handle vectors and
ALU generated addresses with one or zero loop variables respectively, the loop unit has
a special loop counter which is always zero.
9.6.3 Array Variable Renaming
Setting the modulo period field in load.context/store.context instructions to a nonzero
value unlocks a performance enhancing feature called Array Variable Renaming. Modulo
scheduling makes it is possible to overlap the execution of multiple instances of the inner
loop body. Assume that the k loop from Figure 9.8 has a latency of 30 cycles and that
after satisfying resource conflicts and data dependences it is possible to start a new copy
of the loop body every 5 cycles. Then, up to 6 copies of the loop body could be in flight
through the execution pipeline. To get data dependences correct for new loop bodies, the
loop variable should be incremented every 5 cycles. However, when it is incremented,
old instances of the loop body that are in flight will get the wrong value and violate
dependences for load/store instructions that happen close to the end of the loop body.
The traditional solution is to use multiple copies of the loop variable in conjunction
with the VLIW equivalent of register-renaming – a rotating register file. Multiple address
calculations are performed, the appropriate values loaded into the register file and the
register file is rotated. For long latency loop bodies with short initiation intervals, this
leads to increased register pressure. The solution to this problem is to increment a single
copy of the loop variable every initiation interval and compensate for the increment in
older copies of the loop body which are in flight. The compensation factor, which is
really the modulo period, is encoded into the immediate field of load/store instructions.
It is subtracted from the loop variable’s value to cause dependences to resolve correctly.
In effect, this has the effect of rotating the array variable and letting a generic expression
like A[i][j] be rebound to separate addresses. Array variable renaming, effectively
converts the entire scratch pad memory into a rotating register file with separate virtual
106
rotating registers for each array accessed in a loop. Array variable renaming is much
more powerful than register rotation, but it can also be used in conjunction with a rotating
register file. This could be useful in cases in which it is possible to custom design rotating
register files that have lower latency than the SRAM and address generator combination
used to implement array renaming. Such a combination of array renaming and register
rotation can capitalize on the flexibility provided by array renaming and the low latency
provided by a custom designed rotating register file. The perception processor does not
have an architected register file at all – it merely uses array variable renaming in the place
of register-renaming to achieves very high throughput at low power.
9.6.4 Addressing Modes
The address generator can directly compute array references of the form A[i × P +
Q][j×R+S].f ield and vector accesses when both loop variables are nested loops, when
one loop has been unrolled, and more importantly when the inner loop has been modulo-
scheduled. For higher dimensional arrays, the base address is repeatedly recomputed
using an ALU, and the last two dimensions are handled by the address generator.
Another important access pattern is indirect access of the form A[B[i]]. This is a
common ingredient of neural network evaluation and can be used to implement bit-
reversed addressing for FFT. It is also a generic access pattern – any complex access
pattern can be precomputed and stored in B[] and used at runtime to access the data
in A[ ]. Vector indirect style accesses may be done by passing an ALU generated B[i]
address through the adder in Figure 9.7 thereby offsetting it with the base address of A[ ].
The ALU address can be computed, or it can be streamed into the ALU from SRAM
by another address generator. Using two address generators and an ALU, complicated
access patterns may be realized with high throughput. If the cost in terms of SRAM and
function unit usage becomes too high, the address generator may be extended for other
application specific access patterns. The stream address generator effectively converts
the scratch-pad memory into a vector register file that can operate over complex access
patterns and even interleave vectors for higher throughput. From an operational per-
spective, associating stream address generators with small scratch-pad memories unifies
107
vector and VLIW architectures.
9.7 Compiler Controlled Clock Gating
In a traditional architecture, a function unit pipeline always shifts unless a stall
situation happens. Operands enter the pipeline, and results exit it under hardware control.
A distinguishing feature of the perception processor architecture is that a compiler can
manage pipeline activity on a cycle by cycle basis. Microinstructions contain an opcode
field for each function unit in the cluster. The fetch logic enables the pipeline shift
and clock signals of a function unit only if the corresponding field is not a NOP. It can
also generate a NOP when the opcode field is used for another purpose. The net result
is that a function unit pipeline makes progress only during cycles when operations are
issued to it and stalls by default. The scheme provides fine grain software control over
clock gating while not requiring additional bits in the instruction to enable or disable a
function unit. When the result of an N-cycle operation is required, but the function unit
is not used after that operation, dummy instructions are inserted by the compiler into
following instruction slots to flush out the required value. To avoid excessive power-line
noise a compiler may keep a function unit active even when it has nothing to compute.
The regular nature of modulo scheduled loops make them good candidates for analytical
modeling and reduction of power-line noise [112].
Fine grain compiler directed pipeline control has two main purposes. First, the
compiler has explicit control over the lifetimes of values held in a pipeline unlike a
traditional architecture where values enter and exit the pipeline under hardware control
and only quantities held in architected registers may be explicitly managed. In the
perception processor, pipeline registers and the associated bypass paths may be managed
as if they were a small register file, and dataflows found in custom hardware can be
easily mimicked. Second, it lets the compiler control the amount of activity within a
cluster. Software control of dynamic energy consumption makes energy vs ILP trade-offs
possible. The resulting activity pattern can approximate the ideal condition where each
function unit has its own clock domain and runs with just the right frequency.
108
9.8 Design Flow
The hardware netlist for a perception processor is automatically generated from a
configuration description using a specially developed netlist compiler tool. The configu-
ration description is created manually based on an analysis of benchmarks. Of particular
importance to the analysis is the relative importance of various types of operators within
an algorithm. This determines the mix of function units incorporated into a perception
processor. Also important is the dataflow within loop bodies, which determines the
interconnect topology and size and number of SRAMs. It may be possible to perform this
analysis in a semiautomated manner in the future. Based on benchmark analysis an archi-
tect creates a configuration description expressed as a Python script. The configuration
script selects a set of function units from a library of components like ALUs, multipliers
and floating point units implemented using Verilog HDL and Synopsys module compiler
languages. Each function unit in the library is annotated with attributes like latency,
opcode width and names of input and output ports. Each function unit is provided a
name and a position in the eight slots available for function units. The architect also
selects the number of input and output muxes used to create the interconnect. Depending
on the type and number of function units and SRAMs the actual HDL code for the muxes
will be generated by the netlist compiler. The architect then specifies the topology of the
interconnect. This is done by specifying the names of the function units connected to
each of the input and output muxes.
The architect also describes an instruction format in symbolic form. This is a tree
structure that defines the relative position of opcode bits for each function unit and
interconnect mux within a wide instruction word. Each field is then recursively split
into subfields. It is possible to define alternate interpretations for bitfields. For example,
the opcode slots of several function units may also be used to contain reconfiguration
information for the loop unit. A shared instruction type field in each instruction word
determines which of the interpretations should be used. The netlist compiler tool converts
the configuration description into the top level HDL description of a perception proces-
sor. It generates a small instruction decoder based on the instruction format specified by
the architect. It also creates the interconnect and its constituent muxes and connects the
109
ports of various hardware modules together to create a complete perception processor
implementation.
The generated processor netlist along with HDL descriptions of various components
is processed by a series of commercial ASIC design tools. The Synopsys design compiler
maps the HDL description into a gate level netlist. A suite of specially developed
gate level netlist processing scripts analyze the input and output connectivity of each
gate in the netlist to derive heuristic estimates for wire lengths. These scripts also
modify the netlist and insert an RC component on each wire. Each RC component
is named uniquely, and the wire length associated with each component is saved to a
text database. The modified netlist and a wrapper HDL design which instantiates the
processor, SRAMs, clock generator, self-checking routines, etc., are simulated using
Synopsys Nanosim, a transistor level Spice simulator. Spice transistor models for a
0.13µ CMOS process are also provided to Nanosim. Based on the saved wire lengths
and the resistance and capacitance of the lowest level metal layer, the resistance and
capacitance of each wire in the design are computed. A script then instructs the Nanosim
simulator at run time to annotate these computed values onto the RC elements that were
inserted previously. A test-bench then loads a microprogram binary into the instruction-
SRAM. Nanosim then performs a low level simulation of the entire circuit. It periodically
samples and records the supply current to a text database. The simulation repeatedly
executes the same microprogram. At the end of each execution, self-checking routines in
the test bench verify that the results present in the output SRAM match results that were
precomputed by running a C or Python implementation of the algorithm. Simultaneously,
a specially developed numerical integration program uses the supply current database to
compute power and energy consumption. When the average power consumption result
converges, the Nanosim simulation is terminated.
The configuration description written by the architect is also used as an input to the
microcode compiler so that the compiler knows the actual configuration of the pro-
cessor it is generating code for. The compiler translates a microprogram expressed
in a limited subset of Python into a microcode binary. It then configures a generic
perception processor simulator to represent the parameters specified in the configuration
110
description. Each microprogram file also includes an additional pure Python reference
implementation of the algorithm and some test data. The microcode binary is simulated
using the test data, and output vectors are generated and saved. The simulator then runs
the reference implementation of the algorithm and verifies that the simulation results
match the reference implementation. It then saves the output vectors in a form suitable
for use with the Verilog self-checking routines described previously. Another result of the
simulation is a log of read, write and idle cycles of each SRAM. The simulator uses this
log along with SRAM power consumption information provided by the CAD tool which
generated the SRAM macrocell to compute the energy consumption of each SRAM. The
SRAM power consumption is then added to the processor power consumption computed
using numerical integration of the Nanosim output database to arrive at the overall power
consumption.
9.9 Programming Example
This section illustrates the operation of the perception processor using a simple kernel
that is mapped into microcode. The algorithm to multiply two 16 × 16 floating point
matrices is shown in Figure 9.9. The control pattern consists of 3 level nested for
loops. Assuming that the matrices are stored in row major order, the inner product
computation will access arrayA along the row whileB will be accessed along the column
causing a base stride access pattern. The compute pattern consists of multiply accumulate
operations, which form the core of the inner product function.
Figure 9.10 outlines a simple custom hardware accelerator for this algorithm. Ad-
dress generator A fetches the rows of matrix A. Address generator B generates the base
stride pattern for the columns of matrix B. Corresponding rows and columns are fetched
and applied to the floating point multiplier. The output of the multiplier is accumulated
in a scratch register by the floating point adder. When an inner product sum is ready, it
is written to a result SRAM, which is not shown in the figure.
In theory, this simple pipeline could compute one inner product every 16 cycles.
However, the final accumulation of the inner product value creates a pipeline problem.
The floating point add takes 7 cycles and since the output is accumulated, a new product
111
def inner_product(A, B, row, col):sum = 0.0for i in range(0,16):
sum = sum + A[row][i] * B[i][col]return sum
def matrix_multiply(A, B, C):# C is the result matrixfor i in range(0, 16):
for j in range(0, 16):C[i][j] = inner_product(A, B, i, j)
Figure 9.9. Matrix Multiply Algorithm
Figure 9.10. Inner Product Accelerator
value can only be handled every 7 cycles. Hence each inner product takes 16 × 7
cycles. Interleaving the computation of 7 or more inner products relieves this bottleneck.
However, this interleave complicates address generation. The additional functionality
required to fix this problem includes: a) address generator B needs to be able to generate
multiple interleaved base-stride patterns b) address generator A needs to hold each row
element long enough for all the interleaved inner products and, c) several scratch registers
are required to hold the intermediate sums. If the interleave factor is the same as the
latency of the floating point adder, no scratch registers are required. The output of the
112
adder may be fed back as an input and the intermediate sums will circulate through the
pipeline registers of the adder.
Compilers for high performance architectures attempt to approximate the dataflow in
the custom accelerator. In vector processors, vector chaining creates a similar dataflow
and reduction operators help alleviate some of the performance penalty caused by the
floating point accumulate operation. By selecting independent adds and multiplies, which
are ready for issue from its instruction window, an out of order processor will work some-
what like a vector processor that can be time sliced across several interleaved vectors. In
addition, a combination of software pipelining and branch prediction ensures that the
pipeline has as few wasted cycles as possible. Address generation will be handled by
generic ALUs which send computed addresses to available load/store ports. Some form
of register renaming will also be required to enable software pipelining to work well in
nontrivial kernels.
Figure 9.11 shows the cleaned up perception processor assembly code for the inter-
leaved inner product. For brevity the outer loops, which invoke the interleaved inner
product, are not shown. This code is capable of sustaining the same throughput (7 inner
products every 16 × 7 cycles) as the refined custom hardware accelerator. Performance
and energy efficiency are achieved by a combination of techniques.
The inner product loop i loop is marked for hardware modulo loop acceleration, and
its parameters are configured into a free context in the loop unit. Two address contexts
A ri and B ci are allocated and the address generators attached to the input SRAM ports
are reconfigured. Both contexts are tied to the loop i loop. B ci is set to generate a
column walk indexed by i loop, with the starting offset specified in a constant field in
the load opcode. A ri is set to access the matrix row by row in conjunction with an outer
loop. The address contexts effectively implement array variable renaming functions, a
fact which is not evident in the code.
On entering i loop the previous loop is pushed on a stack, though its counter value is
still available for use by the address contexts, particularly A ri. The new loop updates its
counter every 7 cycles and admits new loop bodies into the pipeline. This is not a branch
in a traditional sense and there is no branch penalty.
113
i_loop = LoopContext(start_count=0, end_count=15,increment=1, II=7 )
A_ri = AddressContext(port=inq.a_port,loop0=row_loop, rowsize=16,loop1=i_loop, base=0)
B_ic = AddressContext(port=inq.b_port,loop0=i_loop, rowsize=16,loop1=Constant, base=256)
for i in LOOP(i_loop):t0 = LOAD( fpu0.a_reg, A_ri )for k in range(0,7): # Will be unrolled 7x
AT(t0 + k)t1 = LOAD(fpu0.b_reg, B_ic, loop1_constant=k)AT(t1)t2 = fpu0.mult( fpu0.a_reg, fpu0.b_reg )AT(t2)t3 = TRANSFER( fpu1.b_reg, fpu0 )AT(t3)fpu1.add( fpu1, fpu1.b_reg )
Figure 9.11. Assembly Code for Interleaved Inner Product
Communication is explicit and happens via load/store instructions or via interfunc-
tion unit data transfers, both of which explicitly address pipeline registers. In the ex-
ample A[r][i] and B[i][c] are allocated to pipeline registers fpu0.a reg and fpu0.b reg
respectively. In fact, it is more appropriate to say that B[i][c + k] where k refers to the
kth interleaved inner product resides in fpu0.b reg at time t0 + k. No scratch registers
are required for the sum. The intermediate sums are merely circulated through the long
latency FPU adder. This notion of allocating variables both in time and space is central
to programming the perception processor.
The return value of each opcode mnemonic is the relative time at which its result
is available. The AT pseudo op is a compile time directive that controls the relative
time step in which following instructions are executed. Dataflow is arranged by referring
to the producer of a value and the time step it is produced in. Such a reference will
114
be translated by the compiler into commands for the forwarding logic. More complex
programs are written as several independent execution streams. The streams are then
made to rendezvous at a particular cycle by adjusting the starting time of each stream.
The example shows that compile time pseudo ops can perform arithmetic on relative
times to ensure correct dataflow without the programmer needing to be aware of the
latencies of the actual hardware implementation.
The loop body for i loopwill consist of 7 inner loop bodies created by loop unrolling.
Each inner loop body before unrolling takes 18 cycles to execute. Since i loop has been
specified to have an initiation interval of 7 cycles, a total of 3 i loop bodies corresponding
to 21 of the original loop bodies will be in flight within the cluster at a time. It is the
modulo aware nature of the address generators that permits each of these loop bodies
to refer to array variables in a generic manner like A[r][i] and get the reference that is
appropriate for the value of r and i which were current at the time that loop body was
started. Without special purpose address generation, such high levels of ILP will not be
possible. A previous version of the architecture without modulo address generators had
limited ILP because generic function units and registers were used for address generation
[67].
For this example, interleaving 7 inner products at a time results in two left over
columns. They are handled by a similar loop to the one shown in Figure 9.11 except that
it will have more idle slots. The adder needs to be active all the time, but the multiplier
needs to work only 2 out of every 7 cycles. Since the multiplier pipeline will not shift
5 out of 7 cycles, the dynamic energy consumption resembles an ideal circuit where
the adder runs at full frequency and the multiplier runs at 2/7 of the frequency thereby
consuming less energy.
The overall effect is that the dataflow and throughput of the perception processor
matches the custom hardware but in a more programmable manner. The address gen-
erators transfer data between the SRAMs and execution units in a distributed and au-
tonomous manner similar to the custom accelerator in Figure 9.10. The output of the
multiplier is directly forwarded to the input of the adder. As in the case of the accelerator,
no scratch registers are used. The intermediate sums are circulated through the pipeline
115
registers in the adder. All together, the microcode and the interconnect provide a level of
programmability while retaining a level of hardware economy close to that of the ASIC.
CHAPTER 10
EVALUATION
The benefits of the perception processor architecture are tested on 10 benchmarks
that were chosen both for their perceived importance in future embedded systems as well
as for their algorithmic variety. In order to compare the approach to the the competition,
four different implementations of benchmarks are considered:
1. Software running on a 400 MHz Intel XScale processor. The XScale represents an
energy efficient embedded processor.
2. Software running on a 2.4 GHz Intel Pentium 4 processor. The Pentium 4 is
optimized for performance rather than energy efficiency since more efficient pro-
cessors can not currently support real-time perception tasks such as speech recog-
nition.
3. A microcode implementation running on the perception processor.
4. Four of the benchmarks have been implemented as custom ASICs since ASICs
represent a high level of performance and energy efficiency that general purpose
processors are seldom able to match.
10.1 BenchmarksThe first two algorithms called GAU and HMM described in Chapter 4 are dominant
components of the Sphinx 3.2 speech recognizer. The next five algorithms named Row-
ley, Fleshtone, Erode, Dilate and Viola are components of the visual feature recognition
system described in Chapter 7. The last three algorithms are FFT, FIR and Rijndael
and these are taken from the DSP and encryption domains. The DSP algorithms were
added to test the generality of our approach. DSP functions like FFT and FIR are
117
important components of speech recognition front ends and image processing algorithms.
Encryption is of increasing importance to secure embedded systems. Rowley, GAU, FFT
and Fleshtone are floating point intensive. The remaining benchmarks are integer only
computations. Some components of GAU, Rowley and Fleshtone may be vectorized
while the rest of the algorithms cannot. HMM is intensive in data dependent branches
which may be if-converted.
Several source level optimizations have been made to the software versions that run
on the Pentium and XScale to boost their performance as much as possible [66]. The
optimizations included hand unrolled loops, partial specialization of functions when
some arguments are known statically, replacing expensive functions with table lookups,
reshaping data structures for better cache locality and a variety of algorithm optimiza-
tions discussed in Chapters 5 and 7. No SIMD optimizations were made in order to
keep the comparison fair. The perception processor could use SIMD floating point units,
just like SSE on the Pentium, but widening datapaths makes isolating the impact of
architectural options like compiler controlled dataflow impossible. A brief description of
the benchmarks follow.
GAU and HMM represent Gaussian probability density evaluation and hidden Markov
model evaluation respectively. GAU occupies 57.5% and HMM consumes 41.5% of the
execution time of the Sphinx 3.2 speech recognition system. Both Gaussian distribu-
tions and hidden Markov models are components of most mature speech recognizers
[59, 111, 91]. GAU computes how closely a 10 ms frame of speech matches a known
Gaussian probability distribution. One input packet corresponds to evaluating a single
acoustic model state over 10 frames of a speech signal. A real-time recognizer needs to
process 600,000 invocations of the GAU algorithm every second. The HMM algorithm
performs a Viterbi search over a hidden Markov model corresponding to one model state.
One input packet to the HMM implementation consists of 32 five-state hidden Markov
models. While the GAU algorithm is entirely floating point, the HMM algorithm is
dominated by integer compare and select operations. Its average rate of invocation varies
significantly with context, but to guarantee real-time performance it is assumed in this
research that all HMM models are evaluated thereby brute forcing a large component of
118
speech processing.
Rowley represents a neural network based visual feature detector [83]. In the face
recognizer a multilayer neural network is swept over 30 × 30 rectangular regions of
an image. Each individual neuron is evaluated by the function: tanh(Σni=1Weighti ×
Image[Connectioni]). Neurons have multiple sizes for their fan-in (n), and each layer
depends on the preceding layer’s output. The software implementation developed for
this dissertation used hand unrolled, specialized versions of neuron evaluation functions
for each input size. Also, tanh() was implemented via table lookup whereas Rowley’s
original implementation used the tanh() function in the C library. This optimization
boosted the Pentium’s performance by a factor of 2.5. A 30 × 30 image as well as the
outputs of all the neurons are maintained within the perception processor. Depending on
the sizes of the neurons an input packet consisting of the weights and connections of 7
to 64 neurons is streamed through the perception processor. All computations involve
single precision floating point numbers.
Fleshtone represents a skin toning algorithm typically used as a preprocessing step
to find skin colored regions of an image so that a more sophisticated object detector like
the Rowley detector may be applied to it. The benchmark converts RGB pixels to another
color space and checks if the projected pixel falls in between two parabolic curves [90].
This algorithm represents a case that is difficult to vectorize since there are far more
floating point operators per pixel than the number of FPUs present in the cluster. This
necessitates multiple passes and saving of intermediate results. It also contains multiple
if statements in the body. Each input packet consists of a single raster line of a 320×200
24-bit color image. The output is a 320-entry bitmap whose elements are set where flesh
color is found.
Erode and Dilate represent two operators from mathematical morphology that help
in image segmentation. Erode sweeps a 3 × 3 pixel filter over the bitmap produced by
Fleshtone and cuts away weakly connected regions, i.e., it blacks out pixels if all pixels
within the filter are not set. Dilate does the opposite, it sweeps a 5 × 5 pixel filter over
a bitmap and fills in pixels if any of the pixels are set. Fleshtone, Erode and Dilate are
used for image segmentation in a visual feature recognition system [65]. Erode works on
119
three raster lines and dilate works on five raster lines of a 320 × 200 image.
Viola is a reimplementation of the Viola and Jones’ method of object detection based
on a well known machine learning algorithm known as AdaBoost [103]. The algorithm
relies on computing features or wavelets which are the weighted sum or difference of
rectangular regions within a 30 × 30 window into an image. The coordinate and weight
information for 100 features are maintained within the perception processor. Each input
packet contains a 30 × 30 pixel image. The output contains the evaluation of all 100
features over the 30 × 30 image.
FFT implements a 128 point complex to complex Fourier transform on floating point
data. The Fourier coefficients are maintained within the perception processor. Input and
output packets consist of 128 complex numbers where each complex number consists
of two single-precision floating point numbers. FFT represents a common algorithm for
which many DSP processors implement ISA extensions. FFT also represents a case that
causes bad interconnect conflicts on our architecture. Good performance depends on
the interconnect borrowing technique described in Section 9.5. The software version on
the Pentium is based on FFTW, a highly tuned FFT implementation which used dynamic
programming techniques to adapt itself to the processor architecture [38]. The microcode
implementation on the other hand uses a simple radix-2 algorithm and no ISA extensions.
Since FFTW cannot be used on the XScale, the simple radix-2 algorithm is used instead.
FIR is a 32 tap finite impulse response filter, a common primitive in DSP appli-
cations. Impulse response coefficients are maintained inside the perception processor.
Input packets of various sizes may be applied to the filter, which successively evaluates
each input and outputs one integer corresponding to every input word.
Rijndael is the advanced encryption standard. The particular version implemented
here uses 128 bit keys and works on 16 byte blocks [29]. Input blocks are 576 bytes long
to simulate network level encryption. The default maximum size of Internet packets
is 576 bytes. The key as well as the encryption S-boxes are maintained within the
perception processor.
120
10.2 Metrics
The trade-off between energy consumption and performance is a common modern
design choice. Increasing performance almost always involves increasing the energy
requirements. As a result, it is misleading to compare solely on the basis of either energy
or performance. This dilemma is even more meaningful for the real-time embedded
perception applications that are the driving force for this work. The ability to process
faster than real time simply means that power is being wasted. Therefore a common
tactic in such cases is to either reduce clock frequency, supply voltage, or both. The
fine grain scheduling capability of the perception processor also enables the work rate to
be scheduled, which is a more intuitive mechanism and achieves results similar to clock
frequency scaling.
An attractive and intuitive metric is to compare designs based on the energy expended
to perform work at some rate [21]. Gonzalez and Horowitz showed that Spec2/Watt, or
its inverse, the energy delay product, is a good metric of architectural merit [39]. Both
architecture and semiconductor process influence the energy delay product. Since the
feature size of the process, λ, has such a large impact it is necessary to normalize any
design comparison to the same process. The normalization techniques applied to the
results were described in Section 3.3.
The perception processor and the Pentium 4 are both implemented in 0.13µ CMOS
technology and their results need not be normalized. The XScale and the custom ASICs
are implemented using 0.18µ and 0.25µ technologies respectively, and their results are
normalized using this method to a 0.13µ technology. The metrics used for evaluating
the perception processor are: IPC, power, throughput, energy consumed to process each
input packet, energy delay product and ET 2.
10.3 Experimental Method
Hardware netlists for two different perception processor configurations were gener-
ated for this evaluation. They will henceforth be referred to as the integer cluster and
the floating point cluster. The integer cluster consists of four ALUs, two multiply units,
and the remaining two slots are unused. The floating point cluster contains four ALUs
121
and four FPUs. All of the integer benchmarks except FIR and Viola would run equally
well on the floating point cluster. FIR and Viola require integer multiply operations.
The hardware for each configuration (the entire organization shown in Figure 9.1) was
generated. The input and scratch SRAMs are sized at 8 KB each and the output SRAM
is 2 KB in size. The design is simulated at the transistor level using Spice while running
the microcode for the benchmarks. The Spice simulation provides a supply current
waveform with one sample per 100 pico seconds. This information along with the
supply voltage is used to compute instantaneous power consumption. Then numerical
integration of power over time is performed to compute energy consumption.
The dual-ported SRAMs are macrocells generated by an SRAM generator tool and
simulating the entire SRAM array using Spice is not feasible. For the SRAMs each
read, write and idle cycle were logged. The normalized energy consumption was then
computed based on the read, write and idle current reported by the SRAM generator.
Each benchmark is run for several thousand cycles until the energy estimate converges.
This chapter assumes a framework similar to Figure 1.2 where a host processor and
memory controller combination transfers data into and out of the perception proces-
sor’s local SRAMS. The perception processor operates only on data present in local
SRAM and has no means of accessing main memory. To isolate main memory system
power consumption and compare the merits of the processors in a fair manner, both the
perception processor and the general purpose processors are forced to repeatedly reuse
data which has already been transferred into on-chip memory. The host processor is not
simulated.
The function units are described in Verilog and the Synopsys module compiler lan-
guage. The overall cluster organization and interconnection between function units is
automatically generated by the compiler. The whole design is then synthesized to the
gate level and a clock tree is generated. The net list is then annotated with heuristic worst
case RC wire loads assuming all routing happened on the lowest metal layer. The energy
measurements are therefore likely to be pessimistic. Exact measurements are extremely
sensitive to wire routing decisions, and as a result wire capacitance calculations were
based on the worst-case wiring layer. The microcode corresponding to the benchmark
122
is loaded into program memory and the Spice model is simulated in NanoSim, a com-
mercial VLSI tool with Spice-like accuracy. The circuits were originally designed for a
0.25µ CMOS process but were subsequently retargeted to a 0.13µ process [22, 23]. Only
the 0.13µ results are reported here.
The software version of each benchmark was compiled with the GNU GCC com-
piler using the O3 optimization level and run on a 2.4 GHz Intel Pentium 4 processor.
This system has been modified at the board level to permit measuring average current
consumed by the processor module using a digital oscilloscope and nonintrusive current
probe. Several million iterations of each benchmark algorithm were run with the same
input data to ensure that the input data always hits in the L1 Cache. So the L2 Cache and
memory system effects are isolated as much as possible and the measurement represents
core power. For the XScale system a similar approach is used except that software control
is used to turn off unnecessary activity and the difference between the quiescent state and
the computation is measured. This method could slightly inflate the processor power,
but measuring the core power alone is not technically feasible on this system due to
packaging constraints. The choice of both systems were based on the technical feasibility
of PCB modifications to permit measuring energy consumption.
Embedded processors like the XScale do not have floating point instructions that
are required for some of the benchmarks. Software emulated floating point will bloat
the energy delay product of the XScale and make a meaningful comparison impossible.
Therefore the comparison is done against an ideal XScale, which has FPUs that have the
same latency and energy consumption as an integer ALU. This is done by replacing
each floating point operator in the code with a corresponding integer operator. The
code is then run on a real XScale processor. Henceforth, the name XScale refers to the
idealized XScale implementation. Floating point units typically incur several times the
latency and power overheads of their integer counterparts. The results computed by the
algorithm after replacing floating point operators with integer operators are meaningless,
but the performance and energy consumption represent a lower bound for any real XScale
implementation with FPUs. This makes the XScale results look better than they really
are.
123
10.4 Results
The design goal of the perception processor was to achieve high performance for
perceptual algorithms at low power. For stream computations, a very important consid-
eration is whether a system has sufficient throughput to be able to process the data rate
in real time. Since dynamic energy consumption is directly proportional to operating
frequency, one method for achieving this goal is to exploit high levels of instruction level
parallelism for stylized applications without a paying a high price in terms of hardware
complexity. The details of this approach were discussed in Chapter 3. Before the details
are presented it is important to note a few details.
1. With the exception of Figures 10.1 and 10.7, the Y axis of all graphs use a loga-
rithmic scale on account of the large range of data.
2. Energy, energy delay product, energy delay squared product and power numbers
in all the graphs in this chapter are normalized to a 0.13µ process. Since the
perception processor and the Pentium are both implemented in 0.13µ processors,
their normalized results correspond to the actual results. Only the XScale and
ASIC numbers are actually scaled.
3. In this chapter the terms average and mean refer to the geometric mean.
10.4.1 Instruction Level Parallelism
Figure 10.1 shows the IPC of the perception processor compared against the IPC
measured using native performance counters on an SGI R14K processor. The bench-
marks were compiled for the R14K using the highly optimizing SGI MIPSpro compiler
suite. The perception processor achieved a mean improvement in IPC of 3.3 times
over the sophisticated super-scalar out of order processor. Figure 10.1 also shows the
breakdown of IPC between execution units and the memory system. It may be seen
that a large fraction of the IPC improvement may be directly attributed to the memory
system, which can transfer data at a high rate into and out of the function units. This leads
to high function unit utilization and high IPC. Since each load/store instruction triggers
an address calculation operation, the two are counted as separate instructions. Though
124
FFT
Fltone Gau
Rowley
Dilate
Erode
HMM
Viola
FIR Rijn0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
8.0
9.0
10.0IP
C
1.7
1.6
2.3
2.4
2.4
2.5
1.6
2.4
2.2 2.3
0.0
4.5
3.4
5.8
5.3
5.2 5.
4
4.1
6.0
3.2
3.9
0.0
1.9
2.0
3.5
2.7 3.
23.
1
1.2
2.9
2.4
1.6
0.0
6.4
5.3
9.3
8.0 8.
4
8.4
5.3
8.9
5.6
5.6
0.0
R14K Perception Proc MEM IPC Perception Proc EX IPC
Figure 10.1. IPC
an address calculation is counted as a single instruction it should be understood that it
does the equivalent of several shift, mask, and add operations on a regular processor as
explained in Section 9.6.2. The results clearly demonstrate that the design goal of high
throughput through ILP has been achieved.
10.4.2 Power Consumption
Figure 10.2 shows the process normalized steady state power consumption of the
different implementations. It is seen that even though the perception processor harvests
high levels of ILP, its power consumption in the integer configuration is lower than the
125
FFT
Fltone Gau
Rowley
Dilate
Erode
HMM
Viola
FIR Rijn1e-01
1e+00
1e+01
1e+02
1e+03P
ower
(W
atts
)
0.67
5
0.64
8
0.67
0
0.66
6
0.65
7
0.66
6
0.67
0
0.66
2
0.64
4
0.67
5
44.6
40
39.0
60
44.1
75
44.7
95
42.6
25
43.7
10
47.1
20
44.9
50
44.3
30
44.7
95
0.72
6
0.67
0
0.75
7
0.74
5
0.49
1
0.50
6
0.56
7
0.55
2
0.40
9
0.47
5
0.34
7
0.34
7
0.08
2
0.60
3
XScale Pentium 4 Perception Processor ASIC
Figure 10.2. Power Consumption
single issue XScale embedded processor, and the power consumption of the floating
point configuration exceeds the XScale by at most most 14.4% . It should be noted that
in reality, the XScale’s power consumption for the floating point benchmarks can never
be as low as the values shown in Figure 10.2. As mentioned in Section 10.3, for the
floating point applications, the experiments represent an ideal XScale processor where
a floating point operation consumes only as much power as its integer counterpart. An
XScale implementation with a floating point unit would likely consume more power for
the floating point benchmarks. To be fair to the competition, it is worth noting that the
XScale is significantly more general than the perception processor since it has a TLB,
126
caches and a memory controller. The benchmarks do not exercise the memory controller.
The perception processor lacks that level of generality, but possesses eight function units,
address generators, loop accelerators and scratch-pad memory, which are not present in
the XScale.
Both the Pentium and the perception processor exhibit significant variability in power
consumption depending on the application whereas the power consumed by the XScale
is relatively independent of the application. For example, among the floating point
algorithms run on the perception processor, GAU has the highest power consumption
of 0.757 W while Fleshtone has the least at 0.67. This corresponds to a 11.5% energy
optimization achieved through compiler controlled data flow and compiler controlled
clock gating. For the integer configuration the application dependent power variation is
even larger. There is 27.9% power savings when comparing HMM and FIR. In contrast
the maximum application dependent power variation in the XScale happens between
Rijndael and FIR corresponding to 4.6% power savings. The Pentium achieves a 17.1%
power difference between Fleshtone and HMM. The perception processor thus possesses
a superior ability to capitalize on application dependent power saving opportunities.
10.4.3 Throughput
Figure 10.3 shows the throughput of the perception processor, the Pentium 4 and
the XScale processors as well as ASIC implementations. Throughput is defined as the
number of input packets processed per second and the results shown in Figure 10.3 are
normalized to the throughput of the Pentium 4. The perception processor operating at
1 GHz outperforms the 2.4 GHz Pentium 4 by a factor of 1.75 (Geometric Mean). The
perception processor’s mean throughput is 41.4% of that of the ASIC implementations
(GAU, Rowley, FIR, Rijndael). This is severely skewed by the fact that the ASIC
implementations, particularly Rijndael, expend vastly more hardware resources than the
perception processor. This is evident from Figure 10.2, which shows that in the case of
Rijndael, the ASIC consumes more than twice the power of the perception processor.
For the set GAU, Rowley and FIR, the perception processor in fact achieves on average
84.6% of the throughput of the ASIC implementation. These results clearly demonstrate
127
FFT
Fltone Gau
Rowley
Dilate
Erode
HMM
Viola
FIR Rijn0.01
0.1
1.0
10.0
100.0T
hrou
ghpu
t
0.06
0.81
0.13
0.13 0.15
0.13
0.11 0.
16
0.12 0.
15
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
1.00
0.64
6.43
2.57
1.57 1.
96
1.76
1.45
1.29 1.
57
1.58
2.94
1.72 2.07
32.4
1XScale Pentium 4 Perception Processor ASIC
Figure 10.3. Throughput Normalized to Pentium 4 Throughput
the benefit of the perception architecture to the problems posed by perceptual algorithms.
Two of the benchmarks demand further explanation. FFT is the only benchmark
where the Pentium outperforms the perception processor. This is due to the fact that
the version of FFT used on the Pentium is based on FFTW, one of the fastest FFT
libraries in existence. It uses a mixture of processor specific measurements and dy-
namic programming optimizations to adapt itself to the specific system it is run on.
The perception processor on the other hand uses a simple radix-2 algorithm as does
the XScale implementation. This is on account of the fact that FFTW is implemented
as a large C library and is difficult to reimplement manually in microcode without the
128
aid of a C compiler that targets the perception processor. XScale lacks the floating
point hardware to support FFTW. The radix-2 algorithm is not particularly well suited
for the perception processor since it causes bad interconnect conflicts that lead to too
high an initiation interval for the main loop. In spite of these adversities the perception
processor implementation achieves 64% of the performance of the Pentium at less than
half its clock frequency. DSP processors typically implement a bit-reversed address
space to improve the performance of FFT [42]. The main reason for the reasonable
FFT performance of the perception processor is that it uses hardware support for vector
indirect accesses to implement a bit-reversed addressing mode for this application. An
indirection vector that corresponds to bit-reversed array indices is kept and used from the
scratch SRAM.
The other outlier is Fleshtone, the benchmark on which the perception processor
performs the best. Though this is a relatively simple algorithm, it involves numerous
floating point operations. Since the number of operators far exceed the number of
function units available on the perception processor, the dataflow graph of the algorithm
was split into several small subgraphs, and multiple passes were made over an input
packet (320 pixel raster line) to fully evaluate the algorithm. Numerous temporary values
are generated in the process, and these are stored in the SRAMs between successive
passes. The Pentium version on the other hand fully evaluates the algorithm on each pixel
before moving on to the next pixel in the input packet. The floating point register stack
in the x86 architecture is inadequate to capture the amount of temporary results created.
This results in several unnecessary moves, exchanges, loads and stores of intermediate
values. The main loop body generated by GCC contains over 80 instructions and takes
more than 208 cycles on average per iteration. In the case of the perception processor,
compiler controlled dataflow reduces the number of temporaries and the SRAM memory
permits storage of a very large number of intermediate results – over 1600 values in six
passes. Ultimately, this leads to the perception processor outperforming the Pentium by
a factor of 6.4.
129
10.4.4 Energy Consumption
In battery powered systems, the energy consumed to complete a task is often a more
relevant metric than power. Circuit designers often have the ability to trade-off power
for performance. Thus it is possible for a high power system, which rapidly completes a
task, to consume less energy than a low power system that steadily draws power for
an extended period to complete the same task. Battery life for mobile systems can
be extended by being energy efficient, not necessarily by being low power. Figures
10.2 and 10.3 showed that the perception processor has low power consumption and
high performance. This in turn translates to a high degree of energy efficiency. Fig-
ure 10.4 shows the per packet energy consumption of the perception processor and its
competition. While delivering 11.8 times the performance of the XScale processor, the
perception processor consumes 13.5 times less energy than the XScale for each input
packet. General purpose processors exact a high energy cost for their generality and
programmability when compared to ASICs. From the results in Figure 10.4 it is possible
to compute that on average the XScale consumes 79.3 times more energy per input packet
when compared to the ASIC implementations (Gau, Rowley, FIR, Rijndael). In sharp
contrast, the perception processor’s energy consumption is only five times larger than
that of the ASIC. The perception processor thus radically improves energy efficiency
while retaining a high level of generality and programmability.
10.4.5 Energy Delay Product
Though CMOS circuits often have the ability to trade energy for performance, it is
quite difficult to improve both energy and performance simultaneously. Gonzalez and
Horowitz argue that the process normalized energy delay product (EDP) or alternately,
Spec2/Wattλ2, which corresponds to the inverse of EDP, is a relatively implementa-
tion neutral metric [39]. They demonstrate that this metric causes the architectural
improvements that contribute the most to both performance and energy efficiency to
stand out. For example, their results demonstrate that pipelining is of fundamental
importance to processor performance and energy efficiency, but super scalar issue is
a lesser contribution. Figure 10.5 shows the process normalized energy delay product
130
FFT
Fltone Gau
Rowley
Dilate
Erode
HMM
Viola
FIR Rijn1e-04
1e-03
1e-02
1e-01
1e+00
1e+01E
nerg
y (m
J/In
put)
2.6e
-02
2.2e
-02
4.8e
-02
6.6e
-03 2.
3e-0
2
1.5e
-02
4.9e
-03
4.9e
-03
3.6e
-02
6.4e
-02
1.1e
-01
1.1e
+00
4.1e
-01
5.7e
-02 2.
2e-0
1
1.3e
-01
3.9e
-02
5.3e
-02
2.8e
-01
6.3e
-01
2.8e
-03
2.9e
-03
2.7e
-03
6.0e
-04
1.3e
-03
8.4e
-04
3.3e
-04
5.1e
-04 1.
7e-0
3
4.2e
-03
1.1e
-03
2.6e
-04
2.5e
-04
2.6e
-04
XScale Pentium 4 Perception Processor ASIC
Figure 10.4. Process Normalized Energy Consumption
(EDP) of the four different designs. It may be seen that in spite of their radically different
architectures, the XScale’s EDP is within 31.4% of the EDP of the Pentium if we ignore
the outliers FFT and Fleshtone. The FFT result is different because the XScale uses a
simple radix-2 algorithm instead of the optimized FFTW library used on the Pentium.
The Fleshtone result underlies the fact that for this floating point benchmark, the XScale
is modeled as an ideal implementation. The floating point version of this algorithm has
a performance problem on the Pentium as explained in Section 10.4.3.
It is evident from Figure 10.5 that the perception processor has a radically better
EDP, which is often one or two orders of magnitude better than its competition. It is
131
FFT
Fltone Gau
Rowley
Dilate
Erode
HMM
Viola
FIR Rijn1e-05
1e-04
1e-03
1e-02
1e-01
1e+00
1e+01
1e+02
1e+03E
nerg
y D
elay
Pro
duct
(J*
1e-9
s)
1.0e
+00
7.7e
-01
3.4e
+00
6.5e
-02
8.1e
-01
3.3e
-01
3.6e
-02
3.6e
-02
2.0e
+00
6.0e
+00
2.7e
-01
3.0e
+01
3.7e
+00
7.2e
-02
1.2e
+00
3.7e
-01
3.3e
-02
6.3e
-02
1.8e
+00 8.
8e+
00
1.1e
-02
1.3e
-02
9.7e
-03
4.8e
-04 3.
5e-0
3
1.4e
-03
1.9e
-04
4.7e
-04
6.7e
-03 3.7e
-02
3.4e
-03
1.9e
-04
7.7e
-04
1.1e
-04
XScale Pentium 4 Perception Processor ASIC
Figure 10.5. Process Normalized Energy Delay Product
particularly noteworthy that in the case of FFT where the perception processor achieved
only 64% of the throughput of the Pentium, it improves EDP by a factor of 24.5. This
may be largely attributed to the higher energy efficiency of the perception processor. The
perception processor on average improves on the EDP of the XScale by a factor of 159
and is only 12 times worse than the ASIC. The perception processor is thus able to bridge
the wide gap in EDP between CPUs and ASICs.
132
10.4.6 Energy Delay Squared Product
Martin, Nystroem and Penzes argue that ET 2 is a voltage independent metric that
is better than the energy delay product [64]. For reasons explained in Section 3.5 this
research uses ET 2 merely as a metric that favors performance at the cost of energy.
Figure 10.6 compares the ET 2 efficiency of the perception processor against its
competition. Since this metric favors performance over energy, in most cases the Pentium
outperforms the XScale unlike the situation in Figure 10.5. The perception processor
outperforms the Pentium on average by a factor of 405 while it is 1869 times better than
the XScale. The ASIC is only 29 times better than the perception processor.
FFT
Fltone Gau
Rowley
Dilate
Erode
HMM
Viola
FIR Rijn1e-02
1e-01
1e+00
1e+01
1e+02
1e+03
1e+04
1e+05
1e+06
1e+07
1e+08
E*T
^2 (
J*1e
-18
s^2)
4.0e
+04
2.6e
+04 2.
5e+
05
6.5e
+02
2.8e
+04
7.3e
+03
2.6e
+02
2.7e
+02
1.1e
+05
5.7e
+05
6.8e
+02
8.4e
+05
3.5e
+04
9.1e
+01
6.2e
+03
1.1e
+03
2.7e
+01
7.5e
+01
1.1e
+04 1.
2e+
05
4.3e
+01
5.4e
+01
3.5e
+01
3.9e
-01
9.4e
+00
2.3e
+00
1.1e
-01
4.3e
-01
2.7e
+01 3.
3e+
02
1.1e
+01
1.4e
-01 2.
4e+
00
4.9e
-02
XScale Pentium 4 Perception Processor ASIC
Figure 10.6. Process Normalized Energy Delay Squared Product (ET 2)
133
10.4.7 Clock Gating
Figure 10.7 shows the synergistic effect of applying clock gating to a cluster that
supports compiler controlled datapaths. Compiler controlled datapaths provide energy
reduction by decreasing datapath activity and avoiding register file and SRAM accesses.
To implement it, the load enable signal of each pipeline register should be controlled by
software. Since compiler controlled data flow demands circuits with software controlled
pipeline register enable signals, it is a trivial extension to clock gate pipeline registers
using the same signals. It is seen in the graph that on average this saves 39.5% power
when compared to the implementation without clock gating. These results are affected
by two factors: a) SRAM power adds a large constant factor to both the cases and, b)
Multicycle datapaths like the FPUs are not clock gated because of limitations of the CAD
tools. Further reduction is possible by clock gating multicycle datapaths.
FFT
Fltone Gau
Rowley
Dilate
Erode
HMM
Viola
FIR Rijn0
0.25
0.5
0.75
1.0
1.25
1.5
Pow
er W
atts
0.73
0.67 0.
76
0.74
0.49
0.51 0.
57
0.55
0.41 0.
47
1.28
1.27 1.
33
1.33
0.76
0.73
0.87
0.86
0.67 0.
78
Clock gated Not clock gated
Figure 10.7. Impact of Clock Gating
134
10.4.8 The Cost of Generality
It could be argued that the perception processor achieves impressive power sav-
ings because it lacks the level of generality possessed by the Pentium or the XScale.
The perception processor is believed to be Turing complete since it has instructions
for integer arithmetic, comparisons, conditional moves, loads, stores and direct and
indirect branches. However, Turing completeness is no measure of the ability to execute
arbitrary programs efficiently. While it is possible to modify the perception processor
for efficiency in the general case by traditional means like adding caches and branch
prediction, consider the simpler alternative of using a perception processor to augment
a general purpose processor. The generic sections of perception applications run on a
host processor, and the perception specific algorithms run on the perception processor
attached to the host processor. How efficient could such an organization be ?
Consider the case where the host processor is an XScale. This scenario represents a
complete system since the XScale contains its own memory controller. It is true that
additional interface circuits will be required between the XScale processor core, the
memory controller and the perception processor. However, such additional circuitry is
likely to be a very small portion of the hardware of the complete system and should not
affect the results presented here significantly. It is also the case that the XScale is ill
suited for this application since it consumes too much power for its performance level
and possesses too much generality. A low power DSP might be a better choice for a host
processor. But choosing an inefficient host processor makes the results presented in this
section very conservative.
Figure 10.2 shows that the process normalized peak power consumptions of the
XScale and the perception processor are 0.675 W and 0.757 W respectively. Consider a
chip multiprocessor called PP+ consisting of an XScale core and a perception processor
on the same die. PP+ will then have a peak power consumption of 1.4 W. To make the
results conservative assume that PP+ consumes 1.4 W of power for all the benchmarks
even though in reality the application specific power savings will be significant. Figure
10.8 shows the energy consumed by PP+ to process each input packet. It may be seen
that in spite of the addition of a host processor, PP+ has a significantly lower energy
135
FFT
Fltone Gau
Rowley
Dilate
Erode
HMM
Viola
FIR Rijn1e-04
1e-03
1e-02
1e-01
1e+00
1e+01E
nerg
y (m
J/In
put)
2.6e
-02
2.2e
-02
4.8e
-02
6.6e
-03 2.
3e-0
2
1.5e
-02
4.9e
-03
4.9e
-03
3.6e
-02
6.4e
-02
1.1e
-01
1.1e
+00
4.1e
-01
5.7e
-02 2.
2e-0
1
1.3e
-01
3.9e
-02
5.3e
-02
2.8e
-01
6.3e
-01
5.6e
-03
6.2e
-03
5.1e
-03
1.2e
-03 3.
8e-0
3
2.4e
-03
8.3e
-04
1.3e
-03 5.
8e-0
3
1.3e
-02
1.1e
-03
2.6e
-04
2.5e
-04
2.6e
-04
XScale Pentium 4 PP+ ASIC
Figure 10.8. Energy Consumption of PP+
consumption than the XScale and the Pentium. This is on account of the fact that energy
is the integral of power over time. Even though PP+ has a higher power consumption
than the XScale, because of its superior performance it is able to complete tasks faster
and thus consumes less energy. In particular PP+ consumes 5.5 and 53.6 times less
energy per packet than the XScale and the Pentium respectively. It is only a factor of
12.4 worse than the ASIC implementations.
Figure 10.9 shows the energy delay product of the PP+. Since the power consumption
of the PP+ is slightly larger than twice the power consumed by the perception processor,
the energy delay product is expected to be a scaled down version of Figure 10.5. This
136
FFT
Fltone Gau
Rowley
Dilate
Erode
HMM
Viola
FIR Rijn1e-05
1e-04
1e-03
1e-02
1e-01
1e+00
1e+01
1e+02
1e+03E
nerg
y D
elay
Pro
duct
(J*
1e-9
s)
1.0e
+00
7.7e
-01
3.4e
+00
6.5e
-02
8.1e
-01
3.3e
-01
3.6e
-02
3.6e
-02
2.0e
+00
6.0e
+00
2.7e
-01
3.0e
+01
3.7e
+00
7.2e
-02
1.2e
+00
3.7e
-01
3.3e
-02
6.3e
-02
1.8e
+00 8.
8e+
00
2.2e
-02
2.7e
-02
1.8e
-02
9.3e
-04
1.0e
-02
3.9e
-03
4.8e
-04
1.2e
-03
2.3e
-02
1.1e
-01
3.4e
-03
1.9e
-04
7.7e
-04
1.1e
-04
XScale Pentium 4 PP+ ASIC
Figure 10.9. Energy Delay Product of PP+
is indeed the case with PP+ outperforming the XScale and the Pentium by factors of
64.1 and 93.6 respectively and it under-performs the ASIC implementations by a factor
of 30. The results clearly demonstrate the benefit of using perception processors as
coprocessors to general purpose processors.
10.5 Summary
The architectural features of the perception processor enable it to provide 1.74 times
the throughput of a Pentium 4 while consuming 13.5 times less energy than an XScale
embedded processor. Its architectural efficiency allows it to reach 41.4% of the through-
137
put of the ASIC at five times the energy consumption of the ASIC – a small price for its
generality and programmability. Since the processor circuits were evaluated at the netlist
level and not laid out, rigorous area estimates were not made. Approximate estimates
show that the die area is dominated by the amount of SRAM used in the design and
the function units and interconnect occupy only a small fraction of the overall area. For
typical high performance embedded systems, having adequate compute ability at a low
energy budget is the critical factor, not area. The microprograms for the benchmarks
discussed in this chapter took approximately 10 to 20 man-hours each to develop. The
effort required can be drastically reduced if a high level language compiler is developed.
In contrast, ASIC implementations of benchmarks like FFT and Fleshtone might take
several man-months of effort. Altogether these radical improvements suggest that in
cases where high performance, low design time and low energy consumption need to be
addressed simultaneously, the perception processor could be an attractive alternative.
CHAPTER 11
CONCLUSIONS
Natural human interfaces built on technologies like speech recognition, gesture recog-
nition, object detection and tracking are central to the widespread acceptance of future
embedded systems. The chances for today’s isolated embedded devices to develop
into tomorrow’s ubiquitous computing environment also depends on services like secure
wireless networking, media processing and integration with visual and audio interfaces.
The levels of performance and power efficiency required to achieve these goals are orders
of magnitude beyond the ability of current embedded processors. Application specific
processor architectures can effectively solve some of these challenges.
The performance characteristics of a face recognition system based on well-known
algorithms and a leading research speech recognition system were analyzed. By recasting
these perception algorithms as well as DSP and encryption algorithms on to an archi-
tecture optimized for stream processing, high levels of ILP and energy efficiency were
demonstrated. The perception processor uses a combination of VLIW execution clusters,
compiler directed dataflow and clock gating, hardware support for modulo scheduling
and special purpose address generators to achieve high performance at low power for
perception algorithms. Operationally, the combination of stream address generators and
scratch-pad memories represent a unification of VLIW and vector styles of execution.
The perception processor is a fairly minimal, yet programmable hardware substrate that
can mimic the dataflow found in ASICs. It outperforms the throughput of a Pentium
4 by 1.75 times with an energy delay product that is 159 times better than an XScale
embedded processor. Its energy delay product is just 12 times worse than that of an
ASIC implementation. This approach has a number of advantages:
1. Its energy-delay efficiency is close to what can be achieved by a custom ASIC.
139
2. The design cycle is extremely short when compared to an ASIC since it substitutes
circuit design with interconnect topology selection and microcode programming.
3. The perception processor architecture is simple and regular. Hardware netlists for
perception processor configurations are automatically generated. Once the netlist
generator and the basic architectural components are proven to be correct, percep-
tion processor configurations should be easier to implement correctly compared to
ASICs. The perception processor architecture provides very fine grain control over
hardware resources making work arounds for hardware problems and software bug
fixes easy.
4. Since applications are implemented in microcode, post deployment bug fixes are
trivial.
5. It retains a large amount of generality compared to an ASIC.
6. It is well suited for rapid automated generation of domain specific processors.
A larger set of applications needs to be analyzed in the future to ensure that the ar-
chitectural primitives of the perception processor have sufficient generality to cover the
perception domain comprehensively. Automated architecture exploration and application
analysis, programming language support for perceptual primitives and streaming, and
formal methods to ensure real-time response will be important directions for future
research.
It has been shown that fine-grained management of communication and storage re-
sources can improve performance and reduce energy consumption whereas simultane-
ously improving on both these axes using a traditional microprocessor approach has been
problematic. The perception processor is an attractive choice when performance, power
efficiency, programmability and rapid design cycles are important. For the first time,
sophisticated real-time perception applications appear to be possible within an energy
budget that is commensurate with the embedded space.
CHAPTER 12
FUTURE RESEARCH
The architecture of the perception processor presented in this dissertation gradually
evolved from analyzing and observing the characteristics of speech recognition and vi-
sion algorithms and trying to design ASICs and traditional processors to accelerate these
tasks. The design process has led to the realization that it may be possible to systemati-
cally derive power efficient high performance processors for a wider class of algorithms.
This chapter outlines possible directions for future extensions to the perception processor
architecture. In this chapter, the term stream processor refers to the extended version of
the architecture so as to clearly distinguish it from the perception processor presented in
Chapter 9.
The term stream processing refers to real-time computations on high bandwidth data
streams. Examples include link-level encryption in networks, video trans-coding and
compression of video streams. Perceptual algorithms tend to be stream oriented. An im-
portant direction for future research is the architecture of generic, high performance, low
power, stream processors that can accelerate both perception algorithms and streaming
algorithms from other domains.
Figure 12.1 shows an abstract representation of a stream function. It is a generaliza-
tion of the map(), reduce() and filter() list processing functions and list comprehensions
found in the Python and Haskell languages [101, 54]. Analogues exist in Lisp and similar
languages. It applies a side effect free function lambda func() to arguments gathered
from a set of input variables and stores the result to a set of output variables. The input
and output variables may be scalars, vectors, multidimensional arrays or more complex
aggregates. The procedure input iterator() is history sensitive. Each time it is invoked,
it returns a tuple consisting of input data gathered from the various input variables. The
141
StreamFunc(input_iterator, input_predicate,output_iterator, output_predicate,lambda_func) -> output_data
input_iterator() -> input_tupleinput_predicate(input_tuple) -> true|falselambda_func(input_tuple) -> output_tupleoutput_predicate(output_tuple) -> true|falseoutput_iterator(output_tuple) /* Stores output_tuple */
Figure 12.1. Generic Stream Function
input predicate() function examines the input tuple gathered by the iterator and decides
if further processing is required or not. If further processing is required, lambda func()
is used to transform the input tuple to an output tuple. The function output predicate()
examines an output tuple and decides if it needs to be saved or not. If the result needs to
be saved, the history sensitive output iterator() procedure scatters the output tuple over
the output variables. Complex streaming algorithms may be expressed as the composi-
tion of several StreamFunc() instantiations with the outputs of earlier instances used as
the input of later instances. Some restrictions like constant dependence distance or flow
dependence may need to be imposed to map such functions onto stream processors with
limited on-chip memory.
The structure of StreamFunc() lends itself to a highly parallel hardware implemen-
tation. Figure 12.2 shows the logical organization of a generic stream processor. Its
architecture is reminiscent of a hydraulic system and fluid flow analogies apply to the
throughput of the system. The input iterator unit pumps or gathers data from a set of
SRAMs. The input predicate examines the data and either passes it to the execution
cluster or drops it. The execution cluster constantly transforms the data being pumped
into it. The output predicate then examines the transformed results and either drops it
or passes it on to the output iterator which saves it to output memory. The structure is
highly parallel and capable of sustaining high throughput. The gathering, transformation
and scattering of data are staged under the control of microcode.
142
Figure 12.2. Stream Processor
The perception processor described in Chapter 9 is less generic when compared to
this stream processor. The input and output iterator functionality is provided by the
Loop Unit and the Address Generators, but they are limited to accelerating simple nested
for loops as well as array and vector accesses. A stream processor needs high perfor-
mance but generic mechanisms for implementing more complex loop nests and data
access patterns. The perception processor does not implement input or output predicates
though conditional moves in the execution cluster permit selection of alternative results.
In the perception processor hardware acceleration is limited to lambda functions that
correspond to the loop bodies of modulo schedulable loops. Other types of code may
be used, but with no significant advantage over what a normal VLIW processor might
provide. A generic stream processor may need to support complex lambda functions that
involve conditional execution and hardware acceleration for scheduling regimes other
than modulo scheduling. Like the perception processor, the stream processor will also
need to behave like a normal processor when operating outside the stream function so as
to efficiently implement loop prologues, epilogues and assorted processing that does not
143
fit the stream function model.
Research in scheduling algorithms that can produce good mappings for stream func-
tions on to stream processors with a specified configuration will be important from a
code generation perspective as well as for automated architecture exploration. Such
algorithms will need to perform both power and performance optimization as well as
ensure that parameters like supply current variation meet design constraints. Algorithms
for splitting and composing complex stream functions expressed as combinations of basic
stream functions so as to make the best use of the limited number of function units and
storage resources available in a particular stream processor configuration will also be
important.
The structure of perception applications is suitable for a pipeline of perception pro-
cessors. Stream processors should support more complex communication and synchro-
nization modes. Chapters 5 and 6 indicated that DRAM bandwidth reservation or in-
dependent DRAM buses for individual algorithmic phases may be required to ensure
adequate bandwidth for perception applications. Chapter 3 explained that the IPC im-
provement provided by thread level parallelism can be an important source of power
savings. Together these factors indicate that research into chip multiprocessors con-
sisting of clusters of stream and RISC processors, a stream optimized interconnect and
multiple DRAM buses could be extremely beneficial. Finally, tools to characterize the
global dataflow within complex applications, refactor applications to ease mapping on
to heterogeneous chip multiprocessors and programming language support for streams
could be important directions for future research.
REFERENCES
[1] Cognex Inc. http://www.cognex.com/, 2004.
[2] Coreco Inc. http://www.coreco.com/, 2004.
[3] AARTS, B., BARRETEAU, M., BODIN, F., BRINKHAUS, P., CHAMSKI, Z.,CHARLES, H.-P., EISENBEIS, C., GURD, J. R., HOGGERBRUGGE, J., HU,P., JALBY, W., KNIJNENBURG, P. M. W., O’BOYLE, M. F. P., ROHOU, E.,SAKELLARIOU, R., SCHEPERS, H., SEZNEC, A., STOHR, E., VERHOEVEN,M., AND WIJSHOFF, H. A. G. OCEANS: Optimizing compilers for embeddedapplications. In European Conference on Parallel Processing (1997), pp. 1351–1356.
[4] ABNOUS, A., SENO, K., ICHIKAWA, Y., WAN, M., AND RABAEY, J. M. Evalu-ation of a low-power reconfigurable DSP architecture. In IPPS/SPDP Workshops(1998), pp. 55–60.
[5] ADVANCED MICRO DEVICES, I. AMD Athlon Processor x86 Code OptimizationGuide, k ed., Feb. 2002.
[6] AGARAM, K., KECKLER, S. W., AND BURGER, D. A characterization ofspeech recognition on modern computer systems. In Proceedings of the 4th IEEEWorkshop on Workload Characterization (Dec. 2001).
[7] AKTURAN, C., AND JACOME, M. F. FDRA: A software-pipelining algorithm forembedded VLIW processors. In Proceedings of the 13th International Symposiumon System Synthesis (2000), pp. 34–40.
[8] AKTURAN, C., AND JACOME, M. F. CALiBeR: A software pipelining algorithmfor clustered embedded VLIW processors. In Proceedings of the IEEE/ACMInternational Conference on Computer Aided Design (2001), pp. 112–118.
[9] ALNUWEIRI, H. M., AND PRASANNA, V. K. Parallel architectures and algo-rithms for image component labelling. IEEE Transactions on Pattern Analysisand Machine Learning 14, 10 (Oct. 1992), 1014–1034.
[10] ANANTHARAMAN, T., AND BISIANI, R. A hardware accelerator for speechrecognition algorithms. In Proceeedings of the 13th International Symposium onComputer Architecture (June 1986).
[11] ASANOVIC, K. The Computer Engineering Handbook. CRC Press, Dec. 2001,ch. Vector Processors.
145
[12] ATHAS, W., YOUNGS, L., AND REINHART, A. Compact models for estimatingmicroprocessor frequency and power. In Proceedings of the 2002 internationalsymposium on Low power electronics and design (2002), ACM Press, pp. 313–318.
[13] BENEDETTI, A., AND PERONA, P. A novel system architecture for real-timelow-level vision. In Proceedings of the IEEE International Symposium on Circuitsand Systems (ISCAS) (1999), pp. 500–503.
[14] BERTRAN, A., YU, H., AND SACCHETTO, P. Face detection project re-port. http://ise.stanford.edu/2002projects/ee368/Project/reports/ee368group17.-pdf, 2002.
[15] BOAHEN, K. Retinomorphic chips that see quadruple images. In Microelectron-ics for Neural, Fuzzy and Bio-Inspired Systems, 1999. MicroNeuro ’99 (1999),pp. 12–20.
[16] BONA, A., SAMI, M., SCIUTO, D., SILVANO, C., ZACCARIA, V., AND
ZAFALON, R. Energy estimation and optimization of embedded VLIW proces-sors based on instruction clustering.
[17] BROOKS, D., TIWARI, V., AND MARTONOSI, M. Wattch: a framework forarchitectural-level power analysis and optimizations. In ISCA (2000), pp. 83–94.
[18] BUDIU, M., AND GOLDSTEIN, S. C. Fast compilation for pipelined reconfig-urable fabrics. In ACM/SIGDA International Symposium on Field ProgrammableGate Arrays (Monterey, CA, 1999), S. Kaptanoglu and S. Trimberger, Eds., ACMPress, pp. 195–205.
[19] BURGER, D., AND AUSTIN, T. M. The SimpleScalar tool set, version 2.0.SIGARCH Computer Architecture News 25, 3 (1997), 13–25.
[20] CALLAHAN, T., AND WAWRZYNEK, J. Adapting software pipelining for re-configurable computing. In Proceedings of the International Conference onCompilers, Architecture, and Synthesis for Embedded Systems (CASES) (San Jose,CA, 2000), ACM.
[21] CAMPBELL, M. Evaluating ASIC, DSP, and RISC architectures for embeddedapplications. In Proceedings of the ACM SIGPLAN Workshop on Languages,Compilers, and Tools for Embedded Systems (1998), Springer-Verlag, pp. 261–265.
[22] CAO, Y., SATO, T., SYLVESTER, D., ORSHANSKY, M., AND HU, C. Newparadigm of predictive MOSFET and interconnect modeling for early circuit de-sign. In Proceedings of the IEEE Custom Integrated Circuits Conference (CICC)(June 2000), pp. 201–204.
[23] CAO, Y., SATO, T., SYLVESTER, D., ORSHANSKY, M., AND HU, C. Predictivetechnology model. http://www-device.eecs.berkeley.edu/˜ptm, 2002.
146
[24] CAT, H. H., EBLE, J. C., WILLS, D. S., DE, V. K., BROOKE, M., , AND
JOKERST, N. M. Low power opportunities for a SIMD VLSI architecture in-corporating integrated optoelectronic devices. In Proceedings of GoMAC (Mar.1996).
[25] CONNELL, J. Face finding. http://www.research.ibm.com/ecvg/jhc proj/faces.-html, June 2002.
[26] CONTE, T. M., DUBEY, P. K., JENNINGS, M. D., LEE, R. B., PELEG, A.,RATHNAM, S., SCHLANSKER, M. S., SONG, P., AND WOLFE, A. Challengesto combining general-purpose and multimedia processors. IEEE Computer 30, 12(1997), 33–37.
[27] CORREALE, JR., A. Overview of the power minimization techniques employedin the IBM PowerPC 4xx embedded controllers. In Proceedings of the 1995international symposium on Low power design (1995), ACM Press, pp. 75–80.
[28] D. BOLME, R. BEVERIDGE, M. T., AND DRAPER, B. The CSU face identi-fication evaluation system: Its purpose, features and structure. In InternationalConference on Vision Systems (April 2003), pp. 304–311.
[29] DAEMEN, J., AND RIJMEN, V. The block cipher Rijndael. Smart Card Researchand Applications, LNCS 1820 (2000), 288–296.
[30] DAVID PALLETT, J. G. F., AND PRZYBOCKI, M. A. 1996 preliminary broadcastnews benchmark tests. In Proceedings of the 1997 DARPA Speech RecognitionWorkshop (Feb. 1997).
[31] DEHON, A. DPGA-coupled microprocessors: Commodity ICs for the early 21stcentury. In IEEE Workshop on FPGAs for Custom Computing Machines (LosAlamitos, CA, 1994), D. A. Buell and K. L. Pocek, Eds., IEEE Computer SocietyPress, pp. 31–39.
[32] DELANEY, B., JAYANT, N., HANS, M., SIMUNIC, T., AND ACQUAVIVA, A.A low-power, fixed-point front-end feature extraction for a distributed speechrecognition system. In Proceedings of the IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP) (2002) (2002).
[33] ECKSTEIN, E., AND KRALL, A. Minimizing cost of local variables access forDSP-processors. In LCTES’99 Workshop on Languages, Compilers and Tools forEmbedded Systems (Atlanta, 1999), Y. A. Liu and R. Wilhelm, Eds., vol. 34(7),pp. 20–27.
[34] FANG, W.-C. A system-on-chip design of a low-power smart vision system. InProceedings of the IEEE Workshop on Signal Processing Systems (1998), pp. 63–72.
[35] FARABOSCHI, P., BROWN, G., FISHER, J. A., DESOLI, G., AND HOMEWOOD,F. Lx: a technology platform for customizable VLIW embedded processing. In
147
The 27th Annual International Symposium on Computer architecture 2000 (NewYork, NY, USA, 2000), ACM Press, pp. 203–213.
[36] FARBER, P., AND ASANOVIC, K. Parallel neural network training on Multi-Spert. In Proceedings of Third IEEE International Conference on Algorithms andArchitectures for Parallel Processing (ICA3PP (Dec. 1997).
[37] FERRETTI, M. Multimedia extensions in super-pipelined microarchitectures.a new case for SIMD processing? In Fifth IEEE International Workshop onComputer Architectures for Machine Perception (2000), pp. 249–258.
[38] FRIGO, M., AND JOHNSON, S. G. FFTW: An adaptive software architecture forthe FFT. In Proc. IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing(Seattle, WA, May 1998), vol. 3, pp. 1381–1384.
[39] GONZALEZ, R., AND HOROWITZ, M. Energy dissipation in general purposemicroprocessors. IEEE Journal of Solid-State Circuits 31, 9 (September 1996),1277–1284.
[40] GONZALEZ, R. E. Xtensa: A configurable and extensible processor. IEEE Micro20, 2 (March 2000), 60–70.
[41] GOWAN, M. K., BIRO, L. L., AND JACKSON, D. B. Power considerations inthe design of the Alpha 21264 microprocessor. In Design Automation Conference(1998), pp. 726–731.
[42] GRADY, T. Bit-reversed addressing in C on the C3x. In TMS320 DSP Designer’sNotebook, vol. SPRA204. Texas Instruments, 1992.
[43] HAGER, G. D., AND TOYAMA, K. X vision: A portable substrate for real-timevision applications. Computer Vision and Image Understanding: CVIU 69, 1(1998), 023–037.
[44] HAMMERSTROM, D. A VLSI architecture for high-performance, low-cost, on-chip learning. In International Joint Conference on Neural Networks (1990),pp. 537–544.
[45] HARRISON, R. R. An Analog VLSI Motion Sensor Based on the Fly VisualSystem. PhD thesis, California Institute of Technology, May 2000.
[46] HENNESSY, J., AND PATTERSON, D. Computer Architecture: A QuantitativeApproach, 3rd ed. Morgan Kaufmann, 2002.
[47] HOOGERBRUGGE, J., AND AUGUSTEIJN, L. Instruction scheduling for TriMe-dia. Journal of Instruction-Level Parallelism, 1(1) (Feb. 1999).
[48] HOOGERBRUGGE, J., CORPORAAL, H., AND MULDER, H. MOVE: a frame-work for high-performance processor design. In Proceedings of the 1991ACM/IEEE conference on Supercomputing (1991), ACM Press, pp. 692–701.
148
[49] HUANG, X., ALLEVA, F., HON, H.-W., HWANG, M.-Y., LEE, K.-F., AND
ROSENFELD, R. The SPHINX-II speech recognition system: an overview.Computer Speech and Language 7, 2 (1993), 137–148.
[50] INTEL CORPORATION. Using streaming SIMD extensions 2 (SSE2) to evaluatehidden Markov model with Viterbi decoding. Tech. Rep. AP-946, Intel Corpora-tion, 2000.
[51] INTEL CORPORATION. Intel Pentium 4 Processor Optimization Reference Man-ual, 2002.
[52] INTEL CORPORATION. Open source computer vision library. http://www.intel-.com/research/mrl/research/opencv/, 2002.
[53] JOHNSON, M. C., SOMASEKHAR, D., AND ROY, K. Leakage control withefficient use of transistor stacks in single threshold CMOS. In Proceedings ofthe 36th ACM/IEEE conference on Design automation conference (1999), ACMPress, pp. 442–445.
[54] JONES, S. P. Haskell 98 Language and Libraries. Cambridge University Press,Cambridge, UK, 2003.
[55] JOSHI, S. M. Some fast speech processing algorithms using Altivec technology.In Proceedings of the IEEE International Conference on Acoustics, Speech, andSignal Processing (ICASSP) (Mar. 1999), pp. 2135 – 2138.
[56] KARL, W. Some design aspects for VLIW architectures exploiting fine - grainedparallelism. In Parallel Architectures and Languages Europe (1993), pp. 582–599.
[57] KLEIHORST, R., ABBO, A., VAN DER AVOIRD, A., OP DE BEECK, M., SEVAT,L., WIELAGE, P., VAN VEEN, R., AND VAN HERTEN, H. Xetal: A low-powerhigh-performance smart camera processor. In The IEEE International Symposiumon Circuits and Systems, (ISCAS) (2001), pp. 215–218.
[58] KRASHINSKY, R. Microprocessor energy characterization and optimizationthrough fast, accurate, and flexible simulation. Master’s thesis, MassachusettsInstitute of Technology, May 2001.
[59] LAI, C., LU, S.-L., AND ZHAO, Q. Performance analysis of speech recogni-tion software. In Proceedings of the Fifth Workshop on Computer ArchitectureEvaluation using Commercial Workloads (Feb. 2002).
[60] LAPINSKII, V., JACOME, M., AND DE VECIANA, G. Application-specificclustered VLIW datapaths: early exploration on a parameterized design space.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems21, 8 (Aug. 2002), 889–903.
[61] LEE, C., LEE, J. K., HWANG, T., AND TSAI, S.-C. Compiler optimization oninstruction scheduling for low power. In Proceedings of the 13th InternationalSymposium on System Synthesis (ISSS’00) (2000), IEEE Computer Society, p. 55.
149
[62] LEE, W., BARUA, R., FRANK, M., SRIKRISHNA, D., BABB, J., SARKAR, V.,AND AMARASINGHE, S. Space-time scheduling of instruction-level parallelismon a raw machine. In Proceedings of the International Conference on Architec-tural Support for Programming Languages and Operating Systems (1998), ACMPress, pp. 46–57.
[63] LEUPERS, R. Instruction scheduling for clustered VLIW DSPs. In Proceedingsof the International Conference on Parallel Architectures and Compilation Tech-niques (PACT) (Oct. 2000), pp. 291–300.
[64] MARTIN, A. J., NYSTROEM, M., AND PENZES, P. ET2: A metric for time andenergy efficiency of computation. Tech. Rep. CaltechCSTR:2001.007, CaltechComputer Science, 2001.
[65] MATHEW, B., DAVIS, A., AND EVANS, R. A characterization of visual featurerecognition. In Proceedings of the IEEE 6th Annual Workshop on WorkloadCharacterization (WWC-6) (October 2003), pp. 3–11.
[66] MATHEW, B., DAVIS, A., AND FANG, Z. A low-power accelerator for theSphinx 3 speech recognition system. In Proceedings of the International Con-ference on Compilers, Architecture and Synthesis for Embedded Systems (CASES’03) (October 2003), pp. 210–219.
[67] MATHEW, B., DAVIS, A., AND IBRAHIM, A. Perception coprocessors forembedded systems. In Proceedings of the Workshop on Embedded Systems forReal-Time Multimedia (ESTIMedia) (October 2003), pp. 109–116.
[68] MCVOY, L. W., AND STAELIN, C. lmbench: Portable tools for performanceanalysis. In USENIX Annual Technical Conference (1996), pp. 279–294.
[69] MEMIK, S., BOZORGZADEH, E., KASTNER, R., AND SARRAFZADEH, M.SPS: A strategically programmable system. In Proceedings of the ReconfigurableArchitectures Workshop (RAW) (Apr. 2001).
[70] MEMIK, S. O., BOZORGZADEH, E., KASTNER, R., AND SARRAFZADE, M. Asuper-scheduler for embedded reconfigurable systems. In Proceedings of the In-ternational Conference on Computer-Aided Design (ICCAD) (Nov. 2001), p. 391.
[71] MIPS TECHNOLOGIES, INC. MIPS R4000 Microprocessor User’s Manual,Second Edition, April 1993.
[72] MODULE RESEARCH CENTER. NeuroMatrix NM6403 digital signal processor.Tech. Rep. 431282.001D2, Module Research Center, 2000.
[73] MORETTO, P. Mapping of speech front-end signal processing to high per-formance vector architectures. Tech. Rep. TR-95-063, International ComputerScience Institute, University of California at Berkeley, 1995.
[74] MOSUR, R. Efficient Algorithms for Speech Recognition. PhD thesis, CarnegieMellon University, May 1996. CMU-CS-96-143.
150
[75] PENTLAND, A. Looking at people: Sensing for ubiquitous and wearable comput-ing. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 22,1 (Jan. 2000), 107–118.
[76] PERING, T., AND BRODERSON, R. Dynamic voltage scaling and the designof a low-power microprocessor system. In Proceedings of the InternationalSymposium on Computer Architecture ISCA’98 (June 1998).
[77] PIHL, J., SVENDSEN, T., AND JOHNSEN, M. H. A VLSI implementation of PDFcomputations in HMM based speech recognition. In Proceedings of the IEEERegion Ten Conference on Digital Signal Processing Applications (TENCON’96)(Nov. 1996).
[78] POWELL, M., YANG, S.-H., FALSAFI, B., ROY, K., AND VIJAYKUMAR,T. N. Gated-Vdd: a circuit technique to reduce leakage in deep-submicron cachememories. In Proceedings of the 2000 International Symposium on Low PowerElectronics and Design (2000), ACM Press, pp. 90–95.
[79] RABINER, L., AND JUANG, B.-H. Fundamentals of Speech Recognition. Pren-tice Hall, 1993, ch. 9, p. 494.
[80] RABINER, L. R. A tutorial on hidden Markov models and selected applicationsin speech recognition. Proceedings of the IEEE 77, 2 (Dec. 1989), 257–286.
[81] RAU, B. R. Iterative modulo scheduling: an algorithm for software pipeliningloops. In Proceedings of the 27th Annual International Symposium on Microar-chitecture (1994), ACM Press, pp. 63–74.
[82] RIXNER, S., DALLY, W. J., KAPASI, U. J., KHAILANY, B., LOPEZ-LAGUNAS,A., MATTSON, P. R., AND OWENS, J. D. A bandwidth-efficient architecture formedia processing. In Proceedings of the 31st Annual ACM/IEEE InternationalSymposium on Microarchitecture (MICRO-31) (Nov. 1998), pp. 3–13.
[83] ROWLEY, H. A., BALUJA, S., AND KANADE, T. Neural network-based facedetection. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 1(1998), 23–38.
[84] RUSSELL, J., AND JACOME, M. Software power estimation and optimization forhigh performance, 32-bit embedded processors.
[85] RUSSELL, R. M. The CRAY-1 computer system. Communications of the ACM21, 1 (1978), 63–72.
[86] SCHAPIRE, R. E. The boosting approach to machine learning: An overview. InIn MSRI Workshop on Nonlinear Estimation and Classification (2002).
[87] SCHMIT, H., WHELIHAN, D., TSAI, A., MOE, M., LEVINE, B., AND TAYLOR,R. Piperench: a virtualized programmable datapath in 0.18 micron technology. InProceedings of the IEEE Custom Integrated Circuits Conference (2002), pp. 63–66.
151
[88] SMITH, J. E. Decoupled access/execute computer architectures. In Proceedingsof the 9th annual symposium on Computer Architecture (1982), IEEE ComputerSociety Press, pp. 112–119.
[89] SMITH, M. D., LAM, M., AND HOROWITZ, M. A. Boosting beyond staticscheduling in a superscalar processor. In Proceedings of the 17th Annual Sym-posium on Computer Architecture (1990), pp. 344–354.
[90] SORIANO, M., MARTINKAUPPI, B., HUOVINEN, S., AND LAAKSONEN, M.Using the skin locus to cope with changing illumination conditions in color-basedface tracking. In Proceedings of the IEEE Nordic Signal Processing Symposium(2000), pp. 383–386.
[91] SRIVASTAVA, S. Fast gaussian evaluations in large vocabulary continuous speechrecognition. M.S. Thesis, Department of Electrical and Computer Engineering,Mississippi State University, Oct. 2002.
[92] STERN, R. M. Specification of the 1996 HUB 4 broadcast news evaluation.http://www.nist.gov/speech/publications/darpa97/pdf/stern1.pdf, 1996.
[93] SUNDARARAJAN, V., AND PARHI, K. K. Low power synthesis of dual thresholdvoltage CMOS VLSI circuits. In Proceedings of the 1999 international sympo-sium on Low power electronics and design (1999), ACM Press, pp. 139–144.
[94] TEXAS INSTRUMENTS. TMS320C6000 CPU and Instruction Set ReferenceGuide, spru189f ed., Oct. 2000.
[95] TIWARI, V., MALIK, S., WOLFE, A., AND LEE, M. Instruction level poweranalysis and optimization of software. In Proceedings of the Ninth InternationalConference on VLSI Design (Jan. 1996), pp. 326–328.
[96] TIWARI, V., SINGH, D., RAJGOPAL, S., MEHTA, G., PATEL, R., AND BAEZ,F. Reducing power in high-performance microprocessors. In Proceedings of the35th Annual Design Automation Conference (1998), ACM Press, pp. 732–737.
[97] TONG, Y. F., RUTENBAR, R., AND NAGLE, D. Minimizing floating-point powerdissipation via bit-width reduction. In Proceedings of the 1998 InternationalSymposium on Computer Architecture Power Driven Microarchitecture Workshop(1998).
[98] TSENG, J. H., AND ASANOVIC, K. Energy-efficient register access. InProceedings of the 13th Symposium on Integrated Circuits and Systems Design(SBCCI’00) (2000), IEEE Computer Society, p. 377.
[99] TURK, M., AND PENTLAND, A. Face recognition using Eigenfaces. In Proceed-ings of the IEEE Computer Society Conference on Computer Vision and PatternRecognition (CVPR) (June 1991), pp. 586–591.
[100] UNGER, S., AND MUELLER, F. Handling irreducible loops: Optimized nodesplitting vs. DJ-graphs. Lecture Notes in Computer Science 2150 (2001), 207+.
152
[101] VAN ROSSUM, G. Python Reference Manual, 2.3.3 ed., Dec. 2003.
[102] VERMA, A., FARUQUIE, T., NETI, C., BASU, S., AND SENIOR, A. Lateintegration in audio-visual continuous speech recognition. In Proceedings of theAutomatic Speech Recognition and Understanding Workshop (ASRU) (1999).
[103] VIOLA, P., AND JONES, M. Rapid object detection using a boosted cascade ofsimple features. In IEEE Computer Society Conference on Computer Vision andPattern Recognition (Dec. 2001).
[104] WAINGOLD, E., TAYLOR, M., SRIKRISHNA, D., SARKAR, V., LEE, W., LEE,V., KIM, J., FRANK, M., FINCH, P., BARUA, R., BABB, J., AMARASINGHE,S., AND AGARWAL, A. Baring it all to software: Raw machines. IEEE Computer30, 9 (1997), 86–93.
[105] WANG, C.-L., BHAT, P. B., AND PRASANNA, V. K. High performance comput-ing for vision. Proceedings of the IEEE 84, 7 (July 1996), 931–946.
[106] WAWRZYNEK, J., ASANOVIC, K., KINGSBURY, B., BECK, J., JOHNSON, D.,AND MORGAN, N. SPERT-II: A vector microprocessor system and its applicationto large problems in backpropagation training. In Advances in Neural InformationProcessing Systems (1996), D. S. Touretzky, M. C. Mozer, and M. E. Hasselmo,Eds., vol. 8, The MIT Press, pp. 619–625.
[107] WEEMS, C. C. The second generation image understanding architecture andbeyond. In Proceedings of Computer Architectures for Machine Perception (Nov.1993), pp. 276–285.
[108] WEISS, M., AND FETTWEIS, G. Dynamic codewidth reduction for VLIWinstruction set architectures in digital signal processors, 1996.
[109] WESTE, N. H. E., AND ESHRAGHIAN, K. Principles of CMOS VLSI Design, ASystems Perspective, second ed. Addison Wesley, 1993.
[110] YANG, M.-H., KRIEGMAN, D., AND AHUJA, N. Detecting faces in images: Asurvey. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)24, 1 (2002), 34–58.
[111] YOUNG, S. Large vocabulary continuous speech recognition: A review. InProceedings of the IEEE Workshop on Automatic Speech Recognition and Un-derstanding (Dec. 1995), pp. 3–28.
[112] YUN, H.-S., AND KIM, J. Power-aware modulo scheduling for high-performance vliw processors. In Proceedings of the 2001 International Sympo-sium on Low Power Electronics and Design (2001), ACM Press, pp. 40–45.