Automatic Speech Recognition: From Theory to Practice 1T-61.184T-61.184
T-61.184Automatic Speech Recognition:
From Theory to Practice
http://www.cis.hut.fi/Opinnot/T-61.184/September 20, 2004
Prof. Bryan PellomDepartment of Computer Science
Center for Spoken Language ResearchUniversity of Colorado
Automatic Speech Recognition: From Theory to Practice 2T-61.184T-61.184
Today
Introduction to Speech Production and Phonetics
Short Video on Spectrogram Reading
Quick Review of Probability and Statistics
Formulation of the Speech Recognition Problem
Talk about Homework #1
Automatic Speech Recognition: From Theory to Practice 3T-61.184T-61.184
Speech Production and Phonetics
Peter Ladefoged, “A Course In Phonetics,” Harcourt Brace Jovanovich, ISBN 0-15-500173-6
Excellent introductory reference to this material
Automatic Speech Recognition: From Theory to Practice 4T-61.184T-61.184
Speech Production Anatomy
Figure from Spoken Language Processing (Huang, Acero, Hon)
Automatic Speech Recognition: From Theory to Practice 5T-61.184T-61.184
Speech Production Anatomy
Vocal TractConsists of the pharyngeal and oral cavities
ArticulatorsComponents of the vocal tract which move to produce various speech soundsInclude: vocal folds, velum, lips, tongue, teeth
Automatic Speech Recognition: From Theory to Practice 6T-61.184T-61.184
Source-Filter Representation of Speech Production
Production viewed as an acoustic filtering operation
Larynx and lungs provide input or source excitation
Vocal and nasal tracts act as filter. Shape the spectrum of the resulting signal
Automatic Speech Recognition: From Theory to Practice 7T-61.184T-61.184
Describing Sounds
The study of speech sounds and their production, classification and transcription is known as phonetics
A phoneme is an abstract unit that can be used for writing a language down in a systematic or unambiguous way
Sub-classifications of phonemesVowels – air passes freely through resonatorsConsonants – air passes partially or totally obstructed in one or more places as it passes through the resonators
Automatic Speech Recognition: From Theory to Practice 8T-61.184T-61.184
Time-Domain Waveform Example
Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003
“two plus seven is less than ten”
Automatic Speech Recognition: From Theory to Practice 9T-61.184T-61.184
Wide-Band Spectrogram
Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003 “two plus seven is less than ten”
freq
uenc
y
time
Automatic Speech Recognition: From Theory to Practice 10T-61.184T-61.184
Phonetic Alphabets
Allow us to describe the primitive sounds that make up a language
Each language will have a unique set of phonemes
Useful for speech recognition since words can be represented by sequences of phonemes as described by a phonetic alphabet.
Automatic Speech Recognition: From Theory to Practice 11T-61.184T-61.184
International Phonetic Alphabet (IPA)
Phonetic representation standard which describes sounds in most/all world languages
IPA last published in 1993 and updated in 1996
Issue: character set difficult to manipulate on a computer…
http://www2.arts.gla.ac.uk/IPA/ipa.html
Automatic Speech Recognition: From Theory to Practice 12T-61.184T-61.184
IPA Consonants
Automatic Speech Recognition: From Theory to Practice 13T-61.184T-61.184
IPA Vowels
Automatic Speech Recognition: From Theory to Practice 14T-61.184T-61.184
American English Phonemes (IPA)
Table from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003
Automatic Speech Recognition: From Theory to Practice 15T-61.184T-61.184
Alternative Phonetic Alphabets
ARPAbetEnglish only ASCII representationPhoneme units represented by 1-2 lettersSimilar representation used by CMU Sphinx-II recognizer,http://www.speech.cs.cmu.edu/cgi-bin/cmudict
SAMPASpeech Assessment Methods Phonetic AlphabetComputer Readable representationMaps symbols of the IPA into ASCII codes
Automatic Speech Recognition: From Theory to Practice 16T-61.184T-61.184
CMU Sphinx-II Phonetic Symbols
(silence)SILtoyOYhurtERseizureZHoatOWedEHzeeZpingNGmatterDX
yieldYnoteNtheeDHweWmeMdudDDveeVleeLdeeDtwoUWlickKDcheeseCHhoodUHkeyKDubBDbitsTSgeeJHbeBthetaTHeatIYhideAYlitTDacidIXuserAXRteaTitIHabideAXsheSHheHHcowAWseaSbagGDoughtAOreadRgreenGhutAHlipPDfeeFatAEpeePateEYoddAA
ExamplePhoneExamplePhoneExamplePhone
Automatic Speech Recognition: From Theory to Practice 17T-61.184T-61.184
Example Words and Corresponding CMU Dictionary Transcriptions
basement B EY S M AX N TD
Bryan B R AY AX N
perfect P AXR F EH KD TD
speech S P IY CH
recognize R EH K AX G N AY Z
Automatic Speech Recognition: From Theory to Practice 18T-61.184T-61.184
Classifications of Speech Sounds
Voiced vs. voicelessVoiced if vocal chords vibrate
Nasal vs. OralNasal if air travels through nasal cavity and oral cavity closed
Consonant vs. VowelConsonants: obstruction in air stream above the glottis. The Glottis is defined as the space between the vocal cords.
Lateral vs. Non-lateralNon-lateral If the air stream passes through the middle of the oral cavity (compared to along side the oral cavity)
Automatic Speech Recognition: From Theory to Practice 19T-61.184T-61.184
Consonants and Vowels
Consonants are characterized by:Place of articulationManner of articulationVoicing
Vowels are characterized by:lip positiontongue heighttongue advancement
Automatic Speech Recognition: From Theory to Practice 20T-61.184T-61.184
Places of Articulation
Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003
Automatic Speech Recognition: From Theory to Practice 21T-61.184T-61.184
Places of Articulation
Biliabial Made with the two lips {P,B,M}Labio-dental Lower lip & upper front teeth {F,V}Dental Tongue tip or blade & upper front teeth {TH,DH}Alveolar Tongue tip or blade & alveolar ridge {T,D,N,…}Retroflex Tongue tip & back of the alveolar ridge {R}Palato-Alveolar Tongue blade & back of the alveolar ridge {SH}Palatal Front of the tongue & hard palate {Y,ZH}Velar Back of the tongue & soft palate {K,G,NG}
Automatic Speech Recognition: From Theory to Practice 22T-61.184T-61.184
Manners of Articulation
Stopcomplete obstruction with sudden (explosive) release
NasalAirflow stopped in the oral cavity, soft palate down, airflow is through the nasal tract
FricativeArticulators close together, turbulent airflow produced
Automatic Speech Recognition: From Theory to Practice 23T-61.184T-61.184
Manners of Articulation
Retroflex (Liquid)Tip of the tongue is curled back slightly (/r/)
Lateral (Liquid)Obstruction of the air stream at a point along the center of the oral tract, with incomplete closure between one or both sides of the tongue and the roof of the mouth (/l/)
GlideVowel-like, but initial position within a syllable (/y/, /w/)
Automatic Speech Recognition: From Theory to Practice 24T-61.184T-61.184
American English Consonants by Place and Manner of Articulation
Place
Man
ner
Automatic Speech Recognition: From Theory to Practice 25T-61.184T-61.184
American English Unvoiced Fricatives
Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003
Automatic Speech Recognition: From Theory to Practice 26T-61.184T-61.184
Voiced vs. Unvoiced Fricatives
Voiced fricatives tend to be shorter than unvoiced fricativesFigure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003
Automatic Speech Recognition: From Theory to Practice 27T-61.184T-61.184
American English Unvoiced Stops
Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003
Automatic Speech Recognition: From Theory to Practice 28T-61.184T-61.184
Voiced vs. Unvoiced Stops
Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003
Automatic Speech Recognition: From Theory to Practice 29T-61.184T-61.184
Stop Consonant Formant Transitions
Figure from http://hitchcock.dlt.asu.edu/media5/a_spanias/speech-recognition/real-lectures/PDF/LECT04-5.PDF
Automatic Speech Recognition: From Theory to Practice 30T-61.184T-61.184
Nasal vs. Oral Articulation
velum
/M/,/N/,/NG/
Automatic Speech Recognition: From Theory to Practice 31T-61.184T-61.184
Spectrogram of Nasals
Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003
Automatic Speech Recognition: From Theory to Practice 32T-61.184T-61.184
American English Semivowels
Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003
Automatic Speech Recognition: From Theory to Practice 33T-61.184T-61.184
Semivowels
Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003
Automatic Speech Recognition: From Theory to Practice 34T-61.184T-61.184
Affricates
Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003
Automatic Speech Recognition: From Theory to Practice 35T-61.184T-61.184
Describing Vowels
Velum Positionnasal vs. non-nasal
Lip ShapeRounded vs. unrounded
Tongue heightHigh, mid, lowCorrelated to first formant position
Tongue advancementFront, central, backCorrelated to second formant position
Automatic Speech Recognition: From Theory to Practice 36T-61.184T-61.184
American English Vowels
“heed”“hid”“head”“had”
“hod”“hawed”“hood” “who'd”
IYIY IHIH EHEH AEAE
AAAA AOAO UHUH UWUW
Automatic Speech Recognition: From Theory to Practice 37T-61.184T-61.184
Vowel Chart
front central back
high
mid
low
IY UW
UHAA
EH
IH
AX
AE
AOAY
EY
OY
OW
AW
Tong
ue H
eigh
t
Tongue Advancement
Diphthongs shown in green
Automatic Speech Recognition: From Theory to Practice 38T-61.184T-61.184
The Vowel Space
1. IY
2. IH3. EH4. AE5. UH6. AA7. AO8. UW
9. (non English “u”)10. AX
1 2 3
4
5
6
78
9
10
Seco
nd F
orm
ant F
requ
ency
, F2
(Hz)
First Formant Frequency, F1 (Hz)
Automatic Speech Recognition: From Theory to Practice 39T-61.184T-61.184
Coarticulation
http://hctv.humnet.ucla.edu/departments/linguistics/VowelsandConsonants
Notice position of 2nd formant onset for these words: "fie”, “thigh”, “sigh”, “shy"
F AYF AY TH AYTH AY S AYS AY SH AYSH AY
Automatic Speech Recognition: From Theory to Practice 40T-61.184T-61.184
Sound Classification Summary
Vowels Semivowels Consonants
Nasals
Stops Fricatives
Affricates
Automatic Speech Recognition: From Theory to Practice 41T-61.184T-61.184
Spectrogram Reading Video
“Speech as Eyes See It” (12 minute video)
1977-1978 video by Ron Cole and Victor Zue
After 2000-3000 hours of training: phonemes and words can be transcribed from a spectrogram alone
80-90% agreement on segments
Provided insight into the speech recognition problem during the 1970’s
Automatic Speech Recognition: From Theory to Practice 42T-61.184T-61.184
Review: Probability & Statistics for Speech Recognition
Automatic Speech Recognition: From Theory to Practice 43T-61.184T-61.184
Relative-Frequency and Probability
Relative Frequency of “A”:Experiment is performed N timesWith 4 possible outcomes: A, B, C, DNA is number of times event A occurs:
P(A) Defined as “the probability of event A”Relative Frequency that event A would occur if an experiment was repeated many times
NNNNN DCBA =+++NNAr A=)(
Automatic Speech Recognition: From Theory to Practice 44T-61.184T-61.184
Probability
Example (con’t)
NNNNN DCBA =+++
1)()()()( =+++ DrCrBrAr
)(lim)( ArAPN ∞→
=
1)()()()( =+++ DPCPBPAP
Automatic Speech Recognition: From Theory to Practice 45T-61.184T-61.184
Mutually Exclusive Events
For mutually exclusive events A, B, C, …, M:
1)(0 ≤≤ AP1)()()()( =++++ MPCPBPAP L
event impossible represents 0)( =APeventcertain represents 1)( =AP
Automatic Speech Recognition: From Theory to Practice 46T-61.184T-61.184
Joint and Conditional ProbabilityP(AB) is the joint probability of event A and B both occurring
P(A|B) is the conditional probability of event A given that event B has occurred:
NNABP AB
N ∞→= lim)(
NNNN
BPABPBAP
B
AB
N //lim
)()()|(
∞→==
Automatic Speech Recognition: From Theory to Practice 47T-61.184T-61.184
Marginal Probability
)()|()()(11
k
n
kk
n
kk APABPBAPBP ∑∑
==
==
B1A2A
3A
4A5A
Probability of an event occurring across all conditions
Automatic Speech Recognition: From Theory to Practice 48T-61.184T-61.184
Bayes’ Theorem
)()()|(
)()()|(
BPAPABP
BPBAPBAP iii
i ==
)()|()()(11
k
n
kk
n
kk APABPBAPBP ∑∑
==
==
)()|()( APABPABP =
Automatic Speech Recognition: From Theory to Practice 49T-61.184T-61.184
Bayes’ Rule Bayes’ Rule allows us to update prior beliefs with new evidence,
)|()()|(
)|()( OWPWPWOP
WOPWP i
jjj
ii =
∑Prior
ProbabilityEvidence
P(O)
Likelihood
PosteriorProbability
Automatic Speech Recognition: From Theory to Practice 50T-61.184T-61.184
Statistically Independent Events
Occurrence of event A does not influence occurrence of event B:
)()()( BPAPABP =
)()|( APBAP =
)()|( BPABP =
Automatic Speech Recognition: From Theory to Practice 51T-61.184T-61.184
Random Variables
Used to describe events in which the number of possible outcomes is infinite
Values of the outcomes can not be predicted with certainty
Distribution of outcome values is known
Automatic Speech Recognition: From Theory to Practice 52T-61.184T-61.184
Probability Distribution Functions
The probability of the event that the random variable X is less than or equal to the allowed value x:
)()( xXPxFX ≤=
0x
)(xFX
1
Automatic Speech Recognition: From Theory to Practice 53T-61.184T-61.184
Probability Distribution Functions
Properties:
∞<<∞−≤≤ xxFX 1)(0
1)( and 0)( =∞=−∞ XX FF
increases as ingnondecreas is )( xxFX
)()()( 1221 xFxFxXxP XX −=≤<
Automatic Speech Recognition: From Theory to Practice 54T-61.184T-61.184
Probability Density Functions (PDF)
Derivative of probability distribution function,
Interpretation:
dxxdFxf X
X)()( =
)()( dxxXxPxf X +≤<=
Automatic Speech Recognition: From Theory to Practice 55T-61.184T-61.184
Probability Density Functions (PDF)
Properties
∞<<∞−≥ xxf X 0)(
1)( =∫∞
∞−dxxf X
duufxFx
XX ∫ ∞−= )()(
)()( 212
1
xXxPdxxfx
x X ≤<=∫
Automatic Speech Recognition: From Theory to Practice 56T-61.184T-61.184
Mean and Variance
Expectation or Mean of a random variable X
Variance of a random variable X
∫∞
∞−
= dxxxfXE X )()(
)(XE=µ[ ]
[ ]222
22
)()()(
)()(
XEXEXVar
XEXVar
−==
−==
σ
µσ
Automatic Speech Recognition: From Theory to Practice 57T-61.184T-61.184
Mean and Variance
Variance properties (if X and Y are independent)
)()()( YVarXVarYXVar +=+
)()( 2 XVaraaXVar =
)()(
)(2
121
11
nn
nn
XVaraXVara
bXaXaVar
++
=+++
L
L
Automatic Speech Recognition: From Theory to Practice 58T-61.184T-61.184
Covariance and Correlation
Covariance of X and Y:
Correlation of X and Y:
),(),(
)])([(),(
XYCovYXCov
YXEYXCov yx
=
−−= µµ
YXXY
YXCovσσ
ρ ),(= 11 ≤≤− XYρ
Automatic Speech Recognition: From Theory to Practice 59T-61.184T-61.184
Gaussian (Normal) Distribution
( )
−−== 2
22
2exp
21),|(
σµ
σπσµ xxXf
∑=
==N
nnx
NxE
1
1][µ
[ ]2
11
2222 11)()(
−=−= ∑∑
==
N
nn
N
nn x
Nx
NXEXEσ
Automatic Speech Recognition: From Theory to Practice 60T-61.184T-61.184
Gaussian (Normal) Distribution
-3 -2 -1 0 1 2 3
),|( 2σµxXf =
x
10
2 =
=
σ
µ
Automatic Speech Recognition: From Theory to Practice 61T-61.184T-61.184
Multivariate Distributions
Characterize more than one random variable at a time
Example: Features for speech recognitionIn speech recognition we typically compute 39-dimensional feature vectors (more about that later)100 feature vectors per second of audio
Often want to compute the likelihood of the observed features given known (estimated) distribution which is being used to model some part of a phoneme (more about that later!).
Automatic Speech Recognition: From Theory to Practice 62T-61.184T-61.184
Multivariate Gaussian Distribution
( )( ) ( )
−Σ−−
Σ=Σ= − uxxxXf
n1T
2/12/ 21exp
21),|( µ
πµ
Observed vector of random variables (features)Distribution mean vector
Distribution covariance matrix
Determinant of covariance matrix
Automatic Speech Recognition: From Theory to Practice 63T-61.184T-61.184
Diagonal Covariance Assumption
Most speech recognition systems assume diagonal covariance matrices
Data sparseness issue:
=Σ
244
233
222
211
000000000000
σσ
σσ
∏=
=Σd
nnn
1
2 || σ
Automatic Speech Recognition: From Theory to Practice 64T-61.184T-61.184
Diagonal Covariance Assumption
Inverting a diagonal matrix involves simply inverting the elements along the diagonal:
=Σ−
244
233
222
211
1
1000
0100
0010
0001
σ
σ
σ
σ
Automatic Speech Recognition: From Theory to Practice 65T-61.184T-61.184
Multivariate Gaussians with Diagonal Covariance Matrices
Assuming a diagonal covariance matrix,
−−=Σ= ∑
=
n
d dddxCxXf
12
2
][])[][(
21exp*),|(
σµµ
( )2/1
1
22/ ][2
1
=
∏=
n
d
n dC
σπ 22 ][ ddd σσ =
Automatic Speech Recognition: From Theory to Practice 66T-61.184T-61.184
Multivariate Mixture Gaussians
Distribution is governed by several Gaussian density functions,Sum of Gaussians (wm = mixture weight)
( )∑=
ΣΝ=M
mmmmm xwxf
1,;)( µ
( )( ) ( )∑
=
−
−Σ−−
Σ=
M
mmmm
mn
m uxxw1
1T2/12/ 2
1exp2
µπ
Automatic Speech Recognition: From Theory to Practice 67T-61.184T-61.184
Multiple Mixtures (1-D case)
-7.5 -5.5 -3.5 -1.5 0.5 2.5 4.5 6.5
Example: 3 mixtures used to model underlying random process of 3 Gaussians
)(xf
x
Automatic Speech Recognition: From Theory to Practice 68T-61.184T-61.184
Speech RecognitionProblem Formulation
Automatic Speech Recognition: From Theory to Practice 69T-61.184T-61.184
Problem Description
Given a sequence of observations (evidence) from an audio signal,
Determine the underlying word sequence,
Number of words (m) unknown, observation sequence is variable length (T)
Tooo L21O =
mwww L21W =
Automatic Speech Recognition: From Theory to Practice 70T-61.184T-61.184
Problem Formulation
Goal: Minimize the classification error rate
Solution: Maximize the Posterior Probability
Solution requires optimization over all possible word strings!
)O|W(maxargWW
P=
Automatic Speech Recognition: From Theory to Practice 71T-61.184T-61.184
Problem Formulation
Using Bayes Rule,
Since P(O) does not impact optimization,
)O()W()W|O()O|W(
PPPP =
)W()W|O(maxarg
)O|W(maxargW
W
W
PP
P
=
=
Automatic Speech Recognition: From Theory to Practice 72T-61.184T-61.184
Problem Formulation
Let’s assume words can be represented by a sequence of states, S,
Words Phonemes StatesStates represent smaller pieces of phonemes
∑=
=
S
PWSPP
PP
)W()|()S|O(maxarg
)W()W|O(maxarg W
W
W
Automatic Speech Recognition: From Theory to Practice 73T-61.184T-61.184
Problem Formulation
Optimize:
Practical Realization,
P(W)W)|P(SS)|P(O
O
∑=S
PSPSP )W()W|()|O(maxargWW
Observation (feature) sequence
Acoustic Model
Lexicon / Pronunciation Model
Language Model
Automatic Speech Recognition: From Theory to Practice 74T-61.184T-61.184
Problem Formulation
Optimization desires most likely word sequence given observations (evidence)
Can not evaluate all possible word / state sequences (too many possibilities!)
We need:To define a representation for modeling states (HMMs…)A means for “approximately” searching for the best word / state sequence given the evidence (Viterbi Algorithm)And a few other tricks up our sleeves to make it FASTER!
Automatic Speech Recognition: From Theory to Practice 75T-61.184T-61.184
Hidden Markov Models (HMMs)Observation vectors are assumed to be “generated” by a Markov Model
HMM: A finite-state machine that at each time t that a state j is entered, an observation is emitted with probability density bj(ot)
Transition from state i to state j modeled with probability aij
S0
00a
01a
11a
S112a
22a
S2
1o 2o 3o 4o to 1+to 2+to To
Automatic Speech Recognition: From Theory to Practice 76T-61.184T-61.184
“Beads-on-a-String” HMM Representation
CHS P IY
SPEECH
1o 2o 3o 4o 5o 6o 7o
Automatic Speech Recognition: From Theory to Practice 77T-61.184T-61.184
Modeled Probability Distribution
S0
00a
01a
11a
S112a
22a
S2
( )tob0 ( )tob1 ( )tob2
Automatic Speech Recognition: From Theory to Practice 78T-61.184T-61.184
HMMs with Mixtures of Gaussians
Multivariate Mixture-Gaussian distribution,
Model parameters are (1) means, (2) variances, and (3) mixture weights
Sum of Gaussians can model complex distributions
( )( )
( ) ( )∑=
−Σ−− −
Σ=
M
m
oo
jn
mtj
jtjT
jtewob1
21 1
2
µµ
π
Automatic Speech Recognition: From Theory to Practice 79T-61.184T-61.184
Hidden Markov Model Based ASR
Observation sequence assumed to be known
Probability of a particular state sequence can be computed
Underlying state sequence is unknown, “hidden”
S0
00a
01a
11a
S112a
22a
S2
1o 2o 3o 4o to 6o 7o To
Automatic Speech Recognition: From Theory to Practice 80T-61.184T-61.184
Components of a Speech Recognizer
Feature Extraction
Feature Extraction
Optimization /
Search
Optimization /
Search
Lexicon Language Model
Acoustic Model
Speech Text +Timing +Confidence
Automatic Speech Recognition: From Theory to Practice 81T-61.184T-61.184
Topics for Next Time
An introduction to hearing
Speech Detection
Frame-based Speech Analysis
Feature Extraction Methods for Speech Recognition