CS 545Lecture XI: Speech
Benjamin Snyder
(some slides courtesy Jurafsky&Martin)
[email protected][email protected]
Announcements
Office hours change for today and next week:1pm - 1:45pm
or by appointment -- but please schedule ahead
HW 4 / 5? will be out soon
n Frequency gives pitch; amplitude gives volume
n Frequencies at each time slice processed into observation vectors
s p ee ch l a b
ampl
itude
Speech in a Slide
frequ
ency
……………………………………………..a12a13a12a14a14………..
w � P (w) o � P (o|w)
The Noisy-Channel Model
language modelacoustic model
ASR System Components
sourceP(w)
w o
decoderobserved
argmax P(w|o) = argmax P(o|w)P(w)w w
w obest
channelP(o|w)
Language Model Acoustic Model
Phoneme InventoriesPhoneme: sound used as a building block in words
Some phonemes occur in most languages (b, p, m, n, s)
But substantial variation occurs in size and scope of phoneme inventories across languages
Consonants characterized by
1. place of articulation
2. manner of articulation
3. voicing
Vowels
PhonotacticsLanguages exhibit phonotactics
Some phoneme sequences are favored, others are forbidden
Phonotactics are largely language-specific... But, often shared within language families
And some sound sequences are anatomically difficult for everyone: “kgvrsatr”
Speech Recognition
Applications of Speech Recognition (ASR)
Dictation
Telephone-based Information (directions, air travel, banking, etc)
Hands-free (in car)
Speaker Identification
Language Identification
Second language ('L2') (accent reduction)
Audio archive searching
7/30/08 3 Speech and Language Processing Jurafsky and Martin
LVCSR
Large Vocabulary Continuous Speech Recognition
~20,000-64,000 words
Speaker independent (vs. speaker-dependent)
Continuous speech (vs isolated-word)
7/30/08 4 Speech and Language Processing Jurafsky and Martin
Current error rates
Task Vocabulary Error Rate%
Digits 11 0.5
WSJ read speech 5K 3
WSJ read speech 20K 3
Broadcast news 64,000+ 10
Conversational Telephone 64,000+ 20
Ballpark numbers; exact numbers depend very much on the specific corpus
7/30/08 5 Speech and Language Processing Jurafsky and Martin
HSR versus ASR
Conclusions: Machines about 5 times worse than humans
Gap increases with noisy speech
These numbers are rough, take with grain of salt
Task Vocab ASR Hum SR
Continuous digits 11 .5 .009
WSJ 1995 clean 5K 3 0.9
WSJ 1995 w/noise 5K 9 1.1
SWBD 2004 65K 20 4
7/30/08 6 Speech and Language Processing Jurafsky and Martin
Issues
Pronunciation error 3-4 times higher for native Spanish and Japanese speakers
Car noiseerror 2-4 times higher
Multiple speakers
LVCSR Design Intuition
• Build a statistical model of the speech-to-words process
• Collect lots and lots of speech, and transcribe all the words.
• Train the model on the labeled speech
• Paradigm: Supervised Machine Learning + Search
7/30/08 8 Speech and Language Processing Jurafsky and Martin
Speech Recognition Architecture
7/30/08 9 Speech and Language Processing Jurafsky and Martin
Architecture: Five easy pieces (only 3-4 for today)
HMMs, Lexicons, and Pronunciation
Feature extraction
Acoustic Modeling
Decoding
Language Modeling (seen this already)
7/30/08 16 Speech and Language Processing Jurafsky and Martin
Noisy Channel Part 1:Words to Phonemes
(transitions in HMM)
Lexicon
A list of words
Each one with a pronunciation in terms of phones
We get these from on-line pronucniation dictionary
CMU dictionary: 127K words http://www.speech.cs.cmu.edu/cgi-bin/
cmudict
We’ll represent the lexicon as an HMM
7/30/08 17 Speech and Language Processing Jurafsky and Martin
HMMs for speech: the word “six”
7/30/08 18 Speech and Language Processing Jurafsky and Martin
Phones are not homogeneous!
Time (s)0.48152 0.937203
0
5000
ay k
7/30/08 19 Speech and Language Processing Jurafsky and Martin
Each phone has 3 subphones
7/30/08 20 Speech and Language Processing Jurafsky and Martin
Resulting HMM word model for “six” with their subphones
7/30/08 21 Speech and Language Processing Jurafsky and Martin
Noisy Channel Part 1I:Phonemes to Sounds
(emissions in HMM)
George Miller figure
7/30/08 43 Speech and Language Processing Jurafsky and Martin
And also, human acoustic perception....
We care about the filter not the source
Most characteristics of the source
F0
Details of glottal pulse
Don’t matter for phone detection
What we care about is the filter
The exact position of the articulators in the
oral tract
So we want a way to separate these
And use only the filter function
7/30/08 44 Speech and Language Processing Jurafsky and Martin
Mel-scale
Human hearing is not equally sensitive to all frequency bands
Less sensitive at higher frequencies, roughly > 1000 Hz
I.e. human perception of frequency is non-linear:
7/30/08 37 Speech and Language Processing Jurafsky and Martin
MFCC: Mel-Frequency Cepstral Coefficients
7/30/08 24 Speech and Language Processing Jurafsky and Martin
Final Feature Vector
39 Features per 10 ms frame:
12 MFCC features
12 Delta MFCC features
12 Delta-Delta MFCC features
1 (log) frame energy
1 Delta (log) frame energy
1 Delta-Delta (log frame energy)
So each frame represented by a 39D vector
7/30/08 28 Speech and Language Processing Jurafsky and Martin
Acoustic Modeling (= Phone detection)
Given a 39-dimensional vector corresponding to the observation of one
frame oi
And given a phone q we want to detect
Compute p(oi|q)
Most popular method:
GMM (Gaussian mixture models)
Other methods
Neural nets, CRFs, SVM, etc
7/30/08 29 Speech and Language Processing Jurafsky and Martin
Gaussian Mixture Models
Also called “fully-continuous HMMs”
P(o|q) computed by a Gaussian:
€
p(o |q) =1
σ 2πexp(−
(o −µ)2
2σ 2)
7/30/08 30 Speech and Language Processing Jurafsky and Martin
Gaussians for Acoustic Modeling
P(o|q):
P(o|q)
o
P(o|q) is highest here at mean
P(o|q is low here, very far from mean)
A Gaussian is parameterized by a mean and a variance:
Different means
7/30/08 31 Speech and Language Processing Jurafsky and Martin
Training Gaussians
A (single) Gaussian is characterized by a mean and a variance
Imagine that we had some training data in which each phone was labeled
And imagine that we were just computing 1 single spectral value (real valued number) as our acoustic observation
We could just compute the mean and variance from the data:
€
µi =1
Tot
t=1
T
∑ s.t. ot is phone i
€
σ i
2=
1
T(ot
t=1
T
∑ −µi)2
s.t. ot is phone i
7/30/08 32 Speech and Language Processing Jurafsky and Martin
But we need 39 gaussians, not 1!
The observation o is really a vector of length 39
So need a vector of Gaussians:
€
p( o |q) =
1
2πD2 σ 2[d]
d =1
D
∏exp(−
1
2
(o[d]−µ[d])2
σ 2[d]d =1
D
∑ )
7/30/08 33 Speech and Language Processing Jurafsky and Martin
Gaussian Intuitions: Size of Σ
µ = [0 0] µ = [0 0] µ = [0 0]
Σ = I Σ = 0.6I Σ = 2I
As Σ becomes larger, Gaussian becomes
more spread out; as Σ becomes smaller, Gaussian more compressed
Text and figures from Andrew Ng’s lecture notes for CS229 7/30/08 30 Speech and Language Processing Jurafsky and Martin
Actually, mixture of gaussians
Each phone is modeled by a sum of different gaussians
Hence able to model complex facts about the data
Phone A
Phone B
7/30/08 34 Speech and Language Processing Jurafsky and Martin
Gaussians acoustic modeling
Summary: each phone is represented by a GMM parameterized by
M mixture weights
M mean vectors
M covariance matrices
Usually assume covariance matrix is diagonal
I.e. just keep separate variance for each cepstral
feature
7/30/08 35 Speech and Language Processing Jurafsky and Martin
HMMs for speech
7/30/08 46 Speech and Language Processing Jurafsky and Martin
HMM for digit recognition task
7/30/08 47 Speech and Language Processing Jurafsky and Martin
Training and Decoding
TrainingWould be easy if phones observed (Maximum Likelihood)
But they are not... and neither are mixture weights
Use EM algorithm (Expectation Maximization)
DecodingBasic idea: Viterbi algorithm from last time
But many little details...
Summary: ASR Architecture
Five easy pieces: ASR Noisy Channel architecture 1) Feature Extraction:
39 “MFCC” features
2) Acoustic Model: Gaussians for computing p(o|q)
3) Lexicon/Pronunciation Model • HMM: what phones can follow each other
4) Language Model • N-grams for computing p(wi|wi-1)
5) Decoder • Viterbi algorithm: dynamic programming for combining all
these to get word sequence from speech!
7/30/08 83 Speech and Language Processing Jurafsky and Martin