Date post: | 12-Jan-2016 |
Category: |
Documents |
Upload: | cody-nichols |
View: | 236 times |
Download: | 2 times |
Automatic Speech Recognition (ASR): Automatic Speech Recognition (ASR): A Brief OverviewA Brief Overview
Radio Rex – 1920’s ASRRadio Rex – 1920’s ASR
Statistical ASRStatistical ASR
• i_best = argmax P(M |X )
= argmax P(X|M ) P(M )
(1st term, acoustic model; 2nd term,
language model)
• P(X|M ) P(X|Q ) [Viterbi approx.]
where Q is the best state sequence in M
approximated by product of local likelihoods
(Markov,conditional independence assumptions)
i ii
M
Mi
ii
i i
i
Automatic Speech Automatic Speech RecognitionRecognition
Speech Production/Collection
Pre-processing
Feature Extraction
Hypothesis Generation
Cost Estimator
Decoding
Simplified Model of SpeechSimplified Model of SpeechProductionProduction
Periodic Source
Random Source
Filters
Vocal vibration or turbulence(Fine spectral structure)
Vocal tract, nasal tract, radiation(spectral envelope)
Pre-processingPre-processing
RoomAcoustics
Speech
MicrophoneLinear
FilteringSampling &Digitization
Issues: Noise and reverb, effect on modeling
Frame 1
Frame 2
Feature VectorX1
Feature VectorX2
Framewise Analysis Framewise Analysis of Speechof Speech
Feature ExtractionFeature Extraction
SpectralAnalysis
AuditoryModel/
Orthogonalize(cepstrum)
Issues: Design for discrimination, insensitivitiesto scaling and simple distortions
Representations Representations are Importantare Important
Network
23% frame correct
Network
70% frame correct
Speech waveform
PLP features
Mel Frequency ScaleMel Frequency Scale
Spectral vs TemporalSpectral vs Temporal ProcessingProcessing
Analysis (e.g., cepstral)
Processing(e.g., mean removal)
Time
frequ
ency
frequ
ency
Spectral processing
Temporal processing
Hypothesis GenerationHypothesis Generation
Issue: models of language and task
cat
dog
a dog is not a cat
a cat not is adog
Cost EstimationCost Estimation
• Distances
• -Log probabilities, from discrete distributions Gaussians, mixtures neural networks
Nonlinear Time NormalizationNonlinear Time Normalization
DecodingDecoding
Pronunciation ModelsPronunciation Models
Language ModelsLanguage Models
Most likely words for largest product
P(acousticswords) P(words)
P(words) = P(wordshistory)
•bigram, history is previous word
•trigram, history is previous 2 words
•n-gram, history is previous n-1 words
ASR System ArchitectureASR System Architecture
PronunciationLexicon
Signal Processing
AcousticProbabilityEstimator
(HMM statelikelihoods)
Decoder
RecognizedWords“zero”“three”“two”
Probabilities“z” -0.81
“th” = 0.15“t” = 0.03
Cepstrum
SpeechSignal
LanguageModel
HMMs for SpeechHMMs for Speech
• Math from Baum and others, 1966-1972
• Applied to speech by Baker in the
original CMU Dragon System (1974)
• Developed by IBM (Baker, Jelinek, Bahl,
Mercer,….) (1970-1993)
• Extended by others in the mid-1980’s
Hidden Markov model Hidden Markov model (graphical form)(graphical form)
q q1 2
q q3 4
x1 2
x x 3 4
x
Hidden Markov ModelHidden Markov Model(state machine form)(state machine form)
q q q
P(x | q )1
P(x | q )2
P(x | q )3
P(q | q )2 1
P(q | q ) P(q | q )3 2 4 3
1 2 3
Markov modelMarkov model
q q1 2
P(x ,x |q ,q ) P( q ) P(x |q ) P(q | q ) P(x | q ) 1 1 1 1 12 2 2 2 21
HMM Training StepsHMM Training Steps
• Initialize estimators and models
• Estimate “hidden” variable probabilities
• Choose estimator parameters to maximize
model likelihoods
• Assess and repeat steps as necessary
• A special case of Expectation
Maximization (EM)
Progress in 3 DecadesProgress in 3 Decades
• From digits to 60,000 words
• From single speakers to many
• From isolated words to continuous
speech
• From no products to many products,
some systems actually saving LOTS
of money
Real UsesReal Uses
• Telephone: phone company services
(collect versus credit card)
• Telephone: call centers for query
information (e.g., stock quotes,
parcel tracking)
• Dictation products: continuous
recognition, speaker dependent/adaptive
But:But:
• Still <97% on “yes” for telephone
• Unexpected rate of speech causes doubling
or tripling of error rate
• Unexpected accent hurts badly
• Performance on unrestricted speech at 70%
(with good acoustics)
• Don’t know when we know
• Few advances in basic understanding
Why is ASR Hard?Why is ASR Hard?
• Natural speech is continuous
• Natural speech has disfluencies
• Natural speech is variable over:
global rate, local rate, pronunciation
within speaker, pronunciation across
speakers, phonemes in different
contexts
Why is ASR Hard?Why is ASR Hard?(continued)(continued)
• Large vocabularies are confusable• Out of vocabulary words inevitable• Recorded speech is variable over:
room acoustics, channel characteristics,background noise
• Large training times are not practical• User expectations are for equal to or
greater than “human performance”
ASR DimensionsASR Dimensions
• Speaker dependent, independent
• Isolated, continuous, keywords
• Lexicon size and difficulty
• Task constraints, perplexity
• Adverse or easy conditions
• Natural or read speech
Telephone SpeechTelephone Speech
• Limited bandwidth (F vs S)• Large speaker variability• Large noise variability• Channel distortion • Different handset microphones• Mobile and handsfree acoustics
Hot Research ProblemsHot Research Problems
• Speech in noise• Multilingual conversational speech (EARS)• Portable (e.g., cellular) ASR• Question answering • Understanding meetings – or at least
browsing them
Hot Research ApproachesHot Research Approaches
• New (multiple) features and models• New statistical dependencies• Multiple time scales• Multiple (larger) sound units • Dynamic/robust pronunciation models• Long-range language models• Incorporating prosody• Incorporating meaning• Non-speech modalities• Understanding confidence
Multi-frame analysisMulti-frame analysis
• Incorporate multiple frames as
a single observation
• LDA the most common approach
• Neural networks
• Bayesian networks (graphical models,
including Buried Markov Models)
Linear Discriminant Linear Discriminant Analysis (LDA)Analysis (LDA)
=X
x
x
x
x
x
1
2
3
4
5
y
y
1
2
All variables for several frames
Transformation to maximize ratio:
between-class variancewithin-class variance
Multi-layer perceptronMulti-layer perceptron
Buried Markov ModelsBuried Markov Models
Multi-stream analysisMulti-stream analysis
• Multi-band systems
• Multiple temporal properties
• Multiple data-driven temporal filters
Multi-band analysisMulti-band analysis
Temporally distinct Temporally distinct featuresfeatures
Combining streamsCombining streams
Another novel approach:Another novel approach:Articulator dynamicsArticulator dynamics
• Natural representation of context
• Production apparatus has mass, inertia
• Difficult to accurately model
• Can approximate with simple dynamics
Hidden Dynamic ModelsHidden Dynamic Models
“We hold these truths to be self-evident: that speech is produced by an underlying dynamic system, that it is endowed by its production system with certain inherent dynamic qualities, among these are compactness, continuity, and the pursuit of target values for each phone class, that to exploit these characteristics Hidden DynamicModels are instituted among men. We … solemnly publish and declare,that these phone classes are and of aright ought to be free and context independent states …And for the support of this declaration, with a firm reliance on the acoustic theory of speech production, we mutually pledge our lives, our fortunes, and our sacred honor.”
John Bridle and Li Deng, 1998 Hopkins Spoken LanguageWorkshop, with apologies to Thomas Jefferson ...
(See http://www/clsp.jhu.edu/ws98/projects/dynamic/)
Hidden Dynamic ModelsHidden Dynamic Models
TARGET VALUES
SEGMENTATION
TARGET SWITCH
FILTER
NEURALNETWORK
SPEECH PATTERN
Sources of OptimismSources of Optimism
• Comparatively new research lines
• Many examples of improvements
• Moore’s Law much more processing
• Points toward joint development of
front end and statistical components
SummarySummary
• 2002 ASR based on 50+ years of research
• Core algorithms mature systems, 10-30 yrs
• Deeply difficult, but tasks can be chosen
that are easier in SOME dimension
• Much more yet to do