+ All Categories
Home > Documents > Automatic Speech Recognition (ASR): A Brief Overview.

Automatic Speech Recognition (ASR): A Brief Overview.

Date post: 12-Jan-2016
Category:
Upload: cody-nichols
View: 236 times
Download: 2 times
Share this document with a friend
Popular Tags:
45
Automatic Speech Recognition (ASR) Automatic Speech Recognition (ASR): A Brief Overview A Brief Overview
Transcript
Page 1: Automatic Speech Recognition (ASR): A Brief Overview.

Automatic Speech Recognition (ASR): Automatic Speech Recognition (ASR): A Brief OverviewA Brief Overview

Page 2: Automatic Speech Recognition (ASR): A Brief Overview.

Radio Rex – 1920’s ASRRadio Rex – 1920’s ASR

Page 3: Automatic Speech Recognition (ASR): A Brief Overview.

Statistical ASRStatistical ASR

• i_best = argmax P(M |X )

= argmax P(X|M ) P(M )

(1st term, acoustic model; 2nd term,

language model)

• P(X|M ) P(X|Q ) [Viterbi approx.]

where Q is the best state sequence in M

approximated by product of local likelihoods

(Markov,conditional independence assumptions)

i ii

M

Mi

ii

i i

i

Page 4: Automatic Speech Recognition (ASR): A Brief Overview.

Automatic Speech Automatic Speech RecognitionRecognition

Speech Production/Collection

Pre-processing

Feature Extraction

Hypothesis Generation

Cost Estimator

Decoding

Page 5: Automatic Speech Recognition (ASR): A Brief Overview.

Simplified Model of SpeechSimplified Model of SpeechProductionProduction

Periodic Source

Random Source

Filters

Vocal vibration or turbulence(Fine spectral structure)

Vocal tract, nasal tract, radiation(spectral envelope)

Page 6: Automatic Speech Recognition (ASR): A Brief Overview.

Pre-processingPre-processing

RoomAcoustics

Speech

MicrophoneLinear

FilteringSampling &Digitization

Issues: Noise and reverb, effect on modeling

Page 7: Automatic Speech Recognition (ASR): A Brief Overview.

Frame 1

Frame 2

Feature VectorX1

Feature VectorX2

Framewise Analysis Framewise Analysis of Speechof Speech

Page 8: Automatic Speech Recognition (ASR): A Brief Overview.

Feature ExtractionFeature Extraction

SpectralAnalysis

AuditoryModel/

Orthogonalize(cepstrum)

Issues: Design for discrimination, insensitivitiesto scaling and simple distortions

Page 9: Automatic Speech Recognition (ASR): A Brief Overview.

Representations Representations are Importantare Important

Network

23% frame correct

Network

70% frame correct

Speech waveform

PLP features

Page 10: Automatic Speech Recognition (ASR): A Brief Overview.

Mel Frequency ScaleMel Frequency Scale

Page 11: Automatic Speech Recognition (ASR): A Brief Overview.

Spectral vs TemporalSpectral vs Temporal ProcessingProcessing

Analysis (e.g., cepstral)

Processing(e.g., mean removal)

Time

frequ

ency

frequ

ency

Spectral processing

Temporal processing

Page 12: Automatic Speech Recognition (ASR): A Brief Overview.

Hypothesis GenerationHypothesis Generation

Issue: models of language and task

cat

dog

a dog is not a cat

a cat not is adog

Page 13: Automatic Speech Recognition (ASR): A Brief Overview.

Cost EstimationCost Estimation

• Distances

• -Log probabilities, from discrete distributions Gaussians, mixtures neural networks

Page 14: Automatic Speech Recognition (ASR): A Brief Overview.

Nonlinear Time NormalizationNonlinear Time Normalization

Page 15: Automatic Speech Recognition (ASR): A Brief Overview.

DecodingDecoding

Page 16: Automatic Speech Recognition (ASR): A Brief Overview.

Pronunciation ModelsPronunciation Models

Page 17: Automatic Speech Recognition (ASR): A Brief Overview.

Language ModelsLanguage Models

Most likely words for largest product

P(acousticswords) P(words)

P(words) = P(wordshistory)

•bigram, history is previous word

•trigram, history is previous 2 words

•n-gram, history is previous n-1 words

Page 18: Automatic Speech Recognition (ASR): A Brief Overview.

ASR System ArchitectureASR System Architecture

PronunciationLexicon

Signal Processing

AcousticProbabilityEstimator

(HMM statelikelihoods)

Decoder

RecognizedWords“zero”“three”“two”

Probabilities“z” -0.81

“th” = 0.15“t” = 0.03

Cepstrum

SpeechSignal

LanguageModel

Page 19: Automatic Speech Recognition (ASR): A Brief Overview.

HMMs for SpeechHMMs for Speech

• Math from Baum and others, 1966-1972

• Applied to speech by Baker in the

original CMU Dragon System (1974)

• Developed by IBM (Baker, Jelinek, Bahl,

Mercer,….) (1970-1993)

• Extended by others in the mid-1980’s

Page 20: Automatic Speech Recognition (ASR): A Brief Overview.

Hidden Markov model Hidden Markov model (graphical form)(graphical form)

q q1 2

q q3 4

x1 2

x x 3 4

x

Page 21: Automatic Speech Recognition (ASR): A Brief Overview.

Hidden Markov ModelHidden Markov Model(state machine form)(state machine form)

q q q

P(x | q )1

P(x | q )2

P(x | q )3

P(q | q )2 1

P(q | q ) P(q | q )3 2 4 3

1 2 3

Page 22: Automatic Speech Recognition (ASR): A Brief Overview.

Markov modelMarkov model

q q1 2

P(x ,x |q ,q ) P( q ) P(x |q ) P(q | q ) P(x | q ) 1 1 1 1 12 2 2 2 21

Page 23: Automatic Speech Recognition (ASR): A Brief Overview.

HMM Training StepsHMM Training Steps

• Initialize estimators and models

• Estimate “hidden” variable probabilities

• Choose estimator parameters to maximize

model likelihoods

• Assess and repeat steps as necessary

• A special case of Expectation

Maximization (EM)

Page 24: Automatic Speech Recognition (ASR): A Brief Overview.

Progress in 3 DecadesProgress in 3 Decades

• From digits to 60,000 words

• From single speakers to many

• From isolated words to continuous

speech

• From no products to many products,

some systems actually saving LOTS

of money

Page 25: Automatic Speech Recognition (ASR): A Brief Overview.

Real UsesReal Uses

• Telephone: phone company services

(collect versus credit card)

• Telephone: call centers for query

information (e.g., stock quotes,

parcel tracking)

• Dictation products: continuous

recognition, speaker dependent/adaptive

Page 26: Automatic Speech Recognition (ASR): A Brief Overview.

But:But:

• Still <97% on “yes” for telephone

• Unexpected rate of speech causes doubling

or tripling of error rate

• Unexpected accent hurts badly

• Performance on unrestricted speech at 70%

(with good acoustics)

• Don’t know when we know

• Few advances in basic understanding

Page 27: Automatic Speech Recognition (ASR): A Brief Overview.

Why is ASR Hard?Why is ASR Hard?

• Natural speech is continuous

• Natural speech has disfluencies

• Natural speech is variable over:

global rate, local rate, pronunciation

within speaker, pronunciation across

speakers, phonemes in different

contexts

Page 28: Automatic Speech Recognition (ASR): A Brief Overview.

Why is ASR Hard?Why is ASR Hard?(continued)(continued)

• Large vocabularies are confusable• Out of vocabulary words inevitable• Recorded speech is variable over:

room acoustics, channel characteristics,background noise

• Large training times are not practical• User expectations are for equal to or

greater than “human performance”

Page 29: Automatic Speech Recognition (ASR): A Brief Overview.

ASR DimensionsASR Dimensions

• Speaker dependent, independent

• Isolated, continuous, keywords

• Lexicon size and difficulty

• Task constraints, perplexity

• Adverse or easy conditions

• Natural or read speech

Page 30: Automatic Speech Recognition (ASR): A Brief Overview.

Telephone SpeechTelephone Speech

• Limited bandwidth (F vs S)• Large speaker variability• Large noise variability• Channel distortion • Different handset microphones• Mobile and handsfree acoustics

Page 31: Automatic Speech Recognition (ASR): A Brief Overview.

Hot Research ProblemsHot Research Problems

• Speech in noise• Multilingual conversational speech (EARS)• Portable (e.g., cellular) ASR• Question answering • Understanding meetings – or at least

browsing them

Page 32: Automatic Speech Recognition (ASR): A Brief Overview.

Hot Research ApproachesHot Research Approaches

• New (multiple) features and models• New statistical dependencies• Multiple time scales• Multiple (larger) sound units • Dynamic/robust pronunciation models• Long-range language models• Incorporating prosody• Incorporating meaning• Non-speech modalities• Understanding confidence

Page 33: Automatic Speech Recognition (ASR): A Brief Overview.

Multi-frame analysisMulti-frame analysis

• Incorporate multiple frames as

a single observation

• LDA the most common approach

• Neural networks

• Bayesian networks (graphical models,

including Buried Markov Models)

Page 34: Automatic Speech Recognition (ASR): A Brief Overview.

Linear Discriminant Linear Discriminant Analysis (LDA)Analysis (LDA)

=X

x

x

x

x

x

1

2

3

4

5

y

y

1

2

All variables for several frames

Transformation to maximize ratio:

between-class variancewithin-class variance

Page 35: Automatic Speech Recognition (ASR): A Brief Overview.

Multi-layer perceptronMulti-layer perceptron

Page 36: Automatic Speech Recognition (ASR): A Brief Overview.

Buried Markov ModelsBuried Markov Models

Page 37: Automatic Speech Recognition (ASR): A Brief Overview.

Multi-stream analysisMulti-stream analysis

• Multi-band systems

• Multiple temporal properties

• Multiple data-driven temporal filters

Page 38: Automatic Speech Recognition (ASR): A Brief Overview.

Multi-band analysisMulti-band analysis

Page 39: Automatic Speech Recognition (ASR): A Brief Overview.

Temporally distinct Temporally distinct featuresfeatures

Page 40: Automatic Speech Recognition (ASR): A Brief Overview.

Combining streamsCombining streams

Page 41: Automatic Speech Recognition (ASR): A Brief Overview.

Another novel approach:Another novel approach:Articulator dynamicsArticulator dynamics

• Natural representation of context

• Production apparatus has mass, inertia

• Difficult to accurately model

• Can approximate with simple dynamics

Page 42: Automatic Speech Recognition (ASR): A Brief Overview.

Hidden Dynamic ModelsHidden Dynamic Models

“We hold these truths to be self-evident: that speech is produced by an underlying dynamic system, that it is endowed by its production system with certain inherent dynamic qualities, among these are compactness, continuity, and the pursuit of target values for each phone class, that to exploit these characteristics Hidden DynamicModels are instituted among men. We … solemnly publish and declare,that these phone classes are and of aright ought to be free and context independent states …And for the support of this declaration, with a firm reliance on the acoustic theory of speech production, we mutually pledge our lives, our fortunes, and our sacred honor.”

John Bridle and Li Deng, 1998 Hopkins Spoken LanguageWorkshop, with apologies to Thomas Jefferson ...

(See http://www/clsp.jhu.edu/ws98/projects/dynamic/)

Page 43: Automatic Speech Recognition (ASR): A Brief Overview.

Hidden Dynamic ModelsHidden Dynamic Models

TARGET VALUES

SEGMENTATION

TARGET SWITCH

FILTER

NEURALNETWORK

SPEECH PATTERN

Page 44: Automatic Speech Recognition (ASR): A Brief Overview.

Sources of OptimismSources of Optimism

• Comparatively new research lines

• Many examples of improvements

• Moore’s Law much more processing

• Points toward joint development of

front end and statistical components

Page 45: Automatic Speech Recognition (ASR): A Brief Overview.

SummarySummary

• 2002 ASR based on 50+ years of research

• Core algorithms mature systems, 10-30 yrs

• Deeply difficult, but tasks can be chosen

that are easier in SOME dimension

• Much more yet to do


Recommended