Automatic Speech Recognition (ASR): A Brief Overview.

Automatic Speech Recognition (ASR): Automatic Speech Recognition (ASR): A Brief OverviewA Brief Overview

Radio Rex – 1920’s ASRRadio Rex – 1920’s ASR

Statistical ASRStatistical ASR

• i_best = argmax P(M |X )

= argmax P(X|M ) P(M )

(1st term, acoustic model; 2nd term,

language model)

• P(X|M ) P(X|Q ) [Viterbi approx.]

where Q is the best state sequence in M

approximated by product of local likelihoods

(Markov,conditional independence assumptions)

i ii

M

Mi

ii

i i

i

Automatic Speech Automatic Speech RecognitionRecognition

Speech Production/Collection

Pre-processing

Feature Extraction

Hypothesis Generation

Cost Estimator

Decoding

Simplified Model of SpeechSimplified Model of SpeechProductionProduction

Periodic Source

Random Source

Filters

Vocal vibration or turbulence(Fine spectral structure)

Vocal tract, nasal tract, radiation(spectral envelope)

Pre-processingPre-processing

RoomAcoustics

Speech

MicrophoneLinear

FilteringSampling &Digitization

Issues: Noise and reverb, effect on modeling

Frame 1

Frame 2

Feature VectorX1

Feature VectorX2

Framewise Analysis Framewise Analysis of Speechof Speech

Feature ExtractionFeature Extraction

SpectralAnalysis

AuditoryModel/

Orthogonalize(cepstrum)

Issues: Design for discrimination, insensitivitiesto scaling and simple distortions

Representations Representations are Importantare Important

Network

23% frame correct

Network

70% frame correct

Speech waveform

PLP features

Mel Frequency ScaleMel Frequency Scale

Spectral vs TemporalSpectral vs Temporal ProcessingProcessing

Analysis (e.g., cepstral)

Processing(e.g., mean removal)

Time

frequ

ency

frequ

ency

Spectral processing

Temporal processing

Hypothesis GenerationHypothesis Generation

Issue: models of language and task

cat

dog

a dog is not a cat

a cat not is adog

Cost EstimationCost Estimation

• Distances

• -Log probabilities, from discrete distributions Gaussians, mixtures neural networks

Nonlinear Time NormalizationNonlinear Time Normalization

DecodingDecoding

Pronunciation ModelsPronunciation Models

Language ModelsLanguage Models

Most likely words for largest product

P(acousticswords) P(words)

P(words) = P(wordshistory)

•bigram, history is previous word

•trigram, history is previous 2 words

•n-gram, history is previous n-1 words

ASR System ArchitectureASR System Architecture

PronunciationLexicon

Signal Processing

AcousticProbabilityEstimator

(HMM statelikelihoods)

Decoder

RecognizedWords“zero”“three”“two”

Probabilities“z” -0.81

“th” = 0.15“t” = 0.03

Cepstrum

SpeechSignal

LanguageModel

HMMs for SpeechHMMs for Speech

• Math from Baum and others, 1966-1972

• Applied to speech by Baker in the

original CMU Dragon System (1974)

• Developed by IBM (Baker, Jelinek, Bahl,

Mercer,….) (1970-1993)

• Extended by others in the mid-1980’s

Hidden Markov model Hidden Markov model (graphical form)(graphical form)

q q1 2

q q3 4

x1 2

x x 3 4

x

Hidden Markov ModelHidden Markov Model(state machine form)(state machine form)

q q q

P(x | q )1

P(x | q )2

P(x | q )3

P(q | q )2 1

P(q | q ) P(q | q )3 2 4 3

1 2 3

Markov modelMarkov model

q q1 2

P(x ,x |q ,q ) P( q ) P(x |q ) P(q | q ) P(x | q ) 1 1 1 1 12 2 2 2 21

HMM Training StepsHMM Training Steps

• Initialize estimators and models

• Estimate “hidden” variable probabilities

• Choose estimator parameters to maximize

model likelihoods

• Assess and repeat steps as necessary

• A special case of Expectation

Maximization (EM)

Progress in 3 DecadesProgress in 3 Decades

• From digits to 60,000 words

• From single speakers to many

• From isolated words to continuous

speech

• From no products to many products,

some systems actually saving LOTS

of money

Real UsesReal Uses

• Telephone: phone company services

(collect versus credit card)

• Telephone: call centers for query

information (e.g., stock quotes,

parcel tracking)

• Dictation products: continuous

recognition, speaker dependent/adaptive

But:But:

• Still <97% on “yes” for telephone

• Unexpected rate of speech causes doubling

or tripling of error rate

• Unexpected accent hurts badly

• Performance on unrestricted speech at 70%

(with good acoustics)

• Don’t know when we know

• Few advances in basic understanding

Why is ASR Hard?Why is ASR Hard?

• Natural speech is continuous

• Natural speech has disfluencies

• Natural speech is variable over:

global rate, local rate, pronunciation

within speaker, pronunciation across

speakers, phonemes in different

contexts

Why is ASR Hard?Why is ASR Hard?(continued)(continued)

• Large vocabularies are confusable• Out of vocabulary words inevitable• Recorded speech is variable over:

room acoustics, channel characteristics,background noise

• Large training times are not practical• User expectations are for equal to or

greater than “human performance”

ASR DimensionsASR Dimensions

• Speaker dependent, independent

• Isolated, continuous, keywords

• Lexicon size and difficulty

• Task constraints, perplexity

• Adverse or easy conditions

• Natural or read speech

Telephone SpeechTelephone Speech

• Limited bandwidth (F vs S)• Large speaker variability• Large noise variability• Channel distortion • Different handset microphones• Mobile and handsfree acoustics

Hot Research ProblemsHot Research Problems

• Speech in noise• Multilingual conversational speech (EARS)• Portable (e.g., cellular) ASR• Question answering • Understanding meetings – or at least

browsing them

Hot Research ApproachesHot Research Approaches

• New (multiple) features and models• New statistical dependencies• Multiple time scales• Multiple (larger) sound units • Dynamic/robust pronunciation models• Long-range language models• Incorporating prosody• Incorporating meaning• Non-speech modalities• Understanding confidence

Multi-frame analysisMulti-frame analysis

• Incorporate multiple frames as

a single observation

• LDA the most common approach

• Neural networks

• Bayesian networks (graphical models,

including Buried Markov Models)

Linear Discriminant Linear Discriminant Analysis (LDA)Analysis (LDA)

=X

x

x

x

x

x

1

2

3

4

5

y

y

1

2

All variables for several frames

Transformation to maximize ratio:

between-class variancewithin-class variance

Multi-layer perceptronMulti-layer perceptron

Buried Markov ModelsBuried Markov Models

Multi-stream analysisMulti-stream analysis

• Multi-band systems

• Multiple temporal properties

• Multiple data-driven temporal filters

Multi-band analysisMulti-band analysis

Temporally distinct Temporally distinct featuresfeatures

Combining streamsCombining streams

Another novel approach:Another novel approach:Articulator dynamicsArticulator dynamics

• Natural representation of context

• Production apparatus has mass, inertia

• Difficult to accurately model

• Can approximate with simple dynamics

Hidden Dynamic ModelsHidden Dynamic Models

“We hold these truths to be self-evident: that speech is produced by an underlying dynamic system, that it is endowed by its production system with certain inherent dynamic qualities, among these are compactness, continuity, and the pursuit of target values for each phone class, that to exploit these characteristics Hidden DynamicModels are instituted among men. We … solemnly publish and declare,that these phone classes are and of aright ought to be free and context independent states …And for the support of this declaration, with a firm reliance on the acoustic theory of speech production, we mutually pledge our lives, our fortunes, and our sacred honor.”

John Bridle and Li Deng, 1998 Hopkins Spoken LanguageWorkshop, with apologies to Thomas Jefferson ...

(See http://www/clsp.jhu.edu/ws98/projects/dynamic/)

Hidden Dynamic ModelsHidden Dynamic Models

TARGET VALUES

SEGMENTATION

TARGET SWITCH

FILTER

NEURALNETWORK

SPEECH PATTERN

Sources of OptimismSources of Optimism

• Comparatively new research lines

• Many examples of improvements

• Moore’s Law much more processing

• Points toward joint development of

front end and statistical components

SummarySummary

• 2002 ASR based on 50+ years of research

• Core algorithms mature systems, 10-30 yrs

• Deeply difficult, but tasks can be chosen

that are easier in SOME dimension

• Much more yet to do

Date post:	12-Jan-2016
Category:	Documents
Upload:	cody-nichols
View:	236 times
Download:	2 times

Automatic Speech Recognition (ASR): A Brief Overview.

Documents