+ All Categories
Home > Documents > T-61.184 Automatic Speech Recognition: From Theory to Practice

T-61.184 Automatic Speech Recognition: From Theory to Practice

Date post: 31-Dec-2021
Category:
Upload: others
View: 7 times
Download: 0 times
Share this document with a friend
81
Automatic Speech Recognition: From Theory to Practice 1 T-61.184 T-61.184 Automatic Speech Recognition: From Theory to Practice http://www.cis.hut.fi/Opinnot/T-61.184/ September 20, 2004 Prof. Bryan Pellom Department of Computer Science Center for Spoken Language Research University of Colorado [email protected]
Transcript
Page 1: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 1T-61.184T-61.184

T-61.184Automatic Speech Recognition:

From Theory to Practice

http://www.cis.hut.fi/Opinnot/T-61.184/September 20, 2004

Prof. Bryan PellomDepartment of Computer Science

Center for Spoken Language ResearchUniversity of Colorado

[email protected]

Page 2: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 2T-61.184T-61.184

Today

Introduction to Speech Production and Phonetics

Short Video on Spectrogram Reading

Quick Review of Probability and Statistics

Formulation of the Speech Recognition Problem

Talk about Homework #1

Page 3: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 3T-61.184T-61.184

Speech Production and Phonetics

Peter Ladefoged, “A Course In Phonetics,” Harcourt Brace Jovanovich, ISBN 0-15-500173-6

Excellent introductory reference to this material

Page 4: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 4T-61.184T-61.184

Speech Production Anatomy

Figure from Spoken Language Processing (Huang, Acero, Hon)

Page 5: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 5T-61.184T-61.184

Speech Production Anatomy

Vocal TractConsists of the pharyngeal and oral cavities

ArticulatorsComponents of the vocal tract which move to produce various speech soundsInclude: vocal folds, velum, lips, tongue, teeth

Page 6: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 6T-61.184T-61.184

Source-Filter Representation of Speech Production

Production viewed as an acoustic filtering operation

Larynx and lungs provide input or source excitation

Vocal and nasal tracts act as filter. Shape the spectrum of the resulting signal

Page 7: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 7T-61.184T-61.184

Describing Sounds

The study of speech sounds and their production, classification and transcription is known as phonetics

A phoneme is an abstract unit that can be used for writing a language down in a systematic or unambiguous way

Sub-classifications of phonemesVowels – air passes freely through resonatorsConsonants – air passes partially or totally obstructed in one or more places as it passes through the resonators

Page 8: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 8T-61.184T-61.184

Time-Domain Waveform Example

Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003

“two plus seven is less than ten”

Page 9: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 9T-61.184T-61.184

Wide-Band Spectrogram

Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003 “two plus seven is less than ten”

freq

uenc

y

time

Page 10: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 10T-61.184T-61.184

Phonetic Alphabets

Allow us to describe the primitive sounds that make up a language

Each language will have a unique set of phonemes

Useful for speech recognition since words can be represented by sequences of phonemes as described by a phonetic alphabet.

Page 11: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 11T-61.184T-61.184

International Phonetic Alphabet (IPA)

Phonetic representation standard which describes sounds in most/all world languages

IPA last published in 1993 and updated in 1996

Issue: character set difficult to manipulate on a computer…

http://www2.arts.gla.ac.uk/IPA/ipa.html

Page 12: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 12T-61.184T-61.184

IPA Consonants

Page 13: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 13T-61.184T-61.184

IPA Vowels

Page 14: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 14T-61.184T-61.184

American English Phonemes (IPA)

Table from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003

Page 15: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 15T-61.184T-61.184

Alternative Phonetic Alphabets

ARPAbetEnglish only ASCII representationPhoneme units represented by 1-2 lettersSimilar representation used by CMU Sphinx-II recognizer,http://www.speech.cs.cmu.edu/cgi-bin/cmudict

SAMPASpeech Assessment Methods Phonetic AlphabetComputer Readable representationMaps symbols of the IPA into ASCII codes

Page 16: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 16T-61.184T-61.184

CMU Sphinx-II Phonetic Symbols

(silence)SILtoyOYhurtERseizureZHoatOWedEHzeeZpingNGmatterDX

yieldYnoteNtheeDHweWmeMdudDDveeVleeLdeeDtwoUWlickKDcheeseCHhoodUHkeyKDubBDbitsTSgeeJHbeBthetaTHeatIYhideAYlitTDacidIXuserAXRteaTitIHabideAXsheSHheHHcowAWseaSbagGDoughtAOreadRgreenGhutAHlipPDfeeFatAEpeePateEYoddAA

ExamplePhoneExamplePhoneExamplePhone

Page 17: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 17T-61.184T-61.184

Example Words and Corresponding CMU Dictionary Transcriptions

basement B EY S M AX N TD

Bryan B R AY AX N

perfect P AXR F EH KD TD

speech S P IY CH

recognize R EH K AX G N AY Z

Page 18: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 18T-61.184T-61.184

Classifications of Speech Sounds

Voiced vs. voicelessVoiced if vocal chords vibrate

Nasal vs. OralNasal if air travels through nasal cavity and oral cavity closed

Consonant vs. VowelConsonants: obstruction in air stream above the glottis. The Glottis is defined as the space between the vocal cords.

Lateral vs. Non-lateralNon-lateral If the air stream passes through the middle of the oral cavity (compared to along side the oral cavity)

Page 19: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 19T-61.184T-61.184

Consonants and Vowels

Consonants are characterized by:Place of articulationManner of articulationVoicing

Vowels are characterized by:lip positiontongue heighttongue advancement

Page 20: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 20T-61.184T-61.184

Places of Articulation

Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003

Page 21: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 21T-61.184T-61.184

Places of Articulation

Biliabial Made with the two lips {P,B,M}Labio-dental Lower lip & upper front teeth {F,V}Dental Tongue tip or blade & upper front teeth {TH,DH}Alveolar Tongue tip or blade & alveolar ridge {T,D,N,…}Retroflex Tongue tip & back of the alveolar ridge {R}Palato-Alveolar Tongue blade & back of the alveolar ridge {SH}Palatal Front of the tongue & hard palate {Y,ZH}Velar Back of the tongue & soft palate {K,G,NG}

Page 22: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 22T-61.184T-61.184

Manners of Articulation

Stopcomplete obstruction with sudden (explosive) release

NasalAirflow stopped in the oral cavity, soft palate down, airflow is through the nasal tract

FricativeArticulators close together, turbulent airflow produced

Page 23: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 23T-61.184T-61.184

Manners of Articulation

Retroflex (Liquid)Tip of the tongue is curled back slightly (/r/)

Lateral (Liquid)Obstruction of the air stream at a point along the center of the oral tract, with incomplete closure between one or both sides of the tongue and the roof of the mouth (/l/)

GlideVowel-like, but initial position within a syllable (/y/, /w/)

Page 24: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 24T-61.184T-61.184

American English Consonants by Place and Manner of Articulation

Place

Man

ner

Page 25: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 25T-61.184T-61.184

American English Unvoiced Fricatives

Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003

Page 26: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 26T-61.184T-61.184

Voiced vs. Unvoiced Fricatives

Voiced fricatives tend to be shorter than unvoiced fricativesFigure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003

Page 27: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 27T-61.184T-61.184

American English Unvoiced Stops

Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003

Page 28: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 28T-61.184T-61.184

Voiced vs. Unvoiced Stops

Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003

Page 29: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 29T-61.184T-61.184

Stop Consonant Formant Transitions

Figure from http://hitchcock.dlt.asu.edu/media5/a_spanias/speech-recognition/real-lectures/PDF/LECT04-5.PDF

Page 30: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 30T-61.184T-61.184

Nasal vs. Oral Articulation

velum

/M/,/N/,/NG/

Page 31: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 31T-61.184T-61.184

Spectrogram of Nasals

Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003

Page 32: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 32T-61.184T-61.184

American English Semivowels

Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003

Page 33: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 33T-61.184T-61.184

Semivowels

Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003

Page 34: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 34T-61.184T-61.184

Affricates

Figure from MIT Course Notes: 6.345 Automatic Speech Recognition, Spring 2003

Page 35: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 35T-61.184T-61.184

Describing Vowels

Velum Positionnasal vs. non-nasal

Lip ShapeRounded vs. unrounded

Tongue heightHigh, mid, lowCorrelated to first formant position

Tongue advancementFront, central, backCorrelated to second formant position

Page 36: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 36T-61.184T-61.184

American English Vowels

“heed”“hid”“head”“had”

“hod”“hawed”“hood” “who'd”

IYIY IHIH EHEH AEAE

AAAA AOAO UHUH UWUW

Page 37: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 37T-61.184T-61.184

Vowel Chart

front central back

high

mid

low

IY UW

UHAA

EH

IH

AX

AE

AOAY

EY

OY

OW

AW

Tong

ue H

eigh

t

Tongue Advancement

Diphthongs shown in green

Page 38: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 38T-61.184T-61.184

The Vowel Space

1. IY

2. IH3. EH4. AE5. UH6. AA7. AO8. UW

9. (non English “u”)10. AX

1 2 3

4

5

6

78

9

10

Seco

nd F

orm

ant F

requ

ency

, F2

(Hz)

First Formant Frequency, F1 (Hz)

Page 39: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 39T-61.184T-61.184

Coarticulation

http://hctv.humnet.ucla.edu/departments/linguistics/VowelsandConsonants

Notice position of 2nd formant onset for these words: "fie”, “thigh”, “sigh”, “shy"

F AYF AY TH AYTH AY S AYS AY SH AYSH AY

Page 40: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 40T-61.184T-61.184

Sound Classification Summary

Vowels Semivowels Consonants

Nasals

Stops Fricatives

Affricates

Page 41: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 41T-61.184T-61.184

Spectrogram Reading Video

“Speech as Eyes See It” (12 minute video)

1977-1978 video by Ron Cole and Victor Zue

After 2000-3000 hours of training: phonemes and words can be transcribed from a spectrogram alone

80-90% agreement on segments

Provided insight into the speech recognition problem during the 1970’s

Page 42: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 42T-61.184T-61.184

Review: Probability & Statistics for Speech Recognition

Page 43: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 43T-61.184T-61.184

Relative-Frequency and Probability

Relative Frequency of “A”:Experiment is performed N timesWith 4 possible outcomes: A, B, C, DNA is number of times event A occurs:

P(A) Defined as “the probability of event A”Relative Frequency that event A would occur if an experiment was repeated many times

NNNNN DCBA =+++NNAr A=)(

Page 44: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 44T-61.184T-61.184

Probability

Example (con’t)

NNNNN DCBA =+++

1)()()()( =+++ DrCrBrAr

)(lim)( ArAPN ∞→

=

1)()()()( =+++ DPCPBPAP

Page 45: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 45T-61.184T-61.184

Mutually Exclusive Events

For mutually exclusive events A, B, C, …, M:

1)(0 ≤≤ AP1)()()()( =++++ MPCPBPAP L

event impossible represents 0)( =APeventcertain represents 1)( =AP

Page 46: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 46T-61.184T-61.184

Joint and Conditional ProbabilityP(AB) is the joint probability of event A and B both occurring

P(A|B) is the conditional probability of event A given that event B has occurred:

NNABP AB

N ∞→= lim)(

NNNN

BPABPBAP

B

AB

N //lim

)()()|(

∞→==

Page 47: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 47T-61.184T-61.184

Marginal Probability

)()|()()(11

k

n

kk

n

kk APABPBAPBP ∑∑

==

==

B1A2A

3A

4A5A

Probability of an event occurring across all conditions

Page 48: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 48T-61.184T-61.184

Bayes’ Theorem

)()()|(

)()()|(

BPAPABP

BPBAPBAP iii

i ==

)()|()()(11

k

n

kk

n

kk APABPBAPBP ∑∑

==

==

)()|()( APABPABP =

Page 49: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 49T-61.184T-61.184

Bayes’ Rule Bayes’ Rule allows us to update prior beliefs with new evidence,

)|()()|(

)|()( OWPWPWOP

WOPWP i

jjj

ii =

∑Prior

ProbabilityEvidence

P(O)

Likelihood

PosteriorProbability

Page 50: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 50T-61.184T-61.184

Statistically Independent Events

Occurrence of event A does not influence occurrence of event B:

)()()( BPAPABP =

)()|( APBAP =

)()|( BPABP =

Page 51: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 51T-61.184T-61.184

Random Variables

Used to describe events in which the number of possible outcomes is infinite

Values of the outcomes can not be predicted with certainty

Distribution of outcome values is known

Page 52: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 52T-61.184T-61.184

Probability Distribution Functions

The probability of the event that the random variable X is less than or equal to the allowed value x:

)()( xXPxFX ≤=

0x

)(xFX

1

Page 53: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 53T-61.184T-61.184

Probability Distribution Functions

Properties:

∞<<∞−≤≤ xxFX 1)(0

1)( and 0)( =∞=−∞ XX FF

increases as ingnondecreas is )( xxFX

)()()( 1221 xFxFxXxP XX −=≤<

Page 54: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 54T-61.184T-61.184

Probability Density Functions (PDF)

Derivative of probability distribution function,

Interpretation:

dxxdFxf X

X)()( =

)()( dxxXxPxf X +≤<=

Page 55: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 55T-61.184T-61.184

Probability Density Functions (PDF)

Properties

∞<<∞−≥ xxf X 0)(

1)( =∫∞

∞−dxxf X

duufxFx

XX ∫ ∞−= )()(

)()( 212

1

xXxPdxxfx

x X ≤<=∫

Page 56: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 56T-61.184T-61.184

Mean and Variance

Expectation or Mean of a random variable X

Variance of a random variable X

∫∞

∞−

= dxxxfXE X )()(

)(XE=µ[ ]

[ ]222

22

)()()(

)()(

XEXEXVar

XEXVar

−==

−==

σ

µσ

Page 57: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 57T-61.184T-61.184

Mean and Variance

Variance properties (if X and Y are independent)

)()()( YVarXVarYXVar +=+

)()( 2 XVaraaXVar =

)()(

)(2

121

11

nn

nn

XVaraXVara

bXaXaVar

++

=+++

L

L

Page 58: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 58T-61.184T-61.184

Covariance and Correlation

Covariance of X and Y:

Correlation of X and Y:

),(),(

)])([(),(

XYCovYXCov

YXEYXCov yx

=

−−= µµ

YXXY

YXCovσσ

ρ ),(= 11 ≤≤− XYρ

Page 59: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 59T-61.184T-61.184

Gaussian (Normal) Distribution

( )

−−== 2

22

2exp

21),|(

σµ

σπσµ xxXf

∑=

==N

nnx

NxE

1

1][µ

[ ]2

11

2222 11)()(

−=−= ∑∑

==

N

nn

N

nn x

Nx

NXEXEσ

Page 60: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 60T-61.184T-61.184

Gaussian (Normal) Distribution

-3 -2 -1 0 1 2 3

),|( 2σµxXf =

x

10

2 =

=

σ

µ

Page 61: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 61T-61.184T-61.184

Multivariate Distributions

Characterize more than one random variable at a time

Example: Features for speech recognitionIn speech recognition we typically compute 39-dimensional feature vectors (more about that later)100 feature vectors per second of audio

Often want to compute the likelihood of the observed features given known (estimated) distribution which is being used to model some part of a phoneme (more about that later!).

Page 62: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 62T-61.184T-61.184

Multivariate Gaussian Distribution

( )( ) ( )

−Σ−−

Σ=Σ= − uxxxXf

n1T

2/12/ 21exp

21),|( µ

πµ

Observed vector of random variables (features)Distribution mean vector

Distribution covariance matrix

Determinant of covariance matrix

Page 63: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 63T-61.184T-61.184

Diagonal Covariance Assumption

Most speech recognition systems assume diagonal covariance matrices

Data sparseness issue:

244

233

222

211

000000000000

σσ

σσ

∏=

=Σd

nnn

1

2 || σ

Page 64: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 64T-61.184T-61.184

Diagonal Covariance Assumption

Inverting a diagonal matrix involves simply inverting the elements along the diagonal:

=Σ−

244

233

222

211

1

1000

0100

0010

0001

σ

σ

σ

σ

Page 65: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 65T-61.184T-61.184

Multivariate Gaussians with Diagonal Covariance Matrices

Assuming a diagonal covariance matrix,

−−=Σ= ∑

=

n

d dddxCxXf

12

2

][])[][(

21exp*),|(

σµµ

( )2/1

1

22/ ][2

1

=

∏=

n

d

n dC

σπ 22 ][ ddd σσ =

Page 66: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 66T-61.184T-61.184

Multivariate Mixture Gaussians

Distribution is governed by several Gaussian density functions,Sum of Gaussians (wm = mixture weight)

( )∑=

ΣΝ=M

mmmmm xwxf

1,;)( µ

( )( ) ( )∑

=

−Σ−−

Σ=

M

mmmm

mn

m uxxw1

1T2/12/ 2

1exp2

µπ

Page 67: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 67T-61.184T-61.184

Multiple Mixtures (1-D case)

-7.5 -5.5 -3.5 -1.5 0.5 2.5 4.5 6.5

Example: 3 mixtures used to model underlying random process of 3 Gaussians

)(xf

x

Page 68: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 68T-61.184T-61.184

Speech RecognitionProblem Formulation

Page 69: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 69T-61.184T-61.184

Problem Description

Given a sequence of observations (evidence) from an audio signal,

Determine the underlying word sequence,

Number of words (m) unknown, observation sequence is variable length (T)

Tooo L21O =

mwww L21W =

Page 70: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 70T-61.184T-61.184

Problem Formulation

Goal: Minimize the classification error rate

Solution: Maximize the Posterior Probability

Solution requires optimization over all possible word strings!

)O|W(maxargWW

P=

Page 71: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 71T-61.184T-61.184

Problem Formulation

Using Bayes Rule,

Since P(O) does not impact optimization,

)O()W()W|O()O|W(

PPPP =

)W()W|O(maxarg

)O|W(maxargW

W

W

PP

P

=

=

Page 72: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 72T-61.184T-61.184

Problem Formulation

Let’s assume words can be represented by a sequence of states, S,

Words Phonemes StatesStates represent smaller pieces of phonemes

∑=

=

S

PWSPP

PP

)W()|()S|O(maxarg

)W()W|O(maxarg W

W

W

Page 73: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 73T-61.184T-61.184

Problem Formulation

Optimize:

Practical Realization,

P(W)W)|P(SS)|P(O

O

∑=S

PSPSP )W()W|()|O(maxargWW

Observation (feature) sequence

Acoustic Model

Lexicon / Pronunciation Model

Language Model

Page 74: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 74T-61.184T-61.184

Problem Formulation

Optimization desires most likely word sequence given observations (evidence)

Can not evaluate all possible word / state sequences (too many possibilities!)

We need:To define a representation for modeling states (HMMs…)A means for “approximately” searching for the best word / state sequence given the evidence (Viterbi Algorithm)And a few other tricks up our sleeves to make it FASTER!

Page 75: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 75T-61.184T-61.184

Hidden Markov Models (HMMs)Observation vectors are assumed to be “generated” by a Markov Model

HMM: A finite-state machine that at each time t that a state j is entered, an observation is emitted with probability density bj(ot)

Transition from state i to state j modeled with probability aij

S0

00a

01a

11a

S112a

22a

S2

1o 2o 3o 4o to 1+to 2+to To

Page 76: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 76T-61.184T-61.184

“Beads-on-a-String” HMM Representation

CHS P IY

SPEECH

1o 2o 3o 4o 5o 6o 7o

Page 77: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 77T-61.184T-61.184

Modeled Probability Distribution

S0

00a

01a

11a

S112a

22a

S2

( )tob0 ( )tob1 ( )tob2

Page 78: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 78T-61.184T-61.184

HMMs with Mixtures of Gaussians

Multivariate Mixture-Gaussian distribution,

Model parameters are (1) means, (2) variances, and (3) mixture weights

Sum of Gaussians can model complex distributions

( )( )

( ) ( )∑=

−Σ−− −

Σ=

M

m

oo

jn

mtj

jtjT

jtewob1

21 1

2

µµ

π

Page 79: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 79T-61.184T-61.184

Hidden Markov Model Based ASR

Observation sequence assumed to be known

Probability of a particular state sequence can be computed

Underlying state sequence is unknown, “hidden”

S0

00a

01a

11a

S112a

22a

S2

1o 2o 3o 4o to 6o 7o To

Page 80: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 80T-61.184T-61.184

Components of a Speech Recognizer

Feature Extraction

Feature Extraction

Optimization /

Search

Optimization /

Search

Lexicon Language Model

Acoustic Model

Speech Text +Timing +Confidence

Page 81: T-61.184 Automatic Speech Recognition: From Theory to Practice

Automatic Speech Recognition: From Theory to Practice 81T-61.184T-61.184

Topics for Next Time

An introduction to hearing

Speech Detection

Frame-based Speech Analysis

Feature Extraction Methods for Speech Recognition


Recommended