Speech Recognition - NYU Computer Scienceeugenew/asr13/lecture_1.pdf · Eugene Weinstein - Speech...

Speech RecognitionLecture 1: Introduction

Eugene WeinsteinGoogle, NYU Courant Institute

[email protected] Credit: Mehryar Mohri

mailto:[email protected]


Eugene Weinstein - Speech Recognition Courant Institute, NYUpage

Logistics

Prerequisites: basics in analysis of algorithms and probability. No specific knowledge about signal processing is required.

Textbooks: no single textbook covering all the material presented in this course. Three suggested textbooks on website (available on reserve). Lecture slides available electronically.

Workload: 3-4 homework assignments, 1 project (your choice). Comfort working with open source software in shell-based environment expected.

2


Logistics

Assignment 0 out today, due September 19th.

Homeworks must be submitted before lecture time on the due date. Late submissions will incur a score deduction based on the degree of lateness.

Electronic (preferred) and paper submissions accepted. Electronic submissions must be made to both the instructor and the grader.

Grader: Philip Gross [email protected] .

Office Hours: Thursday 7-8 PM, WWH 328.

3




Objectives

Computer science view of automatic speech recognition (ASR) (no signal processing focus).

Essential algorithms for large-vocabulary speech recognition.

Emphasis on general algorithms:

• automata and transducer algorithms.

• acoustic, language, and pronunciation modeling.

• statistical learning algorithms.

4


Topics

introduction, formulation, components, features.

weighted automata algorithms.

statistical language modeling.

acoustic models.

pronunciation models, decision trees, context-dependent models.

5


Topics

search algorithms, transducer optimizations, Viterbi decoder.

N-best algorithms, lattice generation, rescoring.

adaptation.

practical applications.

6


Speech recognition problem

Acoustic features

Statistical formulation

This Lecture

7


Speech Recognition Problem

Definition: find accurate written transcription of human speech.

• transcriptions may be in words, phonemes, syllables, or other units.

Accuracy: typically measured in terms of the error rate (computed by edit-distance) between reference transcription and sequence output by the model.

8


ASR Characteristics

Vocabulary size: small (digit recognition, 10), medium (Resource Management, 1000), large (Broadcast News, 100,000), very large (voice search, 1M+).

Speaker-dependent or speaker-independent.

Domain-specific or unconstrained, e.g., travel reservation, modern spoken-dialog systems.

Isolated (pause between units) or continuous.

Read or spontaneous, e.g., dictation, news broadcast, conversational speech.

9


Other Related Problems

Speaker verification.

Speaker identification.

Spoken-dialog systems.

Detection of voice features, e.g., gender, age, dialect, emotion, height, weight!

Speech synthesis.

10


Speech Recognition Is Difficult

Highly variable: the same words pronounced by the same person in the same conditions typically lead to different waveforms

source variation: speaking rate, volume, accent, dialect, pitch, coarticulation

channel variation: microphone (type, position), noise (background, distortion)

Key problem: robustness to such variations

11


Example - YouTube Transcription

12

link

http://www.youtube.com/watch?v=BRTYPfhemHI&list=TLzPOFSCxv6J8

http://www.youtube.com/watch?v=BRTYPfhemHI&list=TLzPOFSCxv6J8


Example: Voice Search

13

link

http://www.youtube.com/watch?v=fHkhp6BwnGo

http://www.youtube.com/watch?v=fHkhp6BwnGo


Unconstrained Spoken-Dialog Systems

14

link

http://www.youtube.com/watch?v=qM79_itR0Nc

http://www.youtube.com/watch?v=qM79_itR0Nc


History

1922: Radio Rex, toy, single-word recognizer (rex).

1939: voder and vocoder (mechanical synthesizer), Dudley (Bell Labs).

1952: isolated digit recognition, single speaker (Bell Labs).

1950s: 10 syllables of single speaker, Olson and Belar, (RCA Labs).

1950s: speaker-independent 10-vowel recognizer (MIT Lincoln Labs).

See (Juang and Rabiner, 1995)

15


History

1960s: Linear Predictive Coding (LPC), Atal and Itakura.

1969: John Pierce’s negative comments about ASR (Bell Labs).

1970s: Advanced Research Projects Agency (ARPA) funds speech understanding program. CMU’s Harpy system based on automata had reasonable accuracy for 1,000 words.

16


History

mid 1990s: FSM library. Weighted transducers major component of almost all modern speech recognition and understanding systems. Dictation systems, Dragon, IBM speaker-dependent system.

2000s: Broadcast News, conversational speech, e.g., Switchboard, Call Home, real-time large-vocabulary systems, unconstrained spoken-dialog systems, e.g., HMIHY.

2009-Present: Google voice search (1M+ vocabulary) in 46 languages, neural networks

17


History

1980s: n-gram models. ARPA Resource Management, Wall Street Journal, and ATIS tasks. Delta/delta-delta cepstra, mel cepstra.

mid-1980s: Hidden Markov models (HMMs) become the preferred technique for speech recognition.

1990s: Discriminative training, vocal tract normalization, speaker adaptation. Very large-vocabulary speech recognition, e.g., 1M names recognizer (Bell Labs), 500,000 words North American Business News (NAB) recognizer.

18



Acoustic features


This Lecture

19


Speech Production

Speech is produced by exhaling air from the lungs

Vowels: Vocal folds vibrate at fundamental frequency (pitch)

Sound is determined by position of articulators (tongue, teeth, lips)

Fricatives: turbulent air flow

Plosives: constriction, then release in vocal tract

20

See (Flanagan, 1965)


Feature Selection

Short-time Fourier analysis:

Idea: find smooth approximation eliminating large variations over short frequency intervals.

Short-time (25 msec. Hamming window) spectrum of /ae/.

power (db)

freq. (Hz)

log��⇥

x(t)w(t � �)e�i�t dt

��

21


Speech Spectogram

22


Mel Frequency Cepstral Coefficients

Refinement: non-linear scale, approximation of human perception of distance between frequencies, e.g., mel frequency scale:

(Stevens and Volkman, 1940)See (Molau, et al., 2001)

23

| FFT | 2

VTN Warping

Mel−Frequency Warping

Filterbank

Logarithm

DCT

| FFT | 2

Logarithm

DCT with integrated VTN−and Mel−Frequency Warping

Figure 2: Comparison of the traditional MFCC computation(left) with the integrated approach (right) investigated here.

3.1. Traditional Filterbank Approach

Mel-frequency warping and the filterbank can be implementedeasily in the frequency domain (see Figure 3). One method isto transform the power spectrum, i.e. to compute a Mel-warpedspectrum by interpolation from the original discrete-frequencypower spectrum. The advantage is that the following triangularfilters all have the same shape and can be placed uniformly atthe Mel-warped spectrum. On the other hand, the discretizationmay be especially critical due to the large dynamic range of thepower spectrum.

original frequency

Mel−f

requ

ency

f

fmel(f)=2595*lg(1

+ f

)700Hz

f mel

Figure 3: Schematic plot of different triangular filterbank imple-mentations. The filters are either uniformly distributed at theMel-warped spectrum, or non uniformly at the original spec-trum. In the latter case, they should be asymmetric as well.

Another way is to place the triangular filters non uniformlyat the unwarped spectrum and thereby implicitely incorporateMel-frequency scaling [1]. However, discretization errors maythen occur if the spectral resolution is not appropriate. The low-est filters could be placed at a very few spectral lines only, andthe maximum of one of the filters may fall just inbetween twospectral lines. In addition, the filters should not be triangularand symmetric anymore, but bend according to the shape of theMel-function at the position of the filter.

Last but not least it is not clear how many filters are requiredand which filter shape is optimal. Triangular filters are occa-sionally replaced by trapezoidal or more complex shaped onesderived from auditory models, and we sometimes observed bet-ter word error rates when using filters with cosine shape.

In all cases the logarithm of the filterbank output is cosinetransformed to obtain MFCCs.

3.2. Computing MFCCs Directly On The Power Spectrum

We have investigated an alternative method to compute Mel-frequency warped cepstral coefficients directly on the powerspectrum and thereby avoid possible problems of the standardapproach.

Ignoring any spectral warping for a moment, cepstral coeffi-cients can be derived by Eq. (1):

(1)

Depending on whether a filterbank is used or not,stands for either the filterbank outputs or the power spectrum.

The sequential application of a monotone invertible fre-quency warping function and DCT canbe expressed as follows:

(2)

To incorporate warping directly into the cosine transforma-tion, we change the integration variable and use the derivative ofthe warping function (Eq. 3). The continuous integral islater approximated in the standard way by a discrete sum (Eq. 4):

(3)

(4)

One specific type of frequency warping is theMel-frequencyscaling , which is usually carried out according to formula(5) with the sampling frequency [6]:

(5)

For integration into the cosine transformation, the Mel-warping function needs to be normalized in order to meet thecriterion .

(6)

with

Replacing in Eq. (4) by leads to a compact imple-mentation of MFCC computation with only a few lines of code.A look-up table for constants like the derivative and the cosineterm can be precomputed, all that remains is a matrix multipli-cation on the logarithm of the power spectrum. Figure 4 shows


Cepstral Coefficients

Let denote the mel-filterbank values derived from the Fourier transform

Definition: the 13 cepstral coefficients are the energy and the 12 first coefficients of the expansion

Other coefficients: 13 first-order (delta-cepstra) and 13 second-order (delta-delta cepstra) differentials.

24

5.4 Filterbank Analysis 60

analysis since this provides a much more straightforward route to obtaining the desired non-linearfrequency resolution. However, filterbank amplitudes are highly correlated and hence, the use ofa cepstral transformation in this case is virtually mandatory if the data is to be used in a HMMbased recogniser with diagonal covariances.

HTK provides a simple Fourier transform based filterbank designed to give approximately equalresolution on a mel-scale. Fig. 5.3 illustrates the general form of this filterbank. As can be seen,the filters used are triangular and they are equally spaced along the mel-scale which is defined by

Mel(f) = 2595 log10(1 +f

700) (5.13)

To implement this filterbank, the window of speech data is transformed using a Fourier transformand the magnitude is taken. The magnitude coefficients are then binned by correlating them witheach triangular filter. Here binning means that each FFT magnitude coefficient is multiplied bythe corresponding filter gain and the results accumulated. Thus, each bin holds a weighted sumrepresenting the spectral magnitude in that filterbank channel. As an alternative, the Booleanconfiguration parameter USEPOWER can be set true to use the power rather than the magnitude ofthe Fourier transform in the binning process.

m1 mP

freq

1

mj... ... Energy inEach Band

MELSPEC

Fig. 5.3 Mel-Scale Filter Bank

Normally the triangular filters are spread over the whole frequency range from zero upto theNyquist frequency. However, band-limiting is often useful to reject unwanted frequencies or avoidallocating filters to frequency regions in which there is no useful signal energy. For filterbank analysisonly, lower and upper frequency cut-offs can be set using the configuration parameters LOFREQ andHIFREQ. For example,

LOFREQ = 300HIFREQ = 3400

might be used for processing telephone speech. When low and high pass cut-offs are set in thisway, the specified number of filterbank channels are distributed equally on the mel-scale across theresulting pass-band such that the lower cut-off of the first filter is at LOFREQ and the upper cut-offof the last filter is at HIFREQ.

If mel-scale filterbank parameters are required directly, then the target kind should be set toMELSPEC. Alternatively, log filterbank parameters can be generated by setting the target kind toFBANK. Most often, however, cepstral parameters are required and these are indicated by settingthe target kind to MFCC standing for Mel-Frequency Cepstral Coefficients (MFCCs). These arecalculated from the log filterbank amplitudes {mj} using the Discrete Cosine Transform

ci =r

2N

NX

j=1

mj cosµ

ºi

N(j ° 0.5)

∂(5.14)

where N is the number of filterbank channels set by the configuration parameter NUMCHANS. Therequired number of cepstral coefficients is set by NUMCEPS as in the linear prediction case. Lifteringcan also be applied to MFCCs using the CEPLIFTER configuration parameter (see equation 5.12).

See (Young et al., 2004)

mj


Mel Filterbank

MFCCs:

• signal first transformed using the Mel frequency band (mel filterbank)

• cosine transform yields cepstral coefficients

Central idea: produce sequence of feature vectors that is smooth with uncorrelated dimensions

• Omit higher coefficients to smooth signal, and to remove influence of pitch

• Typically, normalize signal mean and variance25


MFCC Computation Summary

Pre-emphasis: boost energy of higher frequencies

Add energy features and their deltas: ~39 dimensional signal

26

Pre-emphasis

Window

Fourier Transform

Mel FB + log

DCT

Deltas



Acoustic features


This Lecture

27



Acoustic features


• Maximum likelihood and maximum a posteriori

• Statistical formulation of speech recognition

• Components of a speech recognizer

This Lecture

28


Problem

Data: sample drawn i.i.d. from set according to some distribution ,

Problem: find distribution out of a set that best estimates .

D

x1, . . . , xm ! X.

X

D

p P

29


Maximum Likelihood

Likelihood: probability of observing sample under distribution , which, given the independence assumption is

Principle: select distribution maximizing sample probability

Pr[x1, . . . , xm] =m!

i=1

p(xi).

p ! P

p! = argmaxp!P

m!

i=1

p(xi),

p! = argmaxp!P

m!

i=1

log p(xi).or

30


Example: Bernoulli Trials

Problem: find most likely Bernoulli distribution, given sequence of coin flips

Bernoulli distribution:

Likelihood:

Solution: is differentiable and concave;

H, T, T, H, T, H, T, H, H, H, T, T, . . . , H.

p(H) = !, p(T ) = 1 ! !.

l(p) = log !N(H)(1 ! !)N(T )

= N(H) log ! + N(T ) log(1 ! !).

l

dl(p)d�

=N(H)

�� N(T )

1� �= 0⇥ � =

N(H)N(H) + N(T )

.

31


Example: Gaussian Distribution

Problem: find most likely Gaussian distribution, given sequence of real-valued observations

Normal distribution:

Likelihood:

Solution: is differentiable and concave;

3.18, 2.35, .95, 1.175, . . .

p(x) =1

!

2!"2exp

!

"

(x " µ)2

2"2

"

.

l

!p(x)

!µ= 0 ! µ =

1

m

m!

i=1

xi

!p(x)

!"2= 0 ! "2 =

1

m

m!

i=1

x2

i " µ2.

l(p) = �12m log(2�⇥2)�

m�

i=1

(xi � µ)2

2⇥2.

32


Properties

Problems:

• the underlying distribution may not be among those searched.

• overfitting: number of examples too small wrt number of parameters.

33


Maximum A Posteriori (MAP)

Principle: select the most likely hypothesis given the sample, with some prior distribution over the hypotheses, ,

Note: for a uniform prior, MAP coincides with maximum likelihood.

h ! H

h! = argmaxh!H

Pr[h | S]

= argmaxh!H

Pr[S|h] Pr[h]Pr[S]

= argmaxh!H

Pr[S | h] Pr[h].

Pr[h]

34



Acoustic features





This Lecture

35


General Ideas

Probabilistic formulation: given a spoken utterance, find the most likely transcription.

Decomposition: mapping from spoken utterances to word sequences decomposed into intermediate units.

phoneme seq.

CD phone seq.

word seq.

observ. seq.

w1 w2 w3 w4

p1 p2 p3 p4 p5 p6 p7 p8 p9 p10

c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14

o1 o2 o3 o4 o5 o6 o7 o8 o9 o10 o11 o12 o13 o14 o15 o16

36


Statistical Formulation

Observation sequence produced by signal processing system:

Sequence of words over alphabet :

Formulation (maximum a posteriori decoding):

o = o1 . . . om.

! w = w1 . . . wk.

w = argmaxw!!!

Pr[w | o]

= argmaxw!!!

Pr[o | w] Pr[w]

Pr[o]

= argmaxw!!!

Pr[o | w]! "# $

Pr[w]! "# $

.

language modelacoustic & pronunciation model

(Bahl, Jelinek, and Mercer, 1983)

37


Components

Acoustic and pronunciation model:

• : observation seq. distribution seq.

• : distribution seq. CD phone seq.

• : CD phone seq. phoneme seq.

• : phoneme seq. word seq.

Language model: , distribution over word seq.

Pr(o | w) =!

d,c,p

Pr(o | d) Pr(d | c) Pr(c | p) Pr(p | w).

Pr(o | d)

Pr(d | c)

!

!

Pr(c | p) !

Pr(p | w) !

Pr(w)

acou

stic

mod

el

38


Notes

Formulation does not match the way speech recognition errors are typically measured: edit-distance between hypothesis and reference transcription.

39



Acoustic features





This Lecture

40


Acoustic Observations

Discretization

• time: local spectral analysis of the speech waveform at regular intervals,

Parameter vectors

• magnitude.

Note: other perceptual information, e.g., visual information is ignored.

t = t1, . . . , tm, ti+1 ! ti = 10ms (typically).

o = o1 . . . om, oi ! RN , N = 39 (typically).

41


Acoustic Model

Three-state hidden Markov models (HMMs)

Distributions:

• Full covariance multivariate Gaussians:

• Diagonal covariance Gaussian mixture.

• Semi-continuous, tied mixtures.

0

d0:ε

1d0:ε

d1:ε

2d1:ε

d2:ε

3d2:aeb,d

Pr[!] =1

(2")N/2|#|1/2e!

1

2(!!µ)T "!1(!!µ)

.

(Rabiner and Juang, 1993)

42


Context-Dependent Model

Idea:

• phoneme pronunciation depends on environment (allophones, co-articulation).

• model phone in context better accuracy.

Context-dependent rules:

• Context-dependent units:

• Allophonic rules:

• Complex contexts: regular expressions.

ae/b d ! aeb,d.

!

t/V ! V ! dx.

(Lee, 1990; Young et al., 1994)

43


Pronunciation Dictionary

Phonemic transcription

• Example: word data in American English.

Representation

data D ey dx ax 0.32

data D ey t ax 0.08

data D ae dx ax 0.48

data D ae t ax 0.12

0 1d:ε/1.0

2ey:ε/0.4

ae:ε/0.63

dx:ε/0.8

t:ε/0.24/1

ax:data/1.0

44


Language Model

Definition: probabilistic model for sequences of words

• By the chain rule,

Modeling simplifications:

• Clustering of histories:

• Example: nth order Markov assumption,

w = w1 . . . wk.

(w1, . . . , wi!1) !" c(w1, . . . , wi!1).

!i,Pr[wi | w1 . . . wi!1] = Pr[wi | hi], |hi| " n # 1.

Pr[w] =k!

i=1

Pr[wi | w1 . . . wi!1].

45


Recognition Cascade

Combination of components

Viterbi approximation

Pron. ModelHMM Lang. ModelCD Model

word seq.phoneme seq.CD phone seq. word seq.observ. seq.

w = argmaxw

!

d,c,p

Pr[o | d] Pr[d | c] Pr[c | p] Pr[p | w] Pr[w]

! argmaxw

maxd,c,p

Pr[o | d] Pr[d | c] Pr[c | p] Pr[p | w] Pr[w].

46


Speech Recognition Problems

Learning: how to create accurate models for each component?

Search: how to efficiently combine models and determine best transcription?

Representation: compact data structure for the computational representation of the models.

47

common representation and algorithmic framework based on weighted transducers (next lectures).

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

References• Bahl, L. R., Jelinek, F., and Mercer, R. (1983). A Maximum Likelihood Approach to

Continuous Speech Recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 5(2), 179-190.

• J. L. Flanagan, Speech Analysis and Perception, Springer-Verlag, Berlin, 2nd edition, 1965

• Biing-Hwang Juang and Lawrence R. Rabiner. Automatic Speech Recognition - A Brief History of the Technology. Elsevier Encyclopedia of Language and Linguistics, Second Edition, 2005.

• Frederick Jelinek. Statistical Methods for Speech Recognition. MIT Press, Cambridge, MA, 1998.

• Kai-Fu Lee. Context-Dependent Phonetic Hidden Markov Models for Continuous Speech Recognition. IEEE International Conference on Acoustics, Speech, and Signal Processing, 38(4):599-609, 1990.

• Molau, S., Pitz, M., Schluter, R., Ney, H., Computing Mel-frequency cepstral coefficients on the power spectrum, IEEE International Conference on Acoustics, Speech, and Signal Processing, 2001.

48

Mehryar Mohri - Speech Recognition Courant Institute, NYUpage

References• Lawrence Rabiner and Biing-Hwang Juang. Fundamentals of Speech Recognition. Prentice

Hall, 1993.

• S.S. Stevens and J. Volkman. The relation of pitch to frequency. American Journal of Psychology, 53:329, 1940.

• Steve Young, J. Odell, and Phil Woodland. Tree-Based State-Tying for High Accuracy Acoustic Modelling. In Proceedings of ARPA Human Language Technology Workshop, Morgan Kaufmann, San Francisco, 1994.

• S. Young, G. Evermann, M.J.F. Gales, T. Hain, D. Kershaw, G. Moore, J.J. Odell, D. Ollason, D. Povey, V. Valtchev, and P.C. Woodland. The HTK Book (for HTK Version 3.3). University of Cambridge, March 2004.

49

Date post:	03-Sep-2018
Category:	Documents
Upload:	trinhdung
View:	225 times
Download:	0 times

Speech Recognition - NYU Computer Scienceeugenew/asr13/lecture_1.pdf · Eugene Weinstein - Speech...

Documents