Overview of Speech Models - Brandeiscs136a/CS136a_Slides/CS136a_Lect2_Speech... · Overview of...

+Overview of Speech Models

September 5, 2017 Professor Marie Meteer Brandeis University

+1970’s 1980’s 1990-96 1997 1998 1999 2000 2001 2003 2005 2007 2008 2009 2010

SRI 1995 Decipher SRI DecipherMIT 1989 Voyager Jupiter MIT (research)AT&T InteractionsCambridge Unveristy HTKCMU Hearsay & Harpy 1990 Sphinx Sphinx open sourceBBN HWIM Byblos Byblos (research)

1993 Hark --> AVOKE Hark (Not sold)

Kurzweil OCR > Zerox 1992>Visioneer >>Scancsoft Nuance(s)1982 Dragon > L&H

eScription > Nuance1982 Kurzweil AI > L&H1987 L&H Bankrupt; Assets > Scansoft

1994 AlTech 1998 renamed SpeechWorks bought by Scansoft1994 Nuance Nuance goes public “Merger” with Scansoft1994 Philips SR for medical dictation > Nuance1996 Voice Control Systems > PhilipsVoice Signal >>Nuance

IBM speech research 2009 IP> NuanceLoquendo Loquendo

Microsoft MicrosoftGoogle Google

Yap Amazon AtoZJHU Kaldi Kaldi

Nuance takeovers begin

Industry consolidation

6/6/16 © MM Consulting 2016

2

Early research groups still researching

The big boys come in: Microsoft, Google,

Amazon

BBN Product in UFA

Nuance powers Siri

Kaldi, new research recognizer goes

OpenSource

Caveat: map is not to scale ;-)

+ Today n The speech problem

n Units of speech

n Hidden Markov Models

n Phonetic HMMs

n Recognition architecture n  Training n  Decoding

3

+ Friday: Next level of detail n  Where do the features come from?

n  How are the transition and observation probabilities learned?

n  What is the Grammar/Language Model?

n  How are the models applied in recognition (decoding) and how does the Viterbi algorithm help?

n  What were the core improvements ala 1995?

n  Topics that have been overcome by time: n  Effects of training and grammar, Speaker dependent vs. speaker independent, Real time speech

recognition

n  Topics we’ll return to n  Adaptation and out of vocabulary (OOV)

4

+ 1995: First significant breakthrough in Speech Recognition n Hidden Markoff Models

n  Mathematical framework n  Ability to model time and spectral variability simultaneously n  Ability to automatically estimate parameters given data n  Not longer need to hand segment into phonemes n  Segmentation and modeling done in one step n  Data driven à Standard scientific procedures n  Empirical!

5

+ The speech problem n  Continuous-time signal à

n  Sequence of discrete entitites

n  Sequence of discrete entities

n  Or did she say n  It’s easy to wreck a nice beach?

6

ih t S iy z iy t u r eh k o g n iy z s p iy ch

It’s easy to recognize speech

+ Challenge: Variability n Linguistic

n  Can say many different things n  Phonetics, phonology, syntax, semantic, discourse

n Speaker n  Physical characteristics of speaker n  Co-articulation (mouth has to transition between sounds n  Native language/dialect

n Channel n  Background noise n  Transmission channel (microphone/telephone quality)

7

+ 8 The Problem of Segmentation... or... Why Speech Recognition is so Difficult

m I n & m b & r i s e v & n th r E n I n z E r o t ü s e v & n f O r

(user:Roberto (attribute:telephone-num value:7360474))

MY NUMBER

IS SEVEN

THREE NINE

ZERO TWO

SEVEN FOUR

MY NUMBER

IS SEVEN

THREE NINE

ZERO TWO

SEVEN FOUR

NP NP VP

+ A look at the speech sounds n Grey whales

9

Words Grey Whales meow Phonemes g r ey w ey l z ?

Triphones - g r g r ey r ey w ey w y w ey l ey l z l z -

+ Contextual variability n Ey in grey, ey in whales

n G initial, g in “big gray”, g in “pink gray” (look at live in wavesurver)

10

+ Back to HMMs n Why Markov model?

n Why Hidden Markov model? n  Output symbols are probabilistic distribution over all labels n  The actual sequence of states for a particular output is “Hidden” n  There is one sequence that is the most probable to generate the

output symbols

11

Markov chain models sequences

à Speech is sequential can be modeled as a sequence of states specified by the dictionary

Transitions in the chain are probabilistic

à Probabilities model uncertainly well

+

9/5/17 Speech and Language Processing - Jurafsky and Martin

12

n States Q = q1, q2…qN;

n Observations O= o1, o2…oN; n  Each observation is a symbol from a vocabulary V = {v1,v2,…vV}

n Transition probabilities n  Transition probability matrix A = {aij}

n Observation likelihoods n  Output probability matrix B={bi(k)}

n Special initial probability vector π

€

π i = P(q1 = i) 1≤ i ≤ N€

aij = P(qt = j |qt−1 = i) 1≤ i, j ≤ N

€

bi(k) = P(Xt = ok |qt = i)

Hidden Markov Models formally

+ Different types of HMM structure

Thanks to Dan Jurafsky for these slides

Bakis = left-to-right (allowing skips)

Ergodic = fully-connected

+ HMMs for speech n Dictionary

SIX S IH K S

n State sequence for every word

n Each phone has 3 subphones


+ HMM for digit recognition task


+ Back to the Noisy Channel Model

n Search through space of all possible sentences n Defined by the HMM

n Pick the one that is most probable given the waveform. n Based on the transition and output probabilities in the HMM Thanks to Dan Jurafsky for these slides

+ The Noisy Channel Model n What is the most likely sentence out of all sentences

in the language L given some acoustic input O?

n Treat acoustic input O as sequence of individual observations n  O = o1,o2,o3,…,ot

n Define a sentence as a sequence of words: n  W = w1,w2,w3,…,wn


+ “Output symbols”: the great hack n  Markov models are “generative” models: Find the most likely sequence

that generates the output symbols

n  “Output symbols” = Observations n  What we normally think of as input

n  But speech is continuous. Where are the symbols?

n  Early speech hacked this by “quantizing” vectors, e.g. giving them a unique integer value n  (More on how this hack works later, for now just believe it) n  This is what is described in Makhoul and Schwartz

18

+ Noisy Channel Model n Probabilistic implication: Pick the

highest probability word sequence Ŵ:

n We can use Bayes rule to rewrite this:

n Since denominator is the same for each candidate sentence W, we can ignore it for the argmax: Thanks to Dan Jurafsky for these slides

€

ˆ W = argmaxW ∈L

P(W | O)

€

ˆ W = argmaxW ∈L

P(O |W )P(W )€

ˆ W = argmaxW ∈L

P(O |W )P(W )P(O)

+ Noisy channel model

€

ˆ W = argmaxW ∈L

P(O |W )P(W )


likelihood prior

+Speech Architecture meets Noisy Channel

€

ˆ W = argmaxW ∈L

P(O |W )P(W )


likelihood prior

+ M&S speech architecture 9958 Colloquium Paper: Makhoul and Schwartz

bl(s)

-T I. tA B C D E

b2(S)

A B C D E

b3 (s)

ATBt,A B C D E

FIG. 3. A three-state HMM.

one state to another is probabilistic, but the production of theoutput symbols is deterministic.Now, given a sequence of output symbols that were gener-

ated by a Markov chain, one can retrace the correspondingsequence of states completely and unambiguously (providedthe output symbol for each state was unique). For example, thesample symbol sequence B A A C B B A C C C A is producedby transitioning into the following sequence of states: 2 1 1 32 2 1 3 3 3 1.Hidden Markov Models. A hidden Markov model (HMM)

is the same as a Markov chain, except for one importantdifference: the output symbols in an HMM are probabilistic.Instead of associating a single output symbol per state, in anHMM all symbols are possible at each state, each with its ownprobability. Thus, associated with each state is a probabilitydistribution of all the output symbols. Furthermore, the num-ber of output symbols can be arbitrary. The different statesmay then have different probability distributions defined onthe set of output symbols. The probabilities associated withstates are known as output probabilities.

Fig. 3 shows an example of a three-state HMM. It has thesame transition probabilities as the Markov chain of Fig. 2.What is different is that we associate a probability distributionbi(s) with each state i, defined over the set of output symbolss-in this case we have five output symbols-A, B, C, D, andE. Now, when we transition from one state to another, theoutput symbol is chosen according to the probability distribu-tion corresponding to that state. Compared to a Markov chain,the output sequences generated by an HMM are what is knownas doubly stochastic: not only is the transitioning from onestate to another stochastic (probabilistic) but so is the outputsymbol generated at each state.

Speech Feature Feature Recognition LieostInput Extraction Vectors Search Sentence

FIG. 5. General system for training and recognition.

Now, given a sequence of symbols generated by a particularHMM, it is not possible to retrace the sequence of statesunambiguously. Every sequence of states of the same length asthe sequence of symbols is possible, each with a differentprobability. Given the sample output sequence-C D AA B ED B A C C-there is no way for sure to know which sequenceof states produced these output symbols. We say that thesequence of states is hidden in that it is hidden from theobserver if all one sees is the output sequence, and that is whythese models are known as hidden Markov models.Even though it is not possible to determine for sure what

sequence of states produced a particular sequence of symbols,one might be interested in the sequence of states that has thehighest probability of having generated the given sequence.

Phonetic HMMs. We now explain how HMMs are used tomodel phonetic speech events. Fig. 4 shows an example of athree-state HMM for a single phoneme. The first stage in thecontinuous-to-discrete mapping that is required for recogni-tion is performed by the analysis or feature extraction boxshown in Fig. 5. Typically, the analysis consists of estimation ofthe short-term spectrum of the speech signal over a frame(window) of about 20 ms. The spectral computation is thenupdated about every 10 ms, which corresponds to a frame rateof 100 frames per second. This completes the initial discreti-zation in time. However, the HMM, as depicted in this paper,also requires the definition of a discrete set of "outputsymbols." So, we need to discretize the spectrum into one ofa finite set of spectra. Fig. 4 depicts a set of spectral templates(known as a codebook) that represent the space of possible

3-STATE HMM

/TRANSITIONPROBABILITIES These probabilities

comprise the 'model' forOUTPUT 5 one phonePRfOBABILITIES

O CODE 255 0 CODE 255 0 CODE 255

MODEL LEARNED BY FORWARD-BACKWARD ALGORITIHM

CODEBOOK OFREPRESENTATIVE

SPECTRA

I

0

255~

COLDBOOK DERIVEDBY CLUSTERING PROCESS

FIG. 4. Basic structure of a phonetic HMM.

Proc. Natl. Acad. Sci. USA 92 (1995) 22

Observations

Observations

Language model P(W)

Acoustic model

P(O|W)

Decoding Ŵ = P(O|W)P(W)

+ Acoustic Modeling Training

Words Nine night

Phonemes N-AY-N N-AY-T

Triphoness <S>NAY----NAYN---AYN<S> -<SIL>-<S>NAY—NAYT--AYT<S>

2/10/15 © MM Consulting 2015

23

learns the relationship between feature vectors and triphones

Observations

+ Really learning the observation and transition probabilities of the HMM

24

no

n1 n2 no

n1 n2 ay o

ay1

ay2

no n1 n2 to t1 t2 ay o

ay1

ay2

“Nine”

“Night”

+ Embedded Training

9/5/17 CS 224S Winter 2007

25

+Speech Recognition Architecture


€

ˆ W = argmaxW ∈L

P(O |W )P(W )

+ Front End

9/5/17 CS 224S Winter 2007

27

Observations

State Sequence

Date post:	20-Apr-2018
Category:	Documents
Upload:	trinhkhanh
View:	235 times
Download:	1 times

Overview of Speech Models - Brandeiscs136a/CS136a_Slides/CS136a_Lect2_Speech... · Overview of...

Documents