Automatic Speech Recognition

04/19/23 1

Automatic Speech Recognition

Julia HirschbergCS 6998

04/19/23 2

What is speech recognition?

Transcribing words?Understanding meaning?

04/19/23 3

It’s hard to recognize speech...

People speak in very different ways Across speaker variation Within speaker variation

Speech sounds vary according to the speech context

Environment varies wrt noiseTranscription task must handle all of this

and produce a transcript of spoken words

04/19/23 4

Success: low WER (S+I+D)/N * 100 Thesis test vs. This is a test. 75% WER

Progress: Very large training corpora Fast machines and cheap storage Bake-offs Market for real-time systems New representations and algorithms: Finite

State Transducers

04/19/23 5

Varieties of Speech Recognition

Mode Isolated words continuous

Style Read, prepared, spontaneous

Enrollment Spkr-dependent or independent

Vocabulary size <20 5K --> 60K -->~1M

Language Model

Finite state, ngrams, CFGs, CSGs

Perplexity <10 > 100

SNR > 30dB (high) < 10dB (low)

Input device Telephone, microphones

04/19/23 6

ASR and the Noisy Channel Model

Source --> noisy channel --> HypothesisFind the most likely input to have

generated the (observed) “noisy” sentence by finding most likely sentence W in language given acoustic input O W’= P(W|O)

Bayes rule W’=

maxargLW

maxargLW

)()()|()|(

yPxPxyPyxP

)()()|(

OPWPWOP

04/19/23 7

P(O) same for all hypothetical W, soW’=P(O|W)P(W)P(W) the prior; P(O|W) the (acoustic)

likelihood

04/19/23 8

Simple Isolated Digit Recognition

Train 10 acoustic templates Mi: one per digit

Compare input x with eachSelect most similar template j according to

some comparison function, minimizing differences j = min{f(x,Mi)}

04/19/23 9

Scaling Up: Continuous Speech Recognition

Collect training and test corpora of Speech + word transcription Speech + phonetic transcription

Built by hand or using TTS

Text corpusDetermine a representation for the signalBuild probabilitistic

Acoustic model: signal to phones

04/19/23 10

Pronunciation model: phones to words Language model: words to sentences

Select search procedures to decode new input given these training models

04/19/23 11

Representing the Signal

What parameters (features) of the waveform Can be extracted automatically Will preserve phonetic identity and

distinguish it from other phones Will be independent of speaker

variability and channel conditions Will not take up too much space

…Power Spectrum

04/19/23 12

Speech captured by microphone and digitized

Signal divided into framesPower spectrum computed to represent

energy in different bands of the signal LPC spectrum, Cepstra, PLP Each frame’s spectral features

represented by small set of numbers

04/19/23 13

Why it works?Different phonemes have different spectral

characteristics Why it doesn’t work?

Phonemes can have different properties in different acoustic contexts, spoken by different people, ...

04/19/23 14

Acoustic Models

Model likelihood of phone given spectral features and prior context

Usually represented as HMM Set of states representing phones or other

subword units Transition probabilities on states: how likely

is it to see one phone after another? Observation/output likelihoods: how likely is

spectral feature vector to be observed from state i, given state i-1?

04/19/23 15

Train initial model on small hand-labeled corpus to get estimate of transition and observation probabilities

Tune parameters on large corpus with only transcription

Iterate until no further improvement

04/19/23 16

Pronunciation Model

Models likelihood of word given network of candidate phone hypotheses (weighted phone lattice)

Allophones: butter vs. butLexicon may be HMM or simple dictionary

04/19/23 17

Language Models

Models likelihood of word sequence given candidate word hypotheses

Grammars Finite state or CFG

Ngrams Corpus trained Smoothing issues

Out of Vocabulary (OOV) problem

04/19/23 18

Search

Find the best hypothesis given Lattice of subword units (AM) Segmentation of all paths into possible

words (PM) Probabilities of word sequences (LM)

Huge search space Viterbi decoding Beam search

04/19/23 19

Challenges for Transcription

Robustness to channel characteristics and noise

Portability to new applicationsAdapatation: to speakers, to environmentsLMs: simple ngrams need helpConfidence measuresOOV wordsNew speaking styles/genresNew applications

04/19/23 20

Challenges for Understanding

Recognizing communicative ‘problems’ ASR errors User corrections Disfluencies and self-repairs

04/19/23 21

An Unsuccessful Dialogue

S: Hi, this is AT&T Amtrak schedule system. This is TOOT. How may I help you?

U: I'd like to find a train from Ch- to Chicago from Baltimore.

S: Sorry, I can't understand you. Please repeat your utterance.

U: I'd like to find a train from Baltimore to Chicago. S: I heard you say “leave today around 5 thirty in the

morning .“ Which city do you want to go to? U: I would like to leave Saturday at eight in the

morning.

04/19/23 22

S: You can say the arrival city name, such as “New York City."U: Baltimore to Chicago. S: You can say the arrival city name, such as “New York City.”U: Baltimore to Chicago. (I hate it.)U: Train from Baltimore ...S: ...I heard you s...U: to Chicago.... Hello?S: You can say the name of your departure city, such as "New York City."U: Baltimore.

04/19/23 23

U: Bal-ti-more....S: You can say...U: Baltimore.S: Do you want to continue and find a new set of trains?U: Yes.S: You can say "yes" or "I am done here" to find a new set of trains.U: Yes.S: Thank you for using AT&T Amtrak train time table system. See you next time.U: I nev-

04/19/23 24

Disfluencies and Self-Repairs

Disfluencies abound in spontaneous speech every 4.6s in radio call-in (Blackmer &

Mitton ‘91)hesitation: Ch- change strategy. filled pause: Um Baltimore.

self-repair: Ba- uh Chicago. Hard to recognize

Ch- change strategy. --> to D C D C today ten fifteen.

Um Baltimore. --> From Baltimore ten.

Ba- uh Chicago. --> For Boston Chicago.

04/19/23 25

Possibilities for Understanding

Recognizing speaker emotionIdentifying speech acts: okayLocating topic boundaries for topic

tracking, audio browsing, speech data mining

04/19/23 26

Next Week

04/19/23 27

Date post:	31-Dec-2015
Category:	Documents
Upload:	ulla-burt
View:	59 times
Download:	4 times

Automatic Speech Recognition

Documents