+ All Categories
Home > Documents > Speech Recognition Introduction I

Speech Recognition Introduction I

Date post: 05-Jan-2016
Category:
Upload: chaela
View: 116 times
Download: 0 times
Share this document with a friend
Description:
Speech Recognition Introduction I. E.M. Bakker. Speech Recognition. Some Applications An Overview General Architecture Speech Production Speech Perception. Words. Speech Recognition. “How are you?”. Speech Signal. Speech Recognition. - PowerPoint PPT Presentation
Popular Tags:
24
LML Speech Recognition 2008 LML Speech Recognition 2008 1 Speech Recognition Speech Recognition Introduction I Introduction I E.M. Bakker E.M. Bakker
Transcript
Page 1: Speech Recognition Introduction I

LML Speech Recognition 2008LML Speech Recognition 2008 11

Speech RecognitionSpeech RecognitionIntroduction IIntroduction I

E.M. Bakker E.M. Bakker

Page 2: Speech Recognition Introduction I

LML Speech Recognition 2008LML Speech Recognition 2008 22

Speech RecognitionSpeech Recognition

Some ApplicationsSome Applications

An OverviewAn Overview

General ArchitectureGeneral Architecture

Speech ProductionSpeech Production

Speech PerceptionSpeech Perception

Page 3: Speech Recognition Introduction I

LML Speech Recognition 2008LML Speech Recognition 2008 33

Speech RecognitionSpeech Recognition

Goal: Automatically extract the string of Goal: Automatically extract the string of words spoken from the speech signalwords spoken from the speech signal

SpeechRecognition

Words“How are you?”

Speech Signal

Page 4: Speech Recognition Introduction I

LML Speech Recognition 2008LML Speech Recognition 2008 44

Speech RecognitionSpeech Recognition

Goal: Automatically extract the string of Goal: Automatically extract the string of words spoken from the speech signalwords spoken from the speech signal

SpeechRecognition

Words“How are you?”

Speech Signal

How is SPEECH produced? Characteristics of Acoustic Signal

How is SPEECH produced? Characteristics of Acoustic Signal

Page 5: Speech Recognition Introduction I

LML Speech Recognition 2008LML Speech Recognition 2008 55

Speech RecognitionSpeech Recognition

Goal: Automatically extract the string of Goal: Automatically extract the string of words spoken from the speech signalwords spoken from the speech signal

SpeechRecognition

Words“How are you?”

Speech Signal

How is SPEECH perceived?=> Important FeaturesHow is SPEECH perceived?=> Important Features

Page 6: Speech Recognition Introduction I

LML Speech Recognition 2008LML Speech Recognition 2008 66

Speech RecognitionSpeech Recognition

Goal: Automatically extract the string of Goal: Automatically extract the string of words spoken from the speech signalwords spoken from the speech signal

SpeechRecognition

Words“How are you?”

Speech Signal

What LANGUAGE is spoken?=> Language ModelWhat LANGUAGE is spoken?=> Language Model

Page 7: Speech Recognition Introduction I

LML Speech Recognition 2008LML Speech Recognition 2008 77

Speech RecognitionSpeech Recognition

Goal: Automatically extract the string of Goal: Automatically extract the string of words spoken from the speech signalwords spoken from the speech signal

SpeechRecognition

Words“How are you?”

Speech Signal

What is in the BOX?What is in the BOX?

Input Speech

AcousticFront-end

AcousticFront-end

Acoustic ModelsP(A/W)

Acoustic ModelsP(A/W)

RecognizedUtterance

SearchSearchLanguage

Model P(W)

Page 8: Speech Recognition Introduction I

LML Speech Recognition 2008LML Speech Recognition 2008 88

Important ComponentsImportant Componentsof General SR Architectureof General SR Architecture

Speech SignalsSpeech Signals

Signal Processing FunctionsSignal Processing Functions

ParameterizationParameterization

Acoustic Modeling (Learning Phase)Acoustic Modeling (Learning Phase)

Language Modeling (Learning Phase)Language Modeling (Learning Phase)

Search Algorithms and Data StructuresSearch Algorithms and Data Structures

EvaluationEvaluation

Page 9: Speech Recognition Introduction I

LML Speech Recognition 2008LML Speech Recognition 2008 99

MessageSource

LinguisticChannel

ArticulatoryChannel

AcousticChannel

Observable: Message Words Sounds Features

Speech Recognition Problem: P(W|A), where A is acoustic signal, W words spoken

Recognition ArchitecturesRecognition ArchitecturesA Communication Theoretic ApproachA Communication Theoretic Approach

Objective: minimize the word error rateApproach: maximize P(W|A) during training

Components:

• P(A|W) : acoustic model (hidden Markov models, mixtures)

• P(W) : language model (statistical, finite state networks, etc.)

The language model typically predicts a small set of next words based on knowledge of a finite number of previous words (N-grams).

Bayesian formulation for speech recognition:

• P(W|A) = P(A|W) P(W) / P(A), A is acoustic signal, W words spoken

Page 10: Speech Recognition Introduction I

LML Speech Recognition 2008LML Speech Recognition 2008 1010

Input Speech

Recognition ArchitecturesRecognition Architectures

AcousticFront-end

AcousticFront-end

• The signal is converted to a sequence of feature vectors based on spectral and temporal measurements.

Acoustic ModelsP(A|W)

Acoustic ModelsP(A|W)

• Acoustic models represent sub-word units, such as phonemes, as a finite-state machine in which:

• states model spectral structure and• transitions model temporal structure.

RecognizedUtterance

SearchSearch

• Search is crucial to the system, since many combinations of words must be investigated to find the most probable word sequence.

• The language model predicts the next set of words, and controls which models

are hypothesized.Language Model

P(W)

Page 11: Speech Recognition Introduction I

LML Speech Recognition 2008LML Speech Recognition 2008 1111

ASR ArchitectureASR Architecture

Common BaseClassesConfiguration and Specification

Speech Database, I/O

Feature Extraction Recognition: Searching Strategies

Evaluators

Language Models

HMM Initialisation and Training

Page 12: Speech Recognition Introduction I

LML Speech Recognition 2008LML Speech Recognition 2008 1212

Signal ProcessingSignal ProcessingFunctionalityFunctionality

Acoustic Transducers Acoustic Transducers Sampling and ResamplingSampling and ResamplingTemporal AnalysisTemporal AnalysisFrequency Domain AnalysisFrequency Domain AnalysisCeps-tral AnalysisCeps-tral AnalysisLinear Prediction and LP-Based Linear Prediction and LP-Based RepresentationsRepresentationsSpectral NormalizationSpectral Normalization

Page 13: Speech Recognition Introduction I

LML Speech Recognition 2008LML Speech Recognition 2008 1313

FourierTransform

FourierTransform

CepstralAnalysis

CepstralAnalysis

PerceptualWeighting

PerceptualWeighting

TimeDerivative

TimeDerivative

Time Derivative

Time Derivative

Energy+

Mel-Spaced Cepstrum

Delta Energy+

Delta Cepstrum

Delta-Delta Energy+

Delta-Delta Cepstrum

Input Speech

• Incorporate knowledge of the nature of speech sounds in measurement of the features.

• Utilize rudimentary models of human perception.

Acoustic Modeling: Acoustic Modeling: Feature ExtractionFeature Extraction

• Measure features 100 times per sec.

• Use a 25 msec window forfrequency domain analysis.

• Include absolute energy and 12 spectral measurements.

• Time derivatives to model spectral change.

Page 14: Speech Recognition Introduction I

LML Speech Recognition 2008LML Speech Recognition 2008 1414

Acoustic ModelingAcoustic Modeling

Dynamic ProgrammingDynamic Programming

Markov ModelsMarkov Models

Parameter EstimationParameter Estimation

HMM TrainingHMM Training

Continuous MixturesContinuous Mixtures

Decision TreesDecision Trees

Limitations and Practical Issues of HMMLimitations and Practical Issues of HMM

Page 15: Speech Recognition Introduction I

LML Speech Recognition 2008LML Speech Recognition 2008 1515

• Acoustic models encode the temporal evolution of the features (spectrum).

• Gaussian mixture distributions are used to account for variations in speaker, accent, and pronunciation.

• Phonetic model topologies are simple left-to-right structures.

• Skip states (time-warping) and multiple paths (alternate pronunciations) are also common features of models.

• Sharing model parameters is a common strategy to reduce complexity.

Acoustic ModelingAcoustic ModelingHidden Markov ModelsHidden Markov Models

Page 16: Speech Recognition Introduction I

LML Speech Recognition 2008LML Speech Recognition 2008 1616

• Closed-loop data-driven modeling supervised from a word-level transcription.

• The expectation/maximization (EM) algorithm is used to improve our parameter estimates.

• Computationally efficient training algorithms (Forward-Backward) have been crucial.

• Batch mode parameter updates are typically preferred.

• Decision trees are used to optimize parameter-sharing, system complexity, and the use of additional linguistic knowledge.

Acoustic Modeling: Acoustic Modeling: Parameter EstimationParameter Estimation

• Initialization

• Single Gaussian Estimation

• 2-Way Split

• Mixture Distribution Reestimation

• 4-Way Split

• Reestimation

•••

Page 17: Speech Recognition Introduction I

LML Speech Recognition 2008LML Speech Recognition 2008 1717

Language ModelingLanguage Modeling

Formal Language TheoryFormal Language Theory

Context-Free GrammarsContext-Free Grammars

N-Gram Models and ComplexityN-Gram Models and Complexity

SmoothingSmoothing

Page 18: Speech Recognition Introduction I

LML Speech Recognition 2008LML Speech Recognition 2008 1818

Language ModelingLanguage Modeling

Page 19: Speech Recognition Introduction I

LML Speech Recognition 2008LML Speech Recognition 2008 1919

Language Modeling: Language Modeling: N-GramsN-Grams

Bigrams (SWB):

• Most Common: “you know”, “yeah SENT!”, “!SENT um-hum”, “I think”• Rank-100: “do it”, “that we”, “don’t think”• Least Common: “raw fish”, “moisture content”,

“Reagan Bush”

Trigrams (SWB):

• Most Common: “!SENT um-hum SENT!”, “a lot of”, “I don’t know”

• Rank-100: “it was a”, “you know that”• Least Common: “you have parents”,

“you seen Brooklyn”

Unigrams (SWB):

• Most Common: “I”, “and”, “the”, “you”, “a”• Rank-100: “she”, “an”, “going”• Least Common: “Abraham”, “Alastair”, “Acura”

Page 20: Speech Recognition Introduction I

LML Speech Recognition 2008LML Speech Recognition 2008 2020

LM: LM: Integration of Natural LanguageIntegration of Natural Language

• Natural language constraints can be easily incorporated.

• Lack of punctuation and search space size pose problems.

• Speech recognition typically produces a word-level

time-aligned annotation.

• Time alignments for other levels of information also available.

Page 21: Speech Recognition Introduction I

LML Speech Recognition 2008LML Speech Recognition 2008 2121

Search Algorithms and Search Algorithms and Data StructuresData Structures

Basic Search AlgorithmsBasic Search Algorithms

Time Synchronous SearchTime Synchronous Search

Stack DecodingStack Decoding

Lexical TreesLexical Trees

Efficient TreesEfficient Trees

Page 22: Speech Recognition Introduction I

LML Speech Recognition 2008LML Speech Recognition 2008 2222

Dynamic Programming-Based SearchDynamic Programming-Based Search

• Search is time synchronous and left-to-right.

• Arbitrary amounts of silence must be permitted between each word.

• Words are hypothesized many times with different start/stop times, which significantly increases search complexity.

Page 23: Speech Recognition Introduction I

LML Speech Recognition 2008LML Speech Recognition 2008 2323

Input Speech

Recognition ArchitecturesRecognition Architectures

AcousticFront-end

AcousticFront-end

• The signal is converted to a sequence of feature vectors based on spectral and temporal measurements.

Acoustic ModelsP(A|W)

Acoustic ModelsP(A|W)

• Acoustic models represent sub-word units, such as phonemes, as a finite-state machine in which:

• states model spectral structure and• transitions model temporal structure.

RecognizedUtterance

SearchSearch

• Search is crucial to the system, since many combinations of words must be investigated to find the most probable word sequence.

• The language model predicts the next set of words, and controls which models

are hypothesized.Language Model

P(W)

Page 24: Speech Recognition Introduction I

LML Speech Recognition 2008LML Speech Recognition 2008 2424

Speech RecognitionSpeech Recognition

Goal: Automatically extract the string of Goal: Automatically extract the string of words spoken from the speech signalwords spoken from the speech signal

SpeechRecognition

Words“How are you?”

Speech Signal

How is SPEECH produced?How is SPEECH produced?


Recommended