Introduction to Automatic Speech and Speaker Recognition

Introduction to Automatic Speech and Speaker

RecognitionRecognition

Sadaoki FuruiSadaoki FuruiTokyo Institute of Technology

Department of Computer [email protected]

Major speech recognition applicationsMajor speech recognition applications

• Conversational systems for accessing information services ( f(e.g. automatic flight status or stock quote information systems)

• Systems for transcribing, understanding and summarizing ubiquitous speech documents g q p(e.g. broadcast news, meetings, lectures, presentations, congressional records, court p , g ,records, and voicemails)

1

Radio Rex – 1920’s ASRRadio Rex 1920 s ASR

“ (f C )A sound-activated toy dog named “Rex” (from Elmwood Button Co.) could be called by name from his doghouse by name. 2

Front view of the spoken digit recognizer(J Suzuki & K Nakata Radio Research Labs Japan 1961)(J. Suzuki & K. Nakata, Radio Research Labs, Japan, 1961)

3

Photograph of the Japanese spoken digit recognizer(K Nagata Y Kato and S Chiba NEC Labs Japan 1963)(K. Nagata, Y. Kato and S. Chiba, NEC Labs, Japan, 1963)

4

“Julie” doll with speech synthesis and recognition technology produced by Worlds of Wonder intechnology, produced by Worlds of Wonder in

conjunction with Texas Instruments (1987)

5

Now

6

Speech chain

Talker Listener

Auditory Auditory

Motorcommand

Motorcommand

Auditorynerve

Auditorynerve

EarEar

nervenerve

Speech

EarEar

nervenerve

ArticulatorArticulatorp

FeedbackFeedback

Ph i lLinguisticprocess

Physiologicalprocess

Physical(acoustic)process

Physiologicalprocess

Linguisticprocess

Discrete DiscreteContinuous7

Structure of speech production and recognition system based on information transmission theoryy y

I f tiI f ti

(Transmission theory)

InformationInformationsourcesource DecoderDecoderChannelChannel

S XTextText

generationgenerationAcousticAcoustic

processingprocessingSpeechSpeech

productionproductionLinguisticLinguisticdecodingdecodingW W

Acoustic channel

Speech recognition system

(Speech recognition process)

)()()(

maxarg)(maxargˆXP

WPWXPXWPW

WW== 8

Mechanism of state-of-the-art speech recognizers

Speech inputSpeech input

Aoustic analysisAoustic analysis

Phoneme inventoryPhoneme inventoryTxx L1

( )Global search:

MaximizeGlobal search:

Maximize

Phoneme inventoryPhoneme inventory

Pronunciation Pronunciation

( )kT wwxxP LL 11

overover ww11 wwkk

PP (( xx11...... xxTT||ww11......wwkk ))

Language modelLanguage model

lexiconlexiconP (x1... xT |w1...wk)・P(w1 ... k)w ( )kwwP L1

overover ww11...... wwkk gu ge odegu ge ode

Recognized word sequence

Recognized word sequence 9

Speech energy, sound spectrogram, waveform and pitch0 0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9

0 dB6 kHz

80 dB0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

Power

Spectrum

0 kHz

Waveform

0 ms

Pitch

15 msTime

Pitch

10

Feature vector (short-time spectrum) extraction from speech

F l h

Time window

Frame length

Time window

Frame periodFrame period

…Feature vector

Frame

…Feature vector

Block diagram of a typical speech analysis procedure

LowLow--pass filterpass filterCut off frequency= 8 kHz

A/DA/D(Sampling and quantization)(Sampling and quantization)

Sampling frequency= 16 kHz

Quantization bit rate= 16 bit

Analysis frame extractionAnalysis frame extractionFrame length = 30 msFrame interval = 10 ms

= 16 bit

30 msWindowingWindowing(Hamming, Hanning, etc.)(Hamming, Hanning, etc.)Window length

= Frame length

Spectral analysisSpectral analysis(FFT, LPC, etc.)(FFT, LPC, etc.)

Typical parameter valuesTypical parameter values

Feature extractionFeature extractionExamples of speech waves Examples of speech waves

at each stageat each stageyp pyp pParametric representation

Excitation parametersVocal tract parameters

at each stageat each stage

Linear separable equivalent circuit model of the speech production mechanismof the speech production mechanism

Source G (ω)

Pulse

( )Articulation

H (ω)SpeechwaveS (ω)

Vocal tractarticulationequivalent

filtt

( )

Noise

filter

…..t

Spectral envelopeparameters

Fundamentalperiod

Voiced/unvoiced

Amplitudep

S(ω) = G(ω) • H(ω) 13

Spectral structure of speech

Spectral fine structureL

og

Short-time speech spectrum

fF0 （Fundamental frequency）

Spectral envelope

f

g

fF0 （Fundamental frequency）

Resonance peaks（Formants）

Log

Wh h iWh h if

What we are hearingWhat we are hearing14

Relationship between logarithmic spectrum and cepstrum

C tC tL ith i tL ith i t

Spectral fine structure

CepstrumCepstrumLogarithmic spectrumLogarithmic spectrum

Spectral fine structure

gL

og 0

Fast periodical function ofFast periodical function of ff

f

S t l l

τConcentrating at Concentrating at different positionsdifferent positionsIDFTIDFT

Spectral envelope

Slow periodical function of Slow periodical function of ff

Log 0

f τ 15

Block diagram of cepstrum analysisfor extracting spectral envelope and fundamental period

Sampled sequence

Time window

|DFT||DFT|

Logarithmic transformLogarithmic transform

IDFT

Cepstral window (liftering)

(Low quefrency elements)

DFT(High quefrency elements)

Peak extraction

Spectral envelope Fundamental period 16

Cepstrum and delta-cepstrum coefficients

Parameter (vector) trajectory

Instantaneous vectorTransitional (velocity) vector

(Delta-cepstrum)(Cepstrum)

(Delta-cepstrum)

17

MFCC-based front-end processor

Speech

FFTFFT

FFT based spectrum

Mel scale triangular filters

DCTDCTLogLog

Δ Acousticvector

Δ218

Structure of phoneme HMMs

b (x) b (x) b (x)Outputprobabilities

b1(x) b2(x) b3(x)

x x x

0 2 0 4 0 7

Phonememodels

0.2

0.5 0.6

0.4

0.3

0.7

1 32

Feature

0.3

Featurevectors

Phoneme k-1 Phoneme k Phoneme k+1

timePhoneme k-1 Phoneme k Phoneme k+1

19

An example of FSN (Finite State Network) grammar

WANT BOOKS

THREE

33 55

I

WANT

ONE BOOK

COATS

I

ACOATNEW

BOOK

1122 9*9*88

NEED A

OLD

44 66

ANOLD

77

1. I 5. ONE 9. BOOKS 13. OLD2 WANT 6 A 10 COAT2. WANT 6. A 10. COAT3. NEED 7. AN 11. COATS4. THREE 8. BOOK 12. NEW 20

Statistical language modeling

Probability of the word sequence w1k = w1w2...wk :k kk k

P (w1k) =ΠP (wi |w1w2…wi−1) =ΠP (wi |w1i−1) i =1 i =1

P(wi |w1i−1) = N(w1i) / N(w1i−1)

where N (w1i) is the number of occurrences of the string w1iwhere N (w1 ) is the number of occurrences of the string w1in the given training corpus.

Approximation by Markov processes:Approximation by Markov processes:Bigram model P(wi |w1i−1) = P(wi |wi−1)Trigram model P(wi |w1i−1) = P(wi |wi−2wi−1)g ( i | 1 ) ( i | i−2 i−1)

Smoothing of trigram by the deleted interpolation method:P(wi |wi−2wi−1) = λ1P(wi |wi−2wi−1)+ λ2P(wi |wi−1) + λ3P(wi)

21

Overview of statistical speech recognition

FrontFront--end parameterizationend parameterization

Parameterized speech waveformX

pp

Acoustic models

th ih s ih z s p iy ch PronouncingPronouncingdictionarydictionaryW this is speech dictionarydictionary

Language modelLanguage model P(W) • P(X|W) 22

Complete Hidden Markov Model of a simple grammar

P(wt =YES| wt-1= sil)= 0.2

P(st | st-1)

S(4) S(5) S(6)S(1) S(2) S(3)

W=Yes

Phoneme ‘S’Phoneme ‘YE’0.6

S(0) P(w =sil|w =YES)= 1S(0)

SilenceP(wt =sil|wt-1 =YES)= 1

P(wt =sil|wt-1 =NO)= 1StartStartStartStart

Phoneme ‘N’ Phoneme ‘O’S(10) S(11) S(12)S(7) S(8) S(9)

P(Y|st =s(12))W=No P(Y|st s )

YP(wt =NO| wt-1= sil)= 0.2

23

A unigram grammar network where the unigram probability is attached as the transition probability from starting state S to the first state of each word HMMstarting state S to the first state of each word HMM.

P(W1)WW11

( 1)

WW22P(W2)

…WWNN

P(WN)SS NN

24

A bigram grammar network where the bigram probability P(wj|wi) is attached as the transition probability from word w to wwi to wj. P(w1|w2)

ww11P(w1|w1)

P(w1|wN) P(wN|w1)

P(w1|w1)

P(w2|w1)P(w2|wN)

ww22P(w2|w2)

( 2| N)

P( | )

wwNN

P(wN|w2)

P(wN|wN)

wwNN

25

A trigram grammar network where the trigram probability P(wk|wi, wj) is attached to transition from grammar state wi, wj to the next word wk. Illustrated here is a two-word vocabulary, so there are four grammar states in the trigram network.

ww11P(w1|w2 , w1) P(w1|w1 , w1)

P(w2|w2 , w1) P(w2|w1 , w2)P(w2|w1 , w1)

ww22P(w2|w2 , w1)

P( | )

P(w2|w1 , w2)

ww11P(w1|w2 , w2)

P(w1|w1 , w2)

ww11

ww22P( | )P(w2|w2 , w2)

26

System diagram of a generic speech recognizer based on statistical models, including training and decoding processes and the main knowledge sources.

SpeechSpeechcorpuscorpus

TextTextcorpuscorpus

NormalizationNormalizationManual Manual

transcriptiontranscriptionFeature Feature

ii

NN--gramgram

NormalizationNormalization transcriptiontranscription

HMMHMM

extractionextraction TrainingTraininglexiconlexicon

Training

D di

NN--gramgramestimationestimation

HMM HMM trainingtraining

DecodingLanguageLanguage

modelmodelAcousticAcousticmodelsmodels

RecognizerRecognizerlexiconlexicon

P(W) P(H|W) P(X|H)

Speech sample

Speech transcription

Y X W*Acoustic Acoustic frontfront--endend DecoderDecoder

27

HMM-based speech synthesis system

S h i l

Spectral Spectral ExcitationExcitation

Speech signalSpeechSpeech

databasedatabase

ppparameterparameterextractionextraction

Spectral parameter

parameterparameterextractionextraction

Excitation parameter

Training part

p p

Label

p

Training of HMMTraining of HMM

Text Context dependent

Parameter generationParameter generationText analysisText analysis

Label

HMMs

Synthesis partParameter generationParameter generationfrom HMM from HMM

Excitation parameter

Label

Spectral parameter

Synthesis part

ExcitationExcitationgenerationgeneration

SynthesisSynthesisfilterfilter

Synthesizedspeech

Main causes of acoustic variation in speech

Distortion

Microphone• Distortion• Electrical noise• Directional characteristics

Noise• Other speakers• Background noise

DistortionNoiseEchoesDropouts

SpeechSpeech

• Background noise• Reverberations

ChannelChannelSpeechSpeech

recognitionrecognitionsystemsystem

Speaker Task/Context• Voice quality• Pitch

G d

• Man-machine dialogue• Dictation• Free conversation• Gender

• Dialect

Speaking style Phonetic/Prosodic context

Free conversation• Interview

• Stress/Emotion• Speaking rate• Lombard effect

29

Progress of speech recognition technology since 19801980

SSpontaneousspeech

Fluent

naturalconversation

naturalconversation2-way

dialogue2-way

dialoguetranscriptiontranscriptionnetworknetworkdd

Read

Fluentspeech

ing

styl

e

2000

networkagent &

intelligentmessaging

networkagent &

intelligentmessaging

system drivendialogue

system drivendialogue

namename

wordspotting

wordspotting

digitstringsdigit

stringsReadspeech

Connectedh

Spea

ki

1980

2000

officedictation

officedictation

namedialingname

dialingform fillby voiceform fillby voice

g

carspeech

Isolatedd

1980

directoryassistancedirectoryassistancevoice

commandsvoice

commands

navigation

2 20 200 2000 20000 Unrestricted

words

Vocabulary size (number of words)

1990commandscommands

y ( f )

30

Various speech applicationsCore software products

Performance

Multimediacontents

t i lRobot

Performance

Contentsretrieval

Public informationterminal

retrieval

Automatedcall center

Robot

Home applianceHome appliance

C ll h

InformationretrievalCar

information

HomeServer

VoiceExpert system

call centerHome applianceHome appliance

Cell phoneOperation,Personal navigationPortable

deviceVoice

controlHands-free

Voicecontrol

game

Resource

31

Speaker recognition

• Speaker verification: confirm the identity claim p y(banking transactions, database access services,security control for confidential information)

• Speaker identification: determine from registered• Speaker identification: determine from registeredspeakers (criminal investigations)

• Text-dependent methods

• Text independent methods• Text-independent methods

Intersession variability (variability over time) of• Intersession variability (variability over time) of speech waves and spectra

S t l/lik lih d li ti ( li ti )Spectral/likelihood equalization (normalization)

Applications of speaker recognition technologytechnology

• Access control: For physical facilities, computer networks, websites and automated password reset serviceswebsites and automated password reset services.

• Transaction authentication: For telephone banking and remote electronic and mobile purchases (e- and m-(commerce).

• Law enforcement: Home-parole monitoring, prison call monitoring and corroborating aural/spectral inspections ofmonitoring and corroborating aural/spectral inspections of voice samples for forensic analysis.

• Speech data management: Label incoming voice mailSpeech data management: Label incoming voice mail with speaker name for browsing and/or action. Annotate recorded meetings or video with speaker labels for quick indexing and filingindexing and filing.

• Personalization: Store and retrieve personal setting/preferences for multi-user site or device. Usesetting/preferences for multi user site or device. Use speaker characteristics for directed advertisement or services.

Principal structure of speaker recognition systems

ReferenceReferenceTraining ReferenceReferencemodels/templatesmodels/templatesfor each speakerfor each speaker

FeatureFeatureextractionextraction

Speechwave extractionextractionwave

SimilaritySimilarity/Distance/Distance

Recognition

Recognition results

Basic structure of speaker recognition systems(a) Speaker identification(a) Speaker identification

Si il itSi il itSimilaritySimilarity

ReferenceSpeech

Si il itSi il it

Referencetemplate or model(Speaker #1)

FeatureFeature MaximumMaximum

wave

SimilaritySimilarity

Reference

FeatureFeatureextractionextraction

MaximumMaximumselectionselection

• • •

Referencetemplate or model(Speaker #2) Identification

result(Speaker ID)

SimilaritySimilarity

Referencetemplate or model(Speaker #N)

Basic structure of speaker recognition systems (b) Speaker verification

Speechwave

SimilaritySimilarityFeatureFeatureextractionextraction DecisionDecision

S k ID ReferenceReferenceSpeaker ID(#M)

e e e cee e e cetemplate or modeltemplate or model(Speaker #M)(Speaker #M)

ThresholdThreshold

Verification result(Accept / Reject)

Past and future

• Speech recognition technology has made very significant i th t 50+ ith th h l f tprogress in the past 50+ years with the help of computer

technology.

The majority of technological changes have been directed• The majority of technological changes have been directed toward the purpose of increasing robustness of recognition.

• However there still remain many unsolved problems• However, there still remain many unsolved problems.

• A much greater understanding of the human speech process is required before automatic speech recognitionprocess is required before automatic speech recognition systems can approach human performance.

• Significant advances will come from extended knowledge• Significant advances will come from extended knowledge processing in the framework of statistical pattern recognition.

37

Date post:	24-Feb-2022
Category:	Documents
Upload:	others
View:	9 times
Download:	0 times

Introduction to Automatic Speech and Speaker Recognition

Documents