Scientific Challenges for Speech Recognition Adoption Li Deng Microsoft Research, Redmond February...

Scientific Challenges for Speech Recognition Adoption

Li Deng

Microsoft Research, RedmondMicrosoft Research, RedmondFebruary 14, 2004

AAAS, Seattle, WA

Introduction• Human speech communication and computer speech

recognition• Historic progress of computer speech recognition

– Adoption of speech recognition requires much lower error rate

• Dominant approach: Hidden Markov Model (HMM)• Research on approaches beyond HMM• Example: Hidden dynamic modeling approach• Exploiting linguistic and speech sciences with rigorous

mathematical framework (dynamic system theory)• Illustration: Phonetic reduction --- difficulty for HMM but

pervasive in conversational speech• Summary

Components in Human Speech Communication

Message

motor/articulators

Cognitive processing?

decodedmessage

ear/a

uditory

rece

ption

SPEAKER LISTENER

Speech Acoustics in

closed-loop chain

Historical Progress: Speech Recognition

(DARPA/NIST)

human recognition error rate

Dominant Approach: HMM

• Hidden Markov model (HMM) as the underlying mathematical presentation for speech features:

piecewise

stationarity:

Dominant Approach: HMM

• Many desirable properties for speech recognition• Have led to many deployable systems• But encounters many difficulties for casual-style

speech • Characterized by many as “missing science”• “Flat & shallow structure” for speech• Hard to represent deep & dynamic structure

established from speech science

Approaches Beyond HMM

• Segmental model (or Acoustic Trajectory model) (since 1988)

• Hidden dynamic model (since 1998)

• Max-entropy direct model (since 2003)

• Detection-based approach (since 2003)

• Approach based on high-dimensional space representation

Speech Communication: Closed-Loop Chain

Message

motor/articulators

Cognitive processing?

decodedmessage

ear/a

uditory

rece

ption

SPEAKER LISTENER

Speech Acoustics in

closed-loop chain

Human Speech Perception (decoder)

message

motor/articulators

Internal model

decodedmessage

ear/a

uditory

rece

ption

LISTENER• Convert speech acoustic waves into efficient & robust auditory representation• This processing is largely independent of phonological units• Involves processing stages in cochlea (ear), cochlear nucleus, SOC, IC,…, all the way to A1 cortex• Two principal roles: 1) combat environmental acoustic distortion; 2) provide temporal landmarks to aid decoding• Key properties: 1) Critical-band freq scale, logarithmic compression, 2) adapt freq selectivity, cross-channel correlation, 3) sharp response to transient sounds (CN), 4) modulation in independent frequency bands, 5) binaural noise suppression, etc.

Human Speech Production (source; encoder)

message

motor/articulators

Speech Acoustics

Phonology (higher level):•Symbolic encoding of linguistic message•Discrete representation by phonological features•Loosely-coupled multiple feature tiers•Overcome beads-on-a-string phone model•Theories of distinctive features, feature geometry & articulatory phonology• Account for partial/full sound deletion/modification in casual speech

SPEAKER

PhoneticsPhonetics (lower level):(lower level):•Convert discrete linguistic features toConvert discrete linguistic features to continuous acousticscontinuous acoustics•Mediated by motor control & articulatory Mediated by motor control & articulatory dynamics dynamics•Mapping from articulatory variables toMapping from articulatory variables to VT area function to acoustics VT area function to acoustics •Account for co-articulation and reduction Account for co-articulation and reduction (target undershoot), etc. (target undershoot), etc.

Phonetic Reduction

yo-yo (formal) yo-yo (casual)

2 21 22 (1 )n s n s n s s nz z z T w

Conversational Speech --- Full of Reduction

Spectrogram

(smoothed)

Model output as speech “synthesizer”

Residual spectrogram

Residual spectrogram (enlarged)

Scientific Evidence• J. Acoust. Soc. Am. June, 2000: “Rate and stress effects for formant dynamics” (Pitermann)

•Experimental measurements:

F2

F1

xx

x

x x x

xx

/i/ /a/ /i/ /i/ /a/ /i/ /i/ /e/ /i//i/ /e/ /i/

•Target undershoot (reduction) causes “static” phonetic confusion (esp. unstressed)

•Results(ave over4600 tokens)

speaking rate

F2 /e/

/a/

• Parameter learning & decoding algorithms

• Intractable vs. tractable algorithms

• Approximations to make algorithm tractable

• Design of decoding algorithms to be consistent with the DP search

• ML vs. constrained discriminative learning (min speaking efforts)

• …..

A Happy Model Family & Associated Algorithms

Model root

Switching state-

space model

Switching hidden trajectory model

……

…..…

.........…

……

. discrete-valued

hidden dynamics

continuous-valued hidden dynamics

1st-order

dynamics

2nd-order

dynamics

time-invariant

targets

time-varying

targets

……

…..…

.........…

……

.

……

…..…

.........…

……

.

point

targets

distributed

targets

Preliminary Results

ASR Tasks Relative Word Error Rate Reduction

Model-Space Feature-Space

Connected Digits

13% (monophn

vs. triphn HMM)

TIMIT 5%

Switchboard

(23 hr. training)

5% 2% (strong LM)

7% (weak LM)

•Highly simplified implementation of hidden dynamic model of speech•Model-space and feature-space recognizer implementation

Stage I Stage II Stage III Stage IV

Input Distinctive-feature representation of an utterance

Discrete articulatory state sequence

Segmental target sequence Articulatory or vocal tract resonance vector

Mediating process

Temporal overlapping mechanism

Symbolic-to- numerical mapping

Explicit or recursive trajectory modeling

Static, numerical-to-numerical, nonlinear mapping

Output Overlapping articulatorygestures, represented by a set of discrete meta-states

Segmental target sequence, represented as a left-to-right, constrained switching random process

Continuous, smooth, and target-directed trajectories for articulatory or vocal tract resonance variables

Acoustic or auditory feature vector computable or measurable directly from speech waveforms

Domain Phonology Interface between phonology and phonetics

Phonetics Phonetics

Properties Account for partial or full sound deletion or modification for the pronunciation variation in casual speech; Also account for contextual variation at the pronunciation level

Account for compensatory articulation, or different ways of activating articulators to achieve similar acoustic effects or auditory perception; Targets are used as the control signal directing the dynamic system governing speech articulation

Account for variability of speech due to reduced speaking effort or increased speaking rate (phonetic reduction) and due to increased effort (e.g., Lombard effect); Also account for coarticulation at the physical level due to inertia in articulation

Account for differences in different speakers’ speech production organs and the distorting effects due to acoustic environments

Four major stages in the architectural design of a novel speech recognizer

Summary

• Historical progress shows steady error rate reduction by HMM-based recognizers

• But the errors are many times higher than human speech recognition

• For mainstream adoption of speech recognition, lowering current error rate is critical, esp. for natural free-style speech

• One direction: Research on incorporating speech science and reliable knowledge sources into speech recognition models and algorithms

• Much work remains to bridge speech science and technology for ultimate success in speech recognition adoption

End &

Backup Slides

Date post:	14-Dec-2015
Category:	Documents
Upload:	janice-byrd
View:	216 times
Download:	2 times

Scientific Challenges for Speech Recognition Adoption Li Deng Microsoft Research, Redmond February...

Documents