+ All Categories
Home > Documents > Scientific Challenges for Speech Recognition Adoption Li Deng Microsoft Research, Redmond February...

Scientific Challenges for Speech Recognition Adoption Li Deng Microsoft Research, Redmond February...

Date post: 14-Dec-2015
Category:
Upload: janice-byrd
View: 216 times
Download: 2 times
Share this document with a friend
Popular Tags:
18
Scientific Challenges for Speech Recognition Adoption Li Deng Microsoft Research, Redmond Microsoft Research, Redmond February 14, 2004 AAAS, Seattle, WA
Transcript
Page 1: Scientific Challenges for Speech Recognition Adoption Li Deng Microsoft Research, Redmond February 14, 2004 AAAS, Seattle, WA.

Scientific Challenges for Speech Recognition Adoption

Li Deng

Microsoft Research, RedmondMicrosoft Research, RedmondFebruary 14, 2004

AAAS, Seattle, WA

Page 2: Scientific Challenges for Speech Recognition Adoption Li Deng Microsoft Research, Redmond February 14, 2004 AAAS, Seattle, WA.

Introduction• Human speech communication and computer speech

recognition• Historic progress of computer speech recognition

– Adoption of speech recognition requires much lower error rate

• Dominant approach: Hidden Markov Model (HMM)• Research on approaches beyond HMM• Example: Hidden dynamic modeling approach• Exploiting linguistic and speech sciences with rigorous

mathematical framework (dynamic system theory)• Illustration: Phonetic reduction --- difficulty for HMM but

pervasive in conversational speech• Summary

Page 3: Scientific Challenges for Speech Recognition Adoption Li Deng Microsoft Research, Redmond February 14, 2004 AAAS, Seattle, WA.

Components in Human Speech Communication

Message

motor/articulators

Cognitive processing?

decodedmessage

ear/a

uditory

rece

ption

SPEAKER LISTENER

Speech Acoustics in

closed-loop chain

Page 4: Scientific Challenges for Speech Recognition Adoption Li Deng Microsoft Research, Redmond February 14, 2004 AAAS, Seattle, WA.

Historical Progress: Speech Recognition

(DARPA/NIST)

human recognition error rate

Page 5: Scientific Challenges for Speech Recognition Adoption Li Deng Microsoft Research, Redmond February 14, 2004 AAAS, Seattle, WA.

Dominant Approach: HMM

• Hidden Markov model (HMM) as the underlying mathematical presentation for speech features:

piecewise

stationarity:

Page 6: Scientific Challenges for Speech Recognition Adoption Li Deng Microsoft Research, Redmond February 14, 2004 AAAS, Seattle, WA.

Dominant Approach: HMM

• Many desirable properties for speech recognition• Have led to many deployable systems• But encounters many difficulties for casual-style

speech • Characterized by many as “missing science”• “Flat & shallow structure” for speech• Hard to represent deep & dynamic structure

established from speech science

Page 7: Scientific Challenges for Speech Recognition Adoption Li Deng Microsoft Research, Redmond February 14, 2004 AAAS, Seattle, WA.

Approaches Beyond HMM

• Segmental model (or Acoustic Trajectory model) (since 1988)

• Hidden dynamic model (since 1998)

• Max-entropy direct model (since 2003)

• Detection-based approach (since 2003)

• Approach based on high-dimensional space representation

Page 8: Scientific Challenges for Speech Recognition Adoption Li Deng Microsoft Research, Redmond February 14, 2004 AAAS, Seattle, WA.

Speech Communication: Closed-Loop Chain

Message

motor/articulators

Cognitive processing?

decodedmessage

ear/a

uditory

rece

ption

SPEAKER LISTENER

Speech Acoustics in

closed-loop chain

Page 9: Scientific Challenges for Speech Recognition Adoption Li Deng Microsoft Research, Redmond February 14, 2004 AAAS, Seattle, WA.

Human Speech Perception (decoder)

message

motor/articulators

Internal model

decodedmessage

ear/a

uditory

rece

ption

LISTENER• Convert speech acoustic waves into efficient & robust auditory representation• This processing is largely independent of phonological units• Involves processing stages in cochlea (ear), cochlear nucleus, SOC, IC,…, all the way to A1 cortex• Two principal roles: 1) combat environmental acoustic distortion; 2) provide temporal landmarks to aid decoding• Key properties: 1) Critical-band freq scale, logarithmic compression, 2) adapt freq selectivity, cross-channel correlation, 3) sharp response to transient sounds (CN), 4) modulation in independent frequency bands, 5) binaural noise suppression, etc.

Page 10: Scientific Challenges for Speech Recognition Adoption Li Deng Microsoft Research, Redmond February 14, 2004 AAAS, Seattle, WA.

Human Speech Production (source; encoder)

message

motor/articulators

Speech Acoustics

Phonology (higher level):•Symbolic encoding of linguistic message•Discrete representation by phonological features•Loosely-coupled multiple feature tiers•Overcome beads-on-a-string phone model•Theories of distinctive features, feature geometry & articulatory phonology• Account for partial/full sound deletion/modification in casual speech

SPEAKER

PhoneticsPhonetics (lower level):(lower level):•Convert discrete linguistic features toConvert discrete linguistic features to continuous acousticscontinuous acoustics•Mediated by motor control & articulatory Mediated by motor control & articulatory dynamics dynamics•Mapping from articulatory variables toMapping from articulatory variables to VT area function to acoustics VT area function to acoustics •Account for co-articulation and reduction Account for co-articulation and reduction (target undershoot), etc. (target undershoot), etc.

Page 11: Scientific Challenges for Speech Recognition Adoption Li Deng Microsoft Research, Redmond February 14, 2004 AAAS, Seattle, WA.

Phonetic Reduction

yo-yo (formal) yo-yo (casual)

2 21 22 (1 )n s n s n s s nz z z T w

Page 12: Scientific Challenges for Speech Recognition Adoption Li Deng Microsoft Research, Redmond February 14, 2004 AAAS, Seattle, WA.

Conversational Speech --- Full of Reduction

Spectrogram

(smoothed)

Model output as speech “synthesizer”

Residual spectrogram

Residual spectrogram (enlarged)

Page 13: Scientific Challenges for Speech Recognition Adoption Li Deng Microsoft Research, Redmond February 14, 2004 AAAS, Seattle, WA.

Scientific Evidence• J. Acoust. Soc. Am. June, 2000: “Rate and stress effects for formant dynamics” (Pitermann)

•Experimental measurements:

F2

F1

xx

x

x x x

xx

/i/ /a/ /i/ /i/ /a/ /i/ /i/ /e/ /i//i/ /e/ /i/

•Target undershoot (reduction) causes “static” phonetic confusion (esp. unstressed)

•Results(ave over4600 tokens)

speaking rate

F2 /e/

/a/

Page 14: Scientific Challenges for Speech Recognition Adoption Li Deng Microsoft Research, Redmond February 14, 2004 AAAS, Seattle, WA.

• Parameter learning & decoding algorithms

• Intractable vs. tractable algorithms

• Approximations to make algorithm tractable

• Design of decoding algorithms to be consistent with the DP search

• ML vs. constrained discriminative learning (min speaking efforts)

• …..

A Happy Model Family & Associated Algorithms

Model root

Switching state-

space model

Switching hidden trajectory model

……

…..…

.........…

……

. discrete-valued

hidden dynamics

continuous-valued hidden dynamics

1st-order

dynamics

2nd-order

dynamics

time-invariant

targets

time-varying

targets

……

…..…

.........…

……

.

……

…..…

.........…

……

.

point

targets

distributed

targets

Page 15: Scientific Challenges for Speech Recognition Adoption Li Deng Microsoft Research, Redmond February 14, 2004 AAAS, Seattle, WA.

Preliminary Results

ASR Tasks Relative Word Error Rate Reduction

Model-Space Feature-Space

Connected Digits

13% (monophn

vs. triphn HMM)

TIMIT 5%

Switchboard

(23 hr. training)

5% 2% (strong LM)

7% (weak LM)

•Highly simplified implementation of hidden dynamic model of speech•Model-space and feature-space recognizer implementation

Page 16: Scientific Challenges for Speech Recognition Adoption Li Deng Microsoft Research, Redmond February 14, 2004 AAAS, Seattle, WA.

Stage I Stage II Stage III Stage IV

Input Distinctive-feature representation of an utterance

Discrete articulatory state sequence

Segmental target sequence Articulatory or vocal tract resonance vector

Mediating process

Temporal overlapping mechanism

Symbolic-to- numerical mapping

Explicit or recursive trajectory modeling

Static, numerical-to-numerical, nonlinear mapping

Output Overlapping articulatorygestures, represented by a set of discrete meta-states

Segmental target sequence, represented as a left-to-right, constrained switching random process

Continuous, smooth, and target-directed trajectories for articulatory or vocal tract resonance variables

Acoustic or auditory feature vector computable or measurable directly from speech waveforms

Domain Phonology Interface between phonology and phonetics

Phonetics Phonetics

Properties Account for partial or full sound deletion or modification for the pronunciation variation in casual speech; Also account for contextual variation at the pronunciation level

Account for compensatory articulation, or different ways of activating articulators to achieve similar acoustic effects or auditory perception; Targets are used as the control signal directing the dynamic system governing speech articulation

Account for variability of speech due to reduced speaking effort or increased speaking rate (phonetic reduction) and due to increased effort (e.g., Lombard effect); Also account for coarticulation at the physical level due to inertia in articulation

Account for differences in different speakers’ speech production organs and the distorting effects due to acoustic environments

Four major stages in the architectural design of a novel speech recognizer 

Page 17: Scientific Challenges for Speech Recognition Adoption Li Deng Microsoft Research, Redmond February 14, 2004 AAAS, Seattle, WA.

Summary

• Historical progress shows steady error rate reduction by HMM-based recognizers

• But the errors are many times higher than human speech recognition

• For mainstream adoption of speech recognition, lowering current error rate is critical, esp. for natural free-style speech

• One direction: Research on incorporating speech science and reliable knowledge sources into speech recognition models and algorithms

• Much work remains to bridge speech science and technology for ultimate success in speech recognition adoption

Page 18: Scientific Challenges for Speech Recognition Adoption Li Deng Microsoft Research, Redmond February 14, 2004 AAAS, Seattle, WA.

End &

Backup Slides


Recommended