Date post: | 14-Dec-2015 |
Category: |
Documents |
Upload: | janice-byrd |
View: | 216 times |
Download: | 2 times |
Scientific Challenges for Speech Recognition Adoption
Li Deng
Microsoft Research, RedmondMicrosoft Research, RedmondFebruary 14, 2004
AAAS, Seattle, WA
Introduction• Human speech communication and computer speech
recognition• Historic progress of computer speech recognition
– Adoption of speech recognition requires much lower error rate
• Dominant approach: Hidden Markov Model (HMM)• Research on approaches beyond HMM• Example: Hidden dynamic modeling approach• Exploiting linguistic and speech sciences with rigorous
mathematical framework (dynamic system theory)• Illustration: Phonetic reduction --- difficulty for HMM but
pervasive in conversational speech• Summary
Components in Human Speech Communication
Message
motor/articulators
Cognitive processing?
decodedmessage
ear/a
uditory
rece
ption
SPEAKER LISTENER
Speech Acoustics in
closed-loop chain
Historical Progress: Speech Recognition
(DARPA/NIST)
human recognition error rate
Dominant Approach: HMM
• Hidden Markov model (HMM) as the underlying mathematical presentation for speech features:
piecewise
stationarity:
Dominant Approach: HMM
• Many desirable properties for speech recognition• Have led to many deployable systems• But encounters many difficulties for casual-style
speech • Characterized by many as “missing science”• “Flat & shallow structure” for speech• Hard to represent deep & dynamic structure
established from speech science
Approaches Beyond HMM
• Segmental model (or Acoustic Trajectory model) (since 1988)
• Hidden dynamic model (since 1998)
• Max-entropy direct model (since 2003)
• Detection-based approach (since 2003)
• Approach based on high-dimensional space representation
Speech Communication: Closed-Loop Chain
Message
motor/articulators
Cognitive processing?
decodedmessage
ear/a
uditory
rece
ption
SPEAKER LISTENER
Speech Acoustics in
closed-loop chain
Human Speech Perception (decoder)
message
motor/articulators
Internal model
decodedmessage
ear/a
uditory
rece
ption
LISTENER• Convert speech acoustic waves into efficient & robust auditory representation• This processing is largely independent of phonological units• Involves processing stages in cochlea (ear), cochlear nucleus, SOC, IC,…, all the way to A1 cortex• Two principal roles: 1) combat environmental acoustic distortion; 2) provide temporal landmarks to aid decoding• Key properties: 1) Critical-band freq scale, logarithmic compression, 2) adapt freq selectivity, cross-channel correlation, 3) sharp response to transient sounds (CN), 4) modulation in independent frequency bands, 5) binaural noise suppression, etc.
Human Speech Production (source; encoder)
message
motor/articulators
Speech Acoustics
Phonology (higher level):•Symbolic encoding of linguistic message•Discrete representation by phonological features•Loosely-coupled multiple feature tiers•Overcome beads-on-a-string phone model•Theories of distinctive features, feature geometry & articulatory phonology• Account for partial/full sound deletion/modification in casual speech
SPEAKER
PhoneticsPhonetics (lower level):(lower level):•Convert discrete linguistic features toConvert discrete linguistic features to continuous acousticscontinuous acoustics•Mediated by motor control & articulatory Mediated by motor control & articulatory dynamics dynamics•Mapping from articulatory variables toMapping from articulatory variables to VT area function to acoustics VT area function to acoustics •Account for co-articulation and reduction Account for co-articulation and reduction (target undershoot), etc. (target undershoot), etc.
Phonetic Reduction
yo-yo (formal) yo-yo (casual)
2 21 22 (1 )n s n s n s s nz z z T w
Conversational Speech --- Full of Reduction
Spectrogram
(smoothed)
Model output as speech “synthesizer”
Residual spectrogram
Residual spectrogram (enlarged)
Scientific Evidence• J. Acoust. Soc. Am. June, 2000: “Rate and stress effects for formant dynamics” (Pitermann)
•Experimental measurements:
F2
F1
xx
x
x x x
xx
/i/ /a/ /i/ /i/ /a/ /i/ /i/ /e/ /i//i/ /e/ /i/
•Target undershoot (reduction) causes “static” phonetic confusion (esp. unstressed)
•Results(ave over4600 tokens)
speaking rate
F2 /e/
/a/
• Parameter learning & decoding algorithms
• Intractable vs. tractable algorithms
• Approximations to make algorithm tractable
• Design of decoding algorithms to be consistent with the DP search
• ML vs. constrained discriminative learning (min speaking efforts)
• …..
A Happy Model Family & Associated Algorithms
Model root
Switching state-
space model
Switching hidden trajectory model
……
…..…
.........…
……
. discrete-valued
hidden dynamics
continuous-valued hidden dynamics
1st-order
dynamics
2nd-order
dynamics
time-invariant
targets
time-varying
targets
……
…..…
.........…
……
.
……
…..…
.........…
……
.
point
targets
distributed
targets
Preliminary Results
ASR Tasks Relative Word Error Rate Reduction
Model-Space Feature-Space
Connected Digits
13% (monophn
vs. triphn HMM)
TIMIT 5%
Switchboard
(23 hr. training)
5% 2% (strong LM)
7% (weak LM)
•Highly simplified implementation of hidden dynamic model of speech•Model-space and feature-space recognizer implementation
Stage I Stage II Stage III Stage IV
Input Distinctive-feature representation of an utterance
Discrete articulatory state sequence
Segmental target sequence Articulatory or vocal tract resonance vector
Mediating process
Temporal overlapping mechanism
Symbolic-to- numerical mapping
Explicit or recursive trajectory modeling
Static, numerical-to-numerical, nonlinear mapping
Output Overlapping articulatorygestures, represented by a set of discrete meta-states
Segmental target sequence, represented as a left-to-right, constrained switching random process
Continuous, smooth, and target-directed trajectories for articulatory or vocal tract resonance variables
Acoustic or auditory feature vector computable or measurable directly from speech waveforms
Domain Phonology Interface between phonology and phonetics
Phonetics Phonetics
Properties Account for partial or full sound deletion or modification for the pronunciation variation in casual speech; Also account for contextual variation at the pronunciation level
Account for compensatory articulation, or different ways of activating articulators to achieve similar acoustic effects or auditory perception; Targets are used as the control signal directing the dynamic system governing speech articulation
Account for variability of speech due to reduced speaking effort or increased speaking rate (phonetic reduction) and due to increased effort (e.g., Lombard effect); Also account for coarticulation at the physical level due to inertia in articulation
Account for differences in different speakers’ speech production organs and the distorting effects due to acoustic environments
Four major stages in the architectural design of a novel speech recognizer
Summary
• Historical progress shows steady error rate reduction by HMM-based recognizers
• But the errors are many times higher than human speech recognition
• For mainstream adoption of speech recognition, lowering current error rate is critical, esp. for natural free-style speech
• One direction: Research on incorporating speech science and reliable knowledge sources into speech recognition models and algorithms
• Much work remains to bridge speech science and technology for ultimate success in speech recognition adoption
End &
Backup Slides