Date post: | 05-Jan-2016 |
Category: |
Documents |
Upload: | thomas-charles |
View: | 217 times |
Download: | 1 times |
Speech Intelligibility
Derived fromAsynchronous Processing
of
Auditory-Visual Information
Steven GreenbergInternational Computer Science Institute
1947 Center Street, Berkeley, CA 94704, USAhttp://www.icsi.berkeley.edu/~steveng
Ken W. GrantArmy Audiology and Speech CenterWalter Reed Army Medical Center
Washington, D.C. 20307, USAhttp://www.wramc.amedd.army.mil/departments/aasc/avlab
Acknowledgements and Thanks
Technical Assistance Takayuki Arai, Rosaria Silipo
Research FundingU.S. National Science Foundation
BACKGROUND
Superior recognition and intelligibility under many conditions
Provides phonetic-segment information that is potentially redundant with acoustic information
Vowels
Provides segmental information that complements acoustic informationConsonants
Directs auditory analyses to the target signalWho, where, when, what (spectral)
What’s the Big Deal with Speech Reading?
+
0
10
20
30
40
50
60
70
80
90
100
-15 -10 -5 0 5 10 15 20Speech-to-Noise Ratio (dB)
NH - Auditory Consonants
HI-Auditory Sentences
ASR Sentences
NH - Audiovisual Consonants
HI Auditory Consonants
HI - Audiovisual Consonants
Per
cen
t C
orr
ect
Rec
og
nit
ion
Audio-Visual vs. Audio-Only Recognition
NH = Normal HearingHI = Hearing Impaired
The visual modality provides a significant gain in speech processing
Particularly under low signal-to-noise-ratio conditions
And for hearing-impaired listeners
Figure courtesy of Ken Grant
VoicingMannerPlaceOther
Percent Information Transmitted relative to Total Information Received
Articulatory Information via by Visual Cues
Figure courtesy of Ken Grant
0% 3%4%
Place of articulation93%
Place of Articulation Most Important
Key issues pertaining to:
Early versus late integration models of bi-modal information
Most contemporary models favor late integration of information
However ….Preliminary evidence (Sams et al., 1991) that silent speechreading can
activate auditory cortex (in humans) (but Bernstein et al. 2002 say “nay”)
Superior colliculus (an upper brainstem nucleus) may also serve as a site of bimodal integration (or at least interaction; Stein and colleagues)
Are Auditory & Visual Processing Independent?
What are the temporal factors underlying integration of audio-visual information for speech processing?
Two sets of data are examined:Spectro-temporal integration – audio-only signalsAudio-visual integration using sparse spectral cues and speechreading
In each experiment the cues (acoustic and/or visual) are desynchronized and the impact on word intelligibility measured (for English sentences)
Time Constraints Underlying A/V Integration
EXPERIMENTOVERVIEW
Time course of integration
Within (the acoustic) modality – Four narrow spectral slitsCentral slits desynchronized relative to the lateral slits
Across modalities –Two acoustic slits (the lateral channels)Speechreading video informationDesynchronize the video and audio streams relative to each
other
Spectro-temporal Integration
Auditory-Visual Asynchrony - ParadigmVideo of spoken (Harvard/IEEE) sentences, presented in tandem with a
sparse spectral representation (low- and high-frequency slits) of the same material
+
+
Video Leads Audio Leads
40 – 400 ms 40 – 400 ms
Baseline ConditionSYNCHRONOUS A/V
Auditory-Visual Integration - Preview When the AUDIO signal LEADS the VIDEO, there is a progressive decline in intelligibility, similar to
that observed for audio-alone signals
When the VIDEO signal LEADS the AUDIO, intelligibility is preserved for asynchrony intervals as long as 200 ms
Why? Why? Why?
We’ll return to these data shortly
But first, let’s take a look at audio-alone speech intelligibility data in order to gain some perspective on the audio-visual case
The audio-alone data come from earlier studies by Greenberg and colleagues using TIMIT sentences
9 Subjects
AUDIO-ALONE EXPERIMENTS
The edge of each slit was separated from its nearest neighbor by an octave
Can listeners decode spoken sentences using just four narrow (1/3 octave) channels (“slits”) distributed across the spectrum? – YES (cf. next slide)
What is the intelligibility of each slit alone and in combination with others?
Audio (Alone) Spectral Slit Paradigm
+
+
89% 60% 13%
2% 9% 9% 4%
Word Intelligibility - Single and Multiple Slits
1
2
3
4
Slit
Nu
mb
er
1
2
3
4
Slit
Nu
mb
er
334
841
2120
5340
CF
(H
z)
334
841
2120
5340
CF
(H
z)
Word Intelligibility - Single SlitsThe intelligibility associated with any single slit is only 2 to 9%The mid-frequency slits exhibit somewhat higher intelligibility than the lateral slits
Word Intelligibility - 4 Slits
Word Intelligibility - 2 Slits
Word Intelligibility - 2 Slits
Slit Asynchrony Affects IntelligibilityDesynchronizing the slits by more than 25 ms results in a significant decline in intelligibility
The affect of asynchrony on intelligibility is relatively symmetrical
These data are from a different set of subjects than those participating in the study described earlier - hence slightly different numbers for the baseline conditions
Intelligibility and Slit AsynchronyDesynchronizing the two central slits relative to the lateral ones has a pronounced effect on intelligibility
Asynchrony greater than 50 ms results in intelligibility lower than baseline
AUDIO-VISUAL EXPERIMENTS
Focus on Audio-Leading-Video ConditionsWhen the AUDIO signal LEADS the VIDEO, there is a progressive decline in
intelligibility, similar to that observed for audio-alone signals
These data are next compared with data from the previous slide to illustrated the similarity in the slope of the function
Comparison of A/V and Audio-Alone Data The decline in intelligibility for the audio-alone condition is similar to that of
the audio-leading-video condition
Such similarity in the slopes associated with intelligibility for both experiments suggest that the underlying mechanisms may be similar
The intelligibility of the audio-alone signals is higher than the A/V signals due to slits 2+3 being highly intelligible by themselves
When the VIDEO signal LEADS the AUDIO, intelligibility is preserved for asynchrony intervals as large as 200 ms
These data are rather strange, implying some form of “immunity” against intelligibility degradation when the video channel leads the audio
We’ll consider a variety of interpretations in a few minutes
Focus on Video-Leading-Audio Conditions
The slope of intelligibility-decline associated with the video-leading-audio conditions is rather different from the audio-leading-video conditions
WHY? WHY? WHY?
There are several interpretations of these data – we’ll consider several on the following slide
Auditory-Visual Integration - the Full Monty
INTERPRETATION OF THE DATA
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Possible Interpretations of the Data
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa
Possible Interpretations of the Data – 1
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa
Some problems with this interpretation (at least by itself) …..
Possible Interpretations of the Data – 1
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa
Some problems with this interpretation (at least by itself) …..
The speed of light is ca. 186,300 miles per second (effectively instantaneous)
Possible Interpretations of the Data – 1
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa
Some problems with this interpretation (at least by itself) …..
The speed of light is ca. 186,300 miles per second (effectively instantaneous)
The speech of sound is ca. 1129 feet per second (at sea level, 70° F, etc.)
Possible Interpretations of the Data – 1
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa
Some problems with this interpretation (at least by itself) …..
The speed of light is ca. 186,300 miles per second (effectively instantaneous)
The speech of sound is ca. 1129 feet per second (at sea level, 70° F, etc.)
Subjects in this study were wearing headphones
Possible Interpretations of the Data – 1
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa
Some problems with this interpretation (at least by itself) …..
The speed of light is ca. 186,300 miles per second (effectively instantaneous)
The speech of sound is ca. 1129 feet per second (at sea level, 70° F, etc.)
Subjects in this study were wearing headphones
Therefore the time disparity between audio and visual signals was short (perhaps a few milliseconds)
Possible Interpretations of the Data – 1
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa
Some problems with this interpretation (at least by itself) …..
The speed of light is ca. 186,300 miles per second (effectively instantaneous)
The speech of sound is ca. 1129 feet per second (at sea level, 70° F, etc.)
Subjects in this study were wearing headphones
Therefore the time disparity between audio and visual signals was short (perhaps a few milliseconds)
(Let’s put this potential interpretation aside for a few moments)
Possible Interpretations of the Data – 1
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Possible Interpretations of the Data – 2
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Visual information is processed in the brain much more slowly than auditory information, therefore the video data actually arrives after the audio data in the current experimental situation. Thus, when the video channel leads the audio channel the asynchrony compensates for an internal (neural) asynchrony and the auditory and visual information arrive relatively in synch with each other
Possible Interpretations of the Data – 2
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Visual information is processed in the brain much more slowly than auditory information, therefore the video data actually arrives after the audio data in the current experimental situation. Thus, when the video channel leads the audio channel the asynchrony compensates for an internal (neural) asynchrony and the auditory and visual information arrive relatively in synch with each other
Some problems with this interpretation ….
Possible Interpretations of the Data – 2
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Visual information is processed in the brain much more slowly than auditory information, therefore the video data actually arrives after the audio data in the current experimental situation. Thus, when the video channel leads the audio channel the asynchrony compensates for an internal (neural) asynchrony and the auditory and visual information arrive relatively in synch with each other
Some problems with this interpretation ….
Even if we assume the validity of this assumption (visual processing lagging auditory processing) this interpretation would merely imply that the intelligibility-degradation functions associated with the audio-leading and video-leading conditions should be parallel (but offset from each other)
Possible Interpretations of the Data – 2
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Visual information is processed in the brain much more slowly than auditory information, therefore the video data actually arrives after the audio data in the current experimental situation. Thus, when the video channel leads the audio channel the asynchrony compensates for an internal (neural) asynchrony and the auditory and visual information arrive relatively in synch with each other
Some problems with this interpretation ….
Even if we assume the validity of this assumption (visual processing lagging auditory processing) this interpretation would merely imply that the intelligibility-degradation functions associated with the audio-leading and video-leading conditions should be parallel (but offset from each other)
However, the data do not correspond to this pattern
Possible Interpretations of the Data – 2
Auditory-Visual IntegrationThe slope of intelligibility-decline associated with the video-leading-audio
conditions is rather different from the audio-leading-video conditions
WHY? WHY? WHY?
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Possible Interpretations of the Data – 3
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation. Under such conditions the brain must be tolerant of audio-visual asynchrony since it is so common and ubiquitous
Possible Interpretations of the Data – 3
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation. Under such conditions the brain must be tolerant of audio-visual asynchrony since it is so common and ubiquitous
Some problems with this interpretation ….
Possible Interpretations of the Data – 3
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation. Under such conditions the brain must be tolerant of audio-visual asynchrony since it is so common and ubiquitous
Some problems with this interpretation ….
If the brain were merely tolerant of audio-visual asynchrony then why would the audio-leading-the-video condition be so much more vulnerable to asynchronies less than 200 ms?
Possible Interpretations of the Data – 3
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation. Under such conditions the brain must be tolerant of audio-visual asynchrony since it is so common and ubiquitous
Some problems with this interpretation ….
If the brain were merely tolerant of audio-visual asynchrony then why would the audio-leading-the-video condition be so much more vulnerable to asynchronies less than 200 ms?
There must be some other factor (or set of factors) associated with this perceptual integration asymmetry. What would it (they) be?
Possible Interpretations of the Data – 3
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
Possible Interpretations of the Data – 4
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?
Possible Interpretations of the Data – 4
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?
The visual component of the speech signal is most closely associated with place-of-articulation information (cf. Grant and Walden, 1996)
Possible Interpretations of the Data – 4
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?
The visual component of the speech signal is most closely associated with place-of-articulation information (Grant and Walden, 1996)
In the (audio) speech signal, place-of-articulation information usually evolves over two or three phonetic segments (i.e., a syllable in length)
Possible Interpretations of the Data – 4
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?
The visual component of the speech signal is most closely associated with place-of-articulation information (Grant and Walden, 1996)
In the (audio) speech signal, place-of-articulation information usually evolves over two or three phonetic segments (i.e., a syllable in length)
This syllable interval pertaining to place-of-articulation cues would be appropriate for information that is encoded in a modality (in this instance, visual) that exhibits a variable degree of asynchrony with the auditory modality
Possible Interpretations of the Data – 4
The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….
There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?
The visual component of the speech signal is most closely associated with place-of-articulation information (Grant and Walden, 1996)
In the (audio) speech signal, place-of-articulation information usually evolves over two or three phonetic segments (i.e., a syllable in length)
This syllable interval pertaining to place-of-articulation cues would be appropriate for information that is encoded in a modality (in this instance, visual) that exhibits a variable degree of asynchrony with the auditory modality
BUT ….. the data imply that the modality arriving first determines the mode (and hence the time constant of processing) for combining information across sensory channels
Possible Interpretations of the Data – 4
VARIABILITY AMONG SUBJECTS
Perhaps the most intriguing property of the experimental results concerns the intelligibility patterns associated with individual subjects
One Further Wrinkle to the Story ….
Perhaps the most intriguing property of the experimental results concerns the intelligibility patterns associated with individual subjects
For eight of the nine subjects, the condition associated with the highest intelligibility was one in which the video signal led the audio
One Further Wrinkle to the Story ….
Perhaps the most intriguing property of the experimental results concerns the intelligibility patterns associated with individual subjects
For eight of the nine subjects, the condition associated with the highest intelligibility was one in which the video signal led the audio
The length of optimal asynchrony (in terms of intelligibility) varies from subject to subject, but is generally between 80 and 120 ms
One Further Wrinkle to the Story ….
Variation across subjects
Video signal leading is better than synchronous for 8 of 9 subjects
Auditory-Visual Integration - by Individual Ss
These data are complex, but the implications are clear.
Audio-visual integration is a complicated, poorly understood process, at least with respect to speech intelligibility
SUMMARY
Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality
Audio-Video Integration – Summary
Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality
This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)
Audio-Video Integration – Summary
Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality
This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)
When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony
Audio-Video Integration – Summary
Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality
This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)
When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony
When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms
Audio-Video Integration – Summary
Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality
This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)
When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony
When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms
For eight out of nine subjects, the highest intelligibility is associated with conditions in which the video signal leads the audio (often by 80-120 ms)
Audio-Video Integration – Summary
Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality
This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)
When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony
When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms
For eight out of nine subjects, the highest intelligibility is associated with conditions in which the video signal leads the audio (often by 80-120 ms)
There are many potential interpretations of the data
Audio-Video Integration – Summary
Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality
This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)
When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony
When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms
For eight out of nine subjects, the highest intelligibility is associated with conditions in which the video signal leads the audio (often by 80-120 ms)
There are many potential interpretations of the data
The interpretation currently favored by the presenter posits a relatively long (200 ms) integration buffer for audio-visual integration when the brain is confronted exclusively (even for short intervals) with speech-reading information (as occurs when the video signal leads the audio)
Audio-Video Integration – Summary
Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality
This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)
When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony
When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms
For eight out of nine subjects, the highest intelligibility is associated with conditions in which the video signal leads the audio (often by 80-120 ms)
There are many potential interpretations of the data
The interpretation currently favored by the presenter posits a relatively long (200 ms) integration buffer for audio-visual integration when the brain is confronted exclusively (even for short intervals) with speech-reading information (as occurs when the video signal leads the audio)
The data further suggest that place-of-articulation cues evolve over syllabic intervals of ca. 200 ms in length and could therefore potentially apply to models of speech processing in general
Audio-Video Integration – Summary
That’s All
Many Thanks for Your Time and Attention