+ All Categories
Home > Documents > Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven...

Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven...

Date post: 05-Jan-2016
Category:
Upload: thomas-charles
View: 217 times
Download: 1 times
Share this document with a friend
67
Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947 Center Street, Berkeley, CA 94704, USA http://www.icsi.berkeley.edu/~steveng [email protected] Ken W. Grant Army Audiology and Speech Center Walter Reed Army Medical Center Washington, D.C. 20307, USA http://www.wramc.amedd.army.mil/departments/aasc/avlab [email protected]
Transcript
Page 1: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Speech Intelligibility

Derived fromAsynchronous Processing

of

Auditory-Visual Information

Steven GreenbergInternational Computer Science Institute

1947 Center Street, Berkeley, CA 94704, USAhttp://www.icsi.berkeley.edu/~steveng

[email protected]

Ken W. GrantArmy Audiology and Speech CenterWalter Reed Army Medical Center

Washington, D.C. 20307, USAhttp://www.wramc.amedd.army.mil/departments/aasc/avlab

[email protected]

Page 2: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Acknowledgements and Thanks

Technical Assistance Takayuki Arai, Rosaria Silipo

Research FundingU.S. National Science Foundation

Page 3: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

BACKGROUND

Page 4: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Superior recognition and intelligibility under many conditions

Provides phonetic-segment information that is potentially redundant with acoustic information

Vowels

Provides segmental information that complements acoustic informationConsonants

Directs auditory analyses to the target signalWho, where, when, what (spectral)

What’s the Big Deal with Speech Reading?

+

Page 5: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

0

10

20

30

40

50

60

70

80

90

100

-15 -10 -5 0 5 10 15 20Speech-to-Noise Ratio (dB)

NH - Auditory Consonants

HI-Auditory Sentences

ASR Sentences

NH - Audiovisual Consonants

HI Auditory Consonants

HI - Audiovisual Consonants

Per

cen

t C

orr

ect

Rec

og

nit

ion

Audio-Visual vs. Audio-Only Recognition

NH = Normal HearingHI = Hearing Impaired

The visual modality provides a significant gain in speech processing

Particularly under low signal-to-noise-ratio conditions

And for hearing-impaired listeners

Figure courtesy of Ken Grant

Page 6: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

VoicingMannerPlaceOther

Percent Information Transmitted relative to Total Information Received

Articulatory Information via by Visual Cues

Figure courtesy of Ken Grant

0% 3%4%

Place of articulation93%

Place of Articulation Most Important

Page 7: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Key issues pertaining to:

Early versus late integration models of bi-modal information

Most contemporary models favor late integration of information

However ….Preliminary evidence (Sams et al., 1991) that silent speechreading can

activate auditory cortex (in humans) (but Bernstein et al. 2002 say “nay”)

Superior colliculus (an upper brainstem nucleus) may also serve as a site of bimodal integration (or at least interaction; Stein and colleagues)

Are Auditory & Visual Processing Independent?

Page 8: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

What are the temporal factors underlying integration of audio-visual information for speech processing?

Two sets of data are examined:Spectro-temporal integration – audio-only signalsAudio-visual integration using sparse spectral cues and speechreading

In each experiment the cues (acoustic and/or visual) are desynchronized and the impact on word intelligibility measured (for English sentences)

Time Constraints Underlying A/V Integration

Page 9: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

EXPERIMENTOVERVIEW

Page 10: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Time course of integration

Within (the acoustic) modality – Four narrow spectral slitsCentral slits desynchronized relative to the lateral slits

Across modalities –Two acoustic slits (the lateral channels)Speechreading video informationDesynchronize the video and audio streams relative to each

other

Spectro-temporal Integration

Page 11: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Auditory-Visual Asynchrony - ParadigmVideo of spoken (Harvard/IEEE) sentences, presented in tandem with a

sparse spectral representation (low- and high-frequency slits) of the same material

+

+

Video Leads Audio Leads

40 – 400 ms 40 – 400 ms

Baseline ConditionSYNCHRONOUS A/V

Page 12: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Auditory-Visual Integration - Preview When the AUDIO signal LEADS the VIDEO, there is a progressive decline in intelligibility, similar to

that observed for audio-alone signals

When the VIDEO signal LEADS the AUDIO, intelligibility is preserved for asynchrony intervals as long as 200 ms

Why? Why? Why?

We’ll return to these data shortly

But first, let’s take a look at audio-alone speech intelligibility data in order to gain some perspective on the audio-visual case

The audio-alone data come from earlier studies by Greenberg and colleagues using TIMIT sentences

9 Subjects

Page 13: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

AUDIO-ALONE EXPERIMENTS

Page 14: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

The edge of each slit was separated from its nearest neighbor by an octave

Can listeners decode spoken sentences using just four narrow (1/3 octave) channels (“slits”) distributed across the spectrum? – YES (cf. next slide)

What is the intelligibility of each slit alone and in combination with others?

Audio (Alone) Spectral Slit Paradigm

+

+

Page 15: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

89% 60% 13%

2% 9% 9% 4%

Word Intelligibility - Single and Multiple Slits

1

2

3

4

Slit

Nu

mb

er

1

2

3

4

Slit

Nu

mb

er

334

841

2120

5340

CF

(H

z)

334

841

2120

5340

CF

(H

z)

Page 16: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Word Intelligibility - Single SlitsThe intelligibility associated with any single slit is only 2 to 9%The mid-frequency slits exhibit somewhat higher intelligibility than the lateral slits

Page 17: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Word Intelligibility - 4 Slits

Page 18: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Word Intelligibility - 2 Slits

Page 19: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Word Intelligibility - 2 Slits

Page 20: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Slit Asynchrony Affects IntelligibilityDesynchronizing the slits by more than 25 ms results in a significant decline in intelligibility

The affect of asynchrony on intelligibility is relatively symmetrical

These data are from a different set of subjects than those participating in the study described earlier - hence slightly different numbers for the baseline conditions

Page 21: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Intelligibility and Slit AsynchronyDesynchronizing the two central slits relative to the lateral ones has a pronounced effect on intelligibility

Asynchrony greater than 50 ms results in intelligibility lower than baseline

Page 22: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

AUDIO-VISUAL EXPERIMENTS

Page 23: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Focus on Audio-Leading-Video ConditionsWhen the AUDIO signal LEADS the VIDEO, there is a progressive decline in

intelligibility, similar to that observed for audio-alone signals

These data are next compared with data from the previous slide to illustrated the similarity in the slope of the function

Page 24: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Comparison of A/V and Audio-Alone Data The decline in intelligibility for the audio-alone condition is similar to that of

the audio-leading-video condition

Such similarity in the slopes associated with intelligibility for both experiments suggest that the underlying mechanisms may be similar

The intelligibility of the audio-alone signals is higher than the A/V signals due to slits 2+3 being highly intelligible by themselves

Page 25: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

When the VIDEO signal LEADS the AUDIO, intelligibility is preserved for asynchrony intervals as large as 200 ms

These data are rather strange, implying some form of “immunity” against intelligibility degradation when the video channel leads the audio

We’ll consider a variety of interpretations in a few minutes

Focus on Video-Leading-Audio Conditions

Page 26: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

The slope of intelligibility-decline associated with the video-leading-audio conditions is rather different from the audio-leading-video conditions

WHY? WHY? WHY?

There are several interpretations of these data – we’ll consider several on the following slide

Auditory-Visual Integration - the Full Monty

Page 27: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

INTERPRETATION OF THE DATA

Page 28: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Possible Interpretations of the Data

Page 29: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa

Possible Interpretations of the Data – 1

Page 30: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa

Some problems with this interpretation (at least by itself) …..

Possible Interpretations of the Data – 1

Page 31: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa

Some problems with this interpretation (at least by itself) …..

The speed of light is ca. 186,300 miles per second (effectively instantaneous)

Possible Interpretations of the Data – 1

Page 32: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa

Some problems with this interpretation (at least by itself) …..

The speed of light is ca. 186,300 miles per second (effectively instantaneous)

The speech of sound is ca. 1129 feet per second (at sea level, 70° F, etc.)

Possible Interpretations of the Data – 1

Page 33: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa

Some problems with this interpretation (at least by itself) …..

The speed of light is ca. 186,300 miles per second (effectively instantaneous)

The speech of sound is ca. 1129 feet per second (at sea level, 70° F, etc.)

Subjects in this study were wearing headphones

Possible Interpretations of the Data – 1

Page 34: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa

Some problems with this interpretation (at least by itself) …..

The speed of light is ca. 186,300 miles per second (effectively instantaneous)

The speech of sound is ca. 1129 feet per second (at sea level, 70° F, etc.)

Subjects in this study were wearing headphones

Therefore the time disparity between audio and visual signals was short (perhaps a few milliseconds)

Possible Interpretations of the Data – 1

Page 35: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Light travels faster than sound – therefore the video signal arrives in advance of the audio signal and consequently the brain is adapted to dealing with video-leading-audio situations much more than vice versa

Some problems with this interpretation (at least by itself) …..

The speed of light is ca. 186,300 miles per second (effectively instantaneous)

The speech of sound is ca. 1129 feet per second (at sea level, 70° F, etc.)

Subjects in this study were wearing headphones

Therefore the time disparity between audio and visual signals was short (perhaps a few milliseconds)

(Let’s put this potential interpretation aside for a few moments)

Possible Interpretations of the Data – 1

Page 36: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Possible Interpretations of the Data – 2

Page 37: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Visual information is processed in the brain much more slowly than auditory information, therefore the video data actually arrives after the audio data in the current experimental situation. Thus, when the video channel leads the audio channel the asynchrony compensates for an internal (neural) asynchrony and the auditory and visual information arrive relatively in synch with each other

Possible Interpretations of the Data – 2

Page 38: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Visual information is processed in the brain much more slowly than auditory information, therefore the video data actually arrives after the audio data in the current experimental situation. Thus, when the video channel leads the audio channel the asynchrony compensates for an internal (neural) asynchrony and the auditory and visual information arrive relatively in synch with each other

Some problems with this interpretation ….

Possible Interpretations of the Data – 2

Page 39: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Visual information is processed in the brain much more slowly than auditory information, therefore the video data actually arrives after the audio data in the current experimental situation. Thus, when the video channel leads the audio channel the asynchrony compensates for an internal (neural) asynchrony and the auditory and visual information arrive relatively in synch with each other

Some problems with this interpretation ….

Even if we assume the validity of this assumption (visual processing lagging auditory processing) this interpretation would merely imply that the intelligibility-degradation functions associated with the audio-leading and video-leading conditions should be parallel (but offset from each other)

Possible Interpretations of the Data – 2

Page 40: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Visual information is processed in the brain much more slowly than auditory information, therefore the video data actually arrives after the audio data in the current experimental situation. Thus, when the video channel leads the audio channel the asynchrony compensates for an internal (neural) asynchrony and the auditory and visual information arrive relatively in synch with each other

Some problems with this interpretation ….

Even if we assume the validity of this assumption (visual processing lagging auditory processing) this interpretation would merely imply that the intelligibility-degradation functions associated with the audio-leading and video-leading conditions should be parallel (but offset from each other)

However, the data do not correspond to this pattern

Possible Interpretations of the Data – 2

Page 41: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Auditory-Visual IntegrationThe slope of intelligibility-decline associated with the video-leading-audio

conditions is rather different from the audio-leading-video conditions

WHY? WHY? WHY?

Page 42: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Possible Interpretations of the Data – 3

Page 43: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation. Under such conditions the brain must be tolerant of audio-visual asynchrony since it is so common and ubiquitous

Possible Interpretations of the Data – 3

Page 44: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation. Under such conditions the brain must be tolerant of audio-visual asynchrony since it is so common and ubiquitous

Some problems with this interpretation ….

Possible Interpretations of the Data – 3

Page 45: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation. Under such conditions the brain must be tolerant of audio-visual asynchrony since it is so common and ubiquitous

Some problems with this interpretation ….

If the brain were merely tolerant of audio-visual asynchrony then why would the audio-leading-the-video condition be so much more vulnerable to asynchronies less than 200 ms?

Possible Interpretations of the Data – 3

Page 46: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

The brain has evolved under conditions where the visual signal arrives prior to the audio signal, but where the time disparity between the two modalities varies from situation to situation. Under such conditions the brain must be tolerant of audio-visual asynchrony since it is so common and ubiquitous

Some problems with this interpretation ….

If the brain were merely tolerant of audio-visual asynchrony then why would the audio-leading-the-video condition be so much more vulnerable to asynchronies less than 200 ms?

There must be some other factor (or set of factors) associated with this perceptual integration asymmetry. What would it (they) be?

Possible Interpretations of the Data – 3

Page 47: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

Possible Interpretations of the Data – 4

Page 48: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?

Possible Interpretations of the Data – 4

Page 49: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?

The visual component of the speech signal is most closely associated with place-of-articulation information (cf. Grant and Walden, 1996)

Possible Interpretations of the Data – 4

Page 50: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?

The visual component of the speech signal is most closely associated with place-of-articulation information (Grant and Walden, 1996)

In the (audio) speech signal, place-of-articulation information usually evolves over two or three phonetic segments (i.e., a syllable in length)

Possible Interpretations of the Data – 4

Page 51: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?

The visual component of the speech signal is most closely associated with place-of-articulation information (Grant and Walden, 1996)

In the (audio) speech signal, place-of-articulation information usually evolves over two or three phonetic segments (i.e., a syllable in length)

This syllable interval pertaining to place-of-articulation cues would be appropriate for information that is encoded in a modality (in this instance, visual) that exhibits a variable degree of asynchrony with the auditory modality

Possible Interpretations of the Data – 4

Page 52: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

The video-leading-audio conditions are more robust in the face of asynchrony than audio-leading-video conditions because ….

There is certain information in the video component of the speech signal that evolves over a relatively long interval of time (e.g., 200 ms) and is thus relatively immune to asynchronous combination with information contained in the audio channel. What might this information be?

The visual component of the speech signal is most closely associated with place-of-articulation information (Grant and Walden, 1996)

In the (audio) speech signal, place-of-articulation information usually evolves over two or three phonetic segments (i.e., a syllable in length)

This syllable interval pertaining to place-of-articulation cues would be appropriate for information that is encoded in a modality (in this instance, visual) that exhibits a variable degree of asynchrony with the auditory modality

BUT ….. the data imply that the modality arriving first determines the mode (and hence the time constant of processing) for combining information across sensory channels

Possible Interpretations of the Data – 4

Page 53: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

VARIABILITY AMONG SUBJECTS

Page 54: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Perhaps the most intriguing property of the experimental results concerns the intelligibility patterns associated with individual subjects

One Further Wrinkle to the Story ….

Page 55: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Perhaps the most intriguing property of the experimental results concerns the intelligibility patterns associated with individual subjects

For eight of the nine subjects, the condition associated with the highest intelligibility was one in which the video signal led the audio

One Further Wrinkle to the Story ….

Page 56: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Perhaps the most intriguing property of the experimental results concerns the intelligibility patterns associated with individual subjects

For eight of the nine subjects, the condition associated with the highest intelligibility was one in which the video signal led the audio

The length of optimal asynchrony (in terms of intelligibility) varies from subject to subject, but is generally between 80 and 120 ms

One Further Wrinkle to the Story ….

Page 57: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Variation across subjects

Video signal leading is better than synchronous for 8 of 9 subjects

Auditory-Visual Integration - by Individual Ss

These data are complex, but the implications are clear.

Audio-visual integration is a complicated, poorly understood process, at least with respect to speech intelligibility

Page 58: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

SUMMARY

Page 59: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

Audio-Video Integration – Summary

Page 60: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)

Audio-Video Integration – Summary

Page 61: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)

When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony

Audio-Video Integration – Summary

Page 62: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)

When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony

When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms

Audio-Video Integration – Summary

Page 63: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)

When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony

When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms

For eight out of nine subjects, the highest intelligibility is associated with conditions in which the video signal leads the audio (often by 80-120 ms)

Audio-Video Integration – Summary

Page 64: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)

When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony

When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms

For eight out of nine subjects, the highest intelligibility is associated with conditions in which the video signal leads the audio (often by 80-120 ms)

There are many potential interpretations of the data

Audio-Video Integration – Summary

Page 65: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)

When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony

When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms

For eight out of nine subjects, the highest intelligibility is associated with conditions in which the video signal leads the audio (often by 80-120 ms)

There are many potential interpretations of the data

The interpretation currently favored by the presenter posits a relatively long (200 ms) integration buffer for audio-visual integration when the brain is confronted exclusively (even for short intervals) with speech-reading information (as occurs when the video signal leads the audio)

Audio-Video Integration – Summary

Page 66: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality

This same information can, when combined across modalities, provide good intelligibility (63% average accuracy)

When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony

When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms

For eight out of nine subjects, the highest intelligibility is associated with conditions in which the video signal leads the audio (often by 80-120 ms)

There are many potential interpretations of the data

The interpretation currently favored by the presenter posits a relatively long (200 ms) integration buffer for audio-visual integration when the brain is confronted exclusively (even for short intervals) with speech-reading information (as occurs when the video signal leads the audio)

The data further suggest that place-of-articulation cues evolve over syllabic intervals of ca. 200 ms in length and could therefore potentially apply to models of speech processing in general

Audio-Video Integration – Summary

Page 67: Speech Intelligibility Derived from Asynchronous Processing of Auditory-Visual Information Steven Greenberg International Computer Science Institute 1947.

That’s All

Many Thanks for Your Time and Attention


Recommended