Frequency Band-Importance Frequency Band-Importance Functions for Auditory and Auditory-Functions for Auditory and Auditory-
Visual Speech RecognitionVisual Speech Recognition
Ken W. GrantKen W. GrantWalter Reed Army Medical CenterWalter Reed Army Medical Center
Washington, D.C. 20307-5001Washington, D.C. 20307-5001
BackgroundBackground
• Speech recognition involves broadband Speech recognition involves broadband listening.listening.
• Information is not uniformly distributed Information is not uniformly distributed across the frequency spectrum.across the frequency spectrum.– different cues (spectral and temporal) of different cues (spectral and temporal) of
different relative value reside at different different relative value reside at different frequencies.frequencies.
– in general, more importance is placed at mid-in general, more importance is placed at mid-frequencies around 1500-3000 Hz.frequencies around 1500-3000 Hz.
– probably related to place-of-articulation cues probably related to place-of-articulation cues (F2/F3 transitions)(F2/F3 transitions)
Background (continued)Background (continued)• How can we determine the relative importance How can we determine the relative importance
or “weights” that listeners place on various or “weights” that listeners place on various frequency regions?frequency regions?
• Doherty and Turner, 1996; Turner et al., 1998Doherty and Turner, 1996; Turner et al., 1998– correlational procedure (Lutfi, 1995; correlational procedure (Lutfi, 1995;
Richards and Zhu, 1994) applied to speech Richards and Zhu, 1994) applied to speech recognition.recognition.
– partition speech into a number of spectral partition speech into a number of spectral bands.bands.
– perturb each band so that amount of perturb each band so that amount of information in each band can be correlated information in each band can be correlated with a listener’s performance.with a listener’s performance.
Correlation Method for SpeechCorrelation Method for Speech
Band 1 Band 2 Band 3 Band 4
Frequency (Hz)
Background (continued)Background (continued)• Are the relative importance of different Are the relative importance of different
frequency regions altered by the presence frequency regions altered by the presence of visual speech cues?of visual speech cues?
• Past results using isolated spectral bands Past results using isolated spectral bands of speech show that low-frequency speech of speech show that low-frequency speech provides more benefit to speechreading provides more benefit to speechreading than other spectral regions (Grant and than other spectral regions (Grant and Walden, 1996). Walden, 1996).
Background (continued)Background (continued)
0
10
20
30
40
50
60
70
80
90
100
0
10
20
30
40
50
60
70
80
90
100
Filter Condition
A AV
1 234 5 6 7 89 10 1112
1. 250-505 2. 645-955 3. 1130-1515 4. 1720-2140 5. 2600-3255 6. 4200-5720 7. 250-795 8. 1130-1930 9. 3255-572010. 250-113011. 1130-235512. 2600-5720
Condition (Hz)
Per
cent
Cor
rect
From Grant and Walden (1996). JASA, 100, 2415-2424.
Background (continued)Background (continued)
• Evidence from electrophysiological Evidence from electrophysiological studies show that visual speech cues studies show that visual speech cues fundamentally alter the way the fundamentally alter the way the auditory cortex responds to sound input auditory cortex responds to sound input (Calvert, 1977; van Wassenhove et al., (Calvert, 1977; van Wassenhove et al., 2005).2005).– reduction in N1-P2 amplitude.reduction in N1-P2 amplitude.– latency shift in N2 peak for highly visible latency shift in N2 peak for highly visible
consonants.consonants.
Visual Speech Alters Neural Visual Speech Alters Neural Processing of Auditory SpeechProcessing of Auditory Speech
CPz
From van Wassenhove, Grant, and Poeppel (2005). PNAS, 102, 1181-1186.
GoalsGoals
• Determine relative importance of Determine relative importance of different frequency regions for different frequency regions for auditory and auditory-visual speech.auditory and auditory-visual speech.
• Minimize band-on-band interactions Minimize band-on-band interactions by partitioning the speech signal into by partitioning the speech signal into widely spaced narrow bands.widely spaced narrow bands.
Spectral Slits - SentencesSpectral Slits - Sentences
From Greenberg, Arai, and Silipo (1998). Proc. ICSLP, Sydney, Dec. 1-4.
89%89% 60%60% 13%13%
11
22
33
44
Slit
Nu
mb
erS
lit N
um
ber
334334
841841
21202120
53405340
CF
C
F
(Hz)
(Hz)
2%2% 9%9% 9%9% 4%4%
11
22
33
44
Slit
Nu
mb
erS
lit N
um
ber
334334
841841
21202120
53405340
CF
(H
z)C
F (
Hz)
Spectral Slits - ConsonantsSpectral Slits - Consonants
91%91% 76%76% 63%63%
11
22
33
44
Slit
Nu
mb
erS
lit N
um
ber
334334
841841
21202120
53405340
CF
C
F
(Hz)
(Hz)
21%21% 22%22% 48%48% 50%50%
11
22
33
44
Slit
Nu
mb
erS
lit N
um
ber
334334
841841
21202120
53405340
CF
(H
z)C
F (
Hz)
Spectral Slits - ConsonantsSpectral Slits - Consonants
8.2%8.2% 7.4%7.4% 8.6%8.6% 7.6%7.6%
11
22
33
44
Slit
Nu
mb
erS
lit N
um
ber
334334
841841
21202120
53405340
CF
(H
z)C
F (
Hz)
• Individual band scores are too high for AV testing. AV Individual band scores are too high for AV testing. AV scores would be at ceiling.scores would be at ceiling.
• Different amounts of masking noise needed for each Different amounts of masking noise needed for each band.band.
• Goal in selecting noise levels was to:Goal in selecting noise levels was to:
– make each band roughly equal in intelligibility.make each band roughly equal in intelligibility.
– make the the combination of all 4 bands roughly 40% intelligibile.make the the combination of all 4 bands roughly 40% intelligibile.
Correlation Method for SpeechCorrelation Method for Speech
-5-5
-5-5-9-9-3-3
00+3+3
-5-5
00+1+1
+5+5
+3+3
-3-3
-8-8-10-10
00 -6-6
-7-7-14-14
-7-7
-4-4
298-375 750-945 1890-2381 4762-6000
Frequency (Hz)
Band Number
No
rmal
ized
Ban
d I
mp
ort
ance
0
0.1
0.2
0.3
0.4
0.5
0.6
1 2 3 4
0
0.1
0.2
0.3
0.4
0.5
0.6 A = 44.3%
A = 70.9%
Band Importance (Audio Alone)Band Importance (Audio Alone)
Band Number
No
rmal
ized
Ban
d I
mp
ort
ance
A = 44.3%AV = 78.1%
Band Importance (A versus AV)Band Importance (A versus AV)
0
0.1
1 2 3 4
0.2
0.3
0.4
0.5
0.6
A = 70.9%
Discussion – Audio AloneDiscussion – Audio Alone
• Frequency-importance functions for Frequency-importance functions for auditory alone conditions show that auditory alone conditions show that listeners consistently weighted band 2 the listeners consistently weighted band 2 the greatest.greatest.
• Relative importance changed slightly when Relative importance changed slightly when the overall intelligibility of the auditory the overall intelligibility of the auditory condition was increased.condition was increased.– band 2 still given the greatest weight.band 2 still given the greatest weight.– relative weight for bands 3 and 4 are relative weight for bands 3 and 4 are
swapped.swapped.
Discussion – AudiovisualDiscussion – Audiovisual
• When visual speech cues are present, When visual speech cues are present, listeners’ place more importance on low listeners’ place more importance on low frequencies.frequencies.
• Results are consistent with past studies Results are consistent with past studies using isolated spectral bands of speech.using isolated spectral bands of speech.– low-frequency speech provides cues for voicing low-frequency speech provides cues for voicing
which is highly complementary with which is highly complementary with speechreading.speechreading.
– mid-to-high-frequency speech provides cues for mid-to-high-frequency speech provides cues for place of articulation which is highly redundant place of articulation which is highly redundant with speechreading.with speechreading.
Conclusions - QuestionsConclusions - Questions
• For robust speech recognition, information must be For robust speech recognition, information must be extracted from many different spectral regions.extracted from many different spectral regions.
• The presence or absence of visual speech cues alters The presence or absence of visual speech cues alters the importance of different spectral regions for the the importance of different spectral regions for the listener.listener.
• For listening conditions where low-frequency speech For listening conditions where low-frequency speech cues are compromised (noise, reverberation, hearing cues are compromised (noise, reverberation, hearing loss), enhancement of the low frequencies of speech loss), enhancement of the low frequencies of speech may be advantageous, especially in situations where may be advantageous, especially in situations where visual cues are available.visual cues are available.