Download - Frequency Band-Importance Functions for Auditory and Auditory- Visual Speech Recognition Ken W. Grant Walter Reed Army Medical Center Washington, D.C.

Frequency Band-Importance Frequency Band-Importance Functions for Auditory and Auditory-Functions for Auditory and Auditory-

Visual Speech RecognitionVisual Speech Recognition

Ken W. GrantKen W. GrantWalter Reed Army Medical CenterWalter Reed Army Medical Center

Washington, D.C. 20307-5001Washington, D.C. 20307-5001

BackgroundBackground

• Speech recognition involves broadband Speech recognition involves broadband listening.listening.

• Information is not uniformly distributed Information is not uniformly distributed across the frequency spectrum.across the frequency spectrum.– different cues (spectral and temporal) of different cues (spectral and temporal) of

different relative value reside at different different relative value reside at different frequencies.frequencies.

– in general, more importance is placed at mid-in general, more importance is placed at mid-frequencies around 1500-3000 Hz.frequencies around 1500-3000 Hz.

– probably related to place-of-articulation cues probably related to place-of-articulation cues (F2/F3 transitions)(F2/F3 transitions)

Background (continued)Background (continued)• How can we determine the relative importance How can we determine the relative importance

or “weights” that listeners place on various or “weights” that listeners place on various frequency regions?frequency regions?

• Doherty and Turner, 1996; Turner et al., 1998Doherty and Turner, 1996; Turner et al., 1998– correlational procedure (Lutfi, 1995; correlational procedure (Lutfi, 1995;

Richards and Zhu, 1994) applied to speech Richards and Zhu, 1994) applied to speech recognition.recognition.

– partition speech into a number of spectral partition speech into a number of spectral bands.bands.

– perturb each band so that amount of perturb each band so that amount of information in each band can be correlated information in each band can be correlated with a listener’s performance.with a listener’s performance.

Correlation Method for SpeechCorrelation Method for Speech

Band 1 Band 2 Band 3 Band 4

Frequency (Hz)

Background (continued)Background (continued)• Are the relative importance of different Are the relative importance of different

frequency regions altered by the presence frequency regions altered by the presence of visual speech cues?of visual speech cues?

• Past results using isolated spectral bands Past results using isolated spectral bands of speech show that low-frequency speech of speech show that low-frequency speech provides more benefit to speechreading provides more benefit to speechreading than other spectral regions (Grant and than other spectral regions (Grant and Walden, 1996). Walden, 1996).

Background (continued)Background (continued)

0

10

20

30

40

50

60

70

80

90

100

0

10

20

30

40

50

60

70

80

90

100

Filter Condition

A AV

1 234 5 6 7 89 10 1112

1. 250-505 2. 645-955 3. 1130-1515 4. 1720-2140 5. 2600-3255 6. 4200-5720 7. 250-795 8. 1130-1930 9. 3255-572010. 250-113011. 1130-235512. 2600-5720

Condition (Hz)

Per

cent

Cor

rect

From Grant and Walden (1996). JASA, 100, 2415-2424.

Background (continued)Background (continued)

• Evidence from electrophysiological Evidence from electrophysiological studies show that visual speech cues studies show that visual speech cues fundamentally alter the way the fundamentally alter the way the auditory cortex responds to sound input auditory cortex responds to sound input (Calvert, 1977; van Wassenhove et al., (Calvert, 1977; van Wassenhove et al., 2005).2005).– reduction in N1-P2 amplitude.reduction in N1-P2 amplitude.– latency shift in N2 peak for highly visible latency shift in N2 peak for highly visible

consonants.consonants.

Visual Speech Alters Neural Visual Speech Alters Neural Processing of Auditory SpeechProcessing of Auditory Speech

CPz

From van Wassenhove, Grant, and Poeppel (2005). PNAS, 102, 1181-1186.

GoalsGoals

• Determine relative importance of Determine relative importance of different frequency regions for different frequency regions for auditory and auditory-visual speech.auditory and auditory-visual speech.

• Minimize band-on-band interactions Minimize band-on-band interactions by partitioning the speech signal into by partitioning the speech signal into widely spaced narrow bands.widely spaced narrow bands.

Spectral Slits - SentencesSpectral Slits - Sentences

From Greenberg, Arai, and Silipo (1998). Proc. ICSLP, Sydney, Dec. 1-4.

89%89% 60%60% 13%13%

11

22

33

44

Slit

Nu

mb

erS

lit N

um

ber

334334

841841

21202120

53405340

CF

C

F

(Hz)

(Hz)

2%2% 9%9% 9%9% 4%4%

11

22

33

44

Slit

Nu

mb

erS

lit N

um

ber

334334

841841

21202120

53405340

CF

(H

z)C

F (

Hz)

Spectral Slits - ConsonantsSpectral Slits - Consonants

91%91% 76%76% 63%63%

11

22

33

44

Slit

Nu

mb

erS

lit N

um

ber

334334

841841

21202120

53405340

CF

C

F

(Hz)

(Hz)

21%21% 22%22% 48%48% 50%50%

11

22

33

44

Slit

Nu

mb

erS

lit N

um

ber

334334

841841

21202120

53405340

CF

(H

z)C

F (

Hz)

Spectral Slits - ConsonantsSpectral Slits - Consonants

8.2%8.2% 7.4%7.4% 8.6%8.6% 7.6%7.6%

11

22

33

44

Slit

Nu

mb

erS

lit N

um

ber

334334

841841

21202120

53405340

CF

(H

z)C

F (

Hz)

• Individual band scores are too high for AV testing. AV Individual band scores are too high for AV testing. AV scores would be at ceiling.scores would be at ceiling.

• Different amounts of masking noise needed for each Different amounts of masking noise needed for each band.band.

• Goal in selecting noise levels was to:Goal in selecting noise levels was to:

– make each band roughly equal in intelligibility.make each band roughly equal in intelligibility.

– make the the combination of all 4 bands roughly 40% intelligibile.make the the combination of all 4 bands roughly 40% intelligibile.

Correlation Method for SpeechCorrelation Method for Speech

-5-5

-5-5-9-9-3-3

00+3+3

-5-5

00+1+1

+5+5

+3+3

-3-3

-8-8-10-10

00 -6-6

-7-7-14-14

-7-7

-4-4

298-375 750-945 1890-2381 4762-6000

Frequency (Hz)

Band Number

No

rmal

ized

Ban

d I

mp

ort

ance

0

0.1

0.2

0.3

0.4

0.5

0.6

1 2 3 4

0

0.1

0.2

0.3

0.4

0.5

0.6 A = 44.3%

A = 70.9%

Band Importance (Audio Alone)Band Importance (Audio Alone)

Band Number

No

rmal

ized

Ban

d I

mp

ort

ance

A = 44.3%AV = 78.1%

Band Importance (A versus AV)Band Importance (A versus AV)

0

0.1

1 2 3 4

0.2

0.3

0.4

0.5

0.6

A = 70.9%

Discussion – Audio AloneDiscussion – Audio Alone

• Frequency-importance functions for Frequency-importance functions for auditory alone conditions show that auditory alone conditions show that listeners consistently weighted band 2 the listeners consistently weighted band 2 the greatest.greatest.

• Relative importance changed slightly when Relative importance changed slightly when the overall intelligibility of the auditory the overall intelligibility of the auditory condition was increased.condition was increased.– band 2 still given the greatest weight.band 2 still given the greatest weight.– relative weight for bands 3 and 4 are relative weight for bands 3 and 4 are

swapped.swapped.

Discussion – AudiovisualDiscussion – Audiovisual

• When visual speech cues are present, When visual speech cues are present, listeners’ place more importance on low listeners’ place more importance on low frequencies.frequencies.

• Results are consistent with past studies Results are consistent with past studies using isolated spectral bands of speech.using isolated spectral bands of speech.– low-frequency speech provides cues for voicing low-frequency speech provides cues for voicing

which is highly complementary with which is highly complementary with speechreading.speechreading.

– mid-to-high-frequency speech provides cues for mid-to-high-frequency speech provides cues for place of articulation which is highly redundant place of articulation which is highly redundant with speechreading.with speechreading.

Conclusions - QuestionsConclusions - Questions

• For robust speech recognition, information must be For robust speech recognition, information must be extracted from many different spectral regions.extracted from many different spectral regions.

• The presence or absence of visual speech cues alters The presence or absence of visual speech cues alters the importance of different spectral regions for the the importance of different spectral regions for the listener.listener.

• For listening conditions where low-frequency speech For listening conditions where low-frequency speech cues are compromised (noise, reverberation, hearing cues are compromised (noise, reverberation, hearing loss), enhancement of the low frequencies of speech loss), enhancement of the low frequencies of speech may be advantageous, especially in situations where may be advantageous, especially in situations where visual cues are available.visual cues are available.