+ All Categories
Home > Documents > HCS 7367 Speech Perception - University of Texas at Dallasassmann/hcs6367/lec9.pdfDiscrimination of...

HCS 7367 Speech Perception - University of Texas at Dallasassmann/hcs6367/lec9.pdfDiscrimination of...

Date post: 05-Feb-2021
Category:
Upload: others
View: 3 times
Download: 0 times
Share this document with a friend
12
1 HCS 7367 Speech Perception Dr. Peter Assmann Fall 2012 Long-term spectrum of speech Absolute threshold Females Males Connected speech Long-term spectrum of speech Absolute threshold Females Males 2000) Vowels Sound pressure level (dB) 0 120 100 80 60 40 20 62.5 125 500 1k 4k 8k 2k 16k 250 32k Frequency (Hz) Conversational Speech Audibility of speech Absolute threshold (normal listeners) Types of hearing loss Conductive loss Sensorineural loss Audibility/distortion Effect of noise Source: www.brainconnection.com Pure-tone audiogram 250 500 1K 2K 4K -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 Bone conduction thresholds Air conduction thresholds 250 500 1K 2K 4K -10 0 10 20 30 40 50 60 70 80 90 100 110 120 130 Hearing Loss in dB (ANSI-1996) Left ear Right ear Normal Conductive loss Source: www.bcm.tmc.edu/oto/studs/aud.html
Transcript
  • 1

    HCS 7367Speech Perception

    Dr. Peter AssmannFall  2012

    Long-term spectrum of speech

    Absolute threshold

    Females

    Males

    Connected speech

    Long-term spectrum of speech

    Absolute threshold

    Females

    Males

    2000)

    VowelsSo

    und

    pres

    sure

    leve

    l (dB

    )

    0

    120

    100

    80

    60

    40

    20

    62.5 125 500 1k 4k 8k2k 16k250 32k

    Frequency (Hz)

    Conversational Speech

    Audibility of speech

    Absolutethreshold(normal listeners)

    Types of hearing loss

    • Conductive loss• Sensorineural loss

    • Audibility/distortion• Effect of noise

    Source: www.brainconnection.com

    Pure-tone audiogram

    250 500 1K 2K 4K-10 0 10 20 30 40 50 60 70 80 90 100 110 120 130

    Bone conduction thresholdsAir conduction thresholds

    250 500 1K 2K 4K-10 0 10 20 30 40 50 60 70 80 90 100 110 120 130H

    eari

    ng L

    oss i

    n dB

    (AN

    SI-1

    996)

    Left ear Right ear

    Normal Conductive loss

    Source: www.bcm.tmc.edu/oto/studs/aud.html

  • 2

    Pure-tone audiogram

    250 500 1K 2K 4K-10 0 10 20 30 40 50 60 70 80 90 100 110 120 130

    250 500 1K 2K 4K-10 0 10 20 30 40 50 60 70 80 90 100 110 120 130H

    eari

    ng L

    oss i

    n dB

    (AN

    SI-1

    996)

    Left ear Right ear

    Sensorineural loss Mixed loss

    Bone conduction thresholdsAir conduction thresholds

    Source: www.bcm.tmc.edu/oto/studs/aud.html

    Speech audiometry

    • Nonsense syllables, real words, words in sentences

    • threshold for recognizing 50% of test items• percentage of items correctly reported• Speech tests provide a valid measure of hearing

    handicap.• Poor speech scores may indicate hearing loss of

    retrocochlear origin

    Sensorineural hearing loss

    • Listeners with cochlear hearing loss have difficulty recognizing speech when background noise is present.

    Reduced audibilitySupra-threshold “distortions”• Impaired frequency selectivity• Loudness recruitment

    Speech recognition in noise

    • Speech reception threshold, SRT(Plomp & Mimpen, 1969)

    Speech-to-noise ratio required to achieve a specific level of intelligibility, typically 50%Effects of speech materialsEffects of type of masker (e.g., speech-shaped noise vs. a single competing talker)Effects of spatial separation of target & masker

    Speech recognition in noiseMasker type Listening situation Deficit in SRT

    Speech-shaped noise

    Speech+masker in front, unaided 2.5 - 7.0 dB

    Speech-shaped noise

    Speech+masker in front, aided 2.5 - 6.0 dB

    Single talker Speech+masker in front, unaided 6.0 - 12.0 dB

    Single talker Speech+masker in front, aided 4.0 - 10.0 dB

    Single talker Speech+masker in front, spatially separated

    12.0 – 19.0 dB

    Source: Moore, BCJ (2003) Speech Communication

    Articulation Index

    • How much does audibility contribute to difficulty understanding speech in noise?

    • Articulation Index (AI) estimates the contribution of audibility (and other factors) to speech intelligibility

  • 3

    Articulation Index

    1. Divides the speech and masker spectrum into a small number of frequency bands

    2. Estimates the audibility of speech in each band, weighted by its relative importance for intelligibility

    3. Derives overall intelligibility by summing the contributions of each band.

    Articulation Index

    • Most studies show that speech intelligibility is worse than predicted by the AI for hearing-impaired listeners, especially for moderate or severe hearing loss.

    Articulation Index

    • Conclusion: factors other than audibility must be responsible for the difficulties experienced by hearing-impaired listeners understanding speech in noise.

    • What else?Frequency selectivityTemporal resolution

    Frequency Selectivity

    • Frequency selectivity is the ability to resolve the spectral components of complex sounds.

    • Reduced frequency selectivity may lead to difficulty in understanding speech in noise.

    Auditory filters• Fletcher (1940) suggested that the

    peripheral auditory system could be modeled as a bank of linear bandpass filters with continuously overlapping center frequencies.

    Frequency

    Gai

    n

    Auditory filters

    • Each point along the basilar membrane corresponds to a filter with a different center frequency, with center frequencies increasing roughly logarithmically from the apex to the base.

  • 4

    Auditory filters

    • About half of the length of the human basilar membrane is devoted to the lowest kHz (F1 range of speech) with the majority of neural fibers responding best to low-to-mid-frequencies.

    Critical Bandwidth• Fletcher (1940) band-widening experiment

    The threshold for detecting a pure tone in the presence of a bandpass noise masker increases as the noise bandwidth increases, until the width of the band exceeds the critical bandwidth of the auditory filter.

    Noise masker bandwidth

    Tone

    det

    ectio

    nth

    resh

    old

    Critical Bandwidth• Sources of evidence for critical bandwidth:

    Band-widening experiments (Fletcher, 1940)Loudness summation (Zwicker et al., 1957)Two-tone masking (Zwicker, 1954)Discrimination of partials within complex tones (Plomp and Mimpen, 1968)

    Only the lowest 5-8 partials can be reliably discriminated.

    Critical Bandwidth

    • Fletcher (1940) made the simplifying assumption that the auditory filter could be modeled as a rectangle, with flat top and vertical slopes.

    CB

    Frequency

    Gai

    n

    CB

    Power spectrum model of masking

    • Fletcher suggested that only a narrow band of frequencies in the region of the tone contribute to masking.

    • He called this the critical bandwidth (CB).

    AuditoryFilter

    Frequency

    Gai

    n

    Power spectrum model of masking

    • But threshold changes gradually as the noise bandwidth increases, suggesting auditory filters with sloping rather than rectangular skirts (Patterson, 1976).

    Frequency

    Gai

    n CB

    AuditoryFilter

  • 5

    Power spectrum model of masking

    • Detection of probe tone in the presence of a noise masker depends of the relative power of probe and noise passed by the auditory filter centered on the tone (Patterson, 1976).

    AuditoryFilter

    Tone Noise masker

    Frequency

    Power spectrum model of masking

    • Noise power is often specified as the power in a band of frequencies 1 Hz wide. This is called noise power density, designated N0.

    • The total power in a band of noise is calculated as W N0, where W is the noise bandwidth in Hz.

    W

    Power spectrum model of masking

    • When the noise just masks the tone, the ratio of the power in the tone to the power in the noise is a constant, K.

    KNWP )(/ 0

    )(/ 0NKPW and

    Power spectrum model of masking

    • Assumptions: Only frequencies within the passband of the auditory filter contribute to masking.Detection is based on a single auditory filter, centered on the frequency of the tone.Listeners ignore short-term fluctuations in the noise, and do not rely on phase differences between signal and noise.

    Notched noise method

    Tone

    LP Noise HP NoiseAuditory

    Filter

    Patterson (1976)

    Off-frequency listening

    Tone

    HP NoiseShiftedFilter

    Tone detectioncan be improvedby shifting filtercenter frequencyto maximize SNR

  • 6

    400 600 800 1000 1200 1400 1600-50

    -40

    -30

    -20

    -10

    0

    Frequency (kHz)

    Rel

    ativ

    e am

    plitu

    de (d

    B)

    Notched noise method

    • Patterson (1976) estimated auditory filter shapes from the function relating tone threshold to notch width.

    • The derived filters have a rounded top and steep skirts, with bandwidths 10-15% of filter center frequency. Derived auditory filter shape

    Simulation of reduced frequency selectivity

    Normal

    400 600 800 1000 1200 1400 1600-50

    -40

    -30

    -20

    -10

    0

    Frequency (kHz)

    ()

    Impaired (3Normal)

    400 600 800 1000 1200 1400 1600-50

    -40

    -30

    -20

    -10

    0

    Frequency (kHz)

    Rel

    ativ

    e am

    plitu

    de (d

    B)

    Derived auditory filter shapes

    0 1 2 3 4 5-60

    -50

    -40

    -30

    -20

    -10

    0

    Fc=1094 HzERB=143 Hz

    Frequency (kHz)

    Filte

    r Gai

    n (d

    B)

    Frequency response of gammatone filter bank

    Auditory filter shapes as a function of frequency

    Auditory filter shapes as a function of level

    0 500 1000 1500 20000

    10

    20

    30

    40

    50

    60

    70

    80

    90

    Frequency (Hz)

    Out

    put l

    evel

    (dB

    )

    Equivalent Rectangular Bandwidth• The equivalent rectangular bandwidth (or ERB)

    of a filter is the bandwidth of a rectangularfilter which has the same power output as that filter, when the input is white noise.

    400 600 800 1000 1200 1400 1600-50

    -40

    -30

    -20

    -10

    0

    Frequency (kHz)

    Rel

    ativ

    e am

    plitu

    de (d

    B) ERB

    Equivalent Rectangular Bandwidth• ERB – equivalent rectangular bandwidth of

    the estimated auditory filter – about 10-15% of the filter center frequency.

    0.05 0.1 0.2 0.5 1 2 5 100.02

    0.05

    0.1

    0.2

    0.5

    1

    2

    Center Frequency (Hz)

    ER

    B (H

    z)

  • 7

    Cochlear frequency-place map

    • Greenwood (1961) developed a function to relate the characteristic frequency (CF) at each place on the cochlea to the distance (x) of that place from the apex.

    ERB Scale• One ERB unit corresponds to a distance of

    about 0.89 mm along the basilar membrane.

    Human data

    ERB-rate scale• The ERB-rate scale is a warped frequency scale

    modeling changes in the ERB of the auditory filter as a function of frequency.

    0 1000 2000 3000 40000

    5

    10

    15

    20

    25

    30

    Frequency (Hz)

    ER

    B-r

    ate

    (ER

    B) ERB-rate as

    a function of frequency

    Excitation patterns• Auditory excitation patterns show the

    composite output of a bank of simulated auditory filters as a function of filter center frequency.

    Filter Center Frequency

    Filte

    r out

    put

    Excitation patterns• Excitation patterns provide a good model of

    auditory frequency selectivity and masking: frequency components that are resolved by the auditory system produce distinct peaks in the excitation pattern.

    Excitation patterns

    Cochlear FilteringCochlear Filtering

    Outer and middle earsOuter and middle ears

    EnergyDetector

    EnergyDetector

    EnergyDetector

    CNSCNS

    Freq

    uenc

    y (E

    RB

    -rat

    e)

  • 8

    Excitation patterns

    0.2 0.5 1.0 2.0 4.0-60

    -50

    -40

    -30

    -20

    -10

    0

    Frequency (kHz)

    Ampl

    itude

    (dB)

    500 Hz pure tone

    0.2 0.5 1.0 2.0 4.0-60

    -50

    -40

    -30

    -20

    -10

    0

    Frequency (kHz)

    Ampl

    itude

    (dB

    )

    Excitation patternsComplex tone, equal amplitude harmonics

    0.2 0.5 1.0 2.0 4.0-60

    -50

    -40

    -30

    -20

    -10

    0

    Frequency (kHz)

    Ampl

    itude

    (dB

    )

    Excitation patternsVowel: / æ / F0 = 200 Hz

    200Hz 400

    600

    800 F21450 Hz F3

    2450 Hz

    Time (ms)

    Freq

    uenc

    y (H

    z)

    0 100 200 300 400 500 600 7000.1

    0.2

    0.5

    1.0

    2.0

    Auditory filterbank spectrogram

    Simulation studies

    • Simulation of reduced frequency selectivity (spectral smearing of the short-term speech spectrum) results in lowered intelligibility for listeners with normal hearing, particularly in noise (ter Keurs et al., 1993; Baer & Moore, 1994)

    Simulation of reduced frequency selectivity

    Normal

    400 600 800 1000 1200 1400 1600-50

    -40

    -30

    -20

    -10

    0

    Frequency (kHz)

    ()

    Impaired (3Normal)

    400 600 800 1000 1200 1400 1600-50

    -40

    -30

    -20

    -10

    0

    Frequency (kHz)

    Rel

    ativ

    e am

    plitu

    de (d

    B)

    Derived auditory filter shapes

  • 9

    Effects of reduced frequency selectivity on vowel / ӕ /

    0.2 0.5 1.0 2.0 4.0-60

    -50

    -40

    -30

    -20

    -10

    0

    Frequency (kHz)

    Ampl

    itude

    (dB)

    F0 = 200 Hz

    200Hz 400

    600

    800 F21450 Hz F3

    2450 Hz

    3 x normal

    2 x normal

    1 x normal

    Distortion of spectral shape

    • Broader auditory filters produce a “smeared” excitation pattern: reduced prominence of peaks, smaller peak-to-valley ratios.

    • Introduction of noise fills up the valleys between the spectral peaks and reduces the distinctiveness of the spectral profile.

    Distortion of temporal structure

    • Broader auditory filters alter the temporal fine structure of the output.• Increased contribution of adjacent

    components• Increase in within-channel modulation• Diminished differences between adjacent

    channels

    Effects of reduced frequency selectivity on temporal structure

    0 100 200 300

    4515

    4214

    3913

    3612

    3311

    3010

    2709

    2408

    2107

    1806

    1505

    1204

    903

    602

    301

    0

    Normal x 1

    0 100 200 300

    4515

    4214

    3913

    3612

    3311

    3010

    2709

    2408

    2107

    1806

    1505

    1204

    903

    602

    301

    0

    Normal x 3Normal x 1 Normal x 3

    Time Time

    Filte

    r cen

    ter f

    requ

    ency

    (Hz)

    Loudness Recruitment

    • When a sound is increased in level above absolute threshold, the rate of growth of loudness is greater than normal.

    • At levels >90-100 dB SPL, loudness returns to normal (sound appears equally loud to hearing-impaired and normal listeners).

    Loudness Recruitment

    • Loudness recruitment is associated with reduced dynamic range (range between absolute threshold and highest comfortable level).

    • Recruitment may reduce the ability to “listen in the dips” in a fluctuating masker, such as a competing voice.

    • Recruitment distorts loudness relationships among components of speech sounds.

  • 10

    A glimpsing model of speech perception in noise

    Martin Cooke

    Journal of the Acoustical Society of America, Vol. 119, No. 3, pp. 1562–1573, March 2006

    “Glimpsing” speech in noise

    • “speech is a highly modulated signal in time and frequency, regions of high energy are typically sparsely distributed.”

    Time (ms)

    Freq

    uenc

    y (H

    z)

    0 100 200 300 400 500 600 7000.1

    0.2

    0.5

    1.0

    2.0

    “Glimpsing” speech in noise• “The information conveyed by the spectro-

    temporal energy spectrum of clean speech is redundant… Redundancy allows speech to be identified based on relatively sparse evidence.”

    Time (ms)

    Freq

    uenc

    y (H

    z)

    0 100 200 300 400 500 600 7000.1

    0.2

    0.5

    1.0

    2.0

    Glimpsing speech in noise

    • Can listeners take advantage of “glimpses”?direct attention to spectrotemporal regions where the S+N mixture is dominated by the target speech ASR system trained to recognize consonants in noiseMaskers differed in “glimpse size”ASR model developed to exploit non-uniform distribution of SNR in different time-frequency bandsConclusion: model + listeners benefit from glimpsing.

    Speech + noise mixtures• Some regions dominated by target voice• Local SNR varies across time and frequency• Where the target voice dominates, the problem

    of source segregation is solved because the signal is effectively “clean” speech.

    • Clean speech is highly redundant; it remains intelligible after 50% or more of its energy is removed by gating and/or filtering

    STEP model

    • Auditory excitation pattern (Moore, 2003)Spectrogram-like representationReflects non-uniform frequency selectivity in different frequency bandsIncorporates a sliding time window reflecting temporal analysis by the auditory systemRelative audibility at different frequenciesLoudness model

  • 11

    Missing data ASR

    • HMM-based speech recognizer• “Missing-data” models

    Glimpses only• Ignore missing information (in masked regions)

    Glimpses-plus-background• Try to fill in missing information (based on masked

    regions)

    Sparseness and redundancy• Glimpses = spectrotemporal regions where

    signal exceeds masker by ~3 dB.

    single talkermasker

    eight-talkermasker

    speech-shapednoise

    target

    glimpses

    Syllable identification accuracy as a function of the number of competing voices. The level of the target speech (monosyllabic nonsense words) was held constant at 95dB. (After Miller 1947).

    Results Results

    FIG. 4. The correlation between intelligibility and proportion of the target speech in which the local SNR exceeds 3 dB. Each point represents a noise condition, and proportions are means across all tokens in the test set. The best linear fit is also shown. The correlation between listeners and these putative glimpses is 0.955.

  • 12

    Conclusions

    • Best model:Uses information in glimpses and counterevidence in the masked regionsGlimpses constrained to a minimum areaTreats all regions with local SNR > -5 dB as potential glimpses

    Conclusions• A higher “glimpse threshold” (e.g. local SNR

    > 0 dB) produces fewer glimpses, but this provides less distorted information than a lower threshold (e.g. -5 dB).

    Conclusions• Limitation: local SNR must be known in

    advance. Is there a way to estimate the local SNR directly from the mixture?

    • Tracking problem: how to integrate glimpses over time?

    Brungart et al. (2001)

    –12 –9 –6 –3 0 3 6 9 12 Target-to-Masker Ratio (dB)

    2-ta

    lker

    cor

    rect

    resp

    onse

    s (%

    )

    Same talkerSame talkerDifferent talker, same sexModulated noiseDifferent talker, different sex

    Brungart et al. (2001)

    –12 –9 –6 –3 0 3 6 9 12 Target-to-Masker Ratio (dB)

    2-ta

    lker

    cor

    rect

    res

    pons

    es (%

    )


Recommended