Liverpool University
The Department
• Centre for Cognitive NeuroscienceDepartment of PsychologyLiverpool University
• Overall AimUnderstanding Human Information Processing
Expertise
• Auditory Scene Analysis (ASA)– Perception experiments – Modelling
• Speech Perception
• Audio-Visual Integration– Models of AV information fusion– Applying these models to ASA
Work at Liverpool
Task 1.3, active/passive speech perception. Research Question: Do human listeners actively
predict the time course of background noise to aid speech recognition?
Current state: Perceptual evidence for ‘predictive scene analysis’ : Elvira Perez will explain all
Planned work: Database of environmental noise to test computational models
Work at Liverpool
Task 1.4 Envelope Information & Binaural ProcResearch Question: What features do listeners use to
track a target speaker in the presence of competing signals? Patti Adank (Aug.03-July.04)
Current State: Tested the hypothesis that ‘jitter’ is a stream segregation or stream formation cue:
Tech report finalised in July 04
Work at LiverpoolTask 2.2 Reliability of auditory cues in multi-cue
scenarios Research Question: How are cues perceptually
integrated?Combination of experimentation and modelling
Current state: Experimental data and models on audio-visual motion signal integration (non Hoarse)
Ongoing work: MLE models for speech feature integration. Elvira
Planned work: Collaboration with Patras (John Worley) on location and pitch segregation cue integration
Work at Liverpool
Task 4.1: Informing speech recognition Research Question: How to apply data derived from perception experiments to machine learning?
Current State: Just starting to ‘predict’ environmental noises (using Aurora noises)Recording database of natural scenes for analysis and modelling. (with Sheffield)
… over to Elvira
Environmental Noise
• Two-pronged approach – Elvira: is perceptual evidence for active noise
modelling in listeners
– Georg (+ Sheffield): noise modelling based on database
Baseline Data
• Typical noise databases not very representative– Size severely limited (e.g. Aurora)– Unrealistsic scenarios (fighter jets, foundries)
• Database of environmental noise– Transport noises: A320-200, ICE, Saab 9-3, – Social Places: Departure lounges, Hotel Lobby, Pub– Private Journeys: urban walk, country walk– Buildings: offices, corridors– …
• Aim is to have about 10-20 mins of representative data for typical situations.
Recordings
• Soundman OKMII binaural microphones
• Sony D3 DAT recorder
• 48kHz stereo recordings
• Digital transfer to PC
mics
Analysis
• Previous work– Auditory filterbank (linear, Mel-Scale, 32 ch.)– Linear prediction
• Within channels
• Across channels
• Planned work– Aud filterbank– Non-linear prediction using Nnets
Using Envelope Informationfor ASA (Patti Adank)
• Background– Brungard & Darwin (resource allocation task?)
Two simultaneous sentences: track one • Segregation benefits from
– Pitch differences
– Speaker differences
– Key question: operational definition of speaker characteristics
Speaker characteristics
• Vocal tract shape – Difficult to quantify / computationally extract
• Speaking style (intonation, stress, accent…)– Difficult to extract measures for very short segments
• Voice characteristics– F0 – of course…– Shimmer (amp modulation)– Jitter (roughness - random GCI variation)– Breathiness (open quotient duing voiced speech)
• All relatively easy to extract computationally• All relatively easy to control in speech re-synthesis
No one choice: Jitter
– Dan Ellis• Computational model is segregation by glottal
closure instance
• Model groups coincident energy in auditory filterbank
– Could ‘Jitter’ be useful for segregation?
Jitter as a primary segregation cue• Double vowel experiment:
– 5 synthetic vowels (Assmann & Summerfield)– Synthesized with range of
• 5 pitch levels
• 5 jitter levels
• Results– Pitch difference aids segregation– Jitter difference does not
baseline (0%) 0.5% 1% 2% 4%
% Jitter
45
50
55
60
65
70M
ean
+-
1 S
E p
erc
ent
Jitter analogous to location cues
• Location cues not primary segregation cues– Segregate on pitch first, then– Use location cues for stream formation
• Experiment– Brungard, Darwin (e.g. 2001) Task,
E.g. “Ready Tiger go to White One now”,And “Ready Arrow go to Red Four now”, but
– Speech resynthesized using Praat• Same speaker, different sentences
– Jitter does not aid stream formation
% correct colour/no combination
0% 3% 6% 9% 12% 15%
% jitter
20
30
40
50
60
70
Mea
n +
- 1
SE
Per
cen
t co
rrec
t
Informing Speech Recognition
• Jitter not no.1 candidate for informing speech recognition…
Task 2.2 Reliability of auditory cues in multi-cue scenarios.
• Ernst & Banks (Nature 2002)– Maximum likelihood estimation good model
for visual/somatosensory cue integration
– Adapted this for AV integration: mouse catching experiment: MLE good modelHofbauer et al., JPP: HPP 2004
– Want to look at speech cue integration in collaboration with Sheffield
Hypothesis
• If listeners organise formants by continuity, then
– the /o/ should lead to /m/ , while
– the /e/ should lead to /n/, with the secondformant of the nasal remaining unassigned
• if proximity is a cue then there should bea changeover at around 1400 Hz
800
2000
2700
375
100 200ms
Formants as a representation?
If sequential grouping of formants explains the perceptual change from /m/ to /n/ for high vowel F2s.
Then transitions should ‘undo’ this change.
time
F
Transitions in /vm/ syllables
• Synthetic /v-m/ segments as before, but 0, 2.5, 5, 10, 20ms formant transitions
• 7 fluent German speakers, 200 trials each
• Experimental results fit prediction0 5 10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
hear
d a
s /e
m/
transition duration (ms)
Transitions ??
• ‘Format transitions’ of 5 ms have an effect
• Synthetic speech was synthesized at 100 Hz
• formant transitions: half glottal period ?? – Confirmed that transition has to coincide with
energetic bit of glottal period
• Do subjects use a ‘transition’ or just energy in the appropriate band (1-2 kHz) ?
Formant transitions?
• Take /em/ stimulus without transitions (heard as /en/)• add a chirp in place
of F2 transition (0-,5,10,20,40ms)
– down chirp is FM sinusoid 2kHz-1kHz
– control is FM sinusoid 1kHz-2kHz
800
2000
2700
375
100 200ms
vowel nasal
Form
ant
freq
u. (
Hz)
upchirpdown
• ASA: Chirp should be segregated– listeners should hear ‘vowel-nasal’ plus chirp– listeners should find to difficult to report ‘time of
chirp’
Model prediction
Down Chirp
• 7 listeners 200 trialseach.
• Result:– chirp is perceived
by listeners
– and integrated into percept
/en/ is heard as /em/ 0 10 20 30 40
0.0
0.2
0.4
0.6
0.8
1.0
p(/
m/)
DOWN chirp duration (ms)
What does it all mean
• Subjects– Hear /em/ when the chirp is added
(any chirp!)– Hear the chirp as a separate sound– Can identify direction of chirp
• Chirps are able to replace formant– Spectral and fine time structure different– Up-direction inconsistent with expected F2
Multiresolution scene analysis
• Speech recognition does not require detail
• Scene analysis does…
MLE framework
• Propose to testMLE model forASA cue integration
• Cue integration as weighted sum () of component probability
time
F
ASA says:Ignore this
bit
Hypothetical Example
time
F
time
F
Labial transition p(m) = 0.8 = 0.7Formant structure p(m) = 0.7 = 0.3
time
F
velar transition p(n) = 0.8 = 0.7Formant structure p(m) = 0.7 = 0.3
unknown transition p(n/m) = 0 = 0.0Formant structure p(m) = 0.7 = 1.0/m/ /m//n/
MLE experiment (Elvira)
time
F
time
F
time
F
Taking it further (back)
• Transition cues
• Prior prob speech high non-speech low
• Localisationcues is low
What does it all mean
• Duplex Perception is – Nothing special– Entirely consistent with a probablistic scene
analysis viewpoint
• Could imagine a fairly high impact publication on this topic
• Training activity on ‘Data fusion’?
Where to go from here
• Would like to collaborate on principled testing of these (and related) ideas – Sheffield ?? IDIAP ??
• Is this any different from missing data recognition?
– Bochum ??• Want to ‘warm up’ duplex perception?
– Most useful: a hands-on modeller
EEG / MEG Study• We argue that
– Scene analysis informs speech perception• Therefore would expect non-speech signals to be
processed/evaluated before speech is recognised• EEG / MEG data should show
– Differential processing of speech / non-speech signals– Perhaps show an effect of the chirps on the latency of speech driven
auditory evoked potential (field)
• We have– A really neat stimulus– Emen signals can be listened to as speech
non-speech signal – Non-speech changes speech identity
(very!) Preliminary data
• Four conditions– /em/ with 20 ms formant transitions– /em/ no formant transitions (en percept)– /em/ no formant transition + 20 ms up chirp (em)– /em/ no formant transition + 20 ms dn chirp (em)
• Two tasks– Identify /em/ s– Identify signals containing chirps
• 16 channel EEG recordings 200 stim each
Predictions
• If ‘speech is special’ then should see significant task dependent differences
• May also see significant differences between stimuli leading to same percept– Effect of chirp might delay speech recognition?
– Here we go:
-100 0 100 200 300 400 500 600
-20
0
20
40
60
80
100
120
140 T7 T8 (.. nonspeech __ speech)
LHSRHS
Speech
non-speech
T7 T8
-100 0 100 200 300 400 500 600
-20
0
20
40
60
80
100
120
140 TP7 TP8 (.. nonspeech __ speech)
LHS
RHS
Speech
non-speech
TP7 TP8
-100 0 100 200 300 400 500 600-80
-60
-40
-20
0
20
40
60
80
100 F1 F2 (.. nonspeech __ speech)
Speech
non-speech
F1 F2
-100 0 100 200 300 400 500 600
-20
0
20
40
60
80
100
120 O1 O2 (.. nonspeech __ speech)
O1 O2 (control…)
No evidencefor differences In early (sensory)procesing
EEG Conclusions
• (very!) preliminary data looks very promising
• Need to get more subjects • Refine paradigm (sequence currently too fast)
– Would a MME study be appropriate
• Would like to – Look at source localisation
(MEG Helsinki, fMRI Liverpool)– Get more channels (MEG Helsinki)