Liverpool University

Liverpool University

The Department

• Centre for Cognitive NeuroscienceDepartment of PsychologyLiverpool University

• Overall AimUnderstanding Human Information Processing

Expertise

• Auditory Scene Analysis (ASA)– Perception experiments – Modelling

• Speech Perception

• Audio-Visual Integration– Models of AV information fusion– Applying these models to ASA

Work at Liverpool

Task 1.3, active/passive speech perception. Research Question: Do human listeners actively

predict the time course of background noise to aid speech recognition?

Current state: Perceptual evidence for ‘predictive scene analysis’ : Elvira Perez will explain all

Planned work: Database of environmental noise to test computational models

Work at Liverpool

Task 1.4 Envelope Information & Binaural ProcResearch Question: What features do listeners use to

track a target speaker in the presence of competing signals? Patti Adank (Aug.03-July.04)

Current State: Tested the hypothesis that ‘jitter’ is a stream segregation or stream formation cue:

Tech report finalised in July 04

Work at LiverpoolTask 2.2 Reliability of auditory cues in multi-cue

scenarios Research Question: How are cues perceptually

integrated?Combination of experimentation and modelling

Current state: Experimental data and models on audio-visual motion signal integration (non Hoarse)

Ongoing work: MLE models for speech feature integration. Elvira

Planned work: Collaboration with Patras (John Worley) on location and pitch segregation cue integration

Work at Liverpool

Task 4.1: Informing speech recognition Research Question: How to apply data derived from perception experiments to machine learning?

Current State: Just starting to ‘predict’ environmental noises (using Aurora noises)Recording database of natural scenes for analysis and modelling. (with Sheffield)

… over to Elvira

Environmental Noise

• Two-pronged approach – Elvira: is perceptual evidence for active noise

modelling in listeners

– Georg (+ Sheffield): noise modelling based on database

Baseline Data

• Typical noise databases not very representative– Size severely limited (e.g. Aurora)– Unrealistsic scenarios (fighter jets, foundries)

• Database of environmental noise– Transport noises: A320-200, ICE, Saab 9-3, – Social Places: Departure lounges, Hotel Lobby, Pub– Private Journeys: urban walk, country walk– Buildings: offices, corridors– …

• Aim is to have about 10-20 mins of representative data for typical situations.

Recordings

• Soundman OKMII binaural microphones

• Sony D3 DAT recorder

• 48kHz stereo recordings

• Digital transfer to PC

mics

Analysis

• Previous work– Auditory filterbank (linear, Mel-Scale, 32 ch.)– Linear prediction

• Within channels

• Across channels

• Planned work– Aud filterbank– Non-linear prediction using Nnets

Using Envelope Informationfor ASA (Patti Adank)

• Background– Brungard & Darwin (resource allocation task?)

Two simultaneous sentences: track one • Segregation benefits from

– Pitch differences

– Speaker differences

– Key question: operational definition of speaker characteristics

Speaker characteristics

• Vocal tract shape – Difficult to quantify / computationally extract

• Speaking style (intonation, stress, accent…)– Difficult to extract measures for very short segments

• Voice characteristics– F0 – of course…– Shimmer (amp modulation)– Jitter (roughness - random GCI variation)– Breathiness (open quotient duing voiced speech)

• All relatively easy to extract computationally• All relatively easy to control in speech re-synthesis

No one choice: Jitter

– Dan Ellis• Computational model is segregation by glottal

closure instance

• Model groups coincident energy in auditory filterbank

– Could ‘Jitter’ be useful for segregation?

Jitter as a primary segregation cue• Double vowel experiment:

– 5 synthetic vowels (Assmann & Summerfield)– Synthesized with range of

• 5 pitch levels

• 5 jitter levels

• Results– Pitch difference aids segregation– Jitter difference does not

baseline (0%) 0.5% 1% 2% 4%

% Jitter

45

50

55

60

65

70M

ean

+-

1 S

E p

erc

ent

Jitter analogous to location cues

• Location cues not primary segregation cues– Segregate on pitch first, then– Use location cues for stream formation

• Experiment– Brungard, Darwin (e.g. 2001) Task,

E.g. “Ready Tiger go to White One now”,And “Ready Arrow go to Red Four now”, but

– Speech resynthesized using Praat• Same speaker, different sentences

– Jitter does not aid stream formation

% correct colour/no combination

0% 3% 6% 9% 12% 15%

% jitter

20

30

40

50

60

70

Mea

n +

- 1

SE

Per

cen

t co

rrec

t

Informing Speech Recognition

• Jitter not no.1 candidate for informing speech recognition…

Task 2.2 Reliability of auditory cues in multi-cue scenarios.

• Ernst & Banks (Nature 2002)– Maximum likelihood estimation good model

for visual/somatosensory cue integration

– Adapted this for AV integration: mouse catching experiment: MLE good modelHofbauer et al., JPP: HPP 2004

– Want to look at speech cue integration in collaboration with Sheffield

Hypothesis

• If listeners organise formants by continuity, then

– the /o/ should lead to /m/ , while

– the /e/ should lead to /n/, with the secondformant of the nasal remaining unassigned

• if proximity is a cue then there should bea changeover at around 1400 Hz

800

2000

2700

375

100 200ms

Formants as a representation?

If sequential grouping of formants explains the perceptual change from /m/ to /n/ for high vowel F2s.

Then transitions should ‘undo’ this change.

time

F

Transitions in /vm/ syllables

• Synthetic /v-m/ segments as before, but 0, 2.5, 5, 10, 20ms formant transitions

• 7 fluent German speakers, 200 trials each

• Experimental results fit prediction0 5 10 15 20

0.0

0.2

0.4

0.6

0.8

1.0

hear

d a

s /e

m/

transition duration (ms)

Transitions ??

• ‘Format transitions’ of 5 ms have an effect

• Synthetic speech was synthesized at 100 Hz

• formant transitions: half glottal period ?? – Confirmed that transition has to coincide with

energetic bit of glottal period

• Do subjects use a ‘transition’ or just energy in the appropriate band (1-2 kHz) ?

Formant transitions?

• Take /em/ stimulus without transitions (heard as /en/)• add a chirp in place

of F2 transition (0-,5,10,20,40ms)

– down chirp is FM sinusoid 2kHz-1kHz

– control is FM sinusoid 1kHz-2kHz

800

2000

2700

375

100 200ms

vowel nasal

Form

ant

freq

u. (

Hz)

upchirpdown

• ASA: Chirp should be segregated– listeners should hear ‘vowel-nasal’ plus chirp– listeners should find to difficult to report ‘time of

chirp’

Model prediction

Down Chirp

• 7 listeners 200 trialseach.

• Result:– chirp is perceived

by listeners

– and integrated into percept

/en/ is heard as /em/ 0 10 20 30 40

0.0

0.2

0.4

0.6

0.8

1.0

p(/

m/)

DOWN chirp duration (ms)

What does it all mean

• Subjects– Hear /em/ when the chirp is added

(any chirp!)– Hear the chirp as a separate sound– Can identify direction of chirp

• Chirps are able to replace formant– Spectral and fine time structure different– Up-direction inconsistent with expected F2

Multiresolution scene analysis

• Speech recognition does not require detail

• Scene analysis does…

MLE framework

• Propose to testMLE model forASA cue integration

• Cue integration as weighted sum () of component probability

time

F

ASA says:Ignore this

bit

Hypothetical Example

time

F

time

F

Labial transition p(m) = 0.8 = 0.7Formant structure p(m) = 0.7 = 0.3

time

F

velar transition p(n) = 0.8 = 0.7Formant structure p(m) = 0.7 = 0.3

unknown transition p(n/m) = 0 = 0.0Formant structure p(m) = 0.7 = 1.0/m/ /m//n/

MLE experiment (Elvira)

time

F

time

F

time

F

Taking it further (back)

• Transition cues

• Prior prob speech high non-speech low

• Localisationcues is low

What does it all mean

• Duplex Perception is – Nothing special– Entirely consistent with a probablistic scene

analysis viewpoint

• Could imagine a fairly high impact publication on this topic

• Training activity on ‘Data fusion’?

Where to go from here

• Would like to collaborate on principled testing of these (and related) ideas – Sheffield ?? IDIAP ??

• Is this any different from missing data recognition?

– Bochum ??• Want to ‘warm up’ duplex perception?

– Most useful: a hands-on modeller

EEG / MEG Study• We argue that

– Scene analysis informs speech perception• Therefore would expect non-speech signals to be

processed/evaluated before speech is recognised• EEG / MEG data should show

– Differential processing of speech / non-speech signals– Perhaps show an effect of the chirps on the latency of speech driven

auditory evoked potential (field)

• We have– A really neat stimulus– Emen signals can be listened to as speech

non-speech signal – Non-speech changes speech identity

(very!) Preliminary data

• Four conditions– /em/ with 20 ms formant transitions– /em/ no formant transitions (en percept)– /em/ no formant transition + 20 ms up chirp (em)– /em/ no formant transition + 20 ms dn chirp (em)

• Two tasks– Identify /em/ s– Identify signals containing chirps

• 16 channel EEG recordings 200 stim each

Predictions

• If ‘speech is special’ then should see significant task dependent differences

• May also see significant differences between stimuli leading to same percept– Effect of chirp might delay speech recognition?

– Here we go:

-100 0 100 200 300 400 500 600

-20

0

20

40

60

80

100

120

140 T7 T8 (.. nonspeech __ speech)

LHSRHS

Speech

non-speech

T7 T8

-100 0 100 200 300 400 500 600

-20

0

20

40

60

80

100

120

140 TP7 TP8 (.. nonspeech __ speech)

LHS

RHS

Speech

non-speech

TP7 TP8

-100 0 100 200 300 400 500 600-80

-60

-40

-20

0

20

40

60

80

100 F1 F2 (.. nonspeech __ speech)

Speech

non-speech

F1 F2

-100 0 100 200 300 400 500 600

-20

0

20

40

60

80

100

120 O1 O2 (.. nonspeech __ speech)

O1 O2 (control…)

No evidencefor differences In early (sensory)procesing

EEG Conclusions

• (very!) preliminary data looks very promising

• Need to get more subjects • Refine paradigm (sequence currently too fast)

– Would a MME study be appropriate

• Would like to – Look at source localisation

(MEG Helsinki, fMRI Liverpool)– Get more channels (MEG Helsinki)

Date post:	19-Jan-2016
Category:	Documents
Upload:	vince
View:	43 times
Download:	0 times

Liverpool University

Documents