Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 1/37
Sound, Mixtures, and Learning
Dan Ellis<[email protected]>
Laboratory for Recognition and Organization of Speech and Audio(Lab
ROSA
)
Electrical Engineering, Columbia Universityhttp://labrosa.ee.columbia.edu/
Outline
Auditory Scene Analysis
Speech Recognition & Mixtures
Fragment Recognition
Alarm Sound Detection
Future Work
1
2
3
4
5
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 2/37
Auditory Scene Analysis
•
Auditory Scene Analysis
: describing a complex sound in terms of high-level sources/events
- ... like listeners do
• Hearing is
ecologically
grounded
- reflects ‘natural scene’ properties- subjective, not absolute
1
0 2 4 6 8 10 12 time/s
frq/Hz
0
2000
1000
3000
4000
Voice (evil)
Stab
Rumble Strings
Choir
Voice (pleasant)
Analysis
level / dB-60
-40
-20
0
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 3/37
Sound, mixtures, and learning
• Sound
- carries useful information about the world- complements vision
• Mixtures
- .. are the rule, not the exception- medium is ‘transparent’, sources are many- must be handled!
• Learning
- the ‘speech recognition’ lesson:let the data do the work
- like listeners
time / s time / s time / s
freq
/ kH
z
0 0.5 10
1
2
3
4
0 0.5 1 0 0.5 1
+ =
Speech Noise Speech + Noise
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 4/37
The problem with recognizing mixtures
“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?”
(after Bregman’90)
• Received waveform is a mixture
- two sensors, N signals ...
underconstrained
• Disentangling mixtures as the primary goal?
- perfect solution is not possible- need experience-based
constraints
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 5/37
Human Auditory Scene Analysis
(Bregman 1990)
• How do people analyze sound mixtures?
- break mixture into small
elements
(in time-freq)- elements are
grouped
in to sources using
cues
- sources have aggregate
attributes
• Grouping ‘rules’ (Darwin, Carlyon, ...):
- cues: common onset/offset/modulation, harmonicity, spatial location, ...
Frequencyanalysis
Groupingmechanism
Onsetmap
Harmonicitymap
Positionmap
Sourceproperties
(after Darwin, 1996)
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 6/37
Cues to simultaneous grouping
• Elements + attributes
• Common onset
- simultaneous energy has common source
• Periodicity
- energy in different bands with same cycle
• Other cues
- spatial (ITD/IID), familiarity, ...
time / s
freq
/ H
z
0 1 2 3 4 5 6 7 8 90
2000
4000
6000
8000
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 7/37
The effect of context
• Context can create an ‘expectation’: i.e. a bias towards a particular interpretation
• e.g. Bregman’s “old-plus-new” principle:
A change in a signal will be interpreted as an
added
source whenever possible
- a different division of the same energy depending on what preceded it
+
time / s
freq
uenc
y / k
Hz
0.0 0.4 0.8
1
2
1.20
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 8/37
Computational Auditory Scene Analysis(CASA)
• Goal: Automatic sound organization ;Systems to ‘pick out’ sounds in a mixture
- ... like people do
• E.g. voice against a noisy background
- to improve speech recognition
• Approach:
- psychoacoustics describes grouping ‘rules’- ... just implement them?
CASAObject 1 descriptionObject 2 descriptionObject 3 description...
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 9/37
The Representational Approach
(Brown & Cooke 1993)
• Implement psychoacoustic theory
- ‘bottom-up’ processing- uses common onset & periodicity cues
• Able to extract voiced speech:
inputmixture
signalfeatures
(maps)
discreteobjects
Front end Objectformation
Groupingrules
Sourcegroups
onset
period
frq.mod
time
freq
0.2 0.4 0.6 0.8 1.0 time/s
100
150200
300400
600
1000
15002000
3000
frq/Hzbrn1h.aif
0.2 0.4 0.6 0.8 1.0 time/s
100
150200
300400
600
1000
15002000
3000
frq/Hzbrn1h.fi.aif
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 10/37
Restoration in sound perception
• Auditory ‘illusions’ = hearing what’s not there
• The continuity illusion
• SWS
- duplex perception
• How to model in CASA?
1000
2000
4000
f/Hz ptshort
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4time/s
5
10
15f/Bark S1−env.pf:0
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 1.8
40
60
80
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 11/37
Adding top-down constraints
Perception is not
direct
but a
search
for
plausible hypotheses
• Data-driven (bottom-up)...
- objects irresistibly appear
vs. Prediction-driven (top-down)
- match observations with parameters of a world-model
- need world-model constraints...
inputmixture
signalfeatures
discreteobjects
Front end Objectformation
Groupingrules
Sourcegroups
inputmixture
signalfeatures
predictionerrors
hypotheses
predictedfeaturesFront end Compare
& reconcile
Hypothesismanagement
Predict& combinePeriodic
components
Noisecomponents
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 12/37
Approaches to sound mixture recognition
• Recognize combined signal
- ‘multicondition training’- combinatorics..
• Separate signals
- e.g. CASA, ICA- nice, if you can do it
• Segregate features into fragments
- then missing-data recognition
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 13/37
Aside: Evaluation
• Evaluation is a big problem for CASA
- what is the goal, really?- what is a good test domain?- how do you measure performance?
• SNR improvement
- not easy given only before-after signals:correspondence problem
- can do with fixed filtering mask; rewards removing signal as well as noise
• ASR improvement
- recognizers typically very sensitive to artefacts
• ‘Real’ task?
- mixture corpus with specific sound events...
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 14/37
Outline
Auditory Scene Analysis
Speech Recognition & Mixtures
- standard ASR- approaches to speech + noise
Fragment Recognition
Alarm Sound Detection
Future Work
1
2
3
4
5
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 15/37
Speech recognition & mixtures
• Speech recognizers are the most successful and sophisticated acoustic recognizers to date
• ‘State of the art’ word-error rates (WERs):
- 2% (dictation) - 30% (phone conv’ns)
2
Featurecalculation
sound
Acousticclassifier
feature vectorsAcoustic model
parameters
HMMdecoder
Understanding/application...
phone probabilities
phone / word sequence
Word models
Language modelp("sat"|"the","cat")p("saw"|"the","cat")
s ah t
D A
T A
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 16/37
Learning acoustic models
• Goal: describe with e.g. GMMs
• Separate models for each class
- generalization as blurring
• Training data labels from:
- manual annotation- ‘best path’ from earlier classifier (Viterbi)- EM: joint estimation of labels & pdfs
p X M( )
Labeledtraining
examples{xn,ωxn}
Sortaccordingto class
Estimateconditional pdf
for class ω1
p(x|ω1)
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 17/37
Speech + noise mixture recognition
• Background noise is biggest (?) problem facing current ASR
• Feature invariance approach:Design features to reflect only speech- e.g. normalization, mean subtraction
• Ideally, models of clean speech will match speech in noise- .. although training on noisy examples can’t hurt
• Static noise is relatively easy- but: non-static noise?
• Alternative: More complex models of the signal- separate models for speech and ‘rest’
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 18/37
HMM decomposition(e.g. Varga & Moore 1991, Roweis 2000)
• Total signal model has independent state sequences for 2+ component sources
• New combined state space q' = {q1 q2}
- new observation pdfs for each combination
model 1
model 2
observations / time
p X q1 q2,( )
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 19/37
Problems with HMM decomposition
• O(qk)N is exponentially large...
• Feature normalization no longer holds!- each source has a different gain→ model at various SNRs?
- models typically don’t use overall energy C0
- each source has a different channel H[k]
• Modeling every possible sub-state combination is inefficient, inelegant and impractical
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 20/37
Outline
Auditory Scene Analysis
Speech Recognition & Mixtures
Fragment Recognition- separating signals vs. separating features- missing data recognition- recognizing multiple sources
Alarm Sound Detection
Future Work
1
2
3
4
5
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 21/37
Fragment Recognition(Jon Barker & Martin Cooke, Sheffield)
• Signal separation is too hard!Instead:- segregate features into partially-observed
sources- then classify
• Made possible by ‘missing data’ recognition- integrate over uncertainty in observations
for optimal posterior distribution
• Goal:Relating clean speech models P(X|M)to speech + noise mixture observations- .. and making it tractable
3
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 22/37
Comparing different segregations
• Standard classification chooses between models M to match source features X
• Mixtures → observed features Y, segregation S, all related by
- spectral features allow clean relationship
• Joint classification of model and segregation:
- integral collapses in several cases...
M∗ P M X( )M
argmax P X M( )P M( )P X( )--------------⋅
Margmax = =
P X Y S,( )
freq
ObservationY(f )
Segregation S
SourceX(f )
P M S Y,( ) P M( ) P X M( )P X Y S,( )
P X( )-------------------------⋅ Xd∫ P S Y( )⋅=
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 23/37
Calculating fragment matches
• P(X|M) - the clean-signal feature model
• P(X|Y,S)/P(X) - is X ‘visible’ given segregation?
• Integration collapses some channels...
• P(S|Y) - segregation inferred from observation- just assume uniform, find S for most likely M - use extra information in Y to distinguish S’s
e.g. harmonicity, onset grouping
• Result: - probabilistically-correct relation between
clean-source models P(X|M)and inferred contributory source P(M,S|Y)
P M S Y,( ) P M( ) P X M( )P X Y S,( )
P X( )-------------------------⋅ Xd∫ P S Y( )⋅=
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 24/37
Speech fragment decoder results
• Simple P(S|Y) model forces contiguous regions to stay together- big efficiency gain when searching S space
• Clean-models-based recognition rivals trained-in-noise recognition
"1754" + noise
SNR mask
Fragments
FragmentDecoder "1754"
-5 0 5 10 15 20 clean0
10
20
30
40
50
60
70
80
90 AURORA 2000 - Test Set A
WE
R /
%
SNR / dB
MD Soft SNRHTK clean training
HTK multicondition
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 25/37
Multi-source decoding
• Search for more than one source
• Mutually-dependent data masks
• Use e.g. CASA features to propose masks- locally coherent regions
• Theoretical vs. practical limits
Y(t)
S1(t)q1(t)
S2(t)q2(t)
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 26/37
Outline
Auditory Scene Analysis
Speech Recognition & Mixtures
Fragment Recognition
Alarm Sound Detection- sound- mixtures- learning
Future Work
1
2
3
4
5
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 27/37
Alarm sound detection
• Alarm sounds have particular structure- people ‘know them when they hear them’- clear even at low SNRs
• Why investigate alarm sounds?- they’re supposed to be easy- potential applications...
• Contrast two systems:- standard, global features, P(X|M)- sinusoidal model, fragments, P(M,S|Y)
4
time / s
hrn01 bfr02 buz01
level / dB
freq
/ kH
z
0 5 10 15 20 250
1
2
3
4
-40
-20
0
20s0n6a8+20
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 28/37
Alarms: Sound (representation)
• Standard system: Mel Cepstra- have to model alarms in noise context:
each cepstral element depends on whole signal
• Contrast system: Sinusoid groups- exploit sparse, stable nature of alarm sounds- 2D-filter spectrogram to enhance harmonics- simple magnitude threshold, track growing- form groups based on common onset
• Sinusoid representation is already fragmentary- does not record non-peak energies
freq
/ H
z
1 1.5 2 2.50
1000
2000
3000
4000
5000
time / sec1 1.5 2 2.5
0
1000
2000
3000
4000
5000 1
1 1.5 2 2.50
1000
2000
3000
4000
5000
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 29/37
Alarms: Mixtures
• Effect of varying SNR on representations:- sinusoid peaks have ~ invariant properties
Sine track groups Cepstra (normalized)
0
5
10
0
5
10
0 5 10 15 20 25-5
0
5
time / s
1
2
1
2 3
12 3 4
0 5 10 15 20 25 time / s
60 d
B S
NR
10 d
B S
NR
0 d
B S
NR
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 30/37
Alarms: Learning
• Standard: train MLP on noisy examples
• Alternate: learn distributions of group features- duration, frequency deviation, amp. modulation...
- underlying models are clean (isolated)- recognize in different contexts...
Sound mixture
Detectedalarms
Featureextraction
Neural netacousticclassifier
Medianfiltering
PLPcepstra
Alarmprobability
0 50 100 150 2000
20
40
60
80
100
Spectral centroid
Spe
ctra
l mom
ent
0 10 20 30 400
2
4
6
8
10
12
14
Inverse Frequency SD
Mag
nitu
de S
D
Alarm non alarm
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 31/37
Alarms: Results
• Both systems commit many insertions at 0dB SNR, but in different circumstances:
20 25 30 35 40 45 50
0
6 7 8 9
time/sec0
freq
/ kH
z
1
2
3
4
0
freq
/ kH
z
1
2
3
4Restaurant+ alarms (snr 0 ns 6 al 8)
MLP classifier output
Sound object classifier output
NoiseNeural net system Sinusoid model system
Del Ins Tot Del Ins Tot
1 (amb) 7 / 25 2 36% 14 / 25 1 60%
2 (bab) 5 / 25 63 272% 15 / 25 2 68%
3 (spe) 2 / 25 68 280% 12 / 25 9 84%
4 (mus) 8 / 25 37 180% 9 / 25 135 576%
Overall 22 / 100 170 192% 50 / 100 147 197%
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 32/37
Alarms: Summary
• Sinusoid domain- feature components belong to 1 source- simple ‘segregation’ (grouping) model- alarm model as properties of group- robust to partial feature observation
• Future improvements- more complex alarm class models- exploit repetitive structure of alarms
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 33/37
Outline
Auditory Scene Analysis
Speech Recognition & Mixtures
Fragment Recognition
Alarm Sound Detection
Future Work- generative models & inference- model acquisition- ambulatory audio
1
2
3
4
5
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 34/37
Future work
• CASA as generative model parameterization:
5
Sourcemodels
Sourcesignals
Receivedsignals
Observations
ObservationsO
Modeldependence
Channelparameters
M1 Y1 X1
C1
M2 Y2 X2
C2
O
{Mi}{Ki}
p(X|Mi,Ki)
Θ
Generationmodel
Analysisstructure
Fragmentformation
Maskallocation
Likelihoodevaluation
Modelfitting
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 35/37
Learning source models
• The speech recognition lesson:Use the data as much as possible- what can we do with unlimited data feeds?
• Data sources- clean data corpora- identify near-clean segments in real sound- build up ‘clean’ views from partial observations?
• Model types- templates- parametric/constraint models- HMMs
• Hierarchic classificationvs. individual characterization...
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 36/37
Personal Audio Applications
• Smart PDA records everything
• Only useful if we have index, summaries- monitor for particular sounds- real-time description
• Scenarios
- personal listener → summary of your day- future prosthetic hearing device- autonomous robots
• Meeting data, ambulatory audio
Sound, mixtures, learning @ OSU - Dan Ellis 2002-08-10 - 37/37
Summary
• Sound- carries important information
• Mixtures- need to segregate different source properties- fragment-based recognition
• Learning- information extracted by classification- models guide segregation
• Alarm sounds- simple example of fragment recognition
• General sounds- recognize simultaneous components- acquire classes from training data- build index, summary of real-world sound