Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 1
Scene Analysisfor Speech and Audio Recognition
Sound, Mixtures & Learning
Computational Auditory Scene Analysis
Recognizing Speech in Noise
Using Models in Parallel
The Listening Machine
Dan Ellis <[email protected]>
Laboratory for Recognition and Organization of Speech and Audio(Lab
ROSA
)Columbia University, New Yorkhttp://labrosa.ee.columbia.edu/
1
2
3
4
5
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 2
Sound, Mixtures & Learning
• Sound
- carries useful information about the world- complements vision
• Mixtures
- .. are the rule, not the exception- medium is ‘transparent’ with many sources- must be handled!
• Learning
- the speech recognition lesson:let the data do the work
- ... like listeners do
1
0 2 4 6 8 10 12 time/s
frq/Hz
0
2000
1000
3000
4000
level / dB-60
-40
-20
0
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 3
The problem with recognizing mixtures
“Imagine two narrow channels dug up from the edge of a lake, with handkerchiefs stretched across each one. Looking only at the motion of the handkerchiefs, you are to answer questions such as: How many boats are there on the lake and where are they?”
(after Bregman’90)
• Auditory Scene Analysis: describing a complex sound in terms of high-level sources/events
- ... like listeners do
• Hearing is ecologically grounded
- reflects natural scene properties = constraints- subjective, not absolute
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 4
Auditory Scene Analysis
(Bregman 1990)
• How do people analyze sound mixtures?
- break mixture into small
elements
(in time-freq)- elements are
grouped
in to sources using
cues
- sources have aggregate
attributes
• Grouping ‘rules’ (Darwin, Carlyon, ...):
- cues: common onset/offset/modulation, harmonicity, spatial location, ...
Frequencyanalysis
Groupingmechanism
Onsetmap
Harmonicitymap
Positionmap
Sourceproperties
(after Darwin, 1996)
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 5
Cues to simultaneous grouping
• Elements + attributes
• Common onset
- simultaneous energy has common source
• Periodicity
- energy in different bands with same cycle
• Other cues
- spatial (ITD/IID), familiarity, ...
time / s
freq
/ H
z
0 1 2 3 4 5 6 7 8 90
2000
4000
6000
8000
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 6
The effect of context
• Context can create an ‘expectation’: i.e. a bias towards a particular interpretation
• Bregman’s old-plus-new principle:
- a change is preferably interpreted as addition
• E.g. the continuity illusion
+
time / s
freq
uenc
y / k
Hz
0.0 0.4 0.8
1
2
1.20
1000
2000
4000
f/Hz ptshort
0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4time/s
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 7
Approaches to sound mixture recognition
• Separate signals, then recognize
- e.g. CASA, ICA- nice, if you can do it
• Recognize combined signal
- ‘multicondition training’- combinatorics..
• Recognize with parallel models
- full joint-state space?- divide signal into fragments,
then use missing-data recognition
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 8
Independent Component Analysis (ICA)
(Bell & Sejnowski 1995 etc.)
• Drive a parameterized separation algorithm to maximize independence of outputs
• Advantages:
- mathematically rigorous, minimal assumptions- does not rely on prior information from models
• Disadvantages:
- may converge to local optima...- separation, not recognition- does not exploit prior information from models
m1m2
s1s2
a11a21
a12a22
x
−δ MutInfoδa
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 9
Outline
Sound, Mixtures & Learning
Computational Auditory Scene Analysis
- Data-driven- Top-down constraints
Recognizing Speech in Noise
Using Models in Parallel
The Listening Machine
1
2
3
4
5
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 10
Computational Auditory Scene Analysis:The Representational Approach
(Cooke & Brown 1993)
• Direct implementation of psych. theory
- ‘bottom-up’ processing- uses common onset & periodicity cues
• Able to extract voiced speech:
inputmixture
signalfeatures
(maps)
discreteobjects
Front end Objectformation
Groupingrules
Sourcegroups
onset
period
frq.mod
time
freq
0.2 0.4 0.6 0.8 1.0 time/s
100
150200
300400
600
1000
15002000
3000
frq/Hzbrn1h.aif
0.2 0.4 0.6 0.8 1.0 time/s
100
150200
300400
600
1000
15002000
3000
frq/Hzbrn1h.fi.aif
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 11
Adding top-down constraints
Perception is not directbut a search for plausible hypotheses
• Data-driven (bottom-up)...
- objects irresistibly appear
vs. Prediction-driven (top-down)
- match observations with parameters of a world-model
- need world-model constraints...
inputmixture
signalfeatures
discreteobjects
Front end Objectformation
Groupingrules
Sourcegroups
inputmixture
signalfeatures
predictionerrors
hypotheses
predictedfeaturesFront end Compare
& reconcile
Hypothesismanagement
Predict& combinePeriodic
components
Noisecomponents
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 12
Prediction-Driven CASA
(Ellis 1996)
• Explain a complex sound with basic elements
−70
−60
−50
−40
dB
200400
100020004000
f/Hz Noise1
200400
100020004000
f/Hz Noise2,Click1
200400
100020004000
f/Hz City
0 1 2 3 4 5 6 7 8 9
50100200400
1000
Horn1 (10/10)
Crash (10/10)
Horn2 (5/10)
Truck (7/10)
Horn3 (5/10)
Squeal (6/10)
Horn4 (8/10)Horn5 (10/10)
0 1 2 3 4 5 6 7 8 9time/s
200400
100020004000
f/Hz Wefts1−4
50100200400
1000
Weft5 Wefts6,7 Weft8 Wefts9−12
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 13
Aside: Evaluation
• Evaluation is a big problem for CASA
- what is the goal, really?- what is a good test domain?- how do you measure performance?
• SNR improvement
- tricky to derive from before/after signals:correspondence problem
- can do with fixed filtering mask; but rewards removing signal as well as noise
• Speech Recognition (ASR) improvement
- recognizers typically very sensitive to artefacts
• ‘Real’ task?
- mixture corpus with specific sound events...
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 14
Outline
Sound, Mixtures & Learning
Computational Auditory Scene Analysis
Recognizing Speech in Noise
- Conventional ASR- Tandem modeling
Using Models in Parallel
The Listening Machine
1
2
3
4
5
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 15
Recognizing Speech in Noise
• Standard speech recognition structure:
• How to handle additive noise?
- just train on noisy data: ‘multicondition training’
3
Featurecalculation
sound
Acousticclassifier
feature vectorsAcoustic model
parameters
HMMdecoder
Understanding/application...
phone probabilities
phone / word sequence
Word models
Language modelp("sat"|"the","cat")p("saw"|"the","cat")
s ah t
D A
T A
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 16
Tandem speech recognition
(with Hermansky, Sharma & Sivadas/OGI, Singh/CMU, ICSI)
• Neural net estimates phone posteriors;but Gaussian mixtures model finer detail
• Combine them!
• Train net, then train GMM on net output
- GMM is ignorant of net output ‘meaning’
Speechfeatures
Featurecalculation
Inputsound
Neural netclassifier
Nowaydecoder
Phoneprobabilities
Words
s ah t
C0
C1
C2
Cktn
tn+w
h#pclbcltcldcl
Hybrid Connectionist-HMM ASR
Speechfeatures
Featurecalculation
Inputsound
Gauss mixmodels
HTKdecoder
Subwordlikelihoods
Words
s ah t
Conventional ASR (HTK)
Speechfeatures
Featurecalculation
Inputsound
Neural netclassifier
Phoneprobabilities
C0
C1
C2
Cktn
tn+w
h#pclbcltcldcl
Tandem modeling
Gauss mixmodels
HTKdecoder
Subwordlikelihoods
Words
s ah t
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 17
Tandem system results
• It works very well (‘Aurora’ noisy digits):
System-features Avg. WER 20-0 dB Baseline WER ratio
HTK-mfcc 13.7% 100%
Neural net-mfcc 9.3% 84.5%
Tandem-mfcc 7.4% 64.5%
Tandem-msg+plp 6.4% 47.2%
clean20151050-5
1
2
5
10
20
50
100
SNR / dB (averaged over 4 noises)
HTK GMM: 100%Hybrid: 84.6%Tandem: 64.5%Tandem + PC: 47.2%
WE
R /
% (
log
scal
e)
WER as a function of SNR for various Aurora99 systems
HTK GMM baselineHybrid connectionist
Average WER ratio to baseline:
TandemTandem + PC
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 18
Inside Tandem systems:What’s going on?
• Visualizations of the net outputs
• Neural net normalizes away noise?
- ... just a successful way to build a classifier?
freq
/ kH
z
0
1
2
3
4
0
1
2
3
4
-40 dB
-30
-20
-10
0
10
freq
/ m
el c
han
0
5
10
3
4
5
6
7
0
5
10
-20
-10
0
10
20
phon
e
0
0.5
1
time / s0 0.2 0.4 0.6 0.8 1
phon
e
time / s0 0.2 0.4 0.6 0.8 1
qh#axuwowaoahayeyehihiywrnvthfzs
kcltclkt
qh#axuwowaoahayeyehihiywrnvthfzs
kcltclkt
qh#axuwowaoahayeyehihiywrnvthfzs
kcltclkt
qh#axuwowaoahayeyehihiywrnvthfzs
kcltclkt
Spectrogram
Clean 5dB SNR to ‘Hall’ noise“one eight three” (MFP_183A)
Cepstral-smoothedmel spectrum
Hidden layerlinear outputs
Phone posteriorestimates
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 19
Tandem vs. other approaches
• 50% of word errors corrected over baseline
• Beat a ‘bells and whistles’ systemthat used many large-vocabulary techniques
Aurora 2 Eurospeech 2001 Evaluation
- 1 0
0
1 0
2 0
3 0
4 0
5 0
6 0
Avg. rel. improvementRe
l im
pro
ve
me
nt
%
- M
ult
ico
nd
itio
nColumbia
Philips
UPC BarcelonaBell Labs
IBM
Motorola 1Motorola 2
NijmegenICSI/OGI/Qualcomm
ATR/Gri f f i th
AT&TAlcatel
Siemens
UCLAMicrosoft
SloveniaGranada
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 20
Outline
Sound, Mixtures & Learning
Computational Auditory Scene Analysis
Recognizing Speech in Noise
Using Models in Parallel- HMM decomposition/factoring- Speech fragment decoding
The Listening Machine
1
2
3
4
5
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 21
Using Models in Parallel:HMM decomposition
(e.g. Varga & Moore 1991, Gales & Young 1996)
• Independent state sequences for 2+ component source models
• New combined state space q' = {q1 q2}
- need pdfs for each combination
4
model 1
model 2
observations / time
p X q1 q2,( )
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 22
“One microphone source separation”(Roweis 2000, Manuel Reyes)
• State sequences → t-f estimates → mask
- 1000 states/model (→ 106 transition probs.)- simplify by modeling subbands (coupled HMM)?
freq
/ H
z
0
1000
2000
3000
4000
time / sectime / sec
0
1000
2000
3000
4000
0
1000
2000
3000
4000
0 1 2 3 4 5 60
1000
2000
3000
4000
0 1 2 3 4 5 6
Speaker 1
Ori
gin
alvo
ices
Sta
tem
ean
sR
esyn
thes
ism
asks
Mix
ture
Speaker 2
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 23
Speech Fragment Recognition(Jon Barker & Martin Cooke, Sheffield)
• Signal separation is too hard!Instead:- segregate features into partially-observed
sources- then classify
• Made possible by missing data recognition- integrate over uncertainty in observations
for optimal posterior distribution
• Goal:Relate clean speech models P(X|M)to speech-plus-noise mixture observations- .. and make it tractable
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 24
Comparing different segregations
• Standard classification chooses between models M to match source features X
• Mixtures → observed features Y, segregation S, all related by
- spectral features allow clean relationship
• Joint classification of model and segregation:
- integral collapses in several cases...
M∗ P M X( )M
argmax P X M( )P M( )P X( )--------------⋅
Margmax = =
P X Y S,( )
freq
ObservationY(f )
Segregation S
SourceX(f )
P M S Y,( ) P M( ) P X M( )P X Y S,( )
P X( )-------------------------⋅ Xd∫ P S Y( )⋅=
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 25
Calculating fragment matches
• P(X|M) - the clean-signal feature model
• P(X|Y,S)/P(X) - is X ‘visible’ given segregation?
• Integration collapses some bands...
• P(S|Y) - segregation inferred from observation- just assume uniform, find S for most likely M - or: use extra information in Y to distinguish S’s
e.g. harmonicity, onset grouping
• Result: - probabilistically-correct relation between
clean-source models P(X|M)and inferred, recognized source + segregation P(M,S|Y)
P M S Y,( ) P M( ) P X M( )P X Y S,( )
P X( )-------------------------⋅ Xd∫ P S Y( )⋅=
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 26
Speech fragment decoder results
• Simple P(S|Y) model forces contiguous regions to stay together- big efficiency gain when searching S space
• Clean-models-based recognition rivals trained-in-noise recognition
"1754" + noise
SNR mask
Fragments
FragmentDecoder "1754"
-5 0 5 10 15 20 clean0
10
20
30
40
50
60
70
80
90 AURORA 2000 - Test Set A
WE
R /
%
SNR / dB
MD Soft SNRHTK clean training
HTK multicondition
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 27
Multi-source decoding
• Search for more than one source
• Mutually-dependent data masks
• Use e.g. CASA features to propose masks- locally coherent regions- more powerful than Roweis masks
• Huge practical advantage over full search
Y(t)
S1(t)q1(t)
S2(t)q2(t)
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 28
Outline
Sound, Mixtures & Learning
Computational Auditory Scene Analysis
Recognizing Speech in Noise
Using Models in Parallel
The Listening Machine- Everyday sound- Alarms- Music
1
2
3
4
5
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 29
The Listening Machine
• Smart PDA records everything
• Only useful if we have index, summaries- monitor for particular sounds- real-time description
• Scenarios
- personal listener → summary of your day- future prosthetic hearing device- autonomous robots
• Meeting data, ambulatory audio
5
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 30
Alarm sound detection(Ellis 2001)
• Alarm sounds have particular structure- people ‘know them when they hear them’- clear even at low SNRs
• Why investigate alarm sounds?- they’re supposed to be easy- potential applications...
• Contrast two systems:- standard, global features, P(X|M)- sinusoidal model, fragments, P(M,S|Y)
time / s
hrn01 bfr02 buz01
level / dB
freq
/ kH
z
0 5 10 15 20 250
1
2
3
4
-40
-20
0
20s0n6a8+20
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 31
Alarms: Results
• Both systems commit many insertions at 0dB SNR, but in different circumstances:
20 25 30 35 40 45 50
0
6 7 8 9
time/sec0
freq
/ kH
z
1
2
3
4
0
freq
/ kH
z
1
2
3
4Restaurant+ alarms (snr 0 ns 6 al 8)
MLP classifier output
Sound object classifier output
NoiseNeural net system Sinusoid model system
Del Ins Tot Del Ins Tot
1 (amb) 7 / 25 2 36% 14 / 25 1 60%
2 (bab) 5 / 25 63 272% 15 / 25 2 68%
3 (spe) 2 / 25 68 280% 12 / 25 9 84%
4 (mus) 8 / 25 37 180% 9 / 25 135 576%
Overall 22 / 100 170 192% 50 / 100 147 197%
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 32
Music Applications
• Music as a complex, information-rich sound
• Applications of separation & recognition:- note/chord detection & classification
- singing detection (→ genre identification ...)
23 240
1
2
72 730
1
231 32 40 41 48 49
80 81 85 86 88 89
freq
/ kH
zDYWMB: Alignments to MIDI note 57 mapped to Orig Audio
0 50 100 150 200
Boards of CanadaSugarplastic
Belle & SebastianMercury Rev
CorneliusRichard Davies
Dj ShadowMouse on Mars
The Flaming LipsAimee Mann
WilcoXTCBeck
Built to SpillJason Falkner
OvalArto Lindsay
Eric MatthewsThe MolesThe Roots
Michael Penntrue voice
time / sec
Track 117 - Aimee Mann (dynvox=Aimee, unseg=Aimee)
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 33
Summary
• Sound- .. contains much, valuable information at many
levels- intelligent systems need to use this information
• Mixtures- .. are an unavoidable complication when using
sound- looking in the right time-frequency place to find
points of dominance
• Learning- need to acquire constraints from the
environment- recognition/classification as the real task
Dan Ellis Scene Analysis for Speech & Audio Recognition 2003-04-16 - 34
References
A. Bregman. Auditory Scene Analysis, MIT Press, 1990.A. Bell and T. Sejnowski. “An information-maximization approach to blind separation and blind
deconvolution,” Neural Computation, 7: 1129-1159, 1995.http://citeseer.nj.nec.com/bell95informationmaximization.html
A. Berenzweig, D. Ellis, S. Lawrence (2002). “Using Voice Segments to Improve Artist Classification of Music “, Proc. AES-22 Intl. Conf. on Virt., Synth., and Ent. Audio. Espoo, Finland, June 2002.http://www.ee.columbia.edu/~dpwe/pubs/aes02-aclass.pdf
A. Berenzweig, D. Ellis, S. Lawrence (2002). “Anchor Space for Classification and Similarity Measurement of Music“, Proc. ICME-03, Baltimore, July 2003.http://www.ee.columbia.edu/~dpwe/pubs/icme03-anchor.pdf
M. Cooke and G. Brown. “Computational auditory scene analysis: Exploiting principles of perceived continuity”, Speech Communication 13, 391-399, 1993
D. Ellis. Prediction-driven computational auditory scene analysis, Ph.D. dissertation, MIT, 1996.http://www.ee.columbia.edu/~dpwe/pubs/pdcasa.pdf
D. Ellis. “Detecting Alarm Sounds”, Proc. Workshop on Consistent & Reliable Acoustic Cues CRAC-01, Denmark, Sept. 2001.http://www.ee.columbia.edu/~dpwe/pubs/crac01-alarms.pdf
M. Gales and S. Young. “Robust continuous speech recognition using parallel model combination”, IEEE Tr. Speech and Audio Proc., 4(5):352--359, Sept. 1996.http://citeseer.nj.nec.com/gales96robust.html
H. Hermansky,D. Ellis and S. Sharma, “Tandem connectionist feature extraction for conventional HMM systems,” Proc. ICASSP, Istanbul, June 2000. http://citeseer.nj.nec.com/hermansky00tandem.html