Post on 23-Jun-2020
transcript
Lab
ROSALaboratory for the Recognition andOrganization of Speech and Audio
Estimating Single-Channel Source
Separation Masks
Relevance Vector Machine Classifiers vs.Pitch-Based Masking
Ron J. Weiss, Daniel P. W. Ellis
{ronw,dpwe}@ee.columbia.edu
LabROSA, Columbia University
Estimating Single-Channel Source Separation Masks – p. 1
Lab
ROSALaboratory for the Recognition andOrganization of Speech and Audio
Single Channel Source Separation
Speech
Time (seconds)
Fre
quen
cy (
Hz)
0 1 2 30
1000
2000
3000
4000Babble noise
Time (seconds)0 1 2 3
0
1000
2000
3000
4000Mixture (10 dB SNR)
Time (seconds)0 1 2 3
0
1000
2000
3000
4000
+ =
• Given a monoaural signal composed of multiple sources
• e.g. multiple speakers, speech + music, speech +background noise
• Want to separate the constituent sources
• For noise robust speech recognition, hearing aids
Estimating Single-Channel Source Separation Masks – p. 2
Lab
ROSALaboratory for the Recognition andOrganization of Speech and Audio
Missing Data Masks
Mixture
Time (seconds)
Fre
quen
cy (
Hz)
0 1 2 30
1000
2000
3000
4000Mask − regions where speech energy dominates
Time (seconds)
Fre
quen
cy (
Hz)
0 1 2 30
1000
2000
3000
4000
• Leverage the sparsity of audio sources - only one source islikely to have a significant amount of energy in any giventime-frequency cell
• If we can decide which cells are dominated by the source ofinterest (i.e. has local SNR greater than some threshold),we can filter out noise dominated cells (“refiltering”[3])
• Create a binary mask that labels each cell of thespectrogram as missing or reliable
Estimating Single-Channel Source Separation Masks – p. 3
Lab
ROSALaboratory for the Recognition andOrganization of Speech and Audio
Mask Estimation As Classification [4]
• Goal is to classify each spectrogram cell as being reliable(dominated by speech signal) or not
• Separate classifier for each frequency band
• Train on speech mixed with a variety of different noisesignals (babble noise, white noise, speech shaped noise,etc...) at a variety of different levels (-5 to 10 dB SNR)
• Features: raw spectrogram frames
• current frame + previous 5 frames (∼ 40 ms) ofcontext
Estimating Single-Channel Source Separation Masks – p. 4
Lab
ROSALaboratory for the Recognition andOrganization of Speech and Audio
The Relevance Vector Machine [5]
• Bayesian treatment of the SVM
• Kernel classifier of the form:
y(z|w,v) =∑
n wnK(z,vn) + w0
• z = data point to be classified• vn = nth support vector• wn = weight associated with the nth support vector
Estimating Single-Channel Source Separation Masks – p. 5
Lab
ROSALaboratory for the Recognition andOrganization of Speech and Audio
RVM Versus SVM
• Pros• Huge improvement in sparsity over SVM (∼ 50 rvs vs.
∼ 450 svs per classifier on this task) - fasterclassification
• Wrap y in a sigmoid squashing function to estimateposterior probability of class membership.
P (t = 1|z,w,v) =1
1 + e−y(z|w,v)
• Masks are no longer strictly binary. Can use RVM toestimate the probability that each spectrogram cell isreliable.
• Cons• RVM training is significantly slower
Estimating Single-Channel Source Separation Masks – p. 6
Lab
ROSALaboratory for the Recognition andOrganization of Speech and Audio
CASA Pitch-based Masking [1]
Cochlear
Filtering /
Correlogram
Noisy
Speech
Autocorrelogram
Compensation
Cross-channel
Integration /
HMM Pitch
Tracking
Pitch
Tracks
• Most energy in speech signals is associated with thepseudo-periodic segments of vowel sounds
• Get envelopes of auditory filter outputs
• Find strong periodicities in short-time autocorrelation ofeach envelope
• Sum each channel to find single dominant periodicity
• Channels whose autocorrelation indicated energy at thisperiod are added to the target mask
Estimating Single-Channel Source Separation Masks – p. 7
Lab
ROSALaboratory for the Recognition andOrganization of Speech and Audio
Missing Data Reconstruction [2]
• What if a significant part of the signal is missing?
• Want to fill in the blanks in spectrogram of mixed signal
• Do MMSE reconstruction on missing dimensions usingsignal model of spectrogram frames - GMM trained onclean speech
• Marginalize over missing dimensions to do inference
P (zd|k) = P (rd)N (zd|µk,d, σk,d) + (1 − P (rd))
∫N (zd|µk,d, σk,d)dzd
• MMSE estimator reconstructs by mixing the observedsignal and GMM reconstruction based on the probabilitythat each cell is reliable:
xd = E[xd|z] = P (rd)zd + (1 − P (rd))∑
k
P (k|z)µk,d
Estimating Single-Channel Source Separation Masks – p. 8
Lab
ROSALaboratory for the Recognition andOrganization of Speech and Audio
Experiments
• Speech signal: single male speaker from audio bookrecording
• Training noise signals: Babble noise, speech shaped noise,factory noise 1
• Out of model noise signals used for testing: car noise,white noise, factory noise 2, music
• RVM trained on 20s of speech + noise
• 512 component GMM trained on 80s of clean speech
Estimating Single-Channel Source Separation Masks – p. 9
Lab
ROSALaboratory for the Recognition andOrganization of Speech and Audio
Experiments - Results
-6 -4 -2 0 2 4 6 8 10-10
-5
0
5
10
15
20
mixed signal SNR / dB
reco
nstr
uctio
n S
NR
/ dB
Speech corrupted by factory2 noise
BaselineGT refilteringGT GMM recon
RVM HM refiltRVM HM recon
RVM SM refiltRVM SM recon
CASA refiltCASA recon
Estimating Single-Channel Source Separation Masks – p. 10
Lab
ROSALaboratory for the Recognition andOrganization of Speech and Audio
Experiments - Results
• GMM reconstruction outperforms simple refiltering sincethe GMM reconstruction can fill in the blanks
• Soft masks give about 1 dB improvement over hard masks
• CASA masks not as good as RVM masks
• Still room for improvement in mask estimation based onperformance using ground truth masks
Estimating Single-Channel Source Separation Masks – p. 11
Lab
ROSALaboratory for the Recognition andOrganization of Speech and Audio
Experiments - Results
-6 -4 -2 0 2 4 6 8 100
0.5
1
1.5
2
mixed signal SNR / dB
Speech corrupted by factory2 noise
RVM false accept rateRVM false reject rateCASA false accept rateCASA false reject ratefa
lse
rate
/ %
-6 -4 -2 0 2 4 6 8 100
0.5
1
1.5
2
2.5
3
mixed signal SNR / dB
dist
ortio
n le
vel /
dB
Speech corrupted by factory2 noise
RVM mask: added noiseRVM mask: deleted signalCASA mask: added noiseCASA mask: deleted signal
• False positive rate of CASA masks is much higher thanthat of RVM masks.
• Major problem with CASA mask is added noise. Deletedsignal is not very significant in terms of signal energy
• RVM mask deletes a significant amount of signal energyEstimating Single-Channel Source Separation Masks – p. 12
Lab
ROSALaboratory for the Recognition andOrganization of Speech and Audio
Experiments - Results
-6 -4 -2 0 2 4 6 8 100
0.05
0.1
0.15
0.2
0.25
0.3
0.35
mixed signal SNR / dB
mut
ual i
nfor
mat
ion
/ bits
MI( masks ; ground truth ) — factory2 noise
RVM+CASARVMCASA
• RVM mask is significantly more informative about groundtruth mask than CASA mask
• Some information in CASA mask is not captured by RVMmask
Estimating Single-Channel Source Separation Masks – p. 13
Lab
ROSALaboratory for the Recognition andOrganization of Speech and Audio
Experiments - Results
-5 0 5 10-10
-5
0
5
10
15
mixed signal SNR / dB
reco
nstr
uctio
n S
NR
/ dB
RVM hard mask GMM reconstruction
-5 0 5 10mixed signal SNR / dB
CASA hard mask GMM reconstruction
Baselinefactory1speechnoisebabblewhitenoisefactory2musicvolvowhitepink
• Clear SNR boost when mixed signal at low SNR
• RVM clearly outperforms CASA system
• Both systems perform poorly on music noise• RVM not trained on highly pitched interference• CASA system can’t distinguish between voiced speech
and musical instrument harmonicsEstimating Single-Channel Source Separation Masks – p. 14
Lab
ROSALaboratory for the Recognition andOrganization of Speech and Audio
Spectrograms
Speech + factory2 noise
0
Clean speech signal
RVM mask
0
1
2
3
4
1
2
3
4
1
2
3
4Ground truth mask
RVM Refiltering
0 0.5 1 1.5 20
RVM GMM reconstruction
0 0.5 1 1.5 2
CASA mask CASA Refiltering CASA GMM reconstruction
-100
-50
0
freq
/ kH
zfr
eq /
kHz
freq
/ kH
z
time / s time / s0 0.5 1 1.5 2
time / s
level / dB
Estimating Single-Channel Source Separation Masks – p. 15
Lab
ROSALaboratory for the Recognition andOrganization of Speech and Audio
References
[1] K. S. Lee and D. P. W. Ellis. Voice activity detection in personal audio recordings
using autocorrelogram compensation. In Proc. Interspeech ICSLP-06, Pittsburgh
PA, 2006. submitted.
[2] B. Raj and R. Singh. Reconstructing spectral vectors with uncertain spectrographic
masks for robust speech recognition. In IEEE Workshop on Automatic Speech
Recognition and Understanding, pages 27–32, November 2005.
[3] S. T. Roweis. Factorial models and refiltering for speech separation and denoising.
In Proceedings of EuroSpeech, 2003.
[4] M. L. Seltzer, B. Raj, and R. M. Stern. Classifier-based mask estimation for missing
feature methods of robust speech recognition. In Proceedings of ICSLP, 2000.
[5] M. Tipping. The relevance vector machine. In S. A. Solla, T. K. Leen, and K.-R.
Muller, editors, Advances in Neural Information Processing Systems 12, pages
652–658. MIT Press, 2000.
Estimating Single-Channel Source Separation Masks – p. 16