A T St M k E ti ti AA Two Stage Mask Estimation ApproA Two Stage Mask Estimation ApproA Two Stage Mask Estimation Appro(M k E ti ti d R fi t f M(Mask Estimation and Refinement for M(Mask Estimation and Refinement for M(
Y li Zh L i XiYali Zhao Lei XieYali Zhao, Lei Xie,Sh i P i i l K L b t f SPortland Oregon USA Shaanxi Provincial Key Laboratory of SpePortland, Oregon, USA Shaanxi Provincial Key Laboratory of Spe, g ,
S t b 9 13 2012y y p
S h l f C t S i N th tSeptember 9–13, 2012 School of Computer Science NorthwesteSeptember 9 13, 2012 School of Computer Science, Northwestep
Obj ti P t t k ti ti h t b t kObjective Propose a two-stage mask estimation approach to robust speaker The initial binary mask estimated from acoObject e opose a t o stage as est at o app oac to obust spea everification (SV) in noise environments
The initial binary mask estimated from acoi ll i d iverification (SV) in noise environments many errors, especially in adverse enviro( ) y , p y
Scenario Semi blind i e the location of target speaker is fixed while the locations of Speech is sparse in the T-F domainScenario Semi-blind, i.e., the location of target speaker is fixed while the locations of Speech is sparse in the T F domainall interferers are unknown Based on the above two facts we can gall interferers are unknown Based on the above two facts, we can g
Steps Use a dual-microphone and a semi-blind degenerate unmixingSteps Use a dual microphone and a semi blind degenerate unmixingti ti t h i (DUET) t ti t i iti l bi k For someestimation technique (DUET) to estimate an initial binary mask
Th i l t d biFor someq ( ) y
R fi th k b d th ti d f hi t f thThose isolated bins power coRefine the mask based on the time and frequency histograms of the with weak power are
power colq y g
initial maskwith weak power are several sinitial mask
ppossible unreliable survivingpossible unreliable survivingg
are alsoare also
Main parts of the system:fi h i i i l k b k i h
Main parts of the system:We refine the initial mask by keeping thoWe refine the initial mask by keeping tho
STFTSTFT
S k F tSpeaker Featurep
Spatial FeatureSpatial FeaturepT hi t d d f thMask Estimation Two histograms are produced from the esMask Estimation g p
·Frequency histogram·Frequency histogram Mask RefinementMask Refinement
Speaker Verification Use marginalization to deal with the missing features andSpeaker Verification Use marginalization to deal with the missing features and fdo missing feature speaker recognition A refining matrix is defined to keep only tdo missing feature speaker recognition A refining matrix is defined to keep only t
Ω tarΩ tar
MO bMO b
rrbM b
λx λx
Binary mask estimationBinary mask estimationIt is a classifition problem to estimate the binary maskIt is a classifition problem to estimate the binary mask
∙
∙
∙∙
h t R b t S k V ifi tioach to Robust Speaker Verificationoach to Robust Speaker Verificationoach to Robust Speaker VerificationMFT b d R b t S k V ifi ti )MFT-based Robust Speaker Verification)MFT-based Robust Speaker Verification)p )
d Zh h Fand Zhonghua Fuand Zhonghua Fugh d I I f ti P ieech and Image Information Processingeech and Image Information Processingg gP l t h i l U i it Xi' Chi W b htt //l i lern Polytechnical University Xi'an China Web: http://lxie.nwpu-aslp.org ern Polytechnical University, Xi an, China p p p g
Email: lxie@nwpu edu cny y
Email: [email protected]
C l h f AURORA 2Corpus: clean speech from AURORA-2 corpusoustic and spatial features contains p p poustic and spatial features contains t ∙ We use “TRAIN” set for UBM trainingonments ∙ We use TRAIN set for UBM training
∙ All 104 speakers from the “TESTA” set are used as the target speakersAll 104 speakers from the TESTA set are used as the target speakersN i i l d f l h i i d hi i∙ Noise type includes female speech, pure music, car noise and white noiseet the following two conclusions: Noise type includes female speech, pure music, car noise and white noise
fet the following two conclusions:
Experimental data with spatial information: recorded in a quite roomExperimental data with spatial information: recorded in a quite room
The clean speech frome types of color noise their 4i∙ The clean speech from e types of color noise, their t t ti l i
4cmMicrophonepAURORA-2 is played throughoncentrates continuously in AURORA-2 is played through oncentrates continuously in
bb d hi h k th Loudspeaker L Rthe loudspeaker at the targetsubbands, which make the Loudspeaker(I f )
L Rthe loudspeaker at the target i i
,g bins within those subbands 50cm(Interference source)
positiong bins within those subbands 50cmLoudspeakerpositiongunreliable
Loudspeaker(Target source)∙ The noise is played through
unreliable 50cm(Target source)∙ The noise is played through 50cm50cm
the loudspeaker at the 12i i h hi h fid the loudspeaker at the 12 i t f i iti
ose T-F portions with high confidenceinterfering positions
ose T F portions with high confidence100cm1 432 65interfering positions 100cm1 432 65
∙ Speech signal: 8KHz 16bit∙ Speech signal: 8KHz, 16bit
F 20 ith 10 hift∙ Frame: 20ms with 10ms shift 7 1098 1211Frame: 20ms with 10ms shift 7 1098 1211
∙∙
∙ Speaker feature: 24 conventional linear frequency log power spectra componentsSpeaker feature: 24 conventional linear frequency log power spectra components
ti t d i iti l bi k Speaker recognizer: GMM UBM (64 Gaussian components)stimated initial binary mask Speaker recognizer: GMM_UBM (64 Gaussian components)y
·Time histogram·Time histogram
Robust speaker verification systemsRobust speaker verification systems
Systems Methodsthe high confidence regions
Systems Methodsthe high confidence regions
(1) MFCC 39 MFCC E D A Z (by Hcopy from HTK toolkit)(1) MFCC 39 MFCC_E_D_A_Z (by Hcopy from HTK toolkit)Local SNR criterionLocal SNR criterionProposed mask estimation methodProposed mask estimation methodIdeal mask (target speech and noise are known)Ideal mask (target speech and noise are known)( g p )
Th f ll i t bl h th lt f diff t b t k ifi ti tThe following table shows the results of different robust speaker verification systemsg p y
M i C l iMain Conclusions:MCMM h d h i f i f∙ MCMM method shows inferior performance asMCMM method shows inferior performance as compared with the MFCC baseline in thecompared with the MFCC baseline in theppresence of speech interfererpresence of speech interferer
∙∙
I th hit i t d i t∙ In the white noise corrupted environment, p ,the performance of our method is a littlethe performance of our method is a littlepworse than that of MCMMworse than that of MCMM
∙ The DET curves show mask refinement∙ The DET curves show mask refinement achieves performance improvement in musicachieves performance improvement in music,
i d hi i di icar noise and white noise conditionscar noise and white noise conditions