+ All Categories
Home > Documents > YliYaliZhZhao , Lei ,LiLeiXiXie andand dZh hZhonghuag FFu · XThThe fll i tbl h th lt ffollowing...

YliYaliZhZhao , Lei ,LiLeiXiXie andand dZh hZhonghuag FFu · XThThe fll i tbl h th lt ffollowing...

Date post: 07-Oct-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
1
AT St M kE ti ti A A Two Stage Mask Estimation Appro A Two Stage Mask Estimation Appro A Two Stage Mask Estimation Appro (M k E ti ti dR fi tf M (Mask Estimation and Refinement for M (Mask Estimation and Refinement for M Y li Zh Li Xi Yali Zhao Lei Xie Yali Zhao, Lei Xie Sh iP i i lK Lb t fS Portland Oregon USA Shaanxi Provincial Key Laboratory of Spe Portland, Oregon, USA Shaanxi Provincial Key Laboratory of Spe S t b 9 13 2012 Sh l fC t Si N th t September 913, 2012 School of Computer Science Northweste September 9 13, 2012 School of Computer Science, Northweste Obj ti P t t k ti ti ht b t k X Objective Propose a two-stage mask estimation approach to robust speaker X The initial binary mask estimated from aco verification (SV) in noise environments X The initial binary mask estimated from aco i ll i d i verification (SV) in noise environments many errors, especially in adverse enviro X Scenario Semi blind i e the location of target speaker is fixed while the locations of X Speech is sparse in the T -F domain Scenario Semi-blind, i.e., the location of target speaker is fixed while the locations of X Speech is sparse in the T F domain all interferers are unknown X Based on the above two facts we can g all interferers are unknown X Based on the above two facts, we can g X Steps X Use a dual-microphone and a semi-blind degenerate unmixing Steps X Use a dual microphone and a semi blind degenerate unmixing ti ti t hi (DUET) t ti t i iti l bi k For some estimation technique (DUET) to estimate an initial binary mask Th i l t d bi For some X R fi th kb d th ti df hi t f th Those isolated bins power co X Refine the mask based on the time and frequency histograms of the with weak power are power co l initial mask with weak power are several s initial mask possible unreliable surviving possible unreliable surviving are also are also Main parts of the system: X h iiil kb k i h Main parts of the system: X We refine the initial mask by keeping tho X We refine the initial mask by keeping tho STFT X STFT X S k F t X Speaker Feature X Spatial Feature X Spatial Feature X X T hi t d df th Mask Estimation X X Two histograms are produced from the es Mask Estimation X ·Frequency histogram ·Frequency histogram Mask Refinement X Mask Refinement X Speaker Verification X Use marginalization to deal with the missing features and Speaker Verification X Use marginalization to deal with the missing features and f do missing feature speaker recognition X A refining matrix is defined to keep only t do missing feature speaker recognition X A refining matrix is defined to keep only t Ω tar Ω tar M O b M O b X r X r b M b λ x λ x XBinary mask estimation XBinary mask estimation It is a classifition problem to estimate the binary mask It is a classifition problem to estimate the binary mask X X X X ht Rb tS k V ifi ti oach to Robust Speaker Verification oach to Robust Speaker Verification oach to Robust Speaker Verification MFT b dR b tS k V ifi ti ) MFT -based Robust Speaker Verification) MFT -based Robust Speaker Verification) d Zh h F and Zhonghua Fu and Zhonghua Fu h dI If ti P i eech and Image Information Processing eech and Image Information Processing Plt hi l Ui it Xi' Chi W b htt //l i l ern Polytechnical University Xi'an China Web: http://lxie.nwpu-aslp.org ern Polytechnical University, Xi an, China Email: lxie@nwpu edu cn Email: [email protected] X C l hf AURORA 2 X Corpus: clean speech from AURORA -2 corpus oustic and spatial features contains oustic and spatial features contains t We use “TRAIN” set for UBM training onments We use TRAIN set for UBM training All 104 speakers from the “TESTA” set are used as the target speakers All 104 speakers from the TESTA set are used as the target speakers Ni i ld f l h i i d hi i Noise type includes female speech, pure music, car noise and white noise et the following two conclusions: Noise type includes female speech, pure music, car noise and white noise X f et the following two conclusions: X Experimental data with spatial information: recorded in a quite room X Experimental data with spatial information: recorded in a quite room The clean speech from e types of color noise their 4 i The clean speech from e types of color noise, their t t ti l i 4cm Microphone AURORA -2 is played through oncentrates continuously in AURORA -2 is played through oncentrates continuously in bb d hi h k th Loudspeaker L R the loudspeaker at the target subbands, which make the Loudspeaker I f L R the loudspeaker at the target ii g bins within those subbands 50cm Interference sourceposition g bins within those subbands 50cm Loudspeaker position unreliable Loudspeaker Target sourceThe noise is played through unreliable 50cm Target sourceThe noise is played through 50cm 50cm the loudspeaker at the 12 i i h hi h fid the loudspeaker at the 12 it f i iti ose T -F portions with high confidence interfering positions ose T F portions with high confidence 100cm 1 4 3 2 6 5 interfering positions 100cm 1 4 3 2 6 5 Speech signal: 8KHz 16bit Speech signal: 8KHz, 16bit F 20 ith 10 hift Frame: 20ms with 10ms shift 7 10 9 8 12 11 Frame: 20ms with 10ms shift 7 10 9 8 12 11 Speaker feature: 24 conventional linear frequency log power spectra components Speaker feature: 24 conventional linear frequency log power spectra components ti t d i iti l bi k X Speaker recognizer: GMM UBM (64 Gaussian components) stimated initial binary mask X Speaker recognizer: GMM_UBM (64 Gaussian components) ·Time histogram ·Time histogram X Robust speaker verification systems X Robust speaker verification systems Systems Methods the high confidence regions Systems Methods the high confidence regions (1) MFCC 39 MFCC E D A Z (by Hcopy from HTK toolkit) (1) MFCC 39 MFCC_E_D_A_Z (by Hcopy from HTK toolkit) Local SNR criterion Local SNR criterion Proposed mask estimation method Proposed mask estimation method Ideal mask (target speech and noise are known) Ideal mask (target speech and noise are known) X Th f ll i t bl h th lt f diff t b t k ifi ti t X The following table shows the results of different robust speaker verification systems X Mi C l i X Main Conclusions: MCMM hdh if i f MCMM method shows inferior performance as MCMM method shows inferior performance as compared with the MFCC baseline in the compared with the MFCC baseline in the presence of speech interferer presence of speech interferer I th hit i td i t In the white noise corrupted environment, the performance of our method is a little the performance of our method is a little worse than that of MCMM worse than that of MCMM The DET curves show mask refinement The DET curves show mask refinement achieves performance improvement in music achieves performance improvement in music, i d hi i di i car noise and white noise conditions car noise and white noise conditions
Transcript
Page 1: YliYaliZhZhao , Lei ,LiLeiXiXie andand dZh hZhonghuag FFu · XThThe fll i tbl h th lt ffollowing table shows the results of g diff tdifferent robust speaker verification systemsbt

A T St M k E ti ti AA Two Stage Mask Estimation ApproA Two Stage Mask Estimation ApproA Two Stage Mask Estimation Appro(M k E ti ti d R fi t f M(Mask Estimation and Refinement for M(Mask Estimation and Refinement for M(

Y li Zh L i XiYali Zhao Lei XieYali Zhao, Lei Xie,Sh i P i i l K L b t f SPortland Oregon USA Shaanxi Provincial Key Laboratory of SpePortland, Oregon, USA Shaanxi Provincial Key Laboratory of Spe, g ,

S t b 9 13 2012y y p

S h l f C t S i N th tSeptember 9–13, 2012 School of Computer Science NorthwesteSeptember 9 13, 2012 School of Computer Science, Northwestep

Obj ti P t t k ti ti h t b t kObjective Propose a two-stage mask estimation approach to robust speaker The initial binary mask estimated from acoObject e opose a t o stage as est at o app oac to obust spea everification (SV) in noise environments

The initial binary mask estimated from acoi ll i d iverification (SV) in noise environments many errors, especially in adverse enviro( ) y , p y

Scenario Semi blind i e the location of target speaker is fixed while the locations of Speech is sparse in the T-F domainScenario Semi-blind, i.e., the location of target speaker is fixed while the locations of Speech is sparse in the T F domainall interferers are unknown Based on the above two facts we can gall interferers are unknown Based on the above two facts, we can g

Steps Use a dual-microphone and a semi-blind degenerate unmixingSteps Use a dual microphone and a semi blind degenerate unmixingti ti t h i (DUET) t ti t i iti l bi k For someestimation technique (DUET) to estimate an initial binary mask

Th i l t d biFor someq ( ) y

R fi th k b d th ti d f hi t f thThose isolated bins power coRefine the mask based on the time and frequency histograms of the with weak power are

power colq y g

initial maskwith weak power are several sinitial mask

ppossible unreliable survivingpossible unreliable survivingg

are alsoare also

Main parts of the system:fi h i i i l k b k i h

Main parts of the system:We refine the initial mask by keeping thoWe refine the initial mask by keeping tho

STFTSTFT

S k F tSpeaker Featurep

Spatial FeatureSpatial FeaturepT hi t d d f thMask Estimation Two histograms are produced from the esMask Estimation g p

·Frequency histogram·Frequency histogram Mask RefinementMask Refinement

Speaker Verification Use marginalization to deal with the missing features andSpeaker Verification Use marginalization to deal with the missing features and fdo missing feature speaker recognition A refining matrix is defined to keep only tdo missing feature speaker recognition A refining matrix is defined to keep only t

Ω tarΩ tar

MO bMO b

rrbM b

λx λx

Binary mask estimationBinary mask estimationIt is a classifition problem to estimate the binary maskIt is a classifition problem to estimate the binary mask

∙∙

h t R b t S k V ifi tioach to Robust Speaker Verificationoach to Robust Speaker Verificationoach to Robust Speaker VerificationMFT b d R b t S k V ifi ti )MFT-based Robust Speaker Verification)MFT-based Robust Speaker Verification)p )

d Zh h Fand Zhonghua Fuand Zhonghua Fugh d I I f ti P ieech and Image Information Processingeech and Image Information Processingg gP l t h i l U i it Xi' Chi W b htt //l i lern Polytechnical University Xi'an China Web: http://lxie.nwpu-aslp.org ern Polytechnical University, Xi an, China p p p g

Email: lxie@nwpu edu cny y

Email: [email protected]

C l h f AURORA 2Corpus: clean speech from AURORA-2 corpusoustic and spatial features contains p p poustic and spatial features contains t ∙ We use “TRAIN” set for UBM trainingonments ∙ We use TRAIN set for UBM training

∙ All 104 speakers from the “TESTA” set are used as the target speakersAll 104 speakers from the TESTA set are used as the target speakersN i i l d f l h i i d hi i∙ Noise type includes female speech, pure music, car noise and white noiseet the following two conclusions: Noise type includes female speech, pure music, car noise and white noise

fet the following two conclusions:

Experimental data with spatial information: recorded in a quite roomExperimental data with spatial information: recorded in a quite room

The clean speech frome types of color noise their 4i∙ The clean speech from e types of color noise, their t t ti l i

4cmMicrophonepAURORA-2 is played throughoncentrates continuously in AURORA-2 is played through oncentrates continuously in

bb d hi h k th Loudspeaker L Rthe loudspeaker at the targetsubbands, which make the Loudspeaker(I f )

L Rthe loudspeaker at the target i i

,g bins within those subbands 50cm(Interference source)

positiong bins within those subbands 50cmLoudspeakerpositiongunreliable

Loudspeaker(Target source)∙ The noise is played through

unreliable 50cm(Target source)∙ The noise is played through 50cm50cm

the loudspeaker at the 12i i h hi h fid the loudspeaker at the 12 i t f i iti

ose T-F portions with high confidenceinterfering positions

ose T F portions with high confidence100cm1 432 65interfering positions 100cm1 432 65

∙ Speech signal: 8KHz 16bit∙ Speech signal: 8KHz, 16bit

F 20 ith 10 hift∙ Frame: 20ms with 10ms shift 7 1098 1211Frame: 20ms with 10ms shift 7 1098 1211

∙∙

∙ Speaker feature: 24 conventional linear frequency log power spectra componentsSpeaker feature: 24 conventional linear frequency log power spectra components

ti t d i iti l bi k Speaker recognizer: GMM UBM (64 Gaussian components)stimated initial binary mask Speaker recognizer: GMM_UBM (64 Gaussian components)y

·Time histogram·Time histogram

Robust speaker verification systemsRobust speaker verification systems

Systems Methodsthe high confidence regions

Systems Methodsthe high confidence regions

(1) MFCC 39 MFCC E D A Z (by Hcopy from HTK toolkit)(1) MFCC 39 MFCC_E_D_A_Z (by Hcopy from HTK toolkit)Local SNR criterionLocal SNR criterionProposed mask estimation methodProposed mask estimation methodIdeal mask (target speech and noise are known)Ideal mask (target speech and noise are known)( g p )

Th f ll i t bl h th lt f diff t b t k ifi ti tThe following table shows the results of different robust speaker verification systemsg p y

M i C l iMain Conclusions:MCMM h d h i f i f∙ MCMM method shows inferior performance asMCMM method shows inferior performance as compared with the MFCC baseline in thecompared with the MFCC baseline in theppresence of speech interfererpresence of speech interferer

∙∙

I th hit i t d i t∙ In the white noise corrupted environment, p ,the performance of our method is a littlethe performance of our method is a littlepworse than that of MCMMworse than that of MCMM

∙ The DET curves show mask refinement∙ The DET curves show mask refinement achieves performance improvement in musicachieves performance improvement in music,

i d hi i di icar noise and white noise conditionscar noise and white noise conditions

Recommended