Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned...

transcript

VeryDeepConvolutionalNeuralNetworksforNoiseRobust

SpeechRecognition

Yanmin Qian,etal.“VeryDeepConvolutionalNeuralNetworksforNoiseRobustSpeechRecognition.” IEEETransactionsonAudio,Speech,andLanguageProcessing.Acceptedforpublicationforafutureissue.

Presented by PeidongWang09/09/2016

Content

• Abstract• ReviewofConvolutionalNeuralNetworks• ModelDescription• Experiments• Conclusion

Content

Abstract

• ASR: PreviousattemptsincreasingthenumberofCNNlayersfrom2to3gaveadegradation.• CV:Recentworkinimageshowsthattheaccuracyofimageclassificationcanbeimprovedbyincreasingthenumberofconvolutionallayerswithcarefullytunedarchitecture.• ASR:VeryDeepConvolutionalNeuralNetworksusesupto10convolutionallayersandgetsaWERof8.81%onAurora4,whichisthebestpublishedresult.

Content

ReviewofConvolutionalNeuralNetworks

• AConventionalConvolutionalNeuralNetwork(CNN)

From:SlidesinCSE5526NeuralNetworks

ReviewofConvolutionalNeuralNetworks

• ConvolutionandPooling(Subsampling)

Content

ModelDescription

• ContextWindowExtension• Atypicalsizeofinputfeaturesinspeechrecognitionis11x40,where11denotesthenumberofframesinawindow,40denotesthedimensionofFBankfeatures.[*]

• Usingthiscontextwindowsize,convolutionscanbeperformedintime5timeswithafiltersizeof3,asinthefollowingfigure(vd6).

[*]addedbythepresenter

ModelDescription

• ContextWindowExtension(cont’d)

ModelDescription

• ContextWindowExtension(cont’d)• InVeryDeepConvolutionalNeuralNetworks(VDCNNs),thecontextwindowsizeisextendedto17(andfurtherto21),whichallows8(and10)convolutionstobeperformedintime,respectively.

ModelDescription

• FeatureDimensionExtension• Basedon40-dimFBankfeatures,atmost6convolutionsand2poolingscanbeperformedinfrequency,leadingtothevd6model.• InVDCNN,theFBankfeaturesareextendedto64-dim,sothat4moreconvolutionscanbeperformedinfrequency.

ModelDescription

• FeatureDimensionExtension(cont’d)

ModelDescription

• FeatureDimensionExtension(cont’d)• Finallytheinputextensionisperformedinbothtimeandfrequency,leadingtoa17x64input.Theresultingmodelisnamedvd10.

ModelDescription

• FeatureDimensionExtension(cont’d)• Thefull-ext modelfurtherextendsthenumberoftimeframesto21sothat2moreconvolutionoperationscanbeperformedintime,giving10convolutionoperationsinbothtimeandfrequency.

ModelDescription

• FeatureDimensionExtension(cont’d)• Toconfirmthattheperformancegainisnotfromtheextendedinputfeatures,amodelwiththesamewiderinputfeatures(17x64)butshallowconvolutionallayersisdeveloped.

ModelDescription

• PoolinginTime• YoumayhavenoticedthattheVDCNNmodelsallusepoolinginfrequencyanddonopoolingintime.• Toinvestigatewhetherpoolingintimeishelpful,vd10-tpoolisdesigned.

ModelDescription

• PoolinginTime(cont’d)

ModelDescription

• PoolinginTime(cont’d)

ModelDescription

• PaddinginFeatureMaps• InmostworkonCNNsforspeechrecognition,theconvolutionsareperformedwithoutpadding.• Paddingcansavethesizeoffeaturemapsandbetterutilizetheborderinformation.

ModelDescription

• PaddinginFeatureMaps(cont’d)

ModelDescription

• PaddinginFeatureMaps(cont’d)•Modelvd10-fpadpadsonlyinfrequency,allowingmorepoolingoperationsinfrequency.

ModelDescription

• PaddinginFeatureMaps(cont’d)• Paddinginbothdimensionsisalsoapplied,whichisindicatedasvd10-fpad-tpad.• Inthismodel,consideringthatpoolingisanecessaryapproachtoreducethefeaturemapsize,poolingintimeisalsoapplied.

ModelDescription

• CompleteFigure

ModelDescription

• CompleteFigure(cont’d)

ModelDescription

• 1Channelvs.3ChannelsBasedInputFeatureMaps• VDCNNsuseonechannelfeaturemapasinput,i.e.thestaticFBankfeature.•Mostworkinspeechrecognition,however,usesthree-channelfeatures(static,∆,and∆∆).• ThenumberofinputchannelsarecomparedforVDCNN.

ModelDescription

• 1Channelvs.3ChannelsBasedInputFeatureMaps(cont’d)

ModelDescription

• 1Channelvs.3ChannelsBasedInputFeatureMaps(cont’d)• Itisinterestingtofindthat1channelbaseVDCNNsarebetterthanthemodelsusing3channels.• OnepossibleexplanationwouldbethattheinformationinthedynamicfeaturesmaybebetterextractedfromtherawstaticfeaturesdirectlybyVDCNN.

ModelDescription

• 1Channelvs.3ChannelsBasedInputFeatureMaps(cont’d)• Anotherexplanationmaybeasfollows.

ModelDescription

•ModelParameterSize• ItisobservedthatalthoughthenumberofconvolutionallayersisincreasedsignificantlyintheproposedVDCNN,thetotalparametersizeissmallerthanthebaselineCNNandDNN.

ModelDescription

•ModelParameterSize(cont’d)

ModelDescription

• ConvergenceofVeryDeepCNNs• TheVDCNNconvergesfasterthanothermodeltypes,intermsofthenumberofepochs[*].• Accordingly,althoughVDCNNsneedmorecomputationsineachiteration(9.5timesmorecomputationscomparedtothebaselineCNN),theVDCNNstakecomparabletimeformodeltraining.

[*]addedbythepresenter

ModelDescription

• ConvergenceofVeryDeepCNNs(cont’d)

ModelDescription

• NoiseRobustnessofVeryDeepCNNs

ModelDescription

• NoiseRobustnessofVeryDeepCNNs(cont’d)• TobetterunderstandhowVDCNNprocessesnoisyspeech,eachcondition(A,B,CorD)ofthisframeispropagatedthroughthebestperformingmodelvd10-fpad-tpad.• Theoutputsofthe1st convolutionallayerandthe6thconvolutionallayerforA,B,CandDareplottedinthenextfigures.

ModelDescription

• NoiseRobustnessofVeryDeepCNNs(cont’d)

ModelDescription

• NoiseRobustnessofVeryDeepCNNs(cont’d)• Tofurtherverifytheobservation,thedifferencesbetweennoisyfeaturemapsandcleanfeaturemapsaremeasuredforallconvolutionallayers.• Usingdatainthetest,wecomputetheaveragedmeansquareerror(MSE)toevaluatethedifferencesbetweenthethreenoisyconditionsandthecleancondition.

ModelDescription

• NoiseRobustnessofVeryDeepCNNs(cont’d)• TheMSEvaluesafteralloperationsareshowbelow.

ModelDescription

• NoiseRobustnessofVeryDeepCNNs(cont’d)• TheMSEvaluesfordifferentCNNmodels.

Content

Experiments

• ExperimentalSetup• TheGMM-HMMsystemisbuiltwithKaldi.• Allneuralnetworkmodels,includingDNN/CNN/LSTM,aretrainedusingCNTK.• ThestandardtestingpipelineinKaldirecipesareusedfordecodingandscoring.• Asimilarstructure(IBM-VGG)designedbyresearchersinIBMandNYUisalsoconstructedforcomparison.

Experiments

• EvaluationonAurora4• Aurora4isamediumvocabularytaskbasedontheWallStreetJournal(WSJ0).• Trainingsetscontain14276utterances.• Fourconditions,A,B,CandD,asmentionedbefore.

Experiments

• EvaluationonAurora4(cont’d)

Experiments

• EvaluationonAMI• AMIcorpuscontainsaround100hoursofmeetingrecords.• Thesignalwascapturedandsynchronizedwithmultiplemicrophonessuchasindividualheadmicrophones(IHM,close-talk)andmicrophonearrays(singledistantmicrophone(SDM)andmultipledistantmicrophones(MDM)).•MDMwasprocessedbyastandardbeamformingalgorithmtogenerateasinglechanneldataset.

Experiments

• EvaluationonAMI(cont’d)• Thesizeofinputfeaturesisinvestigated.

Experiments

• EvaluationonAMI(cont’d)• Theeffectofotherdesignsarealsoinvestigated.

Experiments

• EvaluationonAMI(cont’d)• TobetterexplainthesuperiorityofVDCNNs,weusesomerelatedfeaturemaps.

Experiments

• EvaluationonAMI(cont’d)• Onesamesinglesynchronizedframeispropagated.

Experiments

• EvaluationonAMI(cont’d)

Content

Conclusion

• FeaturesofVDCNN• Thesizesoffiltersandpoolingtemplatesaresmall.• Theinputfeaturemapsarelarge.• Otherdesignsuchaspoolingintime,padding,andinputfeaturemapsselectionareadjusted.• OnAurora4,itachievesaWERof8.81%(state-of-art).• OnAMI,itsaccuracyiscompetitivetoanLSTM.

Thank You！

Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned...

Documents