+ All Categories
Transcript
Page 1: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

VeryDeepConvolutionalNeuralNetworksforNoiseRobust

SpeechRecognition

Yanmin Qian,etal.“VeryDeepConvolutionalNeuralNetworksforNoiseRobustSpeechRecognition.” IEEETransactionsonAudio,Speech,andLanguageProcessing.Acceptedforpublicationforafutureissue.

Presented by PeidongWang09/09/2016

1

Page 2: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Content

• Abstract• ReviewofConvolutionalNeuralNetworks• ModelDescription• Experiments• Conclusion

2

Page 3: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Content

• Abstract• ReviewofConvolutionalNeuralNetworks• ModelDescription• Experiments• Conclusion

3

Page 4: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Abstract

• ASR: PreviousattemptsincreasingthenumberofCNNlayersfrom2to3gaveadegradation.• CV:Recentworkinimageshowsthattheaccuracyofimageclassificationcanbeimprovedbyincreasingthenumberofconvolutionallayerswithcarefullytunedarchitecture.• ASR:VeryDeepConvolutionalNeuralNetworksusesupto10convolutionallayersandgetsaWERof8.81%onAurora4,whichisthebestpublishedresult.

4

Page 5: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Content

• Abstract• ReviewofConvolutionalNeuralNetworks• ModelDescription• Experiments• Conclusion

5

Page 6: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ReviewofConvolutionalNeuralNetworks

• AConventionalConvolutionalNeuralNetwork(CNN)

6

From:SlidesinCSE5526NeuralNetworks

Page 7: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ReviewofConvolutionalNeuralNetworks

• ConvolutionandPooling(Subsampling)

7

Page 8: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Content

• Abstract• ReviewofConvolutionalNeuralNetworks• ModelDescription• Experiments• Conclusion

8

Page 9: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• ContextWindowExtension• Atypicalsizeofinputfeaturesinspeechrecognitionis11x40,where11denotesthenumberofframesinawindow,40denotesthedimensionofFBankfeatures.[*]

• Usingthiscontextwindowsize,convolutionscanbeperformedintime5timeswithafiltersizeof3,asinthefollowingfigure(vd6).

9

[*]addedbythepresenter

Page 10: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• ContextWindowExtension(cont’d)

10

Page 11: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• ContextWindowExtension(cont’d)• InVeryDeepConvolutionalNeuralNetworks(VDCNNs),thecontextwindowsizeisextendedto17(andfurtherto21),whichallows8(and10)convolutionstobeperformedintime,respectively.

11

Page 12: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• ContextWindowExtension(cont’d)

12

Page 13: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• ContextWindowExtension(cont’d)

13

Page 14: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• FeatureDimensionExtension• Basedon40-dimFBankfeatures,atmost6convolutionsand2poolingscanbeperformedinfrequency,leadingtothevd6model.• InVDCNN,theFBankfeaturesareextendedto64-dim,sothat4moreconvolutionscanbeperformedinfrequency.

14

Page 15: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• FeatureDimensionExtension(cont’d)

15

Page 16: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• FeatureDimensionExtension(cont’d)• Finallytheinputextensionisperformedinbothtimeandfrequency,leadingtoa17x64input.Theresultingmodelisnamedvd10.

16

Page 17: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• FeatureDimensionExtension(cont’d)

17

Page 18: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• FeatureDimensionExtension(cont’d)• Thefull-ext modelfurtherextendsthenumberoftimeframesto21sothat2moreconvolutionoperationscanbeperformedintime,giving10convolutionoperationsinbothtimeandfrequency.

18

Page 19: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• FeatureDimensionExtension(cont’d)

19

Page 20: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• FeatureDimensionExtension(cont’d)• Toconfirmthattheperformancegainisnotfromtheextendedinputfeatures,amodelwiththesamewiderinputfeatures(17x64)butshallowconvolutionallayersisdeveloped.

20

Page 21: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• FeatureDimensionExtension(cont’d)

21

Page 22: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• PoolinginTime• YoumayhavenoticedthattheVDCNNmodelsallusepoolinginfrequencyanddonopoolingintime.• Toinvestigatewhetherpoolingintimeishelpful,vd10-tpoolisdesigned.

22

Page 23: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• PoolinginTime(cont’d)

23

Page 24: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• PoolinginTime(cont’d)

24

Page 25: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• PaddinginFeatureMaps• InmostworkonCNNsforspeechrecognition,theconvolutionsareperformedwithoutpadding.• Paddingcansavethesizeoffeaturemapsandbetterutilizetheborderinformation.

25

Page 26: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• PaddinginFeatureMaps(cont’d)

26

Page 27: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• PaddinginFeatureMaps(cont’d)•Modelvd10-fpadpadsonlyinfrequency,allowingmorepoolingoperationsinfrequency.

27

Page 28: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• PaddinginFeatureMaps(cont’d)

28

Page 29: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• PaddinginFeatureMaps(cont’d)• Paddinginbothdimensionsisalsoapplied,whichisindicatedasvd10-fpad-tpad.• Inthismodel,consideringthatpoolingisanecessaryapproachtoreducethefeaturemapsize,poolingintimeisalsoapplied.

29

Page 30: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• PaddinginFeatureMaps(cont’d)

30

Page 31: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• PaddinginFeatureMaps(cont’d)

31

Page 32: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• CompleteFigure

32

Page 33: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• CompleteFigure(cont’d)

33

Page 34: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• 1Channelvs.3ChannelsBasedInputFeatureMaps• VDCNNsuseonechannelfeaturemapasinput,i.e.thestaticFBankfeature.•Mostworkinspeechrecognition,however,usesthree-channelfeatures(static,∆,and∆∆).• ThenumberofinputchannelsarecomparedforVDCNN.

34

Page 35: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• 1Channelvs.3ChannelsBasedInputFeatureMaps(cont’d)

35

Page 36: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• 1Channelvs.3ChannelsBasedInputFeatureMaps(cont’d)• Itisinterestingtofindthat1channelbaseVDCNNsarebetterthanthemodelsusing3channels.• OnepossibleexplanationwouldbethattheinformationinthedynamicfeaturesmaybebetterextractedfromtherawstaticfeaturesdirectlybyVDCNN.

36

Page 37: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• 1Channelvs.3ChannelsBasedInputFeatureMaps(cont’d)• Anotherexplanationmaybeasfollows.

37

Page 38: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

•ModelParameterSize• ItisobservedthatalthoughthenumberofconvolutionallayersisincreasedsignificantlyintheproposedVDCNN,thetotalparametersizeissmallerthanthebaselineCNNandDNN.

38

Page 39: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

•ModelParameterSize(cont’d)

39

Page 40: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• ConvergenceofVeryDeepCNNs• TheVDCNNconvergesfasterthanothermodeltypes,intermsofthenumberofepochs[*].• Accordingly,althoughVDCNNsneedmorecomputationsineachiteration(9.5timesmorecomputationscomparedtothebaselineCNN),theVDCNNstakecomparabletimeformodeltraining.

40

[*]addedbythepresenter

Page 41: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• ConvergenceofVeryDeepCNNs(cont’d)

41

Page 42: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• NoiseRobustnessofVeryDeepCNNs

42

Page 43: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• NoiseRobustnessofVeryDeepCNNs(cont’d)• TobetterunderstandhowVDCNNprocessesnoisyspeech,eachcondition(A,B,CorD)ofthisframeispropagatedthroughthebestperformingmodelvd10-fpad-tpad.• Theoutputsofthe1st convolutionallayerandthe6thconvolutionallayerforA,B,CandDareplottedinthenextfigures.

43

Page 44: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• NoiseRobustnessofVeryDeepCNNs(cont’d)

44

Page 45: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• NoiseRobustnessofVeryDeepCNNs(cont’d)• Tofurtherverifytheobservation,thedifferencesbetweennoisyfeaturemapsandcleanfeaturemapsaremeasuredforallconvolutionallayers.• Usingdatainthetest,wecomputetheaveragedmeansquareerror(MSE)toevaluatethedifferencesbetweenthethreenoisyconditionsandthecleancondition.

45

Page 46: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• NoiseRobustnessofVeryDeepCNNs(cont’d)• TheMSEvaluesafteralloperationsareshowbelow.

46

Page 47: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

ModelDescription

• NoiseRobustnessofVeryDeepCNNs(cont’d)• TheMSEvaluesfordifferentCNNmodels.

47

Page 48: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Content

• Abstract• ReviewofConvolutionalNeuralNetworks• ModelDescription• Experiments• Conclusion

48

Page 49: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Experiments

• ExperimentalSetup• TheGMM-HMMsystemisbuiltwithKaldi.• Allneuralnetworkmodels,includingDNN/CNN/LSTM,aretrainedusingCNTK.• ThestandardtestingpipelineinKaldirecipesareusedfordecodingandscoring.• Asimilarstructure(IBM-VGG)designedbyresearchersinIBMandNYUisalsoconstructedforcomparison.

49

Page 50: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Experiments

• EvaluationonAurora4• Aurora4isamediumvocabularytaskbasedontheWallStreetJournal(WSJ0).• Trainingsetscontain14276utterances.• Fourconditions,A,B,CandD,asmentionedbefore.

50

Page 51: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Experiments

• EvaluationonAurora4(cont’d)

51

Page 52: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Experiments

• EvaluationonAMI• AMIcorpuscontainsaround100hoursofmeetingrecords.• Thesignalwascapturedandsynchronizedwithmultiplemicrophonessuchasindividualheadmicrophones(IHM,close-talk)andmicrophonearrays(singledistantmicrophone(SDM)andmultipledistantmicrophones(MDM)).•MDMwasprocessedbyastandardbeamformingalgorithmtogenerateasinglechanneldataset.

52

Page 53: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Experiments

• EvaluationonAMI(cont’d)• Thesizeofinputfeaturesisinvestigated.

53

Page 54: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Experiments

• EvaluationonAMI(cont’d)• Theeffectofotherdesignsarealsoinvestigated.

54

Page 55: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Experiments

• EvaluationonAMI(cont’d)• TobetterexplainthesuperiorityofVDCNNs,weusesomerelatedfeaturemaps.

55

Page 56: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Experiments

• EvaluationonAMI(cont’d)• Onesamesinglesynchronizedframeispropagated.

56

Page 57: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Experiments

• EvaluationonAMI(cont’d)

57

Page 58: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Content

• Abstract• ReviewofConvolutionalNeuralNetworks• ModelDescription• Experiments• Conclusion

58

Page 59: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Conclusion

• FeaturesofVDCNN• Thesizesoffiltersandpoolingtemplatesaresmall.• Theinputfeaturemapsarelarge.• Otherdesignsuchaspoolingintime,padding,andinputfeaturemapsselectionareadjusted.• OnAurora4,itachievesaWERof8.81%(state-of-art).• OnAMI,itsaccuracyiscompetitivetoanLSTM.

59

Page 60: Very Deep Convolutional Neural Networks for Noise …...convolutional layers with carefully tuned architecture. •ASR: Very Deep Convolutional Neural Networks uses up to 10 convolutional

Thank You!

60


Top Related