Convolutional Neural Networks for Language · Convolutional Neural Networks for Language Features...

Post on 05-Jul-2020

27 views 0 download

transcript

CS6956:DeepLearningforNLP

ConvolutionalNeuralNetworksforLanguage

Featuresfromtext

Example:Sentimentclassification

Thegoal:Isthesentimentofasentencepositive,negativeorneutral?

Thefilmisfunandishosttosometrulyexcellentsequences

Approach:TrainamulticlassclassifierWhatfeatures?

2

Featuresfromtext

Example:Sentimentclassification

Thegoal:Isthesentimentofasentencepositive,negativeorneutral?

Thefilmis funandishosttosome trulyexcellentsequences

Approach:TrainamulticlassclassifierWhatfeatures?Somewordsandngrams areinformative,whilesomearenot

3

Featuresfromtext

Example:Sentimentclassification

Thegoal:Isthesentimentofasentencepositive,negativeorneutral?

Thefilmis funandishosttosome trulyexcellentsequences

Approach:TrainamulticlassclassifierWhatfeatures?Somewordsandngrams areinformative,whilesomearenot

Weneedto:1. Identifyinformativelocalinformation2. Aggregateitintoafixedsizevectorrepresentation

4

ConvolutionalNeuralNetworks

Designedto1. Identifylocalpredictorsinalargerinput

2. Poolthemtogethertocreateafeaturerepresentation

3. Andpossiblyrepeatthisinahierarchicalfashion

IntheNLPcontext,ithelpsidentifypredictivengrams foratask

5

Overview

• ConvolutionalNeuralNetworks:Abriefhistory

• ThetwooperationsinaCNN– Convolution– Pooling

• Convolution+Poolingasabuildingblock

• CNNsinNLP

• RecurrentnetworksvsConvolutionalnetworks

6

Overview

• ConvolutionalNeuralNetworks:Abriefhistory

• ThetwooperationsinaCNN– Convolution– Pooling

• Convolution+Poolingasabuildingblock

• CNNsinNLP

• RecurrentnetworksvsConvolutionalnetworks

7

ConvolutionalNeuralNetworks:Briefhistory

• HubelandWiesel,1950s/60s:Mammalianvisualcortexcontainneuronsthatrespondtosmallregionsandspecificpatternsinthevisualfield

• Fukushima1980,Neocognitron:DirectlyinspiredbyHubel,Wiesel– Keyidea:localityoffeaturesinthevisualcortexisimportant,integratethemlocallyand

propagatethemtofurtherlayers– Twooperations:convolutionallayerthatreactstospecificpatternsandadown-sampling

layerthataggregatesinformation

• LeCun 1989-today,ConvolutionalNeuralNetwork:Asupervisedversion– Relatedtoconvolutionkernelsincomputervision– Verysuccessfulonhandwritingrecognitionandothercomputervisiontasks

• Hasbecomebetteroverrecentyearswithmoredata,computation– Krizhevsky etal2012:ObjectdetectionwithImageNet– Thedefactofeatureextractorforcomputervision

8

Firstaroseinthecontextofvision

ConvolutionalNeuralNetworks:Briefhistory

• HubelandWiesel,1950s/60s:Mammalianvisualcortexcontainneuronsthatrespondtosmallregionsandspecificpatternsinthevisualfield

9

Firstaroseinthecontextofvision

NobelPrizeinPhysiologyorMedicine,1981

DavidH.Hubel Torsten Wiesel

ConvolutionalNeuralNetworks:Briefhistory

• HubelandWiesel,1950s/60s:Mammalianvisualcortexcontainneuronsthatrespondtosmallregionsandspecificpatternsinthevisualfield

• Fukushima1980,Neocognitron:DirectlyinspiredbyHubel,Wiesel– Keyidea:localityoffeaturesinthevisualcortexisimportant,integratethemlocallyand

propagatethemtofurtherlayers– Twooperations

1. convolutionallayerthatreactstospecificpatternsand,2. adown-samplinglayerthataggregatesinformation

10

Firstaroseinthecontextofvision

ConvolutionalNeuralNetworks:Briefhistory

• HubelandWiesel,1950s/60s:Mammalianvisualcortexcontainneuronsthatrespondtosmallregionsandspecificpatternsinthevisualfield

• Fukushima1980,Neocognitron:DirectlyinspiredbyHubel,Wiesel– Keyidea:localityoffeaturesinthevisualcortexisimportant,integratethemlocallyand

propagatethemtofurtherlayers– Twooperations:convolutionallayerthatreactstospecificpatternsandadown-sampling

layerthataggregatesinformation

• LeCun 1989-today,ConvolutionalNeuralNetwork:Asupervisedversion– Relatedtoconvolutionkernelsincomputervision– Successwithhandwritingrecognitionandothercomputervisiontasks

11

Firstaroseinthecontextofvision

ConvolutionalNeuralNetworks:Briefhistory

• HubelandWiesel,1950s/60s:Mammalianvisualcortexcontainneuronsthatrespondtosmallregionsandspecificpatternsinthevisualfield

• Fukushima1980,Neocognitron:DirectlyinspiredbyHubel,Wiesel– Keyidea:localityoffeaturesinthevisualcortexisimportant,integratethemlocallyand

propagatethemtofurtherlayers– Twooperations:convolutionallayerthatreactstospecificpatternsandadown-sampling

layerthataggregatesinformation

• LeCun 1989-today,ConvolutionalNeuralNetwork:Asupervisedversion– Relatedtoconvolutionkernelsincomputervision– Successwithhandwritingrecognitionandothercomputervisiontasks

• Hasbecomebetteroverrecentyearswithmoredata,computation– Krizhevsky etal2012:ObjectdetectionwithImageNet– Thedefactofeatureextractorforcomputervision

12

Firstaroseinthecontextofvision

ConvolutionalNeuralNetworks:Briefhistory

• IntroducedtoNLPbyCollobert etal,2011– Usedasafeatureextractionsystemforsemanticrolelabeling

• Sincethenseveralotherapplicationssuchassentimentanalysis,questionclassification,etc– Kalchbrener etal2014,Kim2014

13

CNNterminology

• Filter– Afunctionthattransformsininputmatrix/vectorintoascalarfeature– Afilterisalearnedfeaturedetector

• Channel– Incomputervision,colorimageshavered,blueandgreenchannels– Ingeneral,achannelrepresentsamediumthatcapturesinformation

aboutaninputindependentofotherchannels• Forexample,differentkindsofwordembeddings couldbedifferentchannels• Channelscouldthemselvesbeproducedbypreviousconvolutionallayers

• Receptivefield– Theregionoftheinputthatafiltercurrentlyfocuseson

14

Showsitscomputervisionsandsignalprocessingorigins

CNNterminology

• Filter– Afunctionthattransformsininputmatrix/vectorintoascalarfeature– Afilterisalearnedfeaturedetector(alsocalledafeaturemap)

• Channel– Incomputervision,colorimageshavered,blueandgreenchannels– Ingeneral,achannelrepresentsamediumthatcapturesinformation

aboutaninputindependentofotherchannels• Forexample,differentkindsofwordembeddings couldbedifferentchannels• Channelscouldthemselvesbeproducedbypreviousconvolutionallayers

• Receptivefield– Theregionoftheinputthatafiltercurrentlyfocuseson

15

Showsitscomputervisionsandsignalprocessingorigins

CNNterminology

• Filter– Afunctionthattransformsininputmatrix/vectorintoascalarfeature– Afilterisalearnedfeaturedetector(alsocalledafeaturemap)

• Channel– Incomputervision,colorimageshavered,blueandgreenchannels– Ingeneral,achannelrepresentsamediumthatcapturesinformation

aboutaninputindependentofotherchannels• Forexample,differentkindsofwordembeddings couldbedifferentchannels• Channelscouldthemselvesbeproducedbypreviousconvolutionallayers

• Receptivefield– Theregionoftheinputthatafiltercurrentlyfocuseson

16

Showsitscomputervisionsandsignalprocessingorigins

CNNterminology

• Filter– Afunctionthattransformsininputmatrix/vectorintoascalarfeature– Afilterisalearnedfeaturedetector(alsocalledafeaturemap)

• Channel– Incomputervision,colorimageshavered,blueandgreenchannels– Ingeneral,achannelrepresentsa“viewoftheinput”thatcaptures

informationaboutaninputindependentofotherchannels• Forexample,differentkindsofwordembeddings couldbedifferentchannels• Channelscouldthemselvesbeproducedbypreviousconvolutionallayers

• Receptivefield– Theregionoftheinputthatafiltercurrentlyfocuseson

17

Showsitscomputervisionsandsignalprocessingorigins

Overview

• ConvolutionalNeuralNetworks:Abriefhistory

• ThetwooperationsinaCNN– Convolution– Pooling

• Convolution+Poolingasabuildingblock

• CNNsinNLP

• RecurrentnetworksvsConvolutionalnetworks

18

Whatisaconvolution?

19

Let’sseethisusinganexampleforvectors.

Wewillgeneralizethistomatricesandbeyond,butthegeneralidearemainsthesame.

Whatisaconvolution?

20

Anexampleusingvectors

2 3 1 3 2 1Avector𝐱

Whatisaconvolution?

21

2 3 1 3 2 1

1 2 1

Avector𝐱

Filter 𝐟 ofsize𝑛

Anexampleusingvectors

Here,thefiltersizeis3

Whatisaconvolution?

22

2 3 1 3 2 1

1 2 1

Avector𝐱

Filter𝐟 ofsize𝑛

Theoutput isalsoavector

Anexampleusingvectors

output( =*𝑓, ⋅ 𝑥(/ 01 2,

,

Whatisaconvolution?

23

2 3 1 3 2 1

1 2 1

Avector𝐱

Filter𝐟 ofsize𝑛

Theoutput isalsoavector

Anexampleusingvectors

output( =*𝑓, ⋅ 𝑥(/ 01 2,

,

Thefiltermovesacrossthevector.

Ateachposition,theoutputisthedotproductofthefilterwithasliceofthevectorofthatsize.

Whatisaconvolution?

24

2 3 1 3 2 1

1 2 1

output( =*𝑓, ⋅ 𝑥(/ 01 2,

,

Avector𝐱

Filter𝐟 ofsize𝑛

Anexampleusingvectors

0

Paddingatthebeginning

Whatisaconvolution?

25

2 3 1 3 2 1

1 2 1

output( =*𝑓, ⋅ 𝑥(/ 01 2,

,

Avector𝐱

Filter𝐟 ofsize𝑛

Anexampleusingvectors

7Theoutput isalsoavector

0

Paddingatthebeginning

Whatisaconvolution?

26

2 3 1 3 2 1

1 2 1

output( =*𝑓, ⋅ 𝑥(/ 01 2,

,

Avector𝐱

Filter𝐟 ofsize𝑛

Anexampleusingvectors

7 9Theoutput isalsoavector

Whatisaconvolution?

27

2 3 1 3 2 1

1 2 1

output( =*𝑓, ⋅ 𝑥(/ 01 2,

,

Avector𝐱

Filter𝐟 ofsize𝑛

Anexampleusingvectors

7 9 8Theoutput isalsoavector

Whatisaconvolution?

28

2 3 1 3 2 1

1 2 1

output( =*𝑓, ⋅ 𝑥(/ 01 2,

,

Avector𝐱

Filter𝐟 ofsize𝑛

Anexampleusingvectors

7 9 8 9Theoutput isalsoavector

Whatisaconvolution?

29

2 3 1 3 2 1

1 2 1

output( =*𝑓, ⋅ 𝑥(/ 01 2,

,

Avector𝐱

Filter𝐟 ofsize𝑛

Anexampleusingvectors

7 9 8 9 8Theoutput isalsoavector

Whatisaconvolution?

30

2 3 1 3 2 1

1 2 1

output( =*𝑓, ⋅ 𝑥(/ 01 2,

,

Avector𝐱

Filter𝐟 ofsize𝑛

Anexampleusingvectors

7 9 8 9 8 4Theoutput isalsoavector

0

Paddingattheend

Whatisaconvolution?

31

2 3 1 3 2 1

1 2 1

output( =*𝑓, ⋅ 𝑥(/ 01 2,

,

Avector𝐱

Filter𝐟 ofsize𝑛

Anexampleusingvectors

7 9 8 9 8 4Theoutput isalsoavector

Whatisaconvolution?

32

2 3 1 3 2 1

1 2 1

output( =*𝑓, ⋅ 𝑥(/ 01 2,

,

Avector𝐱

Filter𝐟 ofsize𝑛

Anexampleusingvectors

7 9 8 9 8 4Theoutput isalsoavector

Thefiltermovesacrossthevector.

Ateachposition,theoutputisthedotproductofthefilterwithasliceofthevectorofthatsize.

Whatisaconvolution?

33

Thesameideaappliestomatricesaswell

Aninputmatrix Afilter

Thefiltermovesacrossthematrix.

Ateachposition,theoutputisthedotproductofthefilterwithasliceofthematrix ofthatsize.

Whatisaconvolution?

34

Thesameideaappliestomatricesaswell

Aninputmatrix Afilter Theresultofconvolution

Thefiltermovesacrossthematrix.

Ateachposition,theoutputisthedotproductofthefilterwithasliceofthematrixofthatsize.

Whatisaconvolution?

35

Thesameideaappliestomatricesaswell

Aninputmatrix Afilter Theresultofconvolution

Thefiltermovesacrossthematrix.

Ateachposition,theoutputisthedotproductofthefilterwithasliceofthematrixofthatsize.

Whatisaconvolution?

36

Thesameideaappliestomatricesaswell

Aninputmatrix Afilter Theresultofconvolution

Thefiltermovesacrossthematrix.

Ateachposition,theoutputisthedotproductofthefilterwithasliceofthematrixofthatsize.

Whatisaconvolution?

37

Thesameideaappliestomatricesaswell

Aninputmatrix Afilter Theresultofconvolution

Thefiltermovesacrossthematrix.

Ateachposition,theoutputisthedotproductofthefilterwithasliceofthematrixofthatsize.

Whatisaconvolution?

38

Thesameideaappliestomatricesaswell

Aninputmatrix Afilter Theresultofconvolution

Thefiltermovesacrossthematrix.

Ateachposition,theoutputisthedotproductofthefilterwithasliceofthematrixofthatsize.

Whatisaconvolution?

39

Thesameideaappliestomatricesaswell

Aninputmatrix Afilter Theresultofconvolution

Thefiltermovesacrossthematrix.

Ateachposition,theoutputisthedotproductofthefilterwithasliceofthematrixofthatsize.

Whatisaconvolution?

40

Thesameideaappliestomatricesaswell

Aninputmatrix Afilter Theresultofconvolution

Andsoon…Thefiltermovesacrossthematrix.

Ateachposition,theoutputisthedotproductofthefilterwithasliceofthematrixofthatsize.

Whatisaconvolution?

41

Thesameideaappliestomatricesaswell

Aninputmatrix Afilter Theresultofconvolution

Thefiltermovesacrossthematrix.

Ateachposition,theoutputisthedotproductofthefilterwithasliceofthematrixofthatsize.

Whatisaconvolution?

42

Thesameideaappliestomatricesaswell

Aninputmatrix Afilter Theresultofconvolution

Andsoon…Thefiltermovesacrossthematrix.

Ateachposition,theoutputisthedotproductofthefilterwithasliceofthematrixofthatsize.

Whatisaconvolution?

43

Thesameideaappliestomatricesaswell

Aninputmatrix Afilter Theresultofconvolution

Thefiltermovesacrossthematrix.

Ateachposition,theoutputisthedotproductofthefilterwithasliceofthematrixofthatsize.

Overview

• ConvolutionalNeuralNetworks:Abriefhistory

• ThetwooperationsinaCNN– Convolution– Pooling

• Convolution+Poolingasabuildingblock

• CNNsinNLP

• RecurrentnetworksvsConvolutionalnetworks

44

Pooling:Anaggregationoperation

• Aconvolutionproducesavector/matrixthatcapturespropertiesofeachwindow

• Poolingcombinesthisinformationtoproduceadown-sampledversionvector/matrix– Typicallyusingthemaximumortheaveragevaluewithinawindow

• Intuition– Afilterisafeaturedetectorthatdiscovershowwelleachwindow

matchesafeatureofinterest– Themostimportantfeaturesshouldberecognizedregardlessoftheir

location– Answer:Pooltheinformationfromdifferentwindowstogether

45

Whatispooling?

46

2 3 1 3 2 1

1 2 1

Avector𝐱

Filter𝐟 ofsize𝑛

Anexampleusingvectors

7 9 8 9 8 4Theoutput isalsoavector

Thepoolingoperationcanbeappliedusingawindowaswell

Example1:Maxpoolingwithwindowsize3

Whatispooling?

47

2 3 1 3 2 1

1 2 1

Avector𝐱

Filter𝐟 ofsize𝑛

Anexampleusingvectors

7 9 8 9 8 4Theoutput isalsoavector

Thepoolingoperationcanbeappliedusingawindowaswell

9

Example1:Maxpoolingwithwindowsize3

Whatispooling?

48

2 3 1 3 2 1

1 2 1

Avector𝐱

Filter𝐟 ofsize𝑛

Anexampleusingvectors

7 9 8 9 8 4Theoutput isalsoavector

Thepoolingoperationcanbeappliedusingawindowaswell

9 9

Example1:Maxpoolingwithwindowsize3

Whatispooling?

49

2 3 1 3 2 1

1 2 1

Avector𝐱

Filter𝐟 ofsize𝑛

Anexampleusingvectors

7 9 8 9 8 4Theoutput isalsoavector

Thepoolingoperationcanbeappliedusingawindowaswell

9 9 9

Example1:Maxpoolingwithwindowsize3

Whatispooling?

50

2 3 1 3 2 1

1 2 1

Avector𝐱

Filter𝐟 ofsize𝑛

Anexampleusingvectors

7 9 8 9 8 4Theoutput isalsoavector

Thepoolingoperationcanbeappliedusingawindowaswell

9 9 9 8

Example1:Maxpoolingwithwindowsize3

Whatispooling?

51

2 3 1 3 2 1

1 2 1

Avector𝐱

Filter𝐟 ofsize𝑛

Anexampleusingvectors

7 9 8 9 8 4Theoutput isalsoavector

Thepoolingoperationcanbeappliedusingawindowaswell

Example2:Averagepoolingwithwindowsize3

Whatispooling?

52

2 3 1 3 2 1

1 2 1

Avector𝐱

Filter𝐟 ofsize𝑛

Anexampleusingvectors

7 9 8 9 8 4Theoutput isalsoavector

Thepoolingoperationcanbeappliedusingawindowaswell

Example2:Averagepoolingwithwindowsize3

8

Whatispooling?

53

2 3 1 3 2 1

1 2 1

Avector𝐱

Filter𝐟 ofsize𝑛

Anexampleusingvectors

7 9 8 9 8 4Theoutput isalsoavector

Thepoolingoperationcanbeappliedusingawindowaswell

8 8.6

Example2:Averagepoolingwithwindowsize3

Whatispooling?

54

2 3 1 3 2 1

1 2 1

Avector𝐱

Filter𝐟 ofsize𝑛

Anexampleusingvectors

7 9 8 9 8 4Theoutput isalsoavector

Thepoolingoperationcanbeappliedusingawindowaswell

Example2:Averagepoolingwithwindowsize3

8 8.6 8.3

Whatispooling?

55

2 3 1 3 2 1

1 2 1

Avector𝐱

Filter𝐟 ofsize𝑛

Anexampleusingvectors

7 9 8 9 8 4Theoutput isalsoavector

Thepoolingoperationcanbeappliedusingawindowaswell

Example2:Averagepoolingwithwindowsize3

8 8.6 8.3 7

Whatispooling?

56

2 3 1 3 2 1

1 2 1

Avector𝐱

Filter𝐟 ofsize𝑛

Anexampleusingvectors

7 9 8 9 8 4Theoutput isalsoavector

Thepoolingoperationcanbeappliedusingawindowaswell

Example3:Maxpoolingwithwindowsize=lengthofthevector

9

Whatispooling?

57

2 3 1 3 2 1

1 2 1

Avector𝐱

Filter𝐟 ofsize𝑛

Anexampleusingvectors

7 9 8 9 8 4Theoutput isalsoavector

Thepoolingoperationcanbeappliedusingawindowaswell

ImportantnoteTherearenolearnedparametersforthepoolingoperation.Itisadeterministicoperation.

Typicalkindsofpooling

• Maxpooling– Takethemaximumvalueoftheresultsoftheconvolution

• Averagepooling– Usesaveragetopoolinsteadofmax

• K-maxpooling– TakethetopKvalues(forafixedk)– Generalizationofmaxpooling

58

Overview

• ConvolutionalNeuralNetworks:Abriefhistory

• ThetwooperationsinaCNN– Convolution– Pooling

• Convolution+Poolingasabuildingblock

• CNNsinNLP

• RecurrentnetworksvsConvolutionalnetworks

59

Convolution+Pooling=onelayer

• Input:amatrix.Convolutionwilloperateoverwindowsofthismatrix.

Thiscouldbeextendedtogeneraltensorsaswell

60

Convolution+Pooling=onelayer

• Input:amatrix.Convolutionwilloperateoverwindowsofthismatrix.

• Thewindowsizedefinesthereceptivefield– Wewillrefertothewindowasx5

61

Convolution+Pooling=onelayer

• Input:amatrix.Convolutionwilloperateoverwindowsofthismatrix.

• Thewindowsizedefinesthereceptivefield– Wewillrefertothewindowasx5

• Afilterisdefinedbysomeparameters(thatwillbelearned)– Ingeneral,amatrixu ofthesameshapeasathewindowandabiasb

62

Convolution+Pooling=onelayer

• Input:amatrix.Convolutionwilloperateoverwindowsofthismatrix.

• Thewindowsizedefinesthereceptivefield– Wewillrefertothewindowasx5

• Afilterisdefinedbysomeparameters(thatwillbelearned)– Ingeneral,amatrixu ofthesameshapeasathewindowandabiasb

• Convolution:Iterateoverallwindowsandapplythefilter– Typicallyhasanon-linearity(e.g.ReLU)

𝑝( = 𝑔(𝑢 ⋅ 𝑥( + 𝑏)

63

Convolution+Pooling=onelayer

• Input:amatrix.Convolutionwilloperateoverwindowsofthismatrix.

• Thewindowsizedefinesthereceptivefield– Wewillrefertothewindowasx5

• Afilterisdefinedbysomeparameters(thatwillbelearned)– Ingeneral,amatrixu ofthesameshapeasathewindowandabiasb

• Convolution:Iterateoverallwindowsandapplythefilter– Typicallyhasanon-linearity(e.g.ReLU)

𝑝( = 𝑔(𝑢 ⋅ 𝑥( + 𝑏)

• Pooling:Aggregatethe𝑝(’sintoadown-sampledversion,sometimesasinglenumber

64

Convolution+Pooling=onelayer

• Input:amatrix.Convolutionwilloperateoverwindowsofthismatrix.

• Thewindowsizedefinesthereceptivefield– Wewillrefertothewindowasx5

• Afilterisdefinedbysomeparameters(thatwillbelearned)– Ingeneral,amatrixu ofthesameshapeasathewindowandabiasb

• Convolution:Iterateoverallwindowsandapplythefilter– Typicallyhasanon-linearity(e.g.ReLU)

𝑝( = 𝑔(𝑢 ⋅ 𝑥( + 𝑏)

• Pooling:Aggregatethe𝑝(’sintoadown-sampledversion,sometimesasinglenumber

• Typically,therearemanyfilters,eachofwhicharepooledindependently

65

Hyperparameters

• Filtersizes:Howbigshouldthefilterbe?– Typically,3x3,5x5,etc

• Stride:howdoesthefiltermovealongtheinput?– Itcouldskipsomesteps,ornot.

• Howmanyfiltersshouldthebe?

• Padding:Shouldtherebepaddingornot?Ifso,shouldthepaddingbezerosorrandom?

• Howbigshouldthepoolingwindowbe?

• Whatkindofpooling:Average,Max,L2norm?

66

Example:LeNet

Anexamplenetworkusesthesebuildingblock

67

LeNet-5wasproposedbyLeCun 1998forhandwritingrecognitionHadseverallevelsofconvolution-pooling

Overview

• ConvolutionalNeuralNetworks:Abriefhistory

• ThetwooperationsinaCNN– Convolution– Pooling

• Convolution+Poolingasabuildingblock

• CNNsinNLP

• RecurrentnetworksvsConvolutionalnetworks

68

ConvolutionalNeuralNetworksinNLP

• Goal:Torepresentasequenceofwordsasafeaturevector

• Approach:– Representthesequenceofwordsbysequence(s)ofembeddings– Convolvewithseveralfilters– Poolacrossthesequencetogetafeaturevectorofafixeddimensionality

69

ConvolutionalNeuralNetworksinNLP

70

Iatecaketoday

Supposewewanttoclassifythissentence:

Goal:Torepresentasequenceofwordsasafeaturevector

ConvolutionalNeuralNetworksinNLP

71

I

ate

cake

today

Wordembeddings

Goal:Torepresentasequenceofwordsasafeaturevector

ConvolutionalNeuralNetworksinNLP

72

I

ate

cake

today

Wordembeddings

padding

padding

Goal:Torepresentasequenceofwordsasafeaturevector

ConvolutionalNeuralNetworksinNLP

73

I

ate

cake

today

Wordembeddings

padding

padding

Applyafilter

Goal:Torepresentasequenceofwordsasafeaturevector

ConvolutionalNeuralNetworksinNLP

74

I

ate

cake

today

Wordembeddings

padding

padding

Goal:Torepresentasequenceofwordsasafeaturevector

ConvolutionalNeuralNetworksinNLP

75

I

ate

cake

today

Wordembeddings

padding

padding

Goal:Torepresentasequenceofwordsasafeaturevector

ConvolutionalNeuralNetworksinNLP

76

I

ate

cake

today

Wordembeddings

padding

padding

Goal:Torepresentasequenceofwordsasafeaturevector

ConvolutionalNeuralNetworksinNLP

77

I

ate

cake

today

Wordembeddings

padding

padding

Goal:Torepresentasequenceofwordsasafeaturevector

ConvolutionalNeuralNetworksinNLP

78

I

ate

cake

today

Wordembeddings

padding

padding

Convolutionwithonefilter

Goal:Torepresentasequenceofwordsasafeaturevector

ConvolutionalNeuralNetworksinNLP

79

I

ate

cake

today

Wordembeddings

padding

padding

Convolutionwithonefilter

Poolingacrossthesentence(oftenmax

pooling)togetonefeature

Goal:Torepresentasequenceofwordsasafeaturevector

ConvolutionalNeuralNetworksinNLP

80

I

ate

cake

today

Wordembeddings

padding

padding

Convolutionwithmanyfilters

Poolingacrossthesentence(oftenmax

pooling)getsafeaturevector

Therecanbeseveralfilters(sometimescalledkernels,orfeaturemaps)

Goal:Torepresentasequenceofwordsasafeaturevector

Convolution+poolingexample

81

1. Eachwordisembeddedintoa2dvector,thewindowconcatenatesthem

2. A6x3filterwithatanh non-linearity

3. Maxpoolingovereachdimensiontoproducea3dimensionalvector

Examplesofconvolution+pooling

82FigurefromGoldberg2017

Thinkofconvolutionsasfeatureextractors

Anarrowconvolution(i.e.withoutanypadding)inthevectorconcatenationnotation

Awideconvolution(i.e.withpadding)inthevectorstackingnotation

Overview

• ConvolutionalNeuralNetworks:Abriefhistory

• ThetwooperationsinaCNN– Convolution– Pooling

• Convolution+Poolingasabuildingblock

• CNNsinNLP

• RecurrentnetworksvsConvolutionalnetworks

83

Featuresfromtext

• Ifwewanttoclassifytext,weneedtorepresenttheminsomefeaturespace

• Wehave(atleast)twowaystogetfeaturesfromtextusinganeuralnetwork:– RecurrentNeuralNetworks– ConvolutionalNeuralNetworks

84

RNNsvsCNNs

• RNNsmodelnon-Markoviandependencies– Canlookat(effectively)infinitewindowsaroundatargetword– Cancapturesequentialpatternsinsuchwindows

85

RNNsvsCNNs

• RNNsmodelnon-Markoviandependencies– Canlookat(effectively)infinitewindowsaroundatargetword– Cancapturesequentialpatternsinsuchwindows

• CNNscaptureinformativengrams– Alsogappy n-grams– Butalsoaccountforlocalorderingpatterns

86

RNNsvsCNNs

• RNNsmodelnon-Markoviandependencies– Canlookat(effectively)infinitewindowsaroundatargetword– Cancapturesequentialpatternsinsuchwindows

• CNNscaptureinformativengrams– Alsogappy n-grams– Butalsoaccountforlocalorderingpatterns

• Howdotheycompare?– Botharetrainedend-to-endwithataskloss– RNNs(specifically,BiRNNs)aremorepopulartoday…

• … butthiscanchange– CNNsallowformoreparallelism,andsomaybebettersuitedforcertain

hardware/softwareimprovements

87

RNNsandCNNsasbuildingblocks

ThinkofthemasLegobricksforconstructinglargerarchitectures

BotharecomputationgraphsMixandmatchwithothercomputationgraphstocreatelargerneuralnetworks

Generaltoolsthatcanbeusedwithotherideasthatwehaveseenandwillsee

Eg:contextualembeddings,attention,etc.

88