+ All Categories
Home > Documents > CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… ·...

CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… ·...

Date post: 11-Jul-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
45
CS224d: Deep NLP Lecture 11: Advanced Recursive Neural Networks Richard Socher [email protected]
Transcript
Page 1: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

CS224d:DeepNLP

Lecture11:AdvancedRecursiveNeuralNetworks

[email protected]

Page 2: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

• PSet2pleasereadinstructionsforsubmissions• PleasefollowPiazza forquestionsandannouncements• BecauseofsomeambiguitiesinPSet2,wewillbelenientin

grading.TFisasuperusefulskill.

• Ifre-gradequestionorrequest,pleasecometoofficehoursorsendamessageonPiazza.

• Toimprovelearningandyourexperience,wewillpublishsolutionstoPSets.

5/5/16RichardSocherLecture1,Slide 2

Page 3: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

RecursiveNeuralNetworks• Focusedoncompositionalrepresentationlearningof• Hierarchicalstructure,featuresandpredictions• Differentcombinationsof:1. TrainingObjective

2. CompositionFunction

3. TreeStructure

W

c1 c2

pWscore sV

Page 4: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

Overview

5/5/16RichardSocher4

Lastlecture:RecursiveNeuralNetworks

Thislecture:DifferentRNNcompositionfunctionsandNLPtasks

1. StandardRNNs: Paraphrasedetection

2. Matrix-VectorRNNs: Relationclassification

3. RecursiveNeuralTensorNetworks: SentimentAnalysis

4. TreeLSTMs: PhraseSimilarity

Nextlecture

• ReviewforMidterm.Goingovercommonproblems/questionsfromofficehours.Pleasepreparequestions.

Page 5: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

ApplicationsandModels

• Note:Allmodelscanbeappliedtoalltasks

• Morepowerfulmodelsareneededforhardertasks

• Modelsgetincreasinglymoreexpressiveandpowerful:1. StandardRNNs: Paraphrasedetection2. Matrix-VectorRNNs: Relationclassification3. RecursiveNeuralTensorNetworks: SentimentAnalysis4. TreeLSTMs: PhraseSimilarity

5/5/16RichardSocherLecture1,Slide 5

Page 6: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

ParaphraseDetection

6

PollacksaidtheplaintiffsfailedtoshowthatMerrillandBlodget directlycausedtheirlosses

Basically,theplaintiffsdidnotshowthatomissionsinMerrill’sresearchcausedtheclaimedlosses

TheinitialreportwasmadetoModestoPoliceDecember28

ItstemsfromaModestopolicereport

Page 7: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

Howtocomparethemeaning

oftwosentences?

7

Page 8: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

RNNsforParaphraseDetection

UnsupervisedRNNsandapair-wisesentencecomparisonofnodesinparsedtrees(Socheretal.,NIPS2011)

8

Page 9: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

RNNsforParaphraseDetectionExperimentsonMicrosoftResearchParaphraseCorpus(Dolanetal.2004)Method Acc. F1

Rus etal.(2008) 70.6 80.5

Mihalceaetal.(2006) 70.3 81.3

Islametal.(2007) 72.6 81.3

Qiu etal.(2006) 72.0 81.6

Fernandoetal.(2008) 74.1 82.4

Wanetal.(2006) 75.6 83.0

DasandSmith(2009) 73.9 82.3

DasandSmith(2009)+18SurfaceFeatures 76.1 82.7

F.Buetal.(ACL2012):StringRe-writingKernel 76.3 --

UnfoldingRecursiveAutoencoder (NIPS2011) 76.8 83.6

9 Datasetisproblematic,abetterevaluationisintroduced later

Page 10: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

RNNsforParaphraseDetection

10

Page 11: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

RecursiveDeepLearning

1. StandardRNNs: ParaphraseDetection2. Matrix-VectorRNNs: Relationclassification3. RecursiveNeuralTensorNetworks: SentimentAnalysis4. TreeLSTMs: PhraseSimilarity

11

Page 12: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

CompositionalityThroughRecursiveMatrix-VectorSpaces

12

OnewaytomakethecompositionfunctionmorepowerfulwasbyuntyingtheweightsW

Butwhatifwordsactmostlyasanoperator,e.g.“very”inverygood

Proposal:Anewcompositionfunction

p=tanh(W+b)c1c2

Page 13: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

CompositionalityThroughRecursiveMatrix-VectorRecursiveNeuralNetworks

p=tanh(W+b)c1c2 p=tanh(W+b)C2c1

C1c2

13

Page 14: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

PredictingSentimentDistributionsGoodexamplefornon-linearityinlanguage

14

Page 15: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

MV-RNNforRelationshipClassification

Relationship Sentencewithlabelednounsforwhichtopredictrelationships

Cause-Effect(e2,e1)

Avian[influenza]e1 isaninfectiousdiseasecausedbytypeastrainsoftheinfluenza [virus]e2.

Entity-Origin(e1,e2)

The[mother]e1 lefthernative[land]e2aboutthesametimeandtheyweremarriedinthatcity.

Message-Topic(e2,e1)

Roadside[attractions]e1 arefrequentlyadvertisedwith[billboards]e2 toattracttourists.

15

Page 16: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

SentimentDetection

16

Sentimentdetectioniscrucialtobusinessintelligence,stocktrading,…

Page 17: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

SentimentDetectionandBag-of-WordsModels

17

Mostmethodsstartwithabagofwords+linguisticfeatures/processing/lexica

Butsuchmethods(includingtf-idf)can’tdistinguish:

+whitebloodcellsdestroyinganinfection

−aninfectiondestroyingwhitebloodcells

Page 18: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

SentimentDetectionandBag-of-WordsModels

• Sentimentisthatsentimentis“easy”• Detectionaccuracyforlongerdocuments∼90%• Lotsofeasycases(…horrible…or…awesome …)

• Fordatasetofsinglesentencemoviereviews(PangandLee,2005)accuracyneverreachedabove80%for>7years

• Hardercasesrequireactualunderstandingofnegationanditsscope+othersemanticeffects

Page 19: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

Data:MovieReviews

StealingHarvarddoesn’tcareaboutcleverness,witoranyotherkindofintelligenthumor.

Thereareslowandrepetitivepartsbutithasjustenoughspicetokeepitinteresting.

19

Page 20: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

Twomissingpiecesforimprovingsentiment

1. CompositionalTrainingData

2. BetterCompositionalmodel

Page 21: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

1.NewSentimentTreebank

Page 22: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

1.NewSentimentTreebank

• Parsetreesof11,855sentences• 215,154phraseswithlabels• Allowstrainingandevaluating

withcompositionalinformation

Page 23: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

BetterDatasetHelpedAllModels

• Buthardnegationcasesarestillmostlyincorrect• Wealsoneedamorepowerfulmodel!

75767778798081828384

TrainingwithSentenceLabels

TrainingwithTreebank

BiNB

RNN

MV-RNN

• Positive/negativefullsentenceclassification

Page 24: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

BetterDatasetHelped

• Thisimprovedperformanceforfullsentencepositive/negativeclassificationby2– 3%

• Yay!

• Butamoreindepthanalysisshows:hardnegationcasesarestillmostlyincorrect

• Wealsoneedamorepowerfulmodel!

Page 25: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

2.NewCompositionalModel

• RecursiveNeuralTensorNetwork• MoreexpressivethanpreviousRNNs• Idea:Allowmoreinteractionsofvectors

Page 26: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

2.NewCompositionalModel

• RecursiveNeuralTensorNetwork

Page 27: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

2.NewCompositionalModel

• RecursiveNeuralTensorNetwork

Page 28: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

RecursiveNeuralTensorNetworkRecursiveDeepModels forSemanticCompositionalityOveraSentimentTreebankSocheretal.2013

Page 29: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

Details:TensorBackpropagation Training

• Mainnewmatrixderivativeneededforatensor:

2.3 Derivatives of Eigenvalues 2 DERIVATIVES

from which it follows

@(X�1)kl@Xij

= �(X�1)ki(X�1)jl (60)

@aTX�1b

@X= �X�TabTX�T (61)

@ det(X�1)

@X= � det(X�1)(X�1)T (62)

@Tr(AX�1B)

@X= �(X�1BAX�1)T (63)

@Tr((X+A)�1)

@X= �((X+A)�1(X+A)�1)T (64)

From [32] we have the following result: Let A be an n⇥ n invertible squarematrix, W be the inverse of A, and J(A) is an n⇥n -variate and di↵erentiablefunction with respect to A, then the partial di↵erentials of J with respect to Aand W satisfy

@J

@A= �A�T @J

@WA�T

2.3 Derivatives of Eigenvalues

@

@X

Xeig(X) =

@

@XTr(X) = I (65)

@

@X

Yeig(X) =

@

@Xdet(X) = det(X)X�T (66)

If A is real and symmetric, �i and vi are distinct eigenvalues and eigenvectorsof A (see (276)) with vT

i vi = 1, then [33]

@�i = vTi @(A)vi (67)

@vi = (�iI�A)+@(A)vi (68)

2.4 Derivatives of Matrices, Vectors and Scalar Forms

2.4.1 First Order

@xTa

@x=

@aTx

@x= a (69)

@aTXb

@X= abT (70)

@aTXTb

@X= baT (71)

@aTXa

@X=

@aTXTa

@X= aaT (72)

@X

@Xij= Jij (73)

@(XA)ij@Xmn

= �im(A)nj = (JmnA)ij (74)

@(XTA)ij@Xmn

= �in(A)mj = (JnmA)ij (75)

Petersen & Pedersen, The Matrix Cookbook, Version: November 15, 2012, Page 10

Page 30: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

Details:TensorBackpropagation Training• Minimizingcrossentropyerror:

• Standardsoftmax errormessage:

• Foreachslice,wehaveupdate:• Mainbackprop ruletopasserrordownfromparent:

• Finally,adderrorsfromparentandcurrentsoftmax:

Page 31: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

Positive/NegativeResultsonTreebank

74

76

78

80

82

84

86

TrainingwithSentenceLabels TrainingwithTreebank

BiNBRNNMV-RNNRNTN

ClassifyingSentences:Accuracyimprovesto85.4

Page 32: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

FineGrainedResultsonTreebank

Page 33: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

NegationResults

Page 34: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

NegationResults• Mostmethodscapturethatnegationoftenmakes

thingsmorenegative(SeePotts,2010)• Analysisonnegationdataset• Accuracy:

Page 35: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

ResultsonNegatingNegatives• Buthowaboutnegatingnegatives?• Noflips,butpositiveactivationshouldincrease!

notbad

Page 36: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

ResultsonNegatingNegatives

• Evaluation:Positiveactivationshouldincrease

Page 37: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

37

Page 38: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

VisualizingDeepLearning:WordEmbeddings

Page 39: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

LSTMs

• RememberLSTMs?

• Historicallyonlyovertemporalsequences

5/5/16RichardSocherLecture1,Slide 39

Weused

Page 40: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

TreeLSTMs

• Wecanusethoseideasingrammaticaltreestructures!

• Paper:Taietal.2015:ImprovedSemanticRepresentationsFromTree-StructuredLongShort-TermMemoryNetworks

• Idea:Sumthechildvectorsinatreestructure

• Eachchildhasitsownforgetgate

• Samesoftmax onh

5/5/16RichardSocherLecture1,Slide 40

Page 41: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

ResultsonStanfordSentimentTreebank

Method Fine-grained Binary

RAE (Socher et al., 2013) 43.2 82.4MV-RNN (Socher et al., 2013) 44.4 82.9RNTN (Socher et al., 2013) 45.7 85.4DCNN (Blunsom et al., 2014) 48.5 86.8Paragraph-Vec (Le and Mikolov, 2014) 48.7 87.8CNN-non-static (Kim, 2014) 48.0 87.2CNN-multichannel (Kim, 2014) 47.4 88.1DRNN (Irsoy and Cardie, 2014) 49.8 86.6

LSTM 45.8 86.7Bidirectional LSTM 49.1 86.82-layer LSTM 47.5 85.52-layer Bidirectional LSTM 46.2 84.8

Constituency Tree LSTM (no tuning) 46.7 86.6Constituency Tree LSTM 50.6 86.9

Table 2: Test set accuracies on the Stanford Senti-ment Treebank. Fine-grained: 5-class sentimentclassification. Binary: positive/negative senti-ment classification. We give results for Tree-LSTM models with and without fine-tuning ofword representations.

Sec. 4.2. For the similarity prediction network(Eqs. 15) we use a hidden layer of size 50. Wecompare two Tree-LSTM architectures for com-posing sentence representations: the Child-SumTree-LSTM architecture (Sec. 3.1) on dependencytrees (Chen and Manning, 2014) and the BinaryTree-LSTM (Sec. 3.2) on binarized constituencytrees (Klein and Manning, 2003).

5.3 Hyperparameters and Training DetailsThe hyperparameters for our models were tunedon the development set for each task.

We initialized our word representations usingpublicly available 300-dimensional Glove vectors(Pennington et al., 2014). For the sentiment classi-fication task, word representations were fine-tunedduring training with a learning rate of 0.1; no fine-tuning was performed for the semantic relatednesstask.

Our models were trained using AdaGrad (Duchiet al., 2011) with a learning rate of 0.05 and aminibatch size of 25. The model parameters wereregularized with a per-minibatch L2 regularizationstrength of 10�4. The sentiment classifier was ad-ditionally regularized using dropout (Hinton et al.,2012).

6 Results

6.1 Sentiment ClassificationOur results are summarized in Table 2. As was thecase with the convolutional neural network model

Method r ⇢ MSE

Mean vectors 0.8046 0.7294 0.3595DT-RNN (Socher et al., 2014) 0.7863 0.7305 0.3983SDT-RNN (Socher et al., 2014) 0.7886 0.7280 0.3859

Illinois-LH (Lai and Hockenmaier, 2014) 0.7993 0.7538 0.3692UNAL-NLP (Jimenez et al., 2014) 0.8070 0.7489 0.3550Meaning Factory (Bjerva et al., 2014) 0.8268 0.7721 0.3224ECNU (Zhao et al., 2014) 0.8414 – –

LSTM 0.8477 0.7921 0.2949Bidirectional LSTM 0.8522 0.7952 0.28502-layer LSTM 0.8411 0.7849 0.29802-layer Bidirectional LSTM 0.8488 0.7926 0.2893

Constituency Tree LSTM 0.8491 0.7873 0.2852Dependency Tree LSTM 0.8627 0.8032 0.2635

Table 3: Test set results on the SICK semanticrelatedness subtask. The evaluation metrics arePearson’s r, Spearman’s ⇢, and mean squared er-ror. Results are grouped as follows: (1) Our ownbaselines; (2) SemEval 2014 submissions; (3) Se-quential LSTM variants.

described by Kim (2014), we found that tuningword representations yielded a significant boost inperformance on the fine-grained classification sub-task, in contrast to the minor gains observed on thebinary classification subtask. This suggests thatfine-tuning helps distinguish positive/negative vs.neutral, strongly positive vs. positive, and stronglynegative vs. negative, as opposed to positive vs.negative in the binary case.

The Bidirectional LSTM significantly outper-formed the standard LSTM on the fine-grainedsubtask. Note that this result is achieved with-out introducing any additional parameters in theLSTM transition function since the forward andbackward parameters are shared. This indicatesthat sentence length becomes a limiting factorfor the (unidirectional) LSTM on the fine-grainedsubtask. Somewhat surprisingly, we do not ob-serve a corresponding improvement on the binarysubtask (indeed, we achieve similar results on allour single-layer LSTM models). We conjecturethat the state that needs to be retained by the net-work in order to make a correct binary predictionis easily preserved by both the LSTM and Bidi-rectional LSTM models, whereas the fine-grainedcase requires more complex interactions betweenthe input word representations and the hidden stateof the LSTM unit.

The Tree-LSTM over constituency trees outper-forms existing systems on the fine-grained classi-fication subtask.

5/5/16RichardSocherLecture1,Slide 41

ofwordvectors

Page 42: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

SemanticSimilarity

• Betterthanbinaryparaphraseclassification!• Datasetfromacompetition:

SemEval-2014Task1:Evaluationofcompositionaldistributionalsemanticmodelsonfullsentencesthroughsemanticrelatedness[andtextualentailment]

5/5/16RichardSocherLecture1,Slide 42

Relatedness score Example

1.6 A: “A man is jumping into an empty pool”B: “There is no biker jumping in the air”

2.9 A: “Two children are lying in the snow and are making snow angels”B: “Two angels are making snow on the lying children”

3.6 A: “The young boys are playing outdoors and the man is smiling nearby”B: “There is no boy playing outdoors and there is no man smiling”

4.9 A: “A person in a black jacket is doing tricks on a motorbike”B: “A man in a black jacket is doing tricks on a motorbike”

Table 1: Examples of sentence pairs with their gold relatedness scores (on a 5-point rating scale).

Entailment label Example

ENTAILMENT A: “Two teams are competing in a football match”B: “Two groups of people are playing football”

CONTRADICTION A: “The brown horse is near a red barrel at the rodeo”B: “The brown horse is far from a red barrel at the rodeo”

NEUTRAL A: “A man in a black jacket is doing tricks on a motorbike”B: “A person is riding the bicycle on one wheel”

Table 2: Examples of sentence pairs with their gold entailment labels.

pets which compose the Microsoft ResearchVideo Description Corpus. A subset of 750sentence pairs were randomly chosen fromthis data set to be used in SICK.

In order to generate SICK data from the1,500 sentence pairs taken from the sourcedata sets, a 3-step process was applied to eachsentence composing the pair, namely (i) nor-

malization, (ii) expansion and (iii) pairing.Table 3 presents an example of the output ofeach step in the process.

The normalization step was carried out onthe original sentences (S0) to exclude or sim-plify instances that contained lexical, syntac-tic or semantic phenomena (e.g., named enti-ties, dates, numbers, multiword expressions)that CDSMs are currently not expected to ac-count for.

The expansion step was applied to each ofthe normalized sentences (S1) in order to cre-ate up to three new sentences with specificcharacteristics suitable to CDSM evaluation.

In this step syntactic and lexical transforma-tions with predictable effects were applied toeach normalized sentence, in order to obtain(i) a sentence with a similar meaning (S2), (ii)

a sentence with a logically contradictory orat least highly contrasting meaning (S3), and(iii) a sentence that contains most of the samelexical items, but has a different meaning (S4)(this last step was carried out only where itcould yield a meaningful sentence; as a result,not all normalized sentences have an (S4) ex-pansion).

Finally, in the pairing step each normalizedsentence in the pair was combined with all thesentences resulting from the expansion phaseand with the other normalized sentence in thepair. Considering the example in Table 3, S1a

and S1b were paired. Then, S1a and S1b wereeach combined with S2a, S2b,S3a, S3b, S4a,and S4b, leading to a total of 13 different sen-tence pairs.

Furthermore, a number of pairs composed

Page 43: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

SemanticSimilarityResults(correlationandMSE)

Method Fine-grained Binary

RAE (Socher et al., 2013) 43.2 82.4MV-RNN (Socher et al., 2013) 44.4 82.9RNTN (Socher et al., 2013) 45.7 85.4DCNN (Blunsom et al., 2014) 48.5 86.8Paragraph-Vec (Le and Mikolov, 2014) 48.7 87.8CNN-non-static (Kim, 2014) 48.0 87.2CNN-multichannel (Kim, 2014) 47.4 88.1DRNN (Irsoy and Cardie, 2014) 49.8 86.6

LSTM 45.8 86.7Bidirectional LSTM 49.1 86.82-layer LSTM 47.5 85.52-layer Bidirectional LSTM 46.2 84.8

Constituency Tree LSTM (no tuning) 46.7 86.6Constituency Tree LSTM 50.6 86.9

Table 2: Test set accuracies on the Stanford Senti-ment Treebank. Fine-grained: 5-class sentimentclassification. Binary: positive/negative senti-ment classification. We give results for Tree-LSTM models with and without fine-tuning ofword representations.

Sec. 4.2. For the similarity prediction network(Eqs. 15) we use a hidden layer of size 50. Wecompare two Tree-LSTM architectures for com-posing sentence representations: the Child-SumTree-LSTM architecture (Sec. 3.1) on dependencytrees (Chen and Manning, 2014) and the BinaryTree-LSTM (Sec. 3.2) on binarized constituencytrees (Klein and Manning, 2003).

5.3 Hyperparameters and Training DetailsThe hyperparameters for our models were tunedon the development set for each task.

We initialized our word representations usingpublicly available 300-dimensional Glove vectors(Pennington et al., 2014). For the sentiment classi-fication task, word representations were fine-tunedduring training with a learning rate of 0.1; no fine-tuning was performed for the semantic relatednesstask.

Our models were trained using AdaGrad (Duchiet al., 2011) with a learning rate of 0.05 and aminibatch size of 25. The model parameters wereregularized with a per-minibatch L2 regularizationstrength of 10�4. The sentiment classifier was ad-ditionally regularized using dropout (Hinton et al.,2012).

6 Results

6.1 Sentiment ClassificationOur results are summarized in Table 2. As was thecase with the convolutional neural network model

Method r ⇢ MSE

Mean vectors 0.8046 0.7294 0.3595DT-RNN (Socher et al., 2014) 0.7863 0.7305 0.3983SDT-RNN (Socher et al., 2014) 0.7886 0.7280 0.3859

Illinois-LH (Lai and Hockenmaier, 2014) 0.7993 0.7538 0.3692UNAL-NLP (Jimenez et al., 2014) 0.8070 0.7489 0.3550Meaning Factory (Bjerva et al., 2014) 0.8268 0.7721 0.3224ECNU (Zhao et al., 2014) 0.8414 – –

LSTM 0.8477 0.7921 0.2949Bidirectional LSTM 0.8522 0.7952 0.28502-layer LSTM 0.8411 0.7849 0.29802-layer Bidirectional LSTM 0.8488 0.7926 0.2893

Constituency Tree LSTM 0.8491 0.7873 0.2852Dependency Tree LSTM 0.8627 0.8032 0.2635

Table 3: Test set results on the SICK semanticrelatedness subtask. The evaluation metrics arePearson’s r, Spearman’s ⇢, and mean squared er-ror. Results are grouped as follows: (1) Our ownbaselines; (2) SemEval 2014 submissions; (3) Se-quential LSTM variants.

described by Kim (2014), we found that tuningword representations yielded a significant boost inperformance on the fine-grained classification sub-task, in contrast to the minor gains observed on thebinary classification subtask. This suggests thatfine-tuning helps distinguish positive/negative vs.neutral, strongly positive vs. positive, and stronglynegative vs. negative, as opposed to positive vs.negative in the binary case.

The Bidirectional LSTM significantly outper-formed the standard LSTM on the fine-grainedsubtask. Note that this result is achieved with-out introducing any additional parameters in theLSTM transition function since the forward andbackward parameters are shared. This indicatesthat sentence length becomes a limiting factorfor the (unidirectional) LSTM on the fine-grainedsubtask. Somewhat surprisingly, we do not ob-serve a corresponding improvement on the binarysubtask (indeed, we achieve similar results on allour single-layer LSTM models). We conjecturethat the state that needs to be retained by the net-work in order to make a correct binary predictionis easily preserved by both the LSTM and Bidi-rectional LSTM models, whereas the fine-grainedcase requires more complex interactions betweenthe input word representations and the hidden stateof the LSTM unit.

The Tree-LSTM over constituency trees outper-forms existing systems on the fine-grained classi-fication subtask.

Lecture1,Slide 43

Pearson’sr,Spearman’sρ

Page 44: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

SemanticSimilarityResults,PearsonCorrelation

6.2 Semantic RelatednessOur results are summarized in Table 3. FollowingMarelli et al. (2014), we use the Pearson correla-tion coefficient, the Spearman correlation coeffi-cient and mean squared error as evaluation met-rics.

The mean vector baseline computes sentencerepresentations as a mean of the representations ofthe constituent words. In the DT-RNN and SDT-RNN models (Socher et al., 2014), the vector rep-resentation for each node in a dependency tree isa sum over affine-transformed child vectors, fol-lowed by a nonlinearity (the SDT-RNN conditionsthe affine transformation on the dependency rela-tion with the child node). For each of our base-lines, including the LSTM models, we use the sim-ilarity model described in Sec. 4.2.

We also compare against four of the top-performing systems3 submitted to the SemEval2014 semantic relatedness shared task: ECNU(Zhao et al., 2014), The Meaning Factory (Bjervaet al., 2014), UNAL-NLP (Jimenez et al., 2014),and Illinois-LH (Lai and Hockenmaier, 2014).These systems are heavily feature engineered,generally using a combination of surface formoverlap features and lexical distance features de-rived from WordNet or the Paraphrase Database(Ganitkevitch et al., 2013).

Our LSTM models outperform these baselinesystems without any additional feature engineer-ing. On this task, the Dependency Tree LSTMmodel outperformed the sequential LSTM models.

7 Discussion and Qualitative Analysis

7.1 Modeling Semantic RelatednessIn Table 4, we list nearest-neighbor sentences re-trieved from a 1000-sentence sample of the SICKtest set. We compare the neighbors ranked by theDependency Tree LSTM model against a baselineranking by cosine similarity of the mean word vec-tors for each sentence.

We observe that the Dependency Tree LSTMmodel exhibits several desirable properties. Notethat in the dependency parse of the second querysentence, the word “ocean” is the second-furthest

3We list the strongest results we were able to find for thistask; in some cases, these results are stronger than the officialperformance by the team on the shared task. For example,the listed result by Zhao et al. (2014) is stronger than theirsubmitted system’s Pearson correlation score of 0.8280. Wedo not list the StanfordNLP submission since no descriptionof the system is provided in Marelli et al. (2014).

Figure 3: Fine-grained sentiment classification ac-curacy vs. sentence length. For each `, we plotaccuracy for the test set sentences with length inthe window [` � 2, ` + 2]. Examples in the tailof the length distribution are batched in the finalwindow (` = 45).

4 6 8 10 12 14 16 18 20

mean sentence length

0.78

0.80

0.82

0.84

0.86

0.88

0.90

r

DepTree-LSTMLSTMBi-LSTMConstTree-LSTM

Figure 4: Pearson correlations r between pre-dicted similarities and gold ratings vs. sentencelength. For each `, we plot r for the pairs withmean length in the window [`�2, `+2]. Examplesin the tail of the length distribution are batched inthe final window (` = 18.5).

word from the root (“waving”), with a depth of 4.Regardless, the retrieved sentences are all seman-tically related to the word “ocean”, which indi-cates that the Tree-LSTM is able to both preserveand emphasize information from relatively dis-tant nodes. Additionally, the Tree-LSTM modelshows greater robustness to differences in sen-tence length. Given the query “two men are play-ing guitar”, the Tree-LSTM associates the phrase“playing guitar” with the longer, related phrase“dancing and singing in front of a crowd” (note aswell that there is zero token overlap between thetwo phrases).

7.2 Effect of Sentence Length

We investigate the effect of sentence length on theperformance of our LSTM models.

5/5/16RichardSocherLecture1,Slide 44

Page 45: CS224d: Deep NLP Lecture 11: Advanced Recursive Neural ...kanmy/courses/6101_2016_2/CS224… · Applications and Models • Note: All models can be applied to all tasks • More powerful

Nextlecture:Midtermreviewsession

• Goovermaterialswithdifferentviewpoints

• Comewithquestions!

5/5/16RichardSocherLecture1,Slide 45


Recommended