CS388:NaturalLanguageProcessingLecture6:NeuralNetworks
GregDurrett
Administrivia
‣Mini1graded,postedonCanvas
‣ Project1duein9days
‣ XiYe(88.0F1),QuangDuong(87.3F1),UdayKusupaQ(87.2F1) 6studentsinthe86range,restare85orbelow‣ TestF1s<<devF1‣ Changingthresholds/imbalancedclassificaQon‣ POS/chunkfeatures
‣ SmallbugfixedinBadNerModel(noimpactonthecodeyouwrite)
‣ Someonegot86.3withonly7featurestotal,classifierisadicQonary
ThisLecture
‣ Feedforwardneuralnetworks+backpropagaQon
‣ Neuralnetworkbasics
‣ ApplicaQons
‣ Neuralnetworkhistory
‣ Beamsearch:inafewlectures
‣ ImplemenQngneuralnetworks(ifQme)
History:NN“darkages”‣ Convnets:appliedtoMNISTbyLeCunin1998
‣ LSTMs:HochreiterandSchmidhuber(1997)
‣ Henderson(2003):neuralshic-reduceparser,notSOTA
2008-2013:Aglimmeroflight…
‣ CollobertandWeston2011:“NLP(almost)fromscratch”‣ FeedforwardneuralnetsinducefeaturesforsequenQalCRFs(“neuralCRF”)
‣ 2008versionwasmarredbybadexperiments,claimedSOTAbutwasn’t,2011versionQedSOTA
‣ Socher2011-2014:tree-structuredRNNsworkingokay
‣ Krizhevskeyetal.(2012):AlexNetforvision
2014:Stuffstartsworking
‣ Sutskeveretal.+Bahdanauetal.:seq2seqforneuralMT(LSTMsworkforNLP?)
‣ Kim(2014)+Kalchbrenneretal.(2014):sentenceclassificaQon/senQment(convnetsworkforNLP?)
‣ 2015:explosionofneuralnetsforeverythingunderthesun
‣ ChenandManningtransiQon-baseddependencyparser(evenfeedforwardnetworksworkwellforNLP?)
Whydidn’ttheyworkbefore?‣ Datasetstoosmall:forMT,notreallybenerunQlyouhave1M+parallelsentences(andreallyneedalotmore)
‣Op,miza,onnotwellunderstood:goodiniQalizaQon,per-featurescaling+momentum(Adagrad/Adadelta/Adam)workbestout-of-the-box
‣ Regulariza,on:dropoutisprenyhelpful
‣ Inputs:needwordrepresentaQonstohavetherightconQnuoussemanQcs
‣ Computersnotbigenough:can’trunforenoughiteraQons
NeuralNetBasics
NeuralNetworks
‣ HowcanwedononlinearclassificaQon?Kernelsaretooslow…
‣WanttolearnintermediateconjuncQvefeaturesoftheinput
argmaxyw>f(x, y)‣ LinearclassificaQon:
themoviewasnotallthatgood
I[containsnot&containsgood]
NeuralNetworks:XOR
x1
x2
x1 x2
1 1111
100 0
00
0
0
1 0
1
x1, x2
(generally x = (x1, . . . , xm))
y
(generally y = (y1, . . . , yn))y = x1 XOR x2
‣ Let’sseehowwecanuseneuralnetstolearnasimplenonlinearfuncQon
‣ Inputs
‣ Output
NeuralNetworks:XOR
x1
x2
x1 x2 x1 XOR x2
1 1111
100 0
00
0
0
1 0
1“or”
y = a1x1 + a2x2 Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)
(looks like action potential in neuron)
NeuralNetworks:XORy = a1x1 + a2x2
x1
x2
x1 x2 x1 XOR x2
1 1111
100 0
00
0
0
1 0
1
Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)
x2
x1
“or”y = �x1 � x2 + 2 tanh(x1 + x2)
NeuralNetworks:XOR
x1
x2
0
1 -1
0
x2
x1
[not]
[good] y = �2x1 � x2 + 2 tanh(x1 + x2)
I
I
themoviewasnotallthatgood
NeuralNetworks
Takenfromhnp://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
Warp space
ShiftNonlinear transformation
Linear model: y = w · x+ b
y = g(w · x+ b)y = g(Wx+ b)
NeuralNetworks
Takenfromhnp://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
Linearclassifier Neuralnetwork…possiblebecausewetransformedthespace!
DeepNeuralNetworks
Adopted from Chris Dyer
}outputoffirstlayer
z = g(Vg(Wx+ b) + c)
z = g(Vy + c)
Input Second Layer
FirstLayer
“Feedforward”computaQon(notrecurrent)
z = V(Wx+ b) + c
Check:whathappensifnononlinearity?Morepowerfulthanbasiclinearmodels?
DeepNeuralNetworks
Takenfromhnp://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
FeedforwardNetworks,BackpropagaQon
LogisQcRegressionwithNNsP (y|x) = exp(w>f(x, y))P
y0 exp(w>f(x, y0))‣ Singlescalarprobability
P (y|x) = softmax
�[w>f(x, y)]y2Y
� ‣ Computescoresforallpossiblelabelsatonce(returnsvector)
softmax(p)i =exp(pi)Pi0 exp(pi0)
‣ socmax:expsandnormalizesagivenvector
P (y|x) = softmax(Wf(x)) ‣Weightvectorperclass;Wis[numclassesxnumfeats]
P (y|x) = softmax(Wg(V f(x))) ‣ Nowonehiddenlayer
NeuralNetworksforClassificaQon
V
nfeatures
dhiddenunits
dxnmatrix num_classesxdmatrix
socmaxWf(x)
z
nonlinearity(tanh,relu,…)
g P(y
|x)
P (y|x) = softmax(Wg(V f(x)))num_classes
probs
TrainingNeuralNetworks
‣Maximizeloglikelihoodoftrainingdata
‣ i*:indexofthegoldlabel
‣ ei:1intheithrow,zeroelsewhere.Dotbythis=selectithindex
z = g(V f(x))P (y|x) = softmax(Wz)
L(x, i⇤) = Wz · ei⇤ � log
X
j
exp(Wz) · ej
L(x, i⇤) = logP (y = i⇤|x) = log (softmax(Wz) · ei⇤)
CompuQngGradients
‣ GradientwithrespecttoW
ifi=i*zj � P (y = i|x)zj
�P (y = i|x)zj
@
@WijL(x, i⇤) =
zj � P (y = i|x)zj
�P (y = i|x)zj otherwise
‣ LookslikelogisQcregressionwithzasthefeatures!
i
j
{
L(x, i⇤) = Wz · ei⇤ � log
X
j
exp(Wz) · ej
W
NeuralNetworksforClassificaQon
V socmaxWf(x)
zg P
(y|x)
P (y|x) = softmax(Wg(V f(x)))
@L@Wz
CompuQngGradients:BackpropagaQonz = g(V f(x))
AcQvaQonsathiddenlayer
‣ GradientwithrespecttoV:applythechainrule
err(root) = ei⇤ � P (y|x)dim=m dim=d
@L(x, i⇤)@z
= err(z) = W>err(root)
L(x, i⇤) = Wz · ei⇤ � log
X
j
exp(Wz) · ej
[somemath…]
@L(x, i⇤)@Vij
=@L(x, i⇤)
@z
@z
@Vij
BackpropagaQon:Picture
V socmaxWf(x)
zg P
(y|x)
P (y|x) = softmax(Wg(V f(x)))
@L@W err(root)err(z)
z
‣ Canforgeteverythingacerz,treatitastheoutputandkeepbackpropping
BackpropagaQon:Takeaways‣ GradientsofoutputweightsWareeasytocompute—lookslikelogisQcregressionwithhiddenlayerzasfeaturevector
‣ CancomputederivaQveoflosswithrespecttoztoforman“errorsignal”forbackpropagaQon
‣ Easytoupdateparametersbasedon“errorsignal”fromnextlayer,keeppushingerrorsignalbackasbackpropagaQon
‣ NeedtorememberthevaluesfromtheforwardcomputaQon
ApplicaQons
NLPwithFeedforwardNetworks
Bothaetal.(2017)
…
Fedraisesinterestratesinorderto…
f(x)?? emb(raises)
‣Wordembeddingsforeachwordforminput
‣ ~1000featureshere—smallerfeaturevectorthaninsparsemodels,buteveryfeaturefiresoneveryexample
emb(interest)
emb(rates)‣WeightmatrixlearnsposiQon-dependent
processingofthewords
previousword
currword
nextword
otherwords,feats,etc.
‣ Part-of-speechtaggingwithFFNNs
NLPwithFeedforwardNetworks
‣ HiddenlayermixesthesedifferentsignalsandlearnsfeatureconjuncQons
Bothaetal.(2017)
NLPwithFeedforwardNetworks‣MulQlingualtaggingresults:
Bothaetal.(2017)
‣ GillickusedLSTMs;thisissmaller,faster,andbener
SenQmentAnalysis‣ DeepAveragingNetworks:feedforwardneuralnetworkonaverageofwordembeddingsfrominput
Iyyeretal.(2015)
SenQmentAnalysis
{
{Bag-of-words
TreeRNNs/CNNS/LSTMS
WangandManning(2012)
Kim(2014)
Iyyeretal.(2015)
CoreferenceResoluQon‣ FeedforwardnetworksidenQfycoreferencearcs
ClarkandManning(2015),Wisemanetal.(2015)
PresidentObamasigned…
Helatergaveaspeech…
?
ImplementaQonDetails
ComputaQonGraphs
‣ CompuQnggradientsishard!
‣ AutomaQcdifferenQaQon:instrumentcodetokeeptrackofderivaQves
y = x * x (y,dy) = (x * x, 2 * x * dx)codegen
‣ ComputaQonisnowsomethingweneedtoreasonaboutsymbolically
‣ UsealibrarylikePytorchorTensorflow.Thisclass:Pytorch
ComputaQonGraphsinPytorch
P (y|x) = softmax(Wg(V f(x)))
class FFNN(nn.Module): def __init__(self, inp, hid, out): super(FFNN, self).__init__() self.V = nn.Linear(inp, hid) self.g = nn.Tanh() self.W = nn.Linear(hid, out) self.softmax = nn.Softmax(dim=0)
def forward(self, x): return self.softmax(self.W(self.g(self.V(x))))
‣ Defineforwardpassfor
ComputaQonGraphsinPytorch
P (y|x) = softmax(Wg(V f(x)))
ffnn = FFNN()
loss.backward()
probs = ffnn.forward(input)loss = torch.neg(torch.log(probs)).dot(gold_label)
optimizer.step()
def make_update(input, gold_label):
ffnn.zero_grad() # clear gradient variables
ei*: one-hot vector of the label (e.g., [0, 1, 0])
TrainingaModel
DefineacomputaQongraph
Foreachepoch:
Computelossonbatch
Foreachbatchofdata:
Decodetestset
Autogradtocomputegradientsandtakestep
Batching
‣ BatchingdatagivesspeedupsduetomoreefficientmatrixoperaQons
‣ NeedtomakethecomputaQongraphprocessabatchatthesameQme
probs = ffnn.forward(input) # [batch_size, num_classes]loss = torch.sum(torch.neg(torch.log(probs)).dot(gold_label))
...
‣ Batchsizesfrom1-100ocenworkwell
def make_update(input, gold_label)
# input is [batch_size, num_feats] # gold_label is [batch_size, num_classes]
...
NextTime‣MoreimplementaQondetails:pracQcaltrainingtechniques
‣WordrepresentaQons/wordvectors
‣ word2vec,GloVe