Lecture6:NeuralNetworks
AlanRi5er(many slides from Greg Durrett)
ThisLecture
‣ Feedforwardneuralnetworks+backpropaga?on
‣ Neuralnetworkbasics
‣ Applica?ons
‣ Neuralnetworkhistory
‣ Implemen?ngneuralnetworks(if?me)
History:NN“darkages”
‣ Convnets:appliedtoMNISTbyLeCunin1998
‣ LSTMs:HochreiterandSchmidhuber(1997)
‣ Henderson(2003):neuralshiS-reduceparser,notSOTA
2008-2013:Aglimmeroflight…
‣ CollobertandWeston2011:“NLP(almost)fromscratch”
‣ Feedforwardneuralnetsinducefeaturesforsequen?alCRFs(“neuralCRF”)
‣ 2008versionwasmarredbybadexperiments,claimedSOTAbutwasn’t,2011version?edSOTA
‣ Socher2011-2014:tree-structuredRNNsworkingokay
‣ Krizhevskeyetal.(2012):AlexNetforvision
2014:Stuffstartsworking
‣ Sutskeveretal.+Bahdanauetal.:seq2seqforneuralMT(LSTMsworkforNLP?)
‣ Kim(2014)+Kalchbrenneretal.(2014):sentenceclassifica?on/sen?ment(convnetsworkforNLP?)
‣ 2015:explosionofneuralnetsforeverythingunderthesun
‣ ChenandManningtransi?on-baseddependencyparser(evenfeedforwardnetworksworkwellforNLP?)
Whydidn’ttheyworkbefore?
‣ Datasetstoosmall:forMT,notreallybe5erun?lyouhave1M+parallelsentences(andreallyneedalotmore)
‣Op,miza,onnotwellunderstood:goodini?aliza?on,per-featurescaling+momentum(Adagrad/Adadelta/Adam)workbestout-of-the-box
‣ Regulariza,on:dropoutispre5yhelpful
‣ Inputs:needwordrepresenta?onstohavetherightcon?nuousseman?cs
‣ Computersnotbigenough:can’trunforenoughitera?ons
NeuralNetBasics
NeuralNetworks
‣ Howcanwedononlinearclassifica?on?Kernelsaretooslow…
‣Wanttolearnintermediateconjunc?vefeaturesoftheinput
argmaxyw>f(x, y)‣ Linearclassifica?on:
themoviewasnotallthatgood
I[containsnot&containsgood]
NeuralNetworks:XOR
x1
x2
x1 x2
1 1111
100 0
00
0
0
1 0
1
x1, x2
(generally x = (x1, . . . , xm))
y
(generally y = (y1, . . . , yn)) y = x1 XOR x2
‣ Let’sseehowwecanuseneuralnetstolearnasimplenonlinearfunc?on
‣ Inputs
‣ Output
NeuralNetworks:XOR
x1
x2
x1 x2 x1 XOR x2
1 1111
100 0
00
0
0
1 0
1“or”
y = a1x1 + a2x2 Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)
(looks like action potential in neuron)
NeuralNetworks:XORy = a1x1 + a2x2
x1
x2
x1 x2 x1 XOR x2
1 1111
100 0
00
0
0
1 0
1
Xy = a1x1 + a2x2 + a3 tanh(x1 + x2)
x2
x1
“or”y = �x1 � x2 + 2 tanh(x1 + x2)
NeuralNetworks:XOR
x1
x2
0
1 -1
0
x2
x1
[not]
[good] y = �2x1 � x2 + 2 tanh(x1 + x2)
I
I
themoviewasnotallthatgood
NeuralNetworks
Takenfromh5p://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
Warp space
ShiftNonlinear transformation
Linear model: y = w · x+ b
y = g(w · x+ b)y = g(Wx+ b)
NeuralNetworks
Takenfromh5p://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
Linearclassifier Neuralnetwork…possiblebecausewetransformedthespace!
DeepNeuralNetworks
Adopted from Chris Dyer
}outputoffirstlayer
z = g(Vg(Wx+ b) + c)
z = g(Vy + c)
Input Second Layer
FirstLayer
“Feedforward”computa?on(notrecurrent)
z = V(Wx+ b) + c
Check:whathappensifnononlinearity?Morepowerfulthanbasiclinearmodels?
DeepNeuralNetworks
Takenfromh5p://colah.github.io/posts/2014-03-NN-Manifolds-Topology/
FeedforwardNetworks,Backpropaga?on
Logis?cRegressionwithNNs
P (y|x) = exp(w>f(x, y))Py0 exp(w>f(x, y0))
‣ Singlescalarprobability
P (y|x) = softmax�[w>f(x, y)]y2Y
� ‣ Computescoresforallpossiblelabelsatonce(returnsvector)
softmax(p)i =exp(pi)Pi0 exp(pi0)
‣ soSmax:expsandnormalizesagivenvector
P (y|x) = softmax(Wf(x)) ‣Weightvectorperclass;Wis[numclassesxnumfeats]
P (y|x) = softmax(Wg(V f(x))) ‣ Nowonehiddenlayer
NeuralNetworksforClassifica?on
V
nfeatures
dhiddenunits
dxnmatrix num_classesxdmatrix
soSmaxWf(x)
z
nonlinearity(tanh,relu,…)
g P(y
|x)
P (y|x) = softmax(Wg(V f(x)))num_classes
probs
TrainingNeuralNetworks
‣Maximizeloglikelihoodoftrainingdata
‣ i*:indexofthegoldlabel
‣ ei:1intheithrow,zeroelsewhere.Dotbythis=selectithindex
z = g(V f(x))P (y|x) = softmax(Wz)
L(x, i⇤) = Wz · ei⇤ � logX
j
exp(Wz) · ej
L(x, i⇤) = logP (y = i⇤|x) = log (softmax(Wz) · ei⇤)
Compu?ngGradients
‣ GradientwithrespecttoW
ifi=i*zj � P (y = i|x)zj
�P (y = i|x)zj
@
@WijL(x, i⇤) =
zj � P (y = i|x)zj
�P (y = i|x)zj otherwise
‣ Lookslikelogis?cregressionwithzasthefeatures!
i
j
{
L(x, i⇤) = Wz · ei⇤ � logX
j
exp(Wz) · ej
W
NeuralNetworksforClassifica?on
V soSmaxWf(x)
zg P
(y|x)
P (y|x) = softmax(Wg(V f(x)))
@L@Wz
Compu?ngGradients:Backpropaga?onz = g(V f(x))
Ac?va?onsathiddenlayer
‣ GradientwithrespecttoV:applythechainrule
err(root) = ei⇤ � P (y|x)dim=m dim=d
@L(x, i⇤)@z
= err(z) = W>err(root)
L(x, i⇤) = Wz · ei⇤ � logX
j
exp(Wz) · ej
[somemath…]
@L(x, i⇤)@Vij
=@L(x, i⇤)
@z
@z
@Vij
Backpropaga?on:Picture
V soSmaxWf(x)
zg P
(y|x)
P (y|x) = softmax(Wg(V f(x)))
@L@W err(root)err(z)
z
‣ CanforgeteverythingaSerz,treatitastheoutputandkeepbackpropping
Backpropaga?on:Takeaways
‣ GradientsofoutputweightsWareeasytocompute—lookslikelogis?cregressionwithhiddenlayerzasfeaturevector
‣ Cancomputederiva?veoflosswithrespecttoztoforman“errorsignal”forbackpropaga?on
‣ Easytoupdateparametersbasedon“errorsignal”fromnextlayer,keeppushingerrorsignalbackasbackpropaga?on
‣ Needtorememberthevaluesfromtheforwardcomputa?on
Applica?ons
NLPwithFeedforwardNetworks
Bothaetal.(2017)
…
Fedraisesinterestratesinorderto…
f(x)?? emb(raises)
‣Wordembeddingsforeachwordforminput
‣ ~1000featureshere—smallerfeaturevectorthaninsparsemodels,buteveryfeaturefiresoneveryexample
emb(interest)
emb(rates)‣Weightmatrixlearnsposi?on-dependent
processingofthewords
previousword
currword
nextword
otherwords,feats,etc.
‣ Part-of-speechtaggingwithFFNNs
NLPwithFeedforwardNetworks
‣ Hiddenlayermixesthesedifferentsignalsandlearnsfeatureconjunc?ons
Bothaetal.(2017)
NLPwithFeedforwardNetworks‣Mul?lingualtaggingresults:
Bothaetal.(2017)
‣ GillickusedLSTMs;thisissmaller,faster,andbe5er
Sen?mentAnalysis‣ DeepAveragingNetworks:feedforwardneuralnetworkonaverageofwordembeddingsfrominput
Iyyeretal.(2015)
Sen?mentAnalysis
{
{Bag-of-words
TreeRNNs/CNNS/LSTMS
WangandManning(2012)
Kim(2014)
Iyyeretal.(2015)
CoreferenceResolu?on‣ Feedforwardnetworksiden?fycoreferencearcs
ClarkandManning(2015),Wisemanetal.(2015)
PresidentObamasigned…
Helatergaveaspeech…
?
Implementa?onDetails
Computa?onGraphs
‣ Compu?nggradientsishard!
‣ Automa?cdifferen?a?on:instrumentcodetokeeptrackofderiva?ves
y = x * x (y,dy) = (x * x, 2 * x * dx)codegen
‣ Computa?onisnowsomethingweneedtoreasonaboutsymbolically
‣ UsealibrarylikePytorchorTensorflow.Thisclass:Pytorch
Computa?onGraphsinPytorch
P (y|x) = softmax(Wg(V f(x)))
class FFNN(nn.Module): def __init__(self, inp, hid, out): super(FFNN, self).__init__() self.V = nn.Linear(inp, hid) self.g = nn.Tanh() self.W = nn.Linear(hid, out) self.softmax = nn.Softmax(dim=0)
def forward(self, x): return self.softmax(self.W(self.g(self.V(x))))
‣ Defineforwardpassfor
Computa?onGraphsinPytorch
P (y|x) = softmax(Wg(V f(x)))
ffnn = FFNN()
loss.backward()
probs = ffnn.forward(input)loss = torch.neg(torch.log(probs)).dot(gold_label)
optimizer.step()
def make_update(input, gold_label):
ffnn.zero_grad() # clear gradient variables
ei*: one-hot vector of the label (e.g., [0, 1, 0])
TrainingaModel
Defineacomputa?ongraph
Foreachepoch:
Computelossonbatch
Foreachbatchofdata:
Decodetestset
Autogradtocomputegradientsandtakestep
Batching
‣ Batchingdatagivesspeedupsduetomoreefficientmatrixopera?ons
‣ Needtomakethecomputa?ongraphprocessabatchatthesame?me
probs = ffnn.forward(input) # [batch_size, num_classes]loss = torch.sum(torch.neg(torch.log(probs)).dot(gold_label))
...
‣ Batchsizesfrom1-100oSenworkwell
def make_update(input, gold_label)
# input is [batch_size, num_feats] # gold_label is [batch_size, num_classes]
...
NextTime
‣Moreimplementa?ondetails:prac?caltrainingtechniques
‣Wordrepresenta?ons/wordvectors
‣ word2vec,GloVe