ReadingComprehension
Bidirec2onalA5en2onFlow
Seoetal.(2016)
Eachpassagewordnow“knowsabout”thequery
QANet‣ OneofmanymodelsbuildingonBiDAFinmorecomplexways
Yuetal.(2018)
‣ SimilarstructureasBiDAF,buttransformerlayers(nextlecture)insteadofLSTMs
SQuADSOTA:Fall18
‣ nlnet,QANet,r-net—duelingsupercomplexsystems(muchmorethanBiDAF…)
‣ BiDAF:73EM/81F1
SQuAD2.0SOTA:Spring2019
‣ Sincespring2019:SQuADperformanceisdominatedbylargepre-trainedmodelslikeBERT
‣ HardervariantofSQuAD
AdversarialExamples‣ Canconstructadversarialexamplesthatfoolthesesystems:addonecarefullychosensentenceandperformancedropstobelow50%
JiaandLiang(2017)
‣ S2ll“surface-level”matching,notcomplexunderstanding
‣ Otherchallenges:recognizingwhenanswersaren’tpresent,doingmul2-stepreasoning
Pre-training/ELMo
Whatispre-training?
‣ “Pre-train”amodelonalargedatasetfortaskX,then“fine-tune”itonadatasetfortaskY
‣ Keyidea:XissomewhatrelatedtoY,soamodelthatcandoXwillhavesomegoodneuralrepresenta2onsforYaswell
‣ GloVecanbeseenaspre-training:learnvectorswiththeskip-gramobjec2veonlargedata(taskX),thenfine-tunethemaspartofaneuralnetworkforsen2ment/anyothertask(taskY)
‣ ImageNetpre-trainingishugeincomputervision:learngenericvisualfeaturesforrecognizingobjects
GloVeisinsufficient‣ GloVeusesalotofdatabutinaweakway
‣ Havingasingleembeddingforeachwordiswrong
‣ Iden2fyingdiscretewordsensesishard,doesn’tscale.Hardtoiden2fyhowmanysenseseachwordhas
‣ Takeapowerfullanguagemodel,trainitonlargeamountsofdata,thenusethoserepresenta2onsindownstreamtasks
theyhittheballstheydanceatballs
‣ Howcanwemakeourwordembeddingsmorecontext-dependent?
Context-dependentEmbeddings
Petersetal.(2018)
‣ Trainaneurallanguagemodeltopredictthenextwordgivenpreviouswordsinthesentence,usethehiddenstates(output)ateachstepaswordembeddings
they hit the ballsthey dance at balls
‣ ThisisthekeyideabehindELMo:languagemodelscanallowustoformusefulwordrepresenta2onsinthesamewayword2vecdid
ELMo‣ CNNovereachword=>RNN
JohnvisitedMadagascaryesterdayCharCNN CharCNN CharCNN CharCNN
4096-dimLSTMs
nextword
2048CNNfiltersprojecteddownto512-dim
Petersetal.(2018)
Representa2onofvisited(plusvectorsfromanotherLMrunningbackwards)
*gesngthismodelrighttookyears
TrainingELMo‣ Data:1BWordBenchmark(Chelbaetal.,2014)
‣ Pre-training2me:2weekson3NVIDIAGTX1080GPUs
‣Muchlower2mecostifweusedV100s/Google’sTPUs,buts2llhundredsofdollarsincomputecosttotrainonce
‣ LargerBERTmodelstrainedonmoredata(nextweek)cost$10k+
‣ Pre-trainingisexpensive,butfine-tuningisdoable
HowtoapplyELMo?
Someneuralnetwork
they dance at balls
Taskpredic2ons(sen2ment,etc.)‣ Takethoseembeddingsandfeedthemintowhateverarchitectureyouwanttouseforyourtask
‣ Frozenembeddings(mostcommon):updatetheweightsofyournetworkbutkeepELMo’sparametersfrozen
‣ Fine-tuning:backpropagateallthewayintoELMowhentrainingyourmodel
Results:FrozenELMo
‣Massiveimprovements,bea2ngmodelshandcrawedforeachtask
Petersetal.(2018)
Five-classversionofsen2mentfromA1-A2
QA
(sortof)likedepparsing
‣ Thesearemostlytextanalysistasks.Otherpre-trainingapproachesneededfortextgenera2onliketransla2on
Whyislanguagemodelingagoodobjec2ve?‣ “Impossible”problembutbiggermodelsseemtodobe5erandbe5eratdistribu2onalmodeling(noupperlimityet)
‣ Successfullypredic2ngnextwordsrequiresmodelinglotsofdifferenteffectsintext
ProbingELMo‣ FromeachlayeroftheELMomodel,a5empttopredictsomething:POStags,wordsenses,etc.
‣ Higheraccuracy=>ELMoiscapturingthatthingmorestrongly
Petersetal.(2018)
Analysis
Petersetal.(2018)
Takeaways
‣ Learningalargelanguagemodelcanbeaneffec2vewayofgenera2ng“wordembeddings”informedbytheircontext
‣ Nextclass:transformersandBERT
‣ Pre-trainingonmassiveamountsofdatacanimproveperformanceontaskslikeQA