Date post: | 21-Jan-2018 |
Category: |
Data & Analytics |
Upload: | universitat-politecnica-de-catalunya |
View: | 137 times |
Download: | 0 times |
AttentionDLAI– MARTAR.COSTA-JUSSÀ
SLIDESADAPTEDFROMGRAHAMNEUBIG’SLECTURES
Whatadvancementsexciteyoumostinthefield?Iamveryexcitedbytherecentlyintroducedattentionmodels,duetotheirsimplicityandduetothefactthattheyworksowell.Althoughthesemodelsarenew,Ihavenodoubtthattheyareheretostay,andthattheywillplayaveryimportantroleinthefutureofdeeplearning.
ILYASUTSKEVER, RESEARCHDIRECTORANDCOFUNDEROFOPENAI
2
Outline1.Sequencemodeling&Sequence-to-sequencemodels[WRAP-UPFROMPREVIOUSRNN’sSESSION]
2.Attention-basedmechanism
3.Attentionvarieties
4.AttentionImprovements
5.Applications
6.“Attentionisallyouneed”
7.Summary
3
SequencemodelingModeltheprobabilityofsequencesofwords
Frompreviouslecture…wemodelsequences
ithRNNs
p(I’m) p(fine|I’m) p(.|fine) EOS
I’m fine .<s>
4
Sequence-to-sequencemodels
how are you ?
Cómo estás EOS
encoder decoder
¿ Cómo estás
?
?
¿
<s>
THOUGHT/CONTEXT
VECTOR
5
Anyproblemwiththesemodels?
6
7
2.Attention-basedmechanism
8
MotivationinthecaseofMT
9
MotivationinthecaseofMT
10
Attention
encoder
decoder
+
Attention allows to use multiple vectors, based onthe length of the input
11
AttentionKeyIdeas•Encodeeachwordintheinputandoutputsentenceintoavector
•Whendecoding,performalinearcombinationofthesevectors,weightedby“attentionweights”
•Usethiscombinationinpickingthenextword
12
AttentioncomputationI•Use“query”vector(decoderstate)and“key”vectors(allencoderstates)
•Foreachquery-keypair,calculateweight
•Normalizetoaddtooneusingsoftmax
Query Vector
Key Vectors
a1=2.1 a2=-0.1 a3=0.3 a4=-1.0
softmax
a1=0.5 a2=0.3 a3=0.1 a4=0.1
13
AttentioncomputationII• Combinetogethervaluevectors(usuallyencoderstates,likekeyvectors)bytakingtheweightedsum
Value Vectors
a1=0.5 a2=0.3 a3=0.1 a4=0.1* * * *
14
AttentionScoreFunctionsqisthequeryandkisthekey
Reference
Multi-layerPerceptron
𝑎 𝑞, 𝑘 = tanh(𝒲- 𝑞, 𝑘 ) Flexible,oftenverygoodwithlargedata
Bahdanau etal.,2015
Bilinear 𝑎 𝑞, 𝑘 = 𝑞/𝒲𝑘 Luongetal2015
DotProduct 𝑎 𝑞, 𝑘 = 𝑞/𝑘 Noparameters!Butrequiressizestobethesame
Luongetal.2015
ScaledDotProduct𝑎 𝑞, 𝑘 =
𝑞/𝑘|𝑘|�
Scalebysizeofthevector Vaswani etal.2017
15
AttentionIntegration
16
AttentionIntegration
17
3.AttentionVarieties
18
HardAttention*Insteadofasoftinterpolation,makeazero-onedecisionaboutwheretoattend(Xuetal.2015)
19
MonotonicAttentionThisapproach"softly"preventsthemodelfromassigningattentionprobabilitybeforewhereitattendedataprevioustimestep bytakingintoaccounttheattentionattheprevioustimestep.
20
ENCODER STATE E
Intra-Attention/Self- AttentionEachelementinthesentenceattendstootherelementsfromtheSAMEsentenceà contextsensitiveencodings!
21
MultipleSourcesAttendtomultiplesentences(Zoph etal.,2015)
Attendtoasentenceandanimage(Huangetal.2016)
22
Multi-headedAttentionIMultipleattention“heads”focusondifferentpartsofthesentence
𝑎 𝑞, 𝑘 =𝑞/𝑘|𝑘|�
23
Multi-headedAttentionIIMultipleattention“heads”focusondifferentpartsofthesentence
E.g.Multipleindependentlylearnedheads(Vaswani etal.2017)
𝑎 𝑞, 𝑘 =𝑞/𝑘|𝑘|�
24
4.ImprovementsinAttentionINTHECONTEXTOFMT
25
CoverageProblem:Neuralmodelstendstodroporrepeatcontent
InMT,
1.Over-translation:somewordsareunnecessarilytranslatedformultipletimes;
2.Under-translation:somewordsaremistakenlyuntranslated.
SRC:Señor Presidente,abre lasesión.
TRG:Mr PresidentMr PresidentMr President.
Solution:Modelhowmanytimeswordshavebeencoverede.g.maintainingacoveragevectortokeeptrackoftheattentionhistory(Tu etal.,2016)
26
IncorporatingMarkovPropertiesIntuition:Attentionfromlasttimetendstobecorrelatedwithattentionthistime
Approach:Addinformationaboutthelastattentionwhenmakingthenextdecision
27
BidirectionalTraining-Background:Establishedthatforlatentvariabletranslationmodelsthealignmentsimproveifbothdirectionalmodelsarecombined(koehn etal,2005)
-Approach:jointtrainingoftwodirectionalmodels
28
SupervisedTrainingSometimeswecanget“goldstandard”alignmentsa–priori◦ Manualalignments◦ Pre-trainedwithstrongalignmentmodel
Trainthemodeltomatchthesestrongalignments
29
5.Applications
30
Chatbotsacomputerprogramthatconductsaconversation
Human: what is your job Enc-dec: i’m a lawyer Human: what do you do ?Enc-dec: i’m a doctor .
what is your job
I’m a EOS
<s> I’m a
lawyer
lawyer
+attention
31
NaturalLanguageInference
32
OtherNLPTasksText summarization: process of shortening a text document withsoftware to create a summary with the major points of the originaldocument.Question Answering: automatically producing an answer to aquestion given a corresponding document.
Semantic Parsing: mapping natural language into a logical form thatcan be executed on a knowledge base and return an answer
Syntactic Parsing: process of analysing a string of symbols, either innatural language or in computer languages, conforming to the rulesof a formal grammar
33
ImagecaptioningIdecoder
encoder A cat on the mata cat
<s> a
on the mat
cat on the
34
ImageCaptioningII
35
OtherComputerVisionTaskswithAttentionVisual Question Answering: given an image and a natural languagequestion about the image, the task is to provide an accurate naturallanguage answer.Video Caption Generation: attempts to generate a complete andnatural sentence, enriching the single label as in video classification,to capture the most informative dynamics in videos.
36
Speechrecognition/translation
37
6.“Attentionisallyouneed”SLIDESBASEDONHTTPS://RESEARCH.GOOGLEBLOG.COM/2017/08/TRANSFORMER-NOVEL-NEURAL-NETWORK.HTML
38
MotivationSequentialnatureofRNNs-à difficulttotakeadvantageofmoderncomputingdevicessuchasTPUs(TensorProcessingUnits)
39
Transformer
Iarrivedat thebankaftercrossingtheriver
40
TransformerIDecoder
Encoder
41
TransformerII
42
Transformerresults
43
Attentionweights
44
Attentionweights
45
7.Summary
46
RNNsandAttentionRNNsareusedtomodelsequences
Attentionisusedtoenhancemodelinglongsequences
Versatilityofthesemodelsallowstoapplythemtoawiderangeofapplications
47
ImplementationsofEncoder-DecoderLSTM CNN
48
Attention-basedmechanismsSoftvsHard:softattentionweightsallpixels,hardattentioncropstheimageandforcesattentiononlyonthekeptpart.
GlobalvsLocal: aglobal approach whichalwaysattendstoallsourcewordsandalocalonethatonlylooksatasubsetofsourcewordsatatime.
IntravsExternal:intraattentioniswithintheencoder’sinputsentence,externalattentionisacrosssentences.
49
Onelargeencoder-decoder•Text,speech,image…isallconverging toasignalparadigm?
•IfyouknowhowtobuildaneuralMTsystem,youmayeasilylearnhowtobuildaspeech-to-textrecognitionsystem...
•Oryoumaytrainthemtogethertoachievezero-shot AI.
*And other references on this research direction….
50
Quizz1.Markallstatementsthataretrue
A.Sequencemodelingonlyreferstolanguageapplications
B.Theattentionmechanismcanbeappliedtoanencoder-decoderarchitecture
C.Neuralmachinetranslationsystemsrequirerecurrentneuralnetworks
D.Ifwewanttohaveafixedrepresentation(thoughtvector),wecannotapplyattention-basedmechanisms
2.Giventhequeryvectorq=[],thekeyvector1k1=[]andthekeyvector2k2=[].
A.Whataretheattentionweights1&2computingthedotproduct?
B.Andwhencomputingthescaleddotproduct?
C.Towhatkeyvectorarewegivingmoreattention?
D.Whatistheadvantageofcomputingthescaleddotproduct?
52