Slav Petrovon behalf of theLanguage Team @ Google Research
Natural Language Representations and Challenges
Outline
●○○○
●○○○
State-of-the-art in Natural Language Understanding in 2017
→ → Custom (Recurrent) Architectures
P 3
Oct. 2018: One Model with Task-specific Tuning in Minutes
P 4
Question Answering (SQuAD 1.1)
P 5
Pre-Training in NLP
●
king
[-0.5, -0.9, 1.4, …]
queen
[-0.6, -0.8, -0.2, …] king wore a crown
Inner Product
queen wore a crown
Inner Product
[0.3, 0.2, -0.8, …]
open a bank account on the river bank
●
History of Contextual Representations
●
Train LSTMLanguage Model
LSTM
<s>
open
LSTM
open
a
LSTM
a
bank
LSTM
very
LSTM
funny
LSTM
movie
POSITIVE
...
Fine-tune on Classification Task
History of Contextual Representations
●
Train Separate Left-to-Right and Right-to-Left LMs
LSTM
<s>
open
LSTM
open
a
LSTM
a
bank
Apply as “Pre-trained Embeddings”
LSTM
open
<s>
LSTM
a
open
LSTM
bank
aExisting Model Architecture
History of Contextual Representations
●
Transformer
<s>
open
open
a
a
bank
Transformer Transformer
POSITIVE
Fine-tune on Classification Task
Transformer
<s> open a
Transformer Transformer
Train Deep (12-layer) Transformer LM
Unidirectional vs. Bidirectional Models
P 10
Layer 2
<s>
Layer 2
open
Layer 2
open
Layer 2
a
Layer 2
a
Layer 2
bank
Unidirectional contextBuild representation incrementally
Layer 2
<s>
Layer 2
open
Layer 2
open
Layer 2
a
Layer 2
a
Layer 2
bank
Bidirectional contextWords can “see themselves”
Masked Language Model (Fill-in-the-blank)
P 11
● Solution: Mask out k% of the input words, and then predict the masked words○ We always use k = 15%
● Too little masking: Too expensive to train● Too much masking: Not enough context
the man went to the [MASK] to buy a [MASK] of milk
store gallon
Next Sentence Prediction
P 12
● To learn relationships between sentences, predict whether Sentence B is actual sentence that proceeds Sentence A, or a random sentence
Next Sentence Prediction
P 13
● Use 30,000 WordPiece vocabulary on input● Each token is sum of three embeddings● Single sequence is much more efficient
Transformer Architecture
● Multi-headed self attention○
● Feed-forward layers○
● Layer norm and residuals○
● Positional embeddings○
From One-Hot Vectors to Word Embeddings & Self-Attention
P 15
on ... river bank
0 0
0 0
…1
0 0
0 1…
01 0
0 0
…0
one-hot
1.4 … 3.7
4.9 … 6.4
2.5 … 8.0
embedding
The Annotated Transformer, The Illustrated Transformer, The Illustrated BERT
From One-Hot Vectors to Word Embeddings & Self-Attention
P 16
0 0
0 0
…1
0 0
0 1…
01 0
0 0
…0
one-hot
1.4 … 3.7
4.9 … 6.4
2.5 … 8.0
embedding
The Annotated Transformer, The Illustrated Transformer, The Illustrated BERT
on ... river bank
query, key, value
From One-Hot Vectors to Word Embeddings & Self-Attention
P 17
0 0
0 0
…1
0 0
0 1…
01 0
0 0
…0
one-hot
1.4 … 3.7
4.9 … 6.4
2.5 … 8.0
embedding
The Annotated Transformer, The Illustrated Transformer, The Illustrated BERT
on ... river bank
query, key, value
From One-Hot Vectors to Word Embeddings & Self-Attention
P 18
0 0
0 0
…1
0 0
0 1…
01 0
0 0
…0
one-hot
0.1
0.2
0.7
(self-)attention
1.4 … 3.7
4.9 … 6.4
2.5 … 8.0
embedding
The Annotated Transformer, The Illustrated Transformer, The Illustrated BERT
on ... river bank
query, key, value
From One-Hot Vectors to Word Embeddings & Self-Attention
P 19
0 0
0 0
…1
0 0
0 1…
01 0
0 0
…0
one-hot
0.1
0.2
0.7
(self-)attention
1.4 … 3.7
4.9 … 6.4
2.5 … 8.0
embedding
The Annotated Transformer, The Illustrated Transformer, The Illustrated BERT
on ... river bank
Transformer vs LSTM
P 20
● Self-attention == no locality bias○ Long-distance context has “equal opportunity”
● Single multiplication per layer == efficiency on TPU○ Effective batch size is number of words, not sequences
X_0_0 X_0_1 X_0_2 X_0_3
X_1_0 X_1_1 X_1_2 X_1_3
✕ W
X_0_0 X_0_1 X_0_2 X_0_3
X_1_0 X_1_1 X_1_2 X_1_3
✕ W
Transformer LSTM
Basic BERT Recipe
P 21
Basic BERT Recipe
P 22
Basic BERT Recipe
P 23
GLUE Benchmark
MultiNLIPremise: Hills and mountains are especially sanctified in Jainism.Hypothesis: Jainism hates nature.Label: Contradiction
CoLaSentence: The wagon rumbled down the road.Label: Acceptable
Sentence: The car honked down the road.Label: Unacceptable
SWAG: Zellers, Bisk, Schwartz, Choi, EMNLP 2018 SQuAD: Rajpurkar, Zhang, Lopyrev, Liang, EMNLP 2016
Results: Commonsense Reasoning and Question Answering
P 25
Ablation Experiments
● Masked LM (compared to left-to-right LM) is very important on some tasks, Next Sentence Prediction is important on other tasks.
● Left-to-right model does very poorly on word-level task (SQuAD), although this is mitigated by BiLSTM
● More data (and training longer) helps => not yet asymptoted
● Bigger model helps a lot
More is Better
Try It Out, Get Faster Training with TPUs
P 28
Do I Need Full BERT Models for All My Tasks?
P 29Houlsby, Giurgiu, Jastrzebski, Morrone, de Laroussilhe, Gesmundo, Attariyan, Gelly, arxiv Feb 2019
Recently: Even Better Pretraining
P 30
BERT vs. XLNet
BERT XLNet
Objective Masked LM + NextSentence Autoregressive LM
Masking Random 15% Last ⅙ in permutation order
Position Encoding Absolute Relative
Data 13 GB of text 126 GB of text
Other BERT-inspired work
P 32
What does BERT learn - Tenney et al., ACL 2019Relation learning - Baldini-Soares et al., ACL 2019Passage representations - Lee et al., ACL 2019
What does BERT know about language?
Tenney et al., ICLR 2019, ACL 2019
● What linguistic relationships?● Where in the model are they
computed?
Classifier-based probing: project model activations into space of linguistic annotations (graph edges)
BERT Rediscovers the Classical NLP Pipeline
Probing Representations
High weights for POS in lower layers, then constituents, dependencies, and SRL, followed by entities and coreference as we move up the stack!
BERT improves coref over ELMo (84->91, or 39% relative)
We can trace hypotheses on individual sentences!
“he smoked toronto in the playoffs with six hits, seven walks …”
Representing Entities and Relations
BERT
BERT
BLANK BLANKBERT
≈
≉
Baldini-Soares et al., ACL 2019
Matching The Blanks Results
few-shot relation extraction
Baldini-Soares et al., ACL 2019
Open Retrieval Question Answering
What does the zip in zip code
stand for?
Zone Improvement
Plan
How many districts are in
Alabama?Wikipedia 7
Input OutputLatent Retrieval
Wikipedia
Goal: Learn to efficiently read Wikipedia without any retrieval data.
Motivation: Best known recipe for latent retrieval is TF-IDF filtering + brute force. We can do better.
Key Insight: Pre-train an unsupervised ScaM neural retriever. This enables efficient end-to-end fine-tuning with standard latent variable learning methods.
Lee et al., ACL 2019
Zebras have four gaits: walk, trot, canter and gallop. They are generally slower than horses, but their great stamina helps them outrun predators. When chased, a zebra will zig-zag from side to side, making it more difficult for the predator to attack...
Pseudo EvidenceZebras have four gaits: walk, trot, canter and gallop. When chased, a zebra will zig-zag from side to side, making it more difficult for the predator to attack...
Pseudo QueryThey are generally slower than horses, but their great stamina helps them outrun predators.
In-batch negative examplePoe capitalized on the success of "The Raven" by following it up with his essay "The Philosophy of Composition" (1846), in which he detailed the poem's creation….
In-batch negative exampleGagarin was further selected for an elite training group known as the Sochi Six, from which the first cosmonauts of the Vostok programme would be chosen...
Open Retrieval Question Answering
Inverse Cloze Task (ICT): Given a sentence (pseudo-query), predict the context (pseudo-evidence)
Can you be charged for the same crime in two different states?
Wikipedia ?
Progress towards one of the hardest binary text classification tasks today:
Realistic Challenge Sets
Natural Questions - Kwiatkowski et al., TACL 2019Yes/No Questions - Clark et al., NAACL 2019Identifying Commands - Elkahky et al., EMNLP 2018
P 39
Natural Questions Motivation
Kwiatkowski et al. TACL 2019
Question: The success of the Britain Can Make It exhibition led to the planning of what exhibition in 1951?
Evidence: ... The success of this exhibition led to the planning of the Festival of Britain (1951). By 1948 most of the collections had been returned to the museum.
Answer: Festival of Britain
Question: Can you make and receive calls in airplane mode?
Evidence: Airplane mode, …. suspends radio-frequency signal transmission by the device, thereby disabling Bluetooth, telephony, and Wi-Fi. GPS may or may not be disabled, because it does not involve transmitting radio waves.
Answer: No
Goal: Provide academia with first question answering dataset that represents a real question answering problem.
Previous question answering datasets are contrived. E.g. SQuAD's questions often paraphrase evidence text.
Answering real user queries requires much deeper language understanding and world knowledge.
Many questions have multiple acceptable answers: last hurricane in Massachusetts has a formal meaning (eye of the storm in MA) and a different colloquial meaning (hurricane force winds in MA). NQ embraces this acceptable variability. Solutions should model the full distribution of possible answers.
SQuAD dataset
Natural Questions
P 41
Challenge:Many Correct Answers
NQ annotators are encouraged to pick the first good answer.
In practice we sometimes get many different answer locations for the same question.
Question: name the substance used to make the filament of bulb
P 42
Defining CorrectnessWrong annotations are often the result of annotators trying to find an answer when the evidence isn't sufficient.
When all annotators agree that there is enough evidence available to answer a question, the annotations are overwhelmingly correct.
Long answers Short answers
X-axis - proportion of annotations that are non-null for question.Y-axis - expectation that a non-null annotation's question is in this bucket.
Also broken down, conditioned on annotation being: Correct (C ); Correct but debatable (C
d ); or Wrong (W ).
Natural Questions - Status
● First ever release of Google queries.
● 300k training items, 16k for evaluation. Upper bound of 87% on long answers, 76% on short.
● Leaderboard seeing good activity, task is quite a bit harder than Squad.
Williams, Nangia, Bowman, 2018
Prior Approaches to Testing Inference/Reasoning Abilities
P 44
Clark, Lee, Chang, Kwiatkowski, Collins, Toutanova, NAACL 2019
BoolQ: Naturally Occurring Yes-No Questions
P 45
Real Problems that Naturally Require Inference to Solve
Question
Passage
Answer
P 46
Collecting Passages
P 47BoolQ
Document Selection Paragraph Selection Answer Selection
Are there blue whales in the Atlantic Ocean?
YesNo
Pipeline from Natural Questions (Kwiatkowski et al., 2019)
Test Set Results
P 48
Sample Efficiency: MultiNLI > BERT for Small Data
P 49
Noun-Verb Ambiguity
“lives” / Noun → /laIvz/
“lives” / Verb → /lIvz/
fliesNOUN
Mark VERB
P 50Elkahky, Webster, Andor, Pitler, EMNLP 2018
Certain insects can damage plumerias, such as mites, flies, or aphids. NOUNMark which area you want to distress. VERB
P 51
“A Challenge Set and Methods for Noun-Verb Ambiguity”, EMNLP 2018
Accuracy on Noun-Verb Disambiguation
P 52
Pronunciation of Homographs Accuracy
P 53
Webster, Recasens, Axelrod, Baldridge, TACL 2019 Kwiatkowski, Palomaki, Redfield, Collins, Parikh, Alberti, Epstein, Polosukhin, Kelcey, Devlin, Lee, Toutanova, Jones, Chang, Dai, Uszkoreit, Le, Petrov, TACL 2019
Released Datasets with “In-the-Wild” Natural Challenges
P 54
Mark VERB
Are there blue whales in the Atlantic Ocean? YES
Summary
P 55