Representations Natural Language and...

Slav Petrovon behalf of theLanguage Team @ Google Research

Natural Language Representations and Challenges

Outline

●○○○

●○○○

State-of-the-art in Natural Language Understanding in 2017

→ → Custom (Recurrent) Architectures

P 3

Oct. 2018: One Model with Task-specific Tuning in Minutes

P 4

Question Answering (SQuAD 1.1)

P 5

Pre-Training in NLP

●

king

[-0.5, -0.9, 1.4, …]

queen

[-0.6, -0.8, -0.2, …] king wore a crown

Inner Product

queen wore a crown

Inner Product

[0.3, 0.2, -0.8, …]

open a bank account on the river bank

●

History of Contextual Representations

●

Train LSTMLanguage Model

LSTM

<s>

open

LSTM

open

a

LSTM

a

bank

LSTM

very

LSTM

funny

LSTM

movie

POSITIVE

...

Fine-tune on Classification Task


●

Train Separate Left-to-Right and Right-to-Left LMs

LSTM

<s>

open

LSTM

open

a

LSTM

a

bank

Apply as “Pre-trained Embeddings”

LSTM

open

<s>

LSTM

a

open

LSTM

bank

aExisting Model Architecture


●

Transformer

<s>

open

open

a

a

bank

Transformer Transformer

POSITIVE

Fine-tune on Classification Task

Transformer

<s> open a

Transformer Transformer

Train Deep (12-layer) Transformer LM

Unidirectional vs. Bidirectional Models

P 10

Layer 2

<s>

Layer 2

open

Layer 2

open

Layer 2

a

Layer 2

a

Layer 2

bank

Unidirectional contextBuild representation incrementally

Layer 2

<s>

Layer 2

open

Layer 2

open

Layer 2

a

Layer 2

a

Layer 2

bank

Bidirectional contextWords can “see themselves”

Masked Language Model (Fill-in-the-blank)

P 11

● Solution: Mask out k% of the input words, and then predict the masked words○ We always use k = 15%

● Too little masking: Too expensive to train● Too much masking: Not enough context

the man went to the [MASK] to buy a [MASK] of milk

store gallon

Next Sentence Prediction

P 12

● To learn relationships between sentences, predict whether Sentence B is actual sentence that proceeds Sentence A, or a random sentence

Next Sentence Prediction

P 13

● Use 30,000 WordPiece vocabulary on input● Each token is sum of three embeddings● Single sequence is much more efficient

Transformer Architecture

● Multi-headed self attention○

● Feed-forward layers○

● Layer norm and residuals○

● Positional embeddings○

From One-Hot Vectors to Word Embeddings & Self-Attention

P 15

on ... river bank

0 0

0 0

…1

0 0

0 1…

01 0

0 0

…0

one-hot

1.4 … 3.7

4.9 … 6.4

2.5 … 8.0

embedding

The Annotated Transformer, The Illustrated Transformer, The Illustrated BERT

http://nlp.seas.harvard.edu/2018/04/03/attention.html

https://jalammar.github.io/illustrated-transformer/

https://jalammar.github.io/illustrated-bert/


P 16

0 0

0 0

…1

0 0

0 1…

01 0

0 0

…0

one-hot

1.4 … 3.7

4.9 … 6.4

2.5 … 8.0

embedding


on ... river bank




query, key, value


P 17

0 0

0 0

…1

0 0

0 1…

01 0

0 0

…0

one-hot

1.4 … 3.7

4.9 … 6.4

2.5 … 8.0

embedding


on ... river bank




query, key, value


P 18

0 0

0 0

…1

0 0

0 1…

01 0

0 0

…0

one-hot

0.1

0.2

0.7

(self-)attention

1.4 … 3.7

4.9 … 6.4

2.5 … 8.0

embedding


on ... river bank




query, key, value


P 19

0 0

0 0

…1

0 0

0 1…

01 0

0 0

…0

one-hot

0.1

0.2

0.7

(self-)attention

1.4 … 3.7

4.9 … 6.4

2.5 … 8.0

embedding


on ... river bank




Transformer vs LSTM

P 20

● Self-attention == no locality bias○ Long-distance context has “equal opportunity”

● Single multiplication per layer == efficiency on TPU○ Effective batch size is number of words, not sequences

X_0_0 X_0_1 X_0_2 X_0_3

X_1_0 X_1_1 X_1_2 X_1_3

✕ W

X_0_0 X_0_1 X_0_2 X_0_3

X_1_0 X_1_1 X_1_2 X_1_3

✕ W

Transformer LSTM

Basic BERT Recipe

P 21

Basic BERT Recipe

P 22

Basic BERT Recipe

P 23

GLUE Benchmark

MultiNLIPremise: Hills and mountains are especially sanctified in Jainism.Hypothesis: Jainism hates nature.Label: Contradiction

CoLaSentence: The wagon rumbled down the road.Label: Acceptable

Sentence: The car honked down the road.Label: Unacceptable

SWAG: Zellers, Bisk, Schwartz, Choi, EMNLP 2018 SQuAD: Rajpurkar, Zhang, Lopyrev, Liang, EMNLP 2016

Results: Commonsense Reasoning and Question Answering

P 25

Ablation Experiments

● Masked LM (compared to left-to-right LM) is very important on some tasks, Next Sentence Prediction is important on other tasks.

● Left-to-right model does very poorly on word-level task (SQuAD), although this is mitigated by BiLSTM

● More data (and training longer) helps => not yet asymptoted

● Bigger model helps a lot

More is Better

Try It Out, Get Faster Training with TPUs

P 28

Do I Need Full BERT Models for All My Tasks?

P 29Houlsby, Giurgiu, Jastrzebski, Morrone, de Laroussilhe, Gesmundo, Attariyan, Gelly, arxiv Feb 2019

Recently: Even Better Pretraining

P 30

BERT vs. XLNet

BERT XLNet

Objective Masked LM + NextSentence Autoregressive LM

Masking Random 15% Last ⅙ in permutation order

Position Encoding Absolute Relative

Data 13 GB of text 126 GB of text

Other BERT-inspired work

P 32

What does BERT learn - Tenney et al., ACL 2019Relation learning - Baldini-Soares et al., ACL 2019Passage representations - Lee et al., ACL 2019

What does BERT know about language?

Tenney et al., ICLR 2019, ACL 2019

● What linguistic relationships?● Where in the model are they

computed?

Classifier-based probing: project model activations into space of linguistic annotations (graph edges)

BERT Rediscovers the Classical NLP Pipeline

Probing Representations

High weights for POS in lower layers, then constituents, dependencies, and SRL, followed by entities and coreference as we move up the stack!

BERT improves coref over ELMo (84->91, or 39% relative)

We can trace hypotheses on individual sentences!

“he smoked toronto in the playoffs with six hits, seven walks …”

Representing Entities and Relations

BERT

BERT

BLANK BLANKBERT

≈

≉

Baldini-Soares et al., ACL 2019

Matching The Blanks Results

few-shot relation extraction

Baldini-Soares et al., ACL 2019

Open Retrieval Question Answering

What does the zip in zip code

stand for?

Zone Improvement

Plan

How many districts are in

Alabama?Wikipedia 7

Input OutputLatent Retrieval

Wikipedia

Goal: Learn to efficiently read Wikipedia without any retrieval data.

Motivation: Best known recipe for latent retrieval is TF-IDF filtering + brute force. We can do better.

Key Insight: Pre-train an unsupervised ScaM neural retriever. This enables efficient end-to-end fine-tuning with standard latent variable learning methods.

Lee et al., ACL 2019

Zebras have four gaits: walk, trot, canter and gallop. They are generally slower than horses, but their great stamina helps them outrun predators. When chased, a zebra will zig-zag from side to side, making it more difficult for the predator to attack...

Pseudo EvidenceZebras have four gaits: walk, trot, canter and gallop. When chased, a zebra will zig-zag from side to side, making it more difficult for the predator to attack...

Pseudo QueryThey are generally slower than horses, but their great stamina helps them outrun predators.

In-batch negative examplePoe capitalized on the success of "The Raven" by following it up with his essay "The Philosophy of Composition" (1846), in which he detailed the poem's creation….

In-batch negative exampleGagarin was further selected for an elite training group known as the Sochi Six, from which the first cosmonauts of the Vostok programme would be chosen...

Open Retrieval Question Answering

Inverse Cloze Task (ICT): Given a sentence (pseudo-query), predict the context (pseudo-evidence)

Can you be charged for the same crime in two different states?

Wikipedia ?

Progress towards one of the hardest binary text classification tasks today:

Realistic Challenge Sets

Natural Questions - Kwiatkowski et al., TACL 2019Yes/No Questions - Clark et al., NAACL 2019Identifying Commands - Elkahky et al., EMNLP 2018

P 39

Natural Questions Motivation

Kwiatkowski et al. TACL 2019

Question: The success of the Britain Can Make It exhibition led to the planning of what exhibition in 1951?

Evidence: ... The success of this exhibition led to the planning of the Festival of Britain (1951). By 1948 most of the collections had been returned to the museum.

Answer: Festival of Britain

Question: Can you make and receive calls in airplane mode?

Evidence: Airplane mode, …. suspends radio-frequency signal transmission by the device, thereby disabling Bluetooth, telephony, and Wi-Fi. GPS may or may not be disabled, because it does not involve transmitting radio waves.

Answer: No

Goal: Provide academia with first question answering dataset that represents a real question answering problem.

Previous question answering datasets are contrived. E.g. SQuAD's questions often paraphrase evidence text.

Answering real user queries requires much deeper language understanding and world knowledge.

Many questions have multiple acceptable answers: last hurricane in Massachusetts has a formal meaning (eye of the storm in MA) and a different colloquial meaning (hurricane force winds in MA). NQ embraces this acceptable variability. Solutions should model the full distribution of possible answers.

SQuAD dataset

Natural Questions

https://rajpurkar.github.io/SQuAD-explorer/explore/v2.0/dev/

http://go/natural-questions

P 41

Challenge:Many Correct Answers

NQ annotators are encouraged to pick the first good answer.

In practice we sometimes get many different answer locations for the same question.

Question: name the substance used to make the filament of bulb

P 42

Defining CorrectnessWrong annotations are often the result of annotators trying to find an answer when the evidence isn't sufficient.

When all annotators agree that there is enough evidence available to answer a question, the annotations are overwhelmingly correct.

Long answers Short answers

X-axis - proportion of annotations that are non-null for question.Y-axis - expectation that a non-null annotation's question is in this bucket.

Also broken down, conditioned on annotation being: Correct (C ); Correct but debatable (C

d ); or Wrong (W ).

Natural Questions - Status

● First ever release of Google queries.

● 300k training items, 16k for evaluation. Upper bound of 87% on long answers, 76% on short.

● Leaderboard seeing good activity, task is quite a bit harder than Squad.

Williams, Nangia, Bowman, 2018

Prior Approaches to Testing Inference/Reasoning Abilities

P 44

Clark, Lee, Chang, Kwiatkowski, Collins, Toutanova, NAACL 2019

BoolQ: Naturally Occurring Yes-No Questions

P 45

Real Problems that Naturally Require Inference to Solve

Question

Passage

Answer

P 46

Collecting Passages

P 47BoolQ

Document Selection Paragraph Selection Answer Selection

Are there blue whales in the Atlantic Ocean?

YesNo

Pipeline from Natural Questions (Kwiatkowski et al., 2019)

Test Set Results

P 48

Sample Efficiency: MultiNLI > BERT for Small Data

P 49

Noun-Verb Ambiguity

“lives” / Noun → /laIvz/

“lives” / Verb → /lIvz/

fliesNOUN

Mark VERB

P 50Elkahky, Webster, Andor, Pitler, EMNLP 2018

Certain insects can damage plumerias, such as mites, flies, or aphids. NOUNMark which area you want to distress. VERB

P 51

“A Challenge Set and Methods for Noun-Verb Ambiguity”, EMNLP 2018

Accuracy on Noun-Verb Disambiguation

P 52

Pronunciation of Homographs Accuracy

P 53

Webster, Recasens, Axelrod, Baldridge, TACL 2019 Kwiatkowski, Palomaki, Redfield, Collins, Parikh, Alberti, Epstein, Polosukhin, Kelcey, Devlin, Lee, Toutanova, Jones, Chang, Dai, Uszkoreit, Le, Petrov, TACL 2019

Released Datasets with “In-the-Wild” Natural Challenges

P 54

Mark VERB

Are there blue whales in the Atlantic Ocean? YES

Summary

P 55

Thank you for your attention!

[email protected]

mailto:[email protected]

Date post:	08-May-2020
Category:	Documents
Upload:	others
View:	9 times
Download:	0 times

Representations Natural Language and...

Documents