VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Weijie Su*, Xizhou Zhu*, Yue Cao, Bin Li, Lewei Lu, Furu Wei, Jifeng Dai
University of Science and Technology of China; Microsoft Research Asia
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
An Example of Visual-Linguistic Tasks
Question Why is [person1] pointing a gun at [person2] ?
Answer [person1] and [person3] are robbing the bank and
[person2] is the bank manager
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Previous Paradigm
Why is [person1] pointing a gun at [person2] ?
ImageEncoder
TextEncoder
ImageFeature
TextFeature
Combiner
Prediction
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Problem (I) High Design Cost
Why is [person1] pointing a gun at [person2] ?
ImageEncoder
TextEncoder
ImageFeature
TextFeature
Combiner
Prediction
Different for each task, hard to design
c
c
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Problem (II) Overfitting
Why is [person1] pointing a gun at [person2] ?
ImageEncoder
TextEncoder
ImageFeature
TextFeature
Combiner
Prediction
c
c
Pre-trained separately w/o joint pre-training
Train from scratch
c
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Inspiration
• Transformer is a unified and powerful architecture in NLP
• MLM based pre-training in BERT enhances the capability
• It can aggregate and align word embedded features
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Solution
Why is [person1] pointing a gun at [person2] ?
ImageEncoder
TextEncoder
ImageFeature
TextFeature
Combiner
Prediction
VL-BERTTask-agnostic
+ VL Pre-training
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Comparison among our VL-BERT and other concurrent works for pre-training generic visual-linguistic representations
Lots of concurrent works in just 3 weeks!
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Model Architecture
TokenEmbedding
SegmentEmbeddingSequence
PositionEmbedding
Visual FeatureEmbedding
drinkkitten from [MASK][CLS] [SEP] [IMG] [IMG] [END]
Visual-Linguistic BERT
bottle
Masked Language Modeling with Visual Clues
[Cat]
Masked RoI Classification with Linguistic Clues
Fast(er) R-CNN
Fully Connected
Appearance Feature
Geometry Embedding
Regions of Interest
AA A AA A C C C
32 4 51 6 7 7 8
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Caption Image Regions Image
Zero Out
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Modification (I) Add Image Regions in Input Sequence
TokenEmbedding
SegmentEmbeddingSequence
PositionEmbedding
Visual FeatureEmbedding
drinkkitten from [MASK][CLS] [SEP] [IMG] [IMG] [END]
Visual-Linguistic BERT
bottle
Masked Language Modeling with Visual Clues
[Cat]
Masked RoI Classification with Linguistic Clues
Fast(er) R-CNN
Fully Connected
Appearance Feature
Geometry Embedding
Regions of Interest
AA A AA A C C C
32 4 51 6 7 7 8
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Caption Image Regions Image
Zero Out
c
c
c
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Modification (II) Add Visual Feature Embedding
TokenEmbedding
SegmentEmbeddingSequence
PositionEmbedding
Visual FeatureEmbedding
drinkkitten from [MASK][CLS] [SEP] [IMG] [IMG] [END]
Visual-Linguistic BERT
bottle
Masked Language Modeling with Visual Clues
[Cat]
Masked RoI Classification with Linguistic Clues
Fast(er) R-CNN
Fully Connected
Appearance Feature
Geometry Embedding
Regions of Interest
AA A AA A C C C
32 4 51 6 7 7 8
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
Caption Image Regions Image
Zero Out
c
c
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Pre-training Datasets
• Visual-Linguistic Corpus: Conceptual Captions
• Harvested from the Internet
• ~3M image-text pairs
• Text-only Corpus: English Wikipedia & BooksCorpus
• Improve generalization over long and complex sentences
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Pre-training Tasks #1
VL-BERT
kitten drink from bottle
Masked Language Modeling with Visual Clues
bottle
[MASK]
P.S. For samples from text-only corpus, it degenerate to original MLM in BERT.
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Pre-training Tasks #2
VL-BERT
kitten drink from bottle
Masked RoI Classification with Linguistic Clues
[Cat]Zero Out
P.S. This task is not used in text-only corpus.
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Results on Downstream TasksPe
rform
ance
42
46.5
51
55.5
60
VCR65
66.75
68.5
70.25
72
VQA65
67
69
71
73
RefCOCO+
Baseline VL-BERT Base w/o pre-training VL-BERT Base VL-BERT Large
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Perfo
rman
ce
42
46.5
51
55.5
60
VCR65
66.75
68.5
70.25
72
VQA65
67
69
71
73
RefCOCO+
Baseline VL-BERT Base w/o pre-training VL-BERT Base VL-BERT Large
c
c
c
c
c c
Our generic representation surpasses task-specific baseline by a large margin
Results on Downstream Tasks
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Perfo
rman
ce
42
46.5
51
55.5
60
VCR65
66.75
68.5
70.25
72
VQA65
67
69
71
73
RefCOCO+
Baseline VL-BERT Base w/o pre-training VL-BERT Base VL-BERT Large
c
c
c
c
c c
c
Pre-training further enhances the capability
Results on Downstream Tasks
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
Conclusion
• A new pre-trainable generic representation for VL tasks
• Pre-training procedure can better align VL clues
• Future work: seek better pre-training tasks, benefit more downstream tasks (e.g., Image Caption Generation)
VL-BERT: Pre-training of Generic Visual-Linguistic Representations