PIQA: Reasoning about Physical Commonsense in
Natural LanguageShailesh M Pandey
Bisk, Yonatan et al. “PIQA: Reasoning about Physical Commonsense in Natural Language.” ArXiv abs/1911.11641 (2019)
Outline1. Motivation2. Dataset
2.1. Collection2.2. Cleaning2.3. Statistics
3. Experiments3.1. Results
4. Analysis4.1. Quantitative4.2. Qualitative
5. Critique
Motivation● Modeling physical common sense
knowledge is essential to true AI-completeness.
● Can AI systems learn to reliably answer physical common sense questions without experiencing the physical world?
○ The common sense properties are rarely directly reported.
● No extensive evaluation of SOTA models on questions that require physical common sense knowledge.
Dataset● Task: given a question and two possible answers, choose the most
appropriate answer.
● Question: indicates a post-condition (goal)● Answer: procedure for accomplishing the goal (solution)
Dataset - Collection● Qualification HIT for annotators
○ Identify well formed (goal, solution) pairs >80% times.
● Provided annotators with a prompt derived from instructables.com
○ Drawn from six categories - costume, outside, craft, home, food, and workshop
○ Reminds about less prototypical uses of everyday objects
● Annotators asked to construct two component tasks
○ Articulate the goal and solution○ Perturb the solution subtly to make it
invalid
Dataset - Cleaning● Removed examples with low agreement
○ Correct examples that require expert knowledge are removed● Used AFLite to perform systematic data bias reduction
○ Used 5k examples to fine-tune BERT-Large○ Computed corresponding embeddings of remaining instances○ Used ensemble of linear classifiers (trained on random subsets) to determine if embeddings
are strong indicators of the correct answer.○ Discarded instances whose embeddings are highly indicative of the target label.
AFLite (Adversarial Filtering Lite)000 1
001 1
010 1
011 1
100 0
101 0
110 0
111 0Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.
000 1
001 1
010 1
011 1
100 0
101 0
110 0
111 0
000 1
010 1
101 0
001 -
100 -
110 0
101 0
001 1
011 -
000 -
110 0
100 0
111 0
000 -
100 -
AFLite (Adversarial Filtering Lite)
Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.
000 [ ]
001 [ 1 ]
010 [ ]
011 [ ]
100 [ 0 ]
101 [ ]
110 [ ]
111 [ ]
000 1
010 1
101 0
001 1
100 0
110 0
101 0
001 1
011 -
000 -
110 0
100 0
111 0
000 -
100 -
AFLite (Adversarial Filtering Lite)
Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.
000 [ 1 ]
001 [ 1 ]
010 [ ]
011 [ 0 ]
100 [ 0 ]
101 [ ]
110 [ ]
111 [ ]
000 1
010 1
101 0
001 1
100 0
110 0
101 0
001 1
011 0
000 1
110 0
100 0
111 0
000 -
100 -
AFLite (Adversarial Filtering Lite)
Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.
000 [ 1 0 ]
001 [ 1 ]
010 [ ]
011 [ 0 ]
100 [ 0 0 ]
101 [ ]
110 [ ]
111 [ ]
000 1
010 1
101 0
001 1
100 0
110 0
101 0
001 1
011 0
000 1
110 0
100 0
111 0
000 0
100 0
AFLite (Adversarial Filtering Lite)
Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.
000 1 [ 1 0 ] 0.5
001 1 [ 1 ] 1.0
010 1 [ ]
011 1 [ 0 ] 0.0
100 0 [ 0 0 ] 1.0
101 0 [ ]
110 0 [ ]
111 0 [ ]
AFLite (Adversarial Filtering Lite)
Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.
000 1 [ 1 0 ] 0.5
001 1 [ 1 ] 1.0
010 1 [ ]
011 1 [ 0 ] 0.0
100 0 [ 0 0 ] 1.0
101 0 [ ]
110 0 [ ]
111 0 [ ]
Threshold - 0.75
AFLite (Adversarial Filtering Lite)
Sakaguchi et al. Y. 2020. Winogrande: An adversarial winograd schema challenge at scale. In AAAI.
Examples
Dataset - Statistics● Number of QA pairs
○ Training: > 16k○ Development: ~ 2k○ Testing: ~3k
● Average number of words:○ Goal: 7.8○ Correct solution: 21.3○ Incorrect solution: 21.3
Dataset - Statistics● Nearly identical distribution of correct and incorrect solution length.● At least 85% overlap b/w words used in correct and incorrect solutions.
Experiments● For each choice, provided the model with
goal, solution, and [CLS].● Extracted hidden states corresponding to
[CLS].● Applied linear transformation to each
hidden state and softmax over the two options.
● Trained using a cross-entropy loss.● Truncated examples at 150 tokens -
affects ~1% of the data.● Human performance was calculated by a
majority vote on the development set.
Quantitative Analysis● Two solution choices that differ by editing
a single phrase must test the common sense understanding of that phrase.
● ~60% data involves 1-2 word edit-distance b/w solutions.
● Dataset complexity generally increases with the edit distance b/w the solution pairs.
Quantitative Analysis● RoBERTa struggles to understand certain
flexible relations.○ ‘before’, ‘after’, ‘top’, and ‘bottom’
● Performs worse than average on solutions differing in ‘water’ even after ~300 training examples.
● Performs much better at certain nouns, such as ‘spoon’.
Quantitative Analysis● ‘water’ is prevalent but highly versatile.
○ Substituted with a variety of different household items.
● ‘spoon’ has fewer common replacements which indicates RoBERTa understands these simple affordances.
Qualitative Analysis● RoBERTa distinguishes prototypical correct solutions from clearly ridiculous
trick solutions.● Struggles with subtle relations and non-prototypical situations.
Critique● Try to advance a crucial ‘grounding’ problem
○ A benchmark for testing physical understanding of new models○ Evaluation of physical common sense of SOTA models - unsurprisingly these models don’t
perform very good
● Good effort at creating an unbiased dataset○ No ‘annotate for smart robot’ instruction to the workers.○ Good cleaning of the dataset - agreement scores and AFLite.
● Reasonably good analysis of the performance of RoBERTa on their dataset.
Critique● An intelligent model will have good performance on this benchmark but is the
converse true?○ What if we pre-train RoBERTa on text from ‘instructables.com’?
● Should we expect models trained on text to have physical understanding?○ How would a text-trained model know that squeezing and then releasing a bottle creates
suction?○ Should the focus have been on some ‘grounded’ models? e.g. VQA models.
● Is the dataset easy because we have just two choices?● The paper does not report a few important dataset statistics
○ What is the distribution of words in incorrect solutions? Is it similar to the correct solutions?○ How many examples were actually removed during cleaning?
● Is a majority vote good indicator of human performance?○ What is the average score of a single person?○ Should the dataset have questions where majority vote gets it wrong?
Questions?