Open Information Extraction
Mausam
“The Internet is the world’s largest library. It’s just that all the books are on the floor.”
- John Allen Paulos
> 1 Trillion URLs (Google, 2008)
2
Information Overload
3
Paradigm Shift: from retrieval to reading
Who won Bigg Boss 7?
What sport teams are based in Arizona?
World Wide Web
4
Gauhar Khan
Phoenix Suns, Arizona Cardinals,…
Paradigm Shift: from retrieval to reading
Quick view of today’s news
World Wide Web
5
Science Report
Finding: beer that doesn’t
give a hangover
Researcher: Ben Desbrow
Country: Australia
Organization: Griffith
Health Institute
Paradigm Shift: from retrieval to reading
World Wide Web
6
Which US West coast
companies are hiring for a
software engineer position?
Google, Microsoft,
Facebook, …
8
What is Machine Reading?
Text Assertions Inferences
Information Extraction (IE)
IE(sentence) = Relation instance, probability
“Edison was the inventor of the phonograph.” InventorOf(Edison, phonograph), 0.9
“You shall know a word by the company it keeps” (Firth, 1957)
9
Context Clues
• …Baltimore mayor…
• …Baltimore international airport…
• cities such as Chicago, Baltimore, and..
Where do clues come from?
10
How to Scale IE?
1970s-1980s: heuristic, hand-crafted clues
• Facts from earnings announcements
• Narrow domains; brittle clues
1990s: IE as supervised learning
“Mary was named to the post of CFO, succeeding Joe who retired abruptly.”
11
No.
12
Does “IE as supervised learning” scale to reading the Web?
Critique of IE=supervised learning
• Relation specific
• Genre specific
• Hand-craft training examples
Does not scale to the Web!
13
14
Semi-Supervised Learning
• Few hand-labeled examples
• Limit on the number of relations
• relations are pre-specified
• Still does not scale to the Web
per relation!
Lessons from DB/KR Research
• Declarative KR is expensive & difficult
• Formal semantics is at odds with
– Broad scope
– Distributed authorship
• KBs are brittle: “can only be used for tasks whose knowledge needs have been anticipated in advance” (Halevy IJCAI ‘03)
15
Machine Reading at Web Scale
• A “universal schema” is impossible
• Global consistency is like world peace
• Ontological “glass ceiling”
– Limited vocabulary
– Pre-determined predicates
– Swamped by reading at scale!
16
Motivation
• General purpose– hundreds of thousands of relations
– thousands of domains
• Scalable: computationally efficient– huge body of text on Web and elsewhere
• Scalable: minimal manual effort– large-scale human input impractical
• Knowledge needs not anticipated in advance– rapidly retargetable
Open IE Guiding Principles
• Domain independence
– Training for each domain/fact type not feasible
• Coherence
– Readability important for human interactions
• Scalability
– Ability to process large number of documents fast
Open Information Extraction
“When Saddam Hussain invaded Kuwait in 1990, the international..”
(Saddam Hussain, invaded, Kuwait)
Open IE
Extracting information from natural language text
for all relations in all domains in a few passes.
(Google, acquired, Youtube)(Oranges, contain, Vitamin C)
(Edison, invented, phonograph)…
Open Information Extraction
“When Saddam Hussain invaded Kuwait in 1990, the international..”
(Saddam Hussain, invaded, Kuwait)
Open IE
Extracting information from natural language text
for all relations in all domains in a few passes.
(Google, acquired, Youtube)(Oranges, contain, Vitamin C)
(Edison, invented, phonograph)…
Open Information Extraction
“Edison was the inventor of the phonograph.”
(Edison, was the inventor of, phonograph)
Open IE
Extracting information from natural language text
for all relations in all domains in a few passes.
(Google, acquired, Youtube)(Oranges, contain, Vitamin C)
(Edison, invented, phonograph)…
Open IE
• Avoid hand-labeling sentences
• Single pass over corpus
• No pre-specified vocabulary– Challenge: map relation phrase to canonical relation
– E.g., “was the inventor of” invented
22
Traditional IE Open IE
Input: Corpus + Hand-
labeled Data
Corpus + Existing
resources
Relations: Specified
in Advance
Discovered
Automatically
Complexity:
Output:
O(D * R)
R relations
relation-specific
O(D)
D documents
Relation-
independent
23
OPEN VERSUS TRADITIONAL IE
Open vs. Traditional IE
TextRunner
First Web-scale, Open IE system (Banko, IJCAI ‘07)
1,000,000,000 distinct extractions
Peak of 0.9 precision (but low recall)
24
Trajectory of Open IE
2003 KnowItAll “web reading” project
2007
TextRunner: 1,000,000,000 extractions
2008-9
Synonymy, horn-clause inference
2010-11
ReVerb,
Ontology mapping,
Relation properties
2012
OLLIE,
Event Templates,
Multi-doc Summarization
25
Trajectory of Open IE
2003 KnowItAll “web reading” project
2007
TextRunner: 1,000,000,000 extractions
2008-9
Synonymy, horn-clause inference
2010-11
ReVerb,
Ontology mapping,
Relation properties
2012
OLLIE,
Event Templates,
Multi-doc Summarization
27
The chapter was founded by FANHS, which is headquartered in Seattle.
1. Identify Candidate Args
2. Identify Relation Phrase
28
Labeled Examples
HeuristicsUnlabeled Text
Patterns [Banko & Etzioni 07]Wikipedia [Wu & Weld 10]
CRF [Banko & Etzioni 07, Wu & Weld 10]Markov Logic Network [Zhu et al. 09]
Open IE Example
TR: Problem #1Incoherent Extractions
The guide contains dead links and omits sites.
(The guide, contains omits, sites)
Extendicare agreed to buy Arbor for about $432M in assumed debt.
(Arbor, for assumed, debt).
≈ 15% of TextRunner’s extractions
29
TR: Problem #2 Uninformative Extractions
Homer made a deal with the devil.
(Homer, made, deal)
Existing systems miss verb + noun constructions:
Jane is an expert in physics is is an expert in
Robocop takes place in Detroit takes takes place in
Obama gave a speech on energy gave gave a speech on
30
Relation Frequency in TextRunner
0
2
4
6
8
10
is has makes gives takes gets
% o
f Ex
trac
tio
ns
Relation
Paris is the capital of France
Iran has a role in Afghan talks
Apple made a deal with Google
TCP/IP gave rise to the internet
31
Syntactic ConstraintRelation phrase must start with a verb and match the pattern:
32
The guide contains dead links and omits sites.
(The guide, contains omits, sites)
V discovered V = verb | particle | adv
V P died from P = prep | partcle | inf. marker
V W* P played a role in W = noun | adj | adv | det | pron
or multiple contiguous matches to the pattern:
wants to find a solution for
ReVerb
Identify Relations from Verbs.
1. Find longest phrase matching a simple syntactic constraint:
33
Lexical Constraint
Problem: “overspecified” relation phrases
Obama is offering only modest greenhouse gas reduction targets at the conference.
Solution: must have many distinct args in a large corpus
34
≈ 1is offering only modest …
Obama the conference
100s ≈
is the patron saint of
Anne mothersGeorge EnglandHubbins quality footwear
….
35
Sample of ReVerb Relations
inhibits tumor
growth inhas a PhD in joined forces with
is a person
who studies voted in favor of won an Oscar for
has a maximum
speed of
died from
complications ofmastered the art of
gained fame asgranted political
asylum to
is the patron
saint of
was the first
person to
identified the cause
ofwrote the book on
DARPA MR Domains <50
NYU, Yago <100
NELL ~500
DBpedia 3.2 940
PropBank 3,600
VerbNet 5,000
WikiPedia InfoBoxes, f > 10 ~5,000
TextRunner 100,000+
ReVerb 1,500,000+
36
NUMBER OF RELATIONS
Number of Relations
Coverage of the ReVerb Model
8% Non-Contiguous
X was founded in 1995 by Y
X is produced and hosted by YX shut Y down
4% Not Between Args
Discovered by Y, X …
… the Y that X discovered
3% Does Not Match POS Pattern
X has a lot of faith in Y X to attack Y37
85% of verb-based relations satisfy ReVerb constraint
Sampled 300 Web sentences from (Wu & Weld)
Limitations to Model:
ReVerb Extraction Algorithm
38
Hudson was born in Hampstead, which is a suburb of London.
arg1arg1
arg2 arg2
1. Identify longest relation phrases satisfying constraints
2. Heuristically identify arguments for reach relation phrase
(Hudson, was born in, Hampstead)
(Hampstead, is a suburb of, London)
Experiments: Relation Phrases(Etzioni, Fader, Christensen, Soderland, Mausam – IJCAI’11)
ReVerb
Error Analysis
ReVerb
Summary
• Semantically tractable subset of language
• ReVerb: simple model of relation phrases
• Superfast, highly scalable
• Code/data available at openie.cs.washington.edu
Motivating Examples
“The assassination of Franz Ferdinand,improbable as it may seem, began WWI.”
(it, began, WWI)
“Republicans in the Senate filibusteredan effort to begin debate on the jobs bill.”
(the Senate, filibustered, an effort)
“The plan would reduce the number ofteenagers who begin smoking.”
(The plan, would reduce the number of, teenagers)
Analysis – arg1 substructure
Category Pattern Freq
Basic Noun PhrasesChicago was founded in 1833
NN, JJ NN, etc 65%
Prepositional AttachmentsThe forest in Brazil is threatened by ranching.
NP PP NP 19%
ListGoogle and Apple are headquartered in Silicon Valley.
NP, (NP,)* CC NP 15%
Relative ClauseChicago, which is located in Illinois, has three million residents.
NP (that|WP|WDT)? NP? VP NP <1%
Analysis – arg2 substructureCategory Pattern Freq
Basic Noun PhrasesCalcium prevents osteoporosis
NN, JJ NN, etc 60%
Prepositional AttachmentsBarack Obama is one of the presidents of the United States
NP PP NP 18%
ListA galaxy consists of stars and stellar remnants
NP, (NP,)* CC NP 15%
Independent ClauseScientists estimate that 80% of oil remains a threat.
(that|WP|WDT)? NP? VP NP 8%
Relative ClauseThe shooter killed a woman who was running from the scene.
NP (that|WP|WDT)? NP? VP NP 6%
ArglearnerArgument Extraction Methodology
• Break problem into four parts:– Identify arg1 right bound
… TOK TOK TOK TOK TOK rel TOK TOK TOK …
– Identify arg1 left bound… TOK TOK TOK TOK TOK rel TOK TOK TOK …
– Identify arg2 left bound… TOK TOK TOK TOK TOK rel TOK TOK TOK …
– Identify arg2 right bound… TOK TOK TOK TOK TOK rel TOK TOK TOK …
Classifier
Classifier
Classifier
Evaluation on Web Text
YieldProcessing Time per sentence: Reverb (0.015 sec), Arglearner (1.6x)
Speed Parse (3.2x), Accuracy Parse (31x)
Trajectory of Open IE
2003 KnowItAll “web reading” project
2007
TextRunner: 1,000,000,000 extractions
2008-9
Synonymy, horn-clause inference
2010-11
ReVerb,
Ontology mapping,
Relation properties
2012
OLLIE,
Event Templates,
Multi-doc Summarization
54
ReVerb+Arglearner: Error Analysis
• Last night at CES (Consumer Electronics Show), Steve Balmer, the CEO of Microsoft, held a press conference.
• The first in our list is Stephen Googleheim, born in Virginia in 1953 to Swedish parents.
• After winning the Superbowl, the Giants are now the top dogs of the NFL.
• …is that it makes Judaism different from Christianity and Islam
• Ahmadinejad was elected as the new President of Iran.
OLLIE: Open Language Learningfor Information Extraction
ReVerb
Seed Tuples
Training Data
Open PatternLearning
Bootstrapper
Pattern Templates
Pattern Matching Context AnalysisSentence Tuples Ext. Tuples
Extraction
Learning
ReVerb
Seed Tuples
Training Data
Open PatternLearning
Bootstrapper
Pattern Templates
Pattern Matching Context AnalysisSentence Tuples Ext. Tuples
Extraction
Learning
Bootstrapping Approach
Other Syntactic rels
Verb-basedrelations
Semantic rels
Bootstrapping Approach
Other Syntactic rels
Verb-basedrelations
Reverb’sVerb-basedrelations
Semantic rels
Federer is coached by Paul Annacone.
Bootstrapping Approach
Other Syntactic rels
Verb-basedrelations
Reverb’sVerb-basedrelations
Semantic rels
Federer is coached by Paul Annacone.
Now coached by Paul Annacone, Federer has …
Bootstrapping Approach
Other Syntactic rels
Verb-basedrelations
Reverb’sVerb-basedrelations
Semantic rels
Federer is coached by Paul Annacone.
Now coached by Paul Annacone, Federer has …
Paul Annacone, the coach of Federer,
Bootstrapping Approach
Other Syntactic rels
Verb-basedrelations
Reverb’sVerb-basedrelations
Semantic rels
Federer is coached by Paul Annacone.
Now coached by Paul Annacone, Federer has …
Paul Annacone, the coach of Federer,
Federer hired Annacone as his new coach.
Bootstrapping
High Quality ReVerb Extractions
ClueWeb Sentences
Extraction Lemmas(seeds)
(Ahmadinejad, is the current president of, Iran)
ahmadinejad, president, iran
Ahmadinejad, who is the president of Iran, is a puppet for the Ayatollahs.
ReVerb
Seed Tuples
Training Data
Open PatternLearning
Bootstrapper
Pattern Templates
Pattern Matching Context AnalysisSentence Tuples Ext. Tuples
Extraction
Learning
{arg1}↓rcmod↓{rel:NN:President}↓prep_of↓{arg2}
(arg1, be President of, arg2)
Pattern Templates
Pattern Templates Open Pattern Templates
Can we generalize this pattern to all relations (beyond Presidents)?
Example
(Obama, is President of, US)“US President Obama gives us hope.”
{arg2}↑nn↑{arg1}↓nn↓{rel:NN}
(arg1, be {rel} of, arg2)
(Department, be Police of, NY)(Department, be NY of, Police)X
Open Pattern Templates(Syntactic)
• Syntactic checks
– Relation in the middle of the pattern
– No NN/amod edges
– Prepositions in relation/pattern match
{arg1}↓rcmod↓{rel:NN:President}↓prep_of↓{arg2}
(arg1, be President of, arg2)
{arg1}↓rcmod↓{rel:NN}↓prep↓{arg2}
(arg1, be {rel} {prep}, arg2)
Syntactic Generalization
Open Pattern Templates(Syntactic)
• Syntactic checks
– Relation in the middle of the pattern
– No NN/amod edges
– Prepositions in relation/pattern match
{arg1}↓rcmod↓{rel:NN:President}↓prep_of↓{arg2}
(arg1, be President of, arg2)
{arg1}↓rcmod↓{rel:NN}↓prep↓{arg2}
(arg1, be {rel} {prep}, arg2)
Syntactic Generalization
Open Pattern Templates(Semantic)
• Other patterns are not always applicable
…however they are not completely useless
(Obama, is President of, US)“US President Obama gives us hope.”
{arg2}↑nn↑{arg1}↓nn↓{rel:NN}
(arg1, be {rel} of, arg2)
{arg2}↑nn↑{arg1}↓nn↓{rel:NN}
(arg1, be {rel} of, arg2)rel in {president, chairman, CEO…}
X{arg2}↑nn↑{arg1}↓nn↓{rel:NN}
(arg1, be {rel} of, arg2)rel in Person-Nouns
Type Generalization
Skipping over Nodes(Ahmadinejad, is President of, Iran)
“Ahmadinejad was elected as the president of Iran”
{arg1}↑nsubjpass↑{slot:VBN}↓prep_as↓{rel:NNP}↓prep_of↓{arg2}
(arg1, be {rel} of, arg2)slot in {elect, select, choose, …}
Goal
“I learned that the 2012 Sasquatch music festival is scheduled for May 25th until May 28th—all day.”
(the 2012 Sasquatch music festival, is scheduled for, March 25th)
Pattern TemplatesPattern-
Extractor
Pattern Extractor
{arg1} <nsubjpass< {rel:VBN} >prep> {arg2}
Pattern Extractor
ReVerb
Seed Tuples
Training Data
Open PatternLearning
Bootstrapper
Pattern Templates
Pattern Matching Context AnalysisSentence Tuples Ext. Tuples
Extraction
Learning
Motivating Examples
“Early astronomers believed that the earth is the center of the universe.”
(earth, is the center of, universe)
“If he wins five key states, Romney will be elected President.”
(Romney, will be elected, President)
Context Analysis
“Early astronomers believed that the earth is the center of the universe.”
[(earth, is the center of, universe) Attribution: early astronomers]
“If he wins five key states, Romney will be elected President.”
[(Romney, will be elected, President) Modifier: if he wins five key states]
Ranking Function
• Supervised learning
• Features
– Frequency of pattern in training set
– Lexical/POS features
– Length/coverage features
– …
0.5
0.6
0.7
0.8
0.9
1
0 100 200 300 400 500 600
OLLIE
ReVerb
WOE
Yield
Pre
cisi
on
parse
Evaluation(Mausam, Schmitz, Bart, Soderland, Etzioni - EMNLP’12)
Noun-based Relations
Relations OLLIE ReVerb incr.
is the capital of 8,566 146 59x
is president of 21,306 1,970 11x
is professor at 8,334 400 21x
is scientist of 730 5 146x
Semantic Patterns
Context Analysis
Summary
• Bootstrapping based on ReVerb– Look for args as well as relations when bootstrapping
• Generalization– Syntactic and semantic generalizations of learned patterns
• Context around an extraction– Obtains superior precision than ReVerb
• Syntactically different ways of expressing a relation– Obtains much higher recall than ReVerb
• Code– Available at http://ollie.cs.washington.edu