454 Project Ideas
Administrivia
Office Hours 11-noon, Fridays in 588 Or by email
Project proposals due today Not binding (at least not yet) To be elaborated In-person project reviews next week.
HW 1 – due next Tues @ noon
Autonomously Semantifying Wikipedia
Fei WuDept. Computer Science & Eng.
University of Washington
(Joint work with Dan Weld)
Motivation
Semantic Web [Berners-Lee 01] is great. Web content machine readable Software agents find, share and integrate information
Motivation
Semantic Web [Berners-Lee 01] is great. Web content machine readable Software agents find, share and integrate information
Semantic Data Applications
Chicken-egg problem:
Motivation
Semantic Web [Berners-Lee 01] Web content machine readable Software agents find, share and integrate information
Semantic Data Applications
Bootstrapping:
Automatically Semantifying Data
Chicken-egg problem:
Idea: “Semantify” Wikipedia Wikipedia [http://wikipedia.org]
Comprehensive (1.7 million English articles)
High-quality Important
6th most popular web-site & growing
Benefits: User-tagged data
(links, infobox, lists, categories, etc.) Large, but not too large
Wikipedia Challenges
Much natural-language text
Missing data
Inconsistency
Low information redundancy
Kylin: Autonomously Semantifying Wikipedia Totally autonomous with no additional human efforts
Information extraction from both semi-structured and unstructured data
Kylin: a mythical hooved Chinese chimerical creature that is said to appear in conjunction with the arrival of a sage.
------ Wikipedia
[Wu & Weld CIKM-07]
Outline
Semantics in Wikipedia Opportunities Challenges
Kylin System Infobox Generation Link Creation
Conclusion
Semantics in Wikipedia
Infobox Link List Category Redirection Disambiguation ……
{{Infobox U.S. County| county = Clearfield County| state = Pennsylvania | seal = | map = Map of Pennsylvania highlighting Clearfield County.svg | map size = 225| founded = [[March 26]], [[1804]]| seat = [[Clearfield, Pennsylvania|Clearfield]] | area = 2,988 [[km²]] (1,154 [[square mile|mi²]]) | area water = 17 km² (6 mi²) | area percentage = 0.56% | census yr = 2000| pop = 83,382 | density = 28||}}
04/20/23 16:19 12
Self-Supervised Learning of Infoboxes
Infobox Challenges
Incompleteness US County: ~50% of articles have infoboxes
Inconsistency Manual process -> contradictions between text & infobox 16% of US County articles had an error (revision)
Schema Drift U.S. County (1428), US County (574), Counties (50),
County (19) Attribute drift & duplication, Rare attributes: only 29% used by 30% or more articles
Infobox Challenges (Continued)
Type-free System Deliberate low-tech design “King county” has the following attributes:
Land area = 2126 sq miles Land area (km) = 5506 sq km
Irregular lists Some separate information in items Others use tables with different schemata Others are hierarchical
List of cities & towns in US Places in Florida List of counties in Florida
Infobox Challenges (Continued)
Infoboxes hierarchical themselves Country leader – instead of name, has nested
element listing title to be “king” with name at lower level
Semantics in Wikipedia
Infobox Link List Category Redirection Disambiguation
Why are these useful?
Semantics in Wikipedia
Infobox Link List Category Redirection Disambiguation
Why useful?
Why challenging?
Semantics in Wikipedia
Infobox Link List Category Redirection Disambiguation
“Seattle, Washington”
Challenge: crappy
• flattened
• “to be merged since 3/06
Semantics in Wikipedia
Infobox Link List Category Redirection Disambiguation
Why useful?
Semantics in Wikipedia
Infobox Link List Category Redirection Disambiguation
Why useful?
Semantics in Wikipedia
Infobox Link List Category Redirection Disambiguation
Opportunities
Semantic source
Training dataset
Challenges
Missing data
Inconsistency
Semantics in Wikipedia
Infobox Link List Category Redirection Disambiguation
Opportunities
Semantic source
Training dataset
Challenges
Missing data
Inconsistency
Kylin: Autonomously Semantifying Wikipedia
Outline
Semantics in Wikipedia Opportunities Challenges
Kylin System Infobox Generation Link Creation
Conclusion
Infobox Generation
Preprocessor Schema Refinement Free edit -> schema drift
Duplicate templates: U.S.County(1428), US County(574), Counties(50), County(19)
Low usage of attribute
Duplicate attributes: “Census Yr”, “Census Estimate Yr”, “Census Est.”, “Census Year”
Kylin:
Strict name match
????
>15% occurrences
U.S. County Infobox
0
0.2
0.4
0.6
0.8
1
Preprocessor
Classifier
Extractor Infobox
Preprocessor Training Dataset Construction
Clearfield County was created on 1804 from parts of Huntingdon and Lycoming Counties but was administered as part of Centre County until 1812.
Its county seat is Clearfield.
2,972 km² (1,147 mi²) of it is land and 17 km² (7 mi²) of it (0.56%) is water.
Problems:
Missing data
Noise
As of 2005, the population density was 28.2/km².
Preprocessor
Classifier
Extractor Infobox
Steps:
1.Segment to sentences
2.Find unique match (heuristics)
Classifier
Document Classifiers (1 per article type)
Sentence Classifier (1 per article type x attribute)
Preprocessor
Classifier
Extractor Infobox
Trained on preprocessor output Features: bag of words, POS tags Maximum Entropy Classifier with Bagging:
multi-class, multi-label, missing data
List & Category Fast Precision(98.5%) – with no learning! Recall(68.8%)
Extractor
Preprocessor
Classifier
Extractor Infobox
Input A sentence predicted to contain an attribute: “After considerable
debate, the county was incorporated on September 13, 1852”
Output <founding date, September 13, 1852>
Landscape of Extraction Techniques
Any of these models can be used to capture words, formatting or both.
Lexicons
AlabamaAlaska…WisconsinWyoming
Abraham Lincoln was born in Kentucky.
member?
Classify Pre-segmentedCandidates
Abraham Lincoln was born in Kentucky.
Classifier
which class?
…and beyond
Sliding Window
Abraham Lincoln was born in Kentucky.
Classifier
which class?
Try alternatewindow sizes:
Boundary Models
Abraham Lincoln was born in Kentucky.
Classifier
which class?
BEGIN END BEGIN END
BEGIN
Context Free Grammars
Abraham Lincoln was born in Kentucky.
NNP V P NPVNNP
NP
PP
VP
VP
S
Mos
t lik
ely
pars
e?
Finite State Machines
Abraham Lincoln was born in Kentucky.
Most likely state sequence?
Slides from Cohen & McCallum
Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
Slides from Cohen & McCallum
Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
Slides from Cohen & McCallum
Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
Slides from Cohen & McCallum
Extraction by Sliding Window GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
CMU UseNet Seminar Announcement
E.g.Looking forseminarlocation
Slides from Cohen & McCallum
A “Naïve Bayes” Sliding Window Model [Freitag 1997]
00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrunw t-m w t-1 w t w t+n w t+n+1 w t+n+m
prefix contents suffix
If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.
… …
Estimate Pr(LOCATION|window) using Bayes rule
Try all “reasonable” windows (vary length, position)
Assume independence for length, prefix words, suffix words, content words
Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)
Slides from Cohen & McCallum
“Naïve Bayes” Sliding Window Results
GRAND CHALLENGES FOR MACHINE LEARNING
Jaime Carbonell School of Computer Science Carnegie Mellon University
3:30 pm 7500 Wean Hall
Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.
Domain: CMU UseNet Seminar Announcements
Field F1 Person Name: 30%Location: 61%Start Time: 98%
Slides from Cohen & McCallum
State of the Art Performance
Named entity recognition Person, Location, Organization, … F1 in high 80’s or low- to mid-90’s
Binary relation extraction Contained-in (Location1, Location2)
Member-of (Person1, Organization1) F1 in 60’s or 70’s or 80’s
Wrapper induction Extremely accurate performance obtainable Human effort (~30min) required on each site
Slides from Cohen & McCallum
CRF Extractor
Conditional Random Fields Model [Lafferty 01]Attribute value extraction: sequential data labeling CRF model for each attribute independently
2,972 km² (1,147 mi²) of it is land and 17 km² (7 mi²) of it (0.56%) is water.
Relabel – filter false negative training examples
Preprocessor: Water_area
Classifier: Water_area; Land_area
Preprocessor
Classifier
Extractor Infobox
Precision + Recall -
Pipeline – prune irrelevant sentences
Infobox Generation Experiments
2007.02.06 Wikipedia Dump Data
Dataset
4 popular classes:
U.S.County(1245) Actor(3819)
Airline(791) University(4025)
50 random test articles per class
Kylin performance
Kylin performance (detailed view)
U.S.County (better than manual labeling)
Strict expression Number-typed
Abbeville County is a county located in the U.S. state of South Carolina.
The county has a total area of 2,988 square kilometers (1,154 mi²). 2,972 km² (1,147 mi²) of it is land and 17 km² (7 mi²) of it (0.56%) is water.
Kylin performance (detailed view)
Former U.S. President Dwight D. Eisenhower served as President of the University.
The College began first in 1855 as a one room schoolhouse.
UCL was founded in 1826 under the name “University of London”.
The college opened in 1973 with the Charlestown campus.
University (worse than manual labeling) Flexible expression:
Global context:
Implicit:Eg: students at 3 campus sum up to the total student number
Effect of Relabel, Pipeline
Default Project
Reimplement Kylin (or build on Fei’s code) Improve it See how much information we can extract
Post on web: Dbpedia Merge back into Wikipedia?
Bot issues Associate javascript
Extraction from the Greater WWW Self-verify accuracy by external extraction Add infobox facts which are missing from articles
Extensions
Semi-automated bot interface Firefox plugin Displays improved infobox – user checks & says ok
Safer than a bot
For general Wikipedia authors Extraction in real-time & error checking Attribute values Guide towards best schema & attribute
Typing & microformats
Extensions
Other wikipedia issues Learn author reputation Watch for changes Look for framing or biased language Recognize vandalism
Auto-generate disambiguation pages Extract events & create a timeline view Citation assistance
identify correspondence between text and citation Semiautomatic article generation
Extensions
Where could this be applied besides Wikipedia?
Broader Questions Internet enables generation of structured content How integrate methods? Overwrite, training data, ???