Instructor: Alan Ritter CSE 5539: Web Information Extraction.

Post on 28-Dec-2015

223 views 3 download

Tags:

transcript

Instructor: Alan Ritter

CSE 5539: Web Information Extraction

Motivation

• Data Analytics / Big Data– Companies have lots of data

lying around– Computing cycles are cheap– Using data to get insights:

• Business, Healthcare, Science, Government, Politics

• Challenge: Most of the world’s data is Unstructured– Text– Speech– Images

Structured Data

Bigger Unstructured Data

Extracting Knowledge from Text

The Web News

Text Extractors

Structured Data

Example: Information Extraction from Twitter

“Yess! Yess! Its official Nintendo announced today that they Will release the Nintendo 3DS in north America

march 27 for $250”

Example: Information Extraction from Twitter

“Yess! Yess! Its official Nintendo announced today that they Will release the Nintendo 3DS in north America

march 27 for $250”

Example: Information Extraction from Twitter

“Yess! Yess! Its official Nintendo announced today that they Will release the Nintendo 3DS in north America

march 27 for $250”

COMPANY PRODUCT DATE PRICE REGION

PRODUCT RELEASE

Example: Information Extraction from Twitter

“Yess! Yess! Its official Nintendo announced today that they Will release the Nintendo 3DS in north America

march 27 for $250”

COMPANY PRODUCT DATE PRICE REGION

Nintendo 3DS March 27 $250 North America

PRODUCT RELEASE

Example: Information Extraction from Twitter

Samsung Galaxy S5 Coming to All Major U.S. Carriers Beginning April 11th

COMPANY PRODUCT DATE PRICE REGION

Samsung Galaxy S5 April 11 ? U.S.

Nintendo 3DS March 27 $250 North America

PRODUCT RELEASE

Example: Information Extraction from Twitter

COMPANY PRODUCT DATE PRICE REGION

Samsung Galaxy S5 April 11 ? U.S.

Nintendo 3DS March 27 $250 North America

… … … … …

PRODUCT RELEASE

News

Example Applications

• Question Answering / Structured Queries– Which companies are releasing new smartphones

new products in Europe this Spring?– Alert me anytime a new smartphone is announced

in the U.S.• Data Mining

– Analyze trends in product releases across different industries

– Is there a correlation between price and date of release?

Knowledge GraphsThings not strings!

CSE 5539

Ohio State Univ.

Course offered at

Alan Ritter

Instructor

Columbus OH

Located In

Data Sources

Available Data Sources

All of these databases are sparsely populated

and out of date. We need to extract this type of knowledge from

text!!!!

Available Data Sources

All of these databases are sparsely populated

and out of date. We need to extract this type of knowledge from

text!!!!

Traditional information Extraction

Traditional information Extraction

Example Text from MUC-4 (1992)[Cowie and Wilks]

Example Output from MUC-4 (1992)

[Cowie and Wilks]

Approaches• Initially: Rule Based

– Basically just write a bunch of regular expressions

Approaches• Initially: Rule Based

– Basically just write a bunch of regular expressions

Approaches• Initially: Rule Based

– Basically just write a bunch of regular expressions

Approaches

• Initially: Rule Based– Basically just write a bunch of regular expressions

• Machine Learning (Fietag 1998) (Soderland 1999), (Mooney 1999)

– Annotate training / dev / test documents– Train machine learning models

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.Looking forseminarlocation

[Slide from William Cohen]

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.Looking forseminarlocation

[Slide from William Cohen]

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.Looking forseminarlocation

[Slide from William Cohen]

Extraction by Sliding Window

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

CMU UseNet Seminar Announcement

E.g.Looking forseminarlocation

[Slide from William Cohen]

A “Naïve Bayes” Sliding Window Model[Freitag 1997]

00 : pm Place : Wean Hall Rm 5409 Speaker : Sebastian Thrunw t-m w t-1 w t w t+n w t+n+1 w t+n+m

prefix contents suffix

If P(“Wean Hall Rm 5409” = LOCATION) is above some threshold, extract it.

… …

Estimate Pr(LOCATION|window) using Bayes rule

Try all “reasonable” windows (vary length, position)

Assume independence for length, prefix words, suffix words, content words

Estimate from data quantities like: Pr(“Place” in prefix|LOCATION)

[Slide from William Cohen]

“Naïve Bayes” Sliding Window Results

GRAND CHALLENGES FOR MACHINE LEARNING

Jaime Carbonell School of Computer Science Carnegie Mellon University

3:30 pm 7500 Wean Hall

Machine learning has evolved from obscurity in the 1970s into a vibrant and popular discipline in artificial intelligence during the 1980s and 1990s. As a result of its success and growth, machine learning is evolving into a collection of related disciplines: inductive concept acquisition, analytic learning in problem solving (e.g. analogy, explanation-based learning), learning theory (e.g. PAC learning), genetic algorithms, connectionist learning, hybrid systems, and so on.

Domain: CMU UseNet Seminar Announcements

Field F1 Person Name: 30%Location: 61%Start Time: 98%

[Slide from William Cohen]

IE with Hidden Markov Models

Yesterday Pedro Domingos spoke this example sentence.

Yesterday Pedro Domingos spoke this example sentence.

Person name: Pedro Domingos

Given a sequence of observations:

and a trained HMM:

Find the most likely state sequence: (Viterbi)

Any words said to be generated by the designated “person name”state extract as a person name:

),(maxarg osPs

person name

location name

background

[Slide from William Cohen]

Finite State Models

Naïve Bayes

Logistic Regression

Linear-chain CRFs

HMMsGenerative

directed models

General CRFs

Sequence

Sequence

Conditional Conditional Conditional

GeneralGraphs

GeneralGraphs

Various Annotated Datasets for Event / Relation Extraction

• ACE– Automatic Content Extraction– Newswire– Successor to MUC

Various Annotated Datasets for Event / Relation Extraction

• GENIA– Medline abstracts– Similar extraction task in the Biomedical domain

Schemas -> Triples“Yess! Yess! Its official Nintendo announced today that they Will release the Nintendo 3DS in north America

march 27 for $250”

COMPANY PRODUCT DATE PRICE REGION

Nintendo 3DS March 27 $250 North America

PRODUCT RELEASE

Manufacturer(3DS, Nintendo)ReleaseDate(3DS, March 27)Price(3DS, $250)…

RelationExtraction

Open Information Extraction (Banko et. al. 2007)

Demo (TextRunner)

• http://openie.allenai.org/

42

Distant (weak) Supervision for Relation Extraction e.g. [Mintz et. al. 2009]

Person Birth Location

Barack Obama Honolulu

Mitt Romney Detroit

Albert Einstein Ulm

Nikola Tesla Smiljan

… …

“Barack Obama was born on August 4, 1961 at … in the city of Honolulu ...”

“Birth notices for Barack Obama were published in the Honolulu Advertiser…”

“Born in Honolulu, Barack Obama went on to become…”…

(Barack Obama, Honolulu)

(Mitt Romney, Detroit)

(Albert Einstein, Ulm)

Demo (NELL)

• http://rtw.ml.cmu.edu/rtw/kbbrowser/

Demo (Literome)

• http://literome.azurewebsites.net/

Knowledge Base Population Subtasks

• Entity Recognition/Classification/Linking• Relation Extraction• Event Extraction• Knowledge Base Inference

Applications

• Google knowledge graph• Facebook graph search• Biomedical knowledge bases• -> Your application domain here

– Geoscience knowledge graph?– Patent knowledge graph?– Cybersecurity knowledge graph?

Research Groups at Other Places

Why learn about this stuff?

Paper Selection Form!(please fill out before next class)

https://goo.gl/AghZ1f

Administrative Details

• Course Webpage– http://aritter.github.io/courses/5539_fall15.html