+ All Categories
Home > Documents > © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar...

© 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar...

Date post: 11-Jan-2016
Category:
Upload: patience-gibson
View: 217 times
Download: 2 times
Share this document with a friend
53
© 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick Reiss, Rajasekar Krishnamurthy, Sriram Raghavan and HuaiyuZhu
Transcript
Page 1: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2006 IBM Corporation

Towards Declarative Information ExtractionThe Almaden Story

Shivakumar VaithyanathanIBM Almaden

Acknowledgements to:Frederick Reiss, Rajasekar Krishnamurthy, Sriram Raghavan and

HuaiyuZhu

Page 2: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation2

Lots of Text, Many Applications!

Free-text, semi-structured, streaming …

– Web pages, emails, news articles, call-center records, business reports, spreadsheets, research papers, blogs, wikis, tags, instant messages, …

High-impact applications

– Business intelligence, personal information management, enterprise search, Web communities, Web search and advertising, scientific data management, e-government, medical records management, …

Growing rapidly

– Just look at your inbox!

(Adapted from SIGMOD ’06 tutorialby Ramakrishnan, Doan, and Vaithyanathan)

Page 3: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation3

Information Extraction

Distill structured data from unstructured and semi-structured text

For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

Name Title OrganizationBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman Founder Free Soft..

(from Cohen’s IE tutorial, 2003)

Select Name From PEOPLE Where Organization = ‘Microsoft’

Bill GatesBill Veghte

Exploit the extracted data in your applications

Page 4: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation4

Historical Perspective

MUC (Message Understanding Conference) – 1987 to 1997

– Competition-style conferences organized by DARPA

– Shared data sets and performance metrics• News articles, Radio transcripts, Military telegraphic messages

Classical IE Tasks

– Entity and Relationship/Link extraction

– Event detection, sentiment mining etc.

– Entity resolution/matching

Several IE systems were built during this period

– FRUMP [DeJong82], CIRCUS /AutoSlog [Riloff93], FASTUS [Appelt96], LaSIE/GATE, TextPro, PROTEUS, OSMX [Embley05]

Not Focus of this talk

Page 5: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation5

Custom Code

RAP(CPSL-style cascading

grammar system)

RAP++(RAP + Extensions outside

the scope of grammars)

System T(algebraic information

extraction system)2007

2004

2005

2006

Large number of annotators

Diverse data sets, Complex extraction tasks

Performance, Expressivity

Project Avatar -Evolution of IE: The Almaden Story

Evolutionary triggers

Page 6: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation6

From: Michael D. Baselice <[email protected]>

…. Enron is in search of support with the electric deregulation fight out there. Dick Woodward or Dave Fogarty can be reached at 650-340-0470. If you have any questions, please call. .…

From: Tom Briggs <[email protected]>

call Jeff Dasovich at 415-782-7822. He is an excellent source and has a long, sordid histroy in california's Energy market.

<Person> “(\s+at\s+)|((\w+\s+){1,3}reached\s+at\s+)” <PhoneNumber>

Circa 2004: Custom Code

Extract Person’s Phone Number

Example emails from the Enron collection

Page 7: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation7

Custom Code

RAP(CPSL-style cascading

grammar system)

RAP++(RAP + Extensions outside

the scope of grammars)

System T(algebraic information

extraction system)

Large number of annotators

Circa 2005: Moving to RAP

CPSL: Nice abstraction for specifying rules over annotations and text.

RAP: Almaden CPSL execution engine

As the number of annotators increased, custom code was not an option. We

needed a cleaner abstraction

RAP(CPSL-style cascading

grammar system)

Page 8: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation8

Multiple levels of grammar rules to perform extraction

At each level, rules are written in the formalism of finite-state grammars

– Input text viewed as a sequence of tokens

– Rules expressed as regular expression patterns over the lexical features of these tokens

Common Pattern Specification Language (Appelt & Onyshkevych, 1998)

– A language specification to abstract rules and matching semantics from specific implementations

– Several known implementations• TextPro: reference implementation of CPSL by Doug Appelt

• JAPE (Java Annotation Pattern Engine)– Part of the GATE NLP framework– Under active consideration for commercial use by several companies

Cascading Grammars

Page 9: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation9

Cascading Grammar Reality Set of simple grammar rules for person name recognition

PersonDict PersonDict Person

Salutation CapsWord CapsWord Person

CapsWord CapsWord Token[~“,”]? Qualification Person

Level 1: Rules that look for patterns in each token to produce corresponding annotations

Tokenize(Document Text) Sequence of <Token>

Token[~ “Mr. | Mrs. | Dr. | …”] Salutation

Token[~ “Ph.D | MBA | …”] Qualification

Token[~ “[A-Z][a-z]*”] CapsWord

Token[~ “Michael | Richard | Smith| …”] PersonDict

Richard Smith

Dr. Laura Haas

Laura Haas, Ph.D

Pre-processing step: Tokenization of the document text

Level 2: Rules that look for patterns involving Level-1 annotations to identify Persons

Page 10: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation10

Custom Code

RAP(CPSL-style cascading

grammar system)

RAP++(RAP + Extensions outside

the scope of grammars)

System T(algebraic information

extraction system)

Applying RAP to more complex tasks

Complex tasks

– Extracting signature from emails

Page 11: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation11

RAP : Problems encountered

1. Lack of Support for complex operations

Aggregation, Dictionary evaluation, Combining annotations with character-level regular expressions

2. Handling overlapping output annotations

Multiple rules for a concept may generate overlapping annotations

Sometimes, final results may overlap

3. Overlapping input annotations :

Input to a grammar is a linear sequence of annotations

4. Performance

Annotators for complex tasks are very expensive to execute

Page 12: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation12

Complex operations

Laura Haas, PhDDistinguished Engineer and Director, Computer

ScienceAlmaden Research Center408-927-1700http://www.almaden.ibm.com/cs

Person

OrganizationPhone

URL

Person Organizati

onPhone

URL

At least 1 Phone

At least 2 of {Phone, Organization, URL, Email, Address}

End with one of these.

Start with Person

Within 250 characters

Page 13: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation13

Approximate Signature Extraction using a Grammar

Macro:

Contact = Phone|Organization|URL| Email|Address

Rule:

Person (.{,125} Contact){2,?} Signature

Problems:

– Cannot guarantee at least 1 Phone. • Leads to false positives

– Cannot restrict limit on total character count of 250. • Leads to both false positives and false negatives.

Expressing aggregations is hard in grammars

Page 14: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation14

Custom Code

RAP(CPSL-style cascading

grammar system)

RAP++(RAP + Extensions outside

the scope of grammars)

System T(algebraic information

extraction system)

Circa 2006: Moved to RAP++

Modules for aggregation (e.g.,

Count)

Page 15: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation15

Grammar-based systems : Problems encountered

1. Lack of Support for complex operations

Aggregation, Dictionary evaluation, Combining annotations with character-level regular expressions

2. Handling overlapping output annotations

All overlapping annotations need to be retained Overlapping annotations need to be consolidated

3. Overlapping input annotations :

Input to a CPSL grammar is a linear sequence of annotations

4. Performance

Annotators for complex tasks are expensive to execute Takes 9 hours to execute Band Review annotator over 4.5 million blogs

Page 16: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation16

Issue 2 – All overlapping annotations need to be retained

Valid results may overlap

Dick Woodward or Dave Fogarty can be reached at 650-340-0470.

Work-around: Multiple rules

The number of rules can explode !

Page 17: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation17

Custom Code

RAP(CPSL-style cascading

grammar system)

RAP++(RAP + Extensions outside

the scope of grammars)

System T(algebraic information

extraction system)

Circa 2006: Moved to RAP++

Modules for aggregation (e.g.,

Count)

Workarounds to retain all

overlapping annotations

Page 18: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation18

Overlapping annotations may need to be consolidated

… Dr. John Smith …

Consolidation cannot be expressed in Grammar

Two spans need to be merged

Salutation PersonDict

PersonDict PersonDict

Page 19: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation19

Custom Code

RAP(CPSL-style cascading

grammar system)

RAP++(RAP + Extensions outside

the scope of grammars)

System T(algebraic information

extraction system)

Circa 2006: Moved to RAP++

Modules for aggregation (e.g.,

Count)

Consolidation modules

Workarounds to generate

overlapping output

Page 20: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation20

Apply RAP++ to diverse data-sets

Two severe problems surfaced

– Due to complex workflows results largely dictated by low-level decisions

– Performance

Page 21: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation21

…….I went to see the OTIS concert last night. T’ was SO MUCH FUN I really had a blast …

….there were a bunch of other bands …. I loved STAB (….). they were a really weird ska band and people were running around and …

Weblogs: Identify Bands and Reviews

Page 22: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation22

Concert

Review

review pattern review pattern

review pattern

review patternreview pattern

Block

Concert instance pattern

Page 23: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation23

BandReview

went to the Switchfoot concert at the Roxy. It was pretty fun,… The lead singer/guitarist was really good, and even though there was another guitarist (an Asian guy), he ended up playing most of the guitar parts, which was really impressive. The biggest surprise though is that I actually liked the opening bands. …I especially liked the first band

ReviewInstance

ReviewGroup

“lead singer/guitarist was really good”

“Liked the opening bands”

“Liked the first band”

“Kurt Ralske played guitar”

“Lead singer/guitarist was really good, and even … I actually liked the opening bands. … Well they were none of those. I especially liked the first band”

ConcertInstance

“went to the Switchfoot concert at the Roxy”

“went to AJCO n Band concert”

“performance by local funk band Saaraba”

Outline of the BandReview Extraction Task

Page 24: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation24

Example low-level decision

ProperNoun Instrument ProperNoun

John Pipe plays the guitar Marco Benevento on the Hammond organ

Instrument

Instrument ProperNoun

<ProperNoun> <within 30 characters> <Instrument> Regular Expression Dictionary Match Match

<[A-Z]\w+(\s[A-Z]\w+)?><d1|d2|…dn>

Example rule from the Band Review

Page 25: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation25

Sequencing Overlapping Annotations

Possible options

– Pre-specified disambiguation rules (e.g., pick earlier annotation)

– Supply tie-breaking rules for every possible overlap scenario

– Let implementation make an internal non-deterministic choice (as in JAPE, RAP, ..)

John Pipe plays the guitar

ProperNoun Instrument

Instrument John Pipe plays the guitarProperNoun Token Token Instrument

John Pipe plays the guitarToken Instrument Token Token Instrument

Which of the two should we pick?

Over 4.5M blog entries a choice one way or another on a single rule would change the number of annotations by +/- 25%.

Page 26: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation26

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, <Name> at <Phone> arcu augue rutrum velit, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, John Smith at 555-1212 arcu augue rutrum velit, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti sociosqu ad litora

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, <Name> at 555-1212 arcu augue rutrum velit, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, <PersonPhone> arcu augue rutrum velit, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti

Performance

Name Token[~ “at”] Phone PersonPhone

Token[~ “John | Smith| …”]+ Name

Token[~ “[1-9]\d{2}-\d{4}”] Phone

Each level in a cascading grammar looks at each character in each document

Page 27: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation27

Custom Code

RAP(CPSL-style cascading

grammar system)

RAP++(RAP + Extensions outside

the scope of grammars)

System T(algebraic information

extraction system)

2007: System T

Scalable algebraic information extraction

system

Page 28: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation28

Brief introduction to algebra for Intra-document IE

Each Operator in the algebra…

– …operates on one or more tuples of annotations

– …produces tuples of annotations

“Document at a time” execution model

– Algebra expression is defined over • the current document d • annotations defined over d

Algebra expression is evaluated over each document in the corpus individually

Page 29: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation29

Basic Single-Argument Operator

Annotation 1

Operator

Output Tuple 1

Parameters

DocumentInput Tuple

Document

Annotation 2Output Tuple 2 Document

Page 30: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation30

<ProperNoun> <within 30 characters> <Instrument> Regular Expression Dictionary Match Match

Algebra expression for the Rule from Band Review(Reiss, Raghavan, Krishnamurthy, Zhu and Vaithyanathan, ICDE 2008)

ProperNoun Instrument

(followed within 30 characters)

DictionaryRegular

expression

Join

Page 31: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation31

Revisit Problem of Sequencing Annotations

ProperNoun Instrument ProperNoun

John Pipe plays the guitar Marco Benevento on the Hammond organ

Instrument

InstrumentProperNoun

Page 32: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation32

DictionaryRegex

Join

John PipedocMarco Beneventodoc

Hammonddoc

docdoc

Pipeguitar

doc Hammond organ

ProperNoun Instrument ProperNoun

John Pipe plays the guitar Marco Benevento on the Hammond organ

Instrument

InstrumentProperNoun

John PipedocMarco Beneventodoc

guitarHammond organ

ProperNoun Instrument

ProperNoun <0-30 chars> Instrument

Page 33: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation33

Custom code of RAP++ are also addressed in the Algebra

Interaction between custom modules and grammar hard to maintain

In the algebra world they are addressed more cleanly

– Aggregation

– Consolidation

Page 34: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation34

How is aggregation handled

Laura Haas, PhDDistinguished Engineer and Director, Computer

ScienceAlmaden Research Center408-927-1700http://www.almaden.ibm.com/cs

Person

OrganizationPhone

URL

Person Organizati

onPhone

URL

At least 1 Phone

At least 2 of {Phone, Organization, URL, Email, Address}

End with one of these.

Start with Person

Within 250 characters

Page 35: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation35

Lorem ipsum dolor sit amet, consectetuer adipiscing

elit. In augue mi, scelerisque non, dictum non,

vestibulum congue, erat. Donec non felis. Maecenas

urna nunc, pulvinar et, fringilla a, porta at, diam. In

iaculis dignissim erat. Quisque pharetra. Suspendisse

cursus viverra urna. Aliquam erat volutpat. Donec quis

sapien et metus molestie eleifend. Maecenas sit amet

metus eleifend nibh semper fringilla. Pellentesque

habitant morbi tristique senectus et netus et malesuada

Block Operator ()

Input Input

Input

Input

Page 36: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation36

Lorem ipsum dolor sit amet, consectetuer adipiscing

elit. In augue mi, scelerisque non, dictum non,

vestibulum congue, erat. Donec non felis. Maecenas

urna nunc, pulvinar et, fringilla a, porta at, diam. In

iaculis dignissim erat. Quisque pharetra. Suspendisse

cursus viverra urna. Aliquam erat volutpat. Donec quis

sapien et metus molestie eleifend. Maecenas sit amet

metus eleifend nibh semper fringilla. Pellentesque

habitant morbi tristique senectus et netus et malesuada

Block Operator ()

Input Input

Input

Input

Constraint on distance between inputsConstraint on distance between inputs

Constraint on number of inputsConstraint on number of inputs

Blo

ck

Page 37: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation37

Back to signature

Org Phone URL

Person

Join

Union

Organization Phone

URL

Organization Phone

URL

Person

Block Organization

Phone

URLPerson

Signature

Cleaner and potentially faster

Page 38: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation38

Cost Modeling

Our experience: Extraction operators dominate– 90+ percent of execution time for most plans

Most of the costs that matter in databases don’t matter in information extraction

– I/O costs– Join costs– Sorting costs– Aggregation costs

Page 39: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation39

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, <Name> at <Phone> arcu augue rutrum velit, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, John Smith at 555-1212 arcu augue rutrum velit, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti sociosqu ad litora

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, <Name> at 555-1212 arcu augue rutrum velit, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elementum neque at justo. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, <PersonPhone> arcu augue rutrum velit, hendrerit faucibus pede mi sed ipsum. Curabitur cursus tincidunt orci. Pellentesque justo tellus , scelerisque quis, facilisis quis, interdum non, ante. Suspendisse feugiat, erat in feugiat tincidunt, est nunc volutpat enim, quis viverra lacus nulla sit amet lectus. Nulla odio lorem, feugiat et, volutpat dapibus, ultrices sit amet, sem. Vestibulum quis dui vitae massa euismod faucibus. Pellentesque id neque id tellus hendrerit tincidunt. Etiam augue. Class aptent taciti

And finally performance

Name Token[~ “at”] Phone PersonPhone

Token[~ “John | Smith| …”]+ Name

Token[~ “[1-9]\d{2}-\d{4}”]+ Phone

Page 40: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation40

Why performance should improve

Apply Name Rule

Apply Phone Rule

Apply PersonPhone

…John Smith at 555-1212…

…<Name> at 555-1212…

…<Name> <Name> at <Phone>…

…<PersonPhone>…

…John Smith at 555-1212…

SmithJohn555-1212

John Smith at 555-1212

RAP

Dictionary Regex

Join

AlgebraSmaller number of passes over the data

Page 41: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation41

More Optimization

The algebraic approach allows further performance improvements via query optimization

– Common technique in relational databases

We extend the formalism to the text domain

– Text-specific algebraic transformations, cost model, and selectivity estimation

Page 42: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation42

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Proin elentum non ante. John Pipe played the guitar. Aliquam erat volutpat. Curabitur a massa. Vivamus luctus, risus in sagittis facilisis, arcu augue rutrum ve

Optimization Example

Regex match Dictionary match

0-30 characters

<ProperNoun> <within 30 characters> <Instrument>

Page 43: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation43

<ProperNoun> <Instrument>

(Followed within 30 characters)

<ProperNoun>

Find <Instrument> within 30 characters

<Instrument>

Find <ProperNoun> within 30 characters

Consider text to the rightConsider text to the left

Plan B Plan C

Plan A

Join

Page 44: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation44

Optimization Example

020406080

100120140160180

Running Time (sec)

Plan A Plan C

Page 45: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation45

Improvements even on named-entity annotators

0

100

200

300

400

500

600

Running Time (sec)

RAP System-T

Page 46: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation46

Full Band Review Annotator

Annotator Running Time

0

5000

10000

15000

20000

25000

30000

GRAMMAR ALGEBRA (Baseline) ALGEBRA (Optimized)

Ru

nn

ing

Tim

e (s

ec)

Page 47: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation47

Closely related work (Shen, Doan, Naughton, Ramakrishnan, VLDB 2007)

Regular Expressions and

Custom Code

Cascading Grammars

CPSL, AFST UIMA, GATE

Workflows

System T DBLifeIn the context of Project Cimple.

Search for “cimple wisc”

Page 48: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation48

Delving deeper into System T versus DBLife

Restricted Span

Evaluation

Shared Dictionary Matching

Conditional Evaluation

Pushing Down Text

Properties

Scoping Extractions

Pattern MatchingS

yste

m T

DB

Life

Page 49: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation49

We have started: AQL

SQL-like language for defining annotators

Declarative

– Define basic patterns and the relationships between them

– Let the system worry about the order of operations

Page 50: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation50

AQL Example

select CombineSpans(name.match, instrument.match) as annot, name.match as name, instrument.match as instrfrom Regex(/[A-Z]\w+(\s[A-Z]\w+)?/, DocScan.text) name, Dictionary(“instr.dict”, DocScan.text) instrumentwhere Follows(0, 30, name.match, instrument.match);

select CombineSpans(name.match, instrument.match) as annot, name.match as name, instrument.match as instrfrom Regex(/[A-Z]\w+(\s[A-Z]\w+)?/, DocScan.text) name, Dictionary(“instr.dict”, DocScan.text) instrumentwhere Follows(0, 30, name.match, instrument.match);

<ProperNoun> <within 30 characters> <Instrument> Regular Expression Dictionary Match Match

<ProperNoun> <within 30 characters> <Instrument> Regular Expression Dictionary Match Match

Page 51: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation51

AQL System Architecture

AQL Language

Optimizer

OperatorRuntime

Specify annotator semantics declaratively

Specify annotator semantics declaratively

Choose an efficient execution plan that implements semantics

Choose an efficient execution plan that implements semantics

Page 52: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2007 IBM Corporation52

Where we are ?

Preliminary indications that at least two (seemingly) different directions are coming together

How do applications interact with such a system ?

What are the next steps ?

Page 53: © 2006 IBM Corporation Towards Declarative Information Extraction The Almaden Story Shivakumar Vaithyanathan IBM Almaden Acknowledgements to: Frederick.

© 2006 IBM Corporation

Backup Slides


Recommended