+ All Categories
Home > Documents > India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the...

India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the...

Date post: 05-Jan-2016
Category:
Upload: samantha-horton
View: 218 times
Download: 0 times
Share this document with a friend
27
India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan and Sachindra Joshi IBM India Research Lab
Transcript
Page 1: India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

India Research Lab

© Copyright IBM Corporation 2006

Entity Annotation using operations on the Inverted Index

Ganesh Ramakrishnan, with Sreeram Balakrishnan and Sachindra Joshi

IBM India Research Lab

Page 2: India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

| 2

India Research Lab

© Copyright IBM Corporation 2006

Problem: Entity Annotation

Extract all instances of entities of type E from an unstructured source S.- Company names, Designation, Person names, Date, Time

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME

TITLE ORGANIZATION

Bill Gates

CEO

Microsoft

Bill Veghte

VP

Microsoft

Richard Stallman

founder

Free Soft..

Page 3: India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

| 3

India Research Lab

© Copyright IBM Corporation 2006

Document-at-a-time Based Approach

ML / Hand-built rules

TokenizerPOS Lookup

Gazetteer Lookupetc…

Fea

ture

Co

llect

ion

Inst

ance

Ext

ract

or

AnnotatedDocument

………………………………………………………………………………………………………………………...……….

…<>………………………</>……<>………………….………</

>…………………<>……</

>…………………………………………<>……</>…<>…….</>…

A SingleNon-annotated

Document

Documentcollection

Annotateddocumentcollection

A few rule-based annotators exist: E.g. GATE. We have built a rule-based annotator at IRL

Page 4: India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

| 4

India Research Lab

© Copyright IBM Corporation 2006

Example: Rules for identifying ORGANIZATIONs

How to identify?

B.P. Marsh PlcThe U.S.B. Holding Co.U.S.B. Holding Group

Page 5: India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

| 5

India Research Lab

© Copyright IBM Corporation 2006

Example rule for identifying ORGANIZATION instances

Regular expression macros

Dictionary attribute

ORPart of speech

tag

U.S.B.

The

Holding

Co.

Page 6: India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

| 6

India Research Lab

© Copyright IBM Corporation 2006

Problems with Document-at-a-time Based Approach on large corpora

Repeated computations for multiple occurrences of same token:- Dictionary-lookups- Regular expression matches

Large over-heads while- Re-annotating a corpus after changing dictionary entries

The user realizes that “Group” is a too generic word to be included as an ORGANIZATION:CLUE and want to remove its entry

Page 7: India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

| 7

India Research Lab

© Copyright IBM Corporation 2006

Problems with Document-at-a-time Based Approach on large corpora

Repeated computations for multiple occurrences of same token:- Dictionary-lookups- Regular expression matches

Large over-heads while- Re-annotating a corpus after changing dictionary entries

The user realizes that “Group” is a too generic word to be included as an ORGANIZATION:CLUE and want to remove its entry

- Re-annotating a corpus with slight modification in rules The user realizes that the optional “The” at the beginning introduces too many wrong

annotations and modifies the rule

Page 8: India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

| 8

India Research Lab

© Copyright IBM Corporation 2006

The rule with the optional “The” at the beginning removed

Page 9: India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

| 9

India Research Lab

© Copyright IBM Corporation 2006

Problems with Document-at-a-time Based Approach on large corpora

Repeated computations for multiple occurrences of same token:- Dictionary-lookups- Regular expression matches

Large over-heads while- Re-annotating a corpus after changing dictionary entries

The user realizes that “Group” is a too generic word to be included as an ORGANIZATION:CLUE and want to remove its entry

- Re-annotating a corpus with slight modification in rules The user realizes that the optional “The” at the beginning introduces too many wrong

annotations and modifies the rule- Making incremental annotation updates by adding new rules

The user wants a new rule that identifies “C.B. Fairlie Holding & Finance Limited”

Page 10: India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

| 10

India Research Lab

© Copyright IBM Corporation 2006

A new rule to capture an interspersed conjunction

Page 11: India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

| 11

India Research Lab

© Copyright IBM Corporation 2006

Problems with Document-at-a-time Based Approach on large corpora

Repeated computations for multiple occurrences of same token:- Dictionary-lookups- Regular expression matches

Large over-heads while- Changing dictionary entries

The user realizes that “Group” is a too generic word to be included as an ORGANIZATION:CLUE and want to remove its entry

- Re-annotating a corpus with slight modification in rules The user realizes that the optional “The” at the beginning introduces too many wrong

annotations and modifies the rule

- Making incremental annotation updates by adding new rules The user wants a new rule that identifies “C.B. Fairlie Holding & Finance Limited” The user wants a new rule that identifies acquiring organizations:

“AT&T Wireless, Inc. ” (that purchased Alaska Communications System in 1995)

Page 12: India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

| 12

India Research Lab

© Copyright IBM Corporation 2006

A new rule to identify acquiring organizations

Post-context specifier

Page 13: India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

| 13

India Research Lab

© Copyright IBM Corporation 2006

An alternative approach: Operating on the Inverted Index

Inverted Index- A compact representation of the collection- Captures redundancies/repetition information

Page 14: India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

| 14

India Research Lab

© Copyright IBM Corporation 2006

Structure of Index

Example:The company said that it will acquire the other company

the

company

said

that

it

will

acquire

other

sid first last

Posting List

sid: a sentence identifierfirst: beginning position of an occurrencelast: end position of the same occurrence

Basic Entities Orthographic properties Dictionary Features

Page 15: India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

| 15

India Research Lab

© Copyright IBM Corporation 2006

An alternative approach: Operating on the Inverted Index

Inverted Index- A compact representation of the collection- Captures redundancies/repetition information

Many applications build an inverted index on the annotated corpus anyways- We directly update the inverted index with annotation entries

Page 16: India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

| 16

India Research Lab

© Copyright IBM Corporation 2006

Our approach: Index Based Entity Annotation

Page 17: India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

| 17

India Research Lab

© Copyright IBM Corporation 2006

Complexity Analysis for Document based Approach

Problem: Find all annotations of length at most Solution: Given a regular expression R, convert it into a DFA DR

Complexity

N

i

N

iiiD ScntWScntC

1 2

)()(

visitedisStimesofnumberScnt

DinstateaS

tokensofnumbertotalW

DinstatesofnumberN

ii

Ri

R

)(

Page 18: India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

| 18

India Research Lab

© Copyright IBM Corporation 2006

Operations on Index

merge(L,L’) : returns a posting list where each entry in the returned posting list occurs either in posting list L or L’ or in both

consint(L, L’) : returns a posting list where each entry in the posting list points to a token sequence which consists of two consecutive subsequences @sa and @sb such that L has a pointer to @sa and L’ has a pointer to @sb

1

|'|

||log|'|2,1

||

|'|log||2|),'||(|min 22 L

LL

L

LLLL

|'||| LL

Page 19: India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

| 19

India Research Lab

© Copyright IBM Corporation 2006

Implementing a DFA using Index

With each pair of state s and list k associate is a posting list of token sequences of length k

which end in state s Iteratively compute from its predecessor states

)))(,consint()),(,(consintmerge(1,21,22,3

cLlistbLlistlist sss

1;, Sslist ks

kslist ,

kslist ,

S1a

b

ccS2 S3 S4

))(,(consint2,33,4

cLlistlist ss

Page 20: India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

| 20

India Research Lab

© Copyright IBM Corporation 2006

)(||1

, ik

kS Scntlisti

Complexity Analysis for Index based Approach

)()log(2|))(log(|2 )(

i

N

i SdestsisiI ScntSprevC

i

)()log(2|))(log(|2 )(

i

N

i SdestsisiI ScntSprevC

i

sizeslistpostingtheofratio

SSprev

is

ii

into arcs incoming ofnumber |)(|

Observation:

])([

)()log(2|))(log(|

2

2 )(

N

ii

i

N

i Sdestsisi

D

I

ScntW

ScntSprev

C

C i

N

ii

i

N

i Sdestsisi

D

I

Sfcnt

SfcntSprev

C

C i

2

2 )(

)(1

)()log(2|))(log(|

Page 21: India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

| 21

India Research Lab

© Copyright IBM Corporation 2006

Example: Simple Dictionary MatchLet tokens in T be drawn from {a,b…z}Let D be a dictionary {a,e,i,o,u}A simple 2 state DFA that matches D is:

S1 S2

ae

i

o

u

Ratio of document based match to index based match

)()5log(

)(1

2

2

Sfcnt

Sfcnt

C

C

I

D

27.0)()(

:Desirable

22 ScntoffractionSfcnt

Page 22: India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

| 22

India Research Lab

© Copyright IBM Corporation 2006

Index based Annotation using Regular Expressions

NFA to DFA conversion may cause explosion of states

Scan regular expression from left to right and build AND/OR graph recursively

Compute posting list using AND/OR graph by propagating lists from leaves to root node AND

Page 23: India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

| 23

India Research Lab

© Copyright IBM Corporation 2006

Handling ? And Kleen Operators

Each node contains two binary properties- isOpt: 1 if the regular

expression of the form R? (selfRecursion=? or *)

- selfLoop: 1 if the regular expression matched is of the form R+ (one or more times)

(selfRecursion=* or +)- For R* both the properties

are set

Page 24: India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

| 24

India Research Lab

© Copyright IBM Corporation 2006

New Operations

consint(L,L’): Generated list has isOpt set iff if both the arguments have isOpt set

merge(L,L’): Generated list has isOpt set if any of the arguments have isOpt set.

consint(L,+): Returns posting list such that each entry points to at most subsequences in L

Page 25: India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

| 25

India Research Lab

© Copyright IBM Corporation 2006

Computing Regular Expression using AND/OR Graph

Compute posting lists with each node from bottom up.

For each AND node use consint operation with the posting list of children nodes.

For each OR node use merge operation

Page 26: India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

| 26

India Research Lab

© Copyright IBM Corporation 2006

Experimental Results

Data sets- Enron email: 2.3 GB- Reuters+20NG: 93 MB

8 rules for 4 annotations- Person name, company

name, location and date

Data set GATE Index based Speedup

Factor

Enron 4974343 374926 13.26

Reuter+ 752287 92238 8.15

A greater speedup is achieved on larger corpus Incremental annotations achieve even larger performance gains

Data set GATE Index based Speedup Factor

Enron 1479954 62227 23.78

Reuter+ 661157 17929 36.87

Page 27: India Research Lab © Copyright IBM Corporation 2006 Entity Annotation using operations on the Inverted Index Ganesh Ramakrishnan, with Sreeram Balakrishnan.

India Research Lab

© Copyright IBM Corporation 2006

THANK YOU


Recommended