+ All Categories
Home > Documents > November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information...

November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information...

Date post: 02-Jan-2016
Category:
Upload: amelia-leonard
View: 216 times
Download: 0 times
Share this document with a friend
Popular Tags:
30
November 2003 CSA4050: Information Extr action I 1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?
Transcript
Page 1: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 1

CSA4050: Advanced Topics in NLP

Information Extraction I

What is Information Extraction?

Page 2: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 2

Sources

• R. Gaizauskas and Y. Wilks, Information Extraction: Beyond Document Retrieval. Technical Report CS-97-10, Department of Computer Science, University of Sheffield, 1997.

Page 3: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 3

What is Information Extraction?

• IE: the analysis of unrestricted text in order to extract information about pre-specified types of entity, relationship and event.

• Typically, text is newspaper text or newswire feed.

• Typically, prespecified structure is a class-like object with different data fields.

Page 4: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 4

A Example ofInformation Extraction

19 March – A bomb went off near a power tower in San Salvador leaving a large part of city, without energy; but no casualties has been reported. According to unofficial sources, the bomb- allegedly detonated by urban Guerilla commandos- blew up a power tower in northwestern part of San Salvador

Template Structure: IncidentType : bombingDate : March 19Location : San Salvador Perpetrator :Urban Guerilla Commandos Target : power tower

Page 5: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 5

• Different levels of structure can be envisaged.– Named Entities– Relationships– Events– Scenarios

Page 6: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 6

Examples of Named Entities

• People– John Smith, J. Smith, Smith, John, Mr. Smith

• Locations– EU, The Hague, SLT, Piazza Tuta

• Organisations– IBM, The Mizzi Group, University of Malta

• Numerical Quantities– Lm 10, forty per cent, 40%, $10

Page 7: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 7

Examples of Relationships between Named Entities

George Bush1 is [President2 of the United States3 ] 4

– nation(3)– president(1,3)– coref(1,4)

Page 8: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 8

Examples of Events

• Financial Events– Takeover bids– Changes of management

• Socio/Political Events– Terrorist attacks– Traffic accidents

• Geographical Events– Natural Disasters

Page 9: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 9

Some Differencesbetween IE and IR

• IE extracts relevant information from documents.

• IE has emerged from research into rule based systems in CL.

• IE typically based on some kind of linguistic analysis of source text.

• Information Retrieval (IR) retrieves relevant documents in a collection

• IR mostly influenced from theory of information, probability, and statistics.

• IR typically uses bag of words model of source text.

Page 10: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 10

Why Linguistic Analysisis Necessary

• Active/Passive distinction– BNC Holdings named Ms G. Torretta to succeed Mr.

N. Andrews as new chairperson– Nicholas Andrews was named by Gina Torretta as

chair-person of BNC Holdings

• Use of different phrases to mean the same thing– Ms. Gina Torretta took the helm at BNC Holdings.

She succeeds Nick Andrews– G Torretta succeeds N Andrews as chairperson at

BNC Holdings

• Establishing coreferences

Page 11: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 11

Brief History

• 1960-80 N Sager Linguistic String project: automatically induced information formats for radiology reports

• 1970s R. Schank: Scripts• 1982 G. DeJong FRUMP: “Sketchy Scripts” used to

process UPI newswire stores in domains (e.g. earthquakes; labour strikes); systematic evaluation.

• 1983 J-P Zarri – analysis of historical texts by translating text into a semantic metalanguage

• 1986 ATRANS (S. Lytinen et al) – script based system for analysis of money transfer messages between banks

• 1992 Carnegie Group: JASPER - skims company press releases to fill in templates concerning earnings and dividends.

Page 12: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 12

Message Understanding Conferences

• Conferences aimed at comparing the performance of a number of systems working on IE from naval messages.

• Sponsored by DARPA and organised by the US Naval Command centre, San Diego.– Progressively more difficult tasks.– Progressively more refined evaluation

measures.

Page 13: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 13

MUC Tasks

• MUC1: tactical naval operations reports on ship sightings and engagements. No task definition; no evaluation criteria

• MUC3: newswire stories about terrorist attacks. 18 slot templates to be filled. Formal evaluation criteria supplied.

• MUC6: specific subtasks including named entity recognition; coreference identification; scenario template extraction.

Page 14: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 14

IE Subtasks

• Named Entity recognition (NE) – Finds and classifies names, places etc.

• Coreference Resolution (CO) – Identifies identity relations between entities in texts.

• Template Element construction (TE) – Adds descriptive information to NE results (using CO).

• Template Relation construction (TR) – Finds relations between TE entities.

• Scenario Template production (ST) – Fits TE and TR results into specified event scenarios.

Page 15: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 15

Evaluation: the IR Starting Point

selected target

false pos true pos false neg

Page 16: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 16

Evaluation Metrics

• Starting points are those used for IR, namely recall and precision.

Relevant Not Relevant

Retrieved tp (true pos) fp (false pos)

Not Retrieved fn (false neg) tn (true neg)

Page 17: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 17

IR Measures:Precision and Recall

• Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved)

Precision P = tp/(tp + fp)

• Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant)

Recall R = tp/(tp + fn)

Page 18: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 18

F-Measure

• Whatever method is chosen to establish P and R there is a trade-off between them.

• For this reason researchers often use a measure which combines the two.

• F = 1/ (α/P + (1- α)/R) is commonly used where α is a factor which determines the weighting between P and R

• When α = 0.5 the formula reduces to the harmonic mean = 2PR/(P+R)

• Clearly F is weighed towards P as α approaches 1.

Page 19: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 19

Harmonic Mean

                                                                    

    x y

arithmetic mean

geometric mean

harmonic mean

50 50 50 50 50

40 60 50 49 48

30 70 50 46 42

20 80 50 40 32

arithmetic mean

Page 20: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 20

Evaluation Metrics for IE

• For IE, these measures need to be related to the activity of slot-filling:– Slot fills can be correct, partially correct or

incorrect, missing, spurious. – These differences permit the introduction of

finer grained measures of correctness that include overgeneration, undergeneration, and substitution.

Page 21: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 21

Recall

• Recall is a measure of how much relevant information a system has extracted from text.• It is the ratio of how much information is actually extracted against how much information there is to

be extracted, ie

count of facts extractedcount of possible facts

Page 22: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 22

Precision

• Precision is a measure of how accurate a system is in extracting information.• It is the ratio of how much correct information is actually extracted against how much information is

extracted, i.e.

count of correct facts extractedcount of facts extracted

Page 23: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 23

Bare Bones Architecture( from Appelt and Israel 1999)Tokenisation

Morphological &Lexical Processing

Syntactic Analysis

Discourse Analysis

Word segmentation

POS Tagging

Word Sense Tagging

Preparsing

Parsing

Coreference

Page 24: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 24

Generic IE System(Hobbs 1993)

Text Zoner

Preprocessor

TemplateGenerator

Lexical Disambiguator

SemanticInterpreter

CoreferenceResolution

FragmentCombiner Parser

Filter Preparser

Page 25: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 25

Large Scale IELaSIE

• General-purpose IE research system geared towards MUC-6 tasks.

• Pipelined system with three principle processing tasks:– Lexical preprocessing– Parsing and semantic interpretation– Discourse interpretation

Page 26: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 26

LaSIE: Processing Stages

• Lexical preprocessing: reads, tokenises, and tags raw input text.

• Parsing and semantic interpretation: chart parser; best-parse selection; construction of predicate/argument structure

• Discourse interpretation: adds information from predicate-argument representation to a world model in the form of a hierarchically structured semantic net

Page 27: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 27

LaSIE Parse Forest

• It is rare that analysis contains a unique, spanning parse• selection of best parse is carried out by choosing that sequence of non-overlapping, semantically interpretable categories that covers the most words and consists of the fewest constituents.

Page 28: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 28

LaSIE Discourse Model

Page 29: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 29

Example Applications of IE

• Finance

• Medicine

• Law

• Police

• Academic Research

Page 30: November 2003CSA4050: Information Extraction I1 CSA4050: Advanced Topics in NLP Information Extraction I What is Information Extraction?

November 2003 CSA4050: Information Extraction I 30

Future Trends

• Better performance: higher precision & recall

• User (not expert) defined IE: minimisation of role of expert

• Integration with other technologies (e.g. IR)

• Multilingual IE


Recommended