Post on 02-Jan-2016
transcript
CS276BText Information Retrieval, Mining, and
Exploitation
Lecture 13Text Mining IIFeb 27, 2003
(includes slides borrowed from J. Allan, G. Doddington, G. Neumann, M. Venkataramani, and D. Radev)
Today’s Topics
First story detection (FSD) Summarization Coreference resolution
First Story Detection
First Story Detection
Automatically identify the first story on a new event from a stream of text
Topic Detection and Tracking – TDT “Bake-off” sponsored by US government
agencies Applications
Intelligence services Finance: Be the first to trade a stock
Examples
2002 Presidential Elections Thai Airbus Crash (11.12.98)
On topic: stories reporting details of the crash, injuries and deaths; reports on the investigation following the crash; policy changes due to the crash (new runway lights were installed at airports).
Euro Introduced (1.1.1999) On topic: stories about the preparation for the common currency (negotiations about
exchange rates and financial standards to be shared among the member nations); official introduction of the Euro; economic details of the shared currency; reactions within the EU and around the world.
First Story Detection
Other technologies don’t work for this Information retrieval Text classification Why?
There is no supervised topic training (like Topic Detection)
Time
First Stories
Not First Stories
= Topic 1= Topic 2
The First-Story Detection Task
To detect the first story that discusses a topic, for all topics.
Definitions
Event: A reported occurrence at a specific time and place, and the unavoidable consequences. Specific elections, accidents, crimes, natural disasters.
Activity: A connected set of actions that have a common focus or purpose - campaigns, investigations, disaster relief efforts.
Topic: a seminal event or activity, along with all directly related events and activities
Story: a topically cohesive segment of news that includes two or more DECLARATIVE independent clauses about a single topic.
TDT Tasks
First story detection (FSD) Detect the first story on a new topic
Topic tracking Once a topic has been detected, identify
subsequent stories about it Standard text classification task However, very small training set (initially:
1!)
First Story Detection (FSD)
First story detection is an unsupervised learning task.
On-line vs. Retrospective On-line: Flag onset of new events from live news feeds
as stories come in Retrospective: Detection consists of identifying first
story looking back over longer period Lack of advance knowledge of new events, but have
access to unlabeled historical data as a contrast set FSD input: stream of stories in chronological order
simulating real-time incoming document stream FSD output: YES/NO decision per document
Patterns in Event Distributions
News stories discussing the same event tend to be temporally proximate
A time gap between burst of topically similar stories is often an indication of different events
Different earthquakes Airplane accidents
A significant vocabulary shift and rapid changes in term frequency are typical of stories reporting a new event, including previously unseen proper nouns
Events are typically reported in a relatively brief time window of 1- 4 weeks
Similar Events over Time
TDT: The Corpus
TDT evaluation corpora consist of text and transcribed news from 1990s.
A set of target events (e.g., 119 in TDT2) is used for evaluation
Corpus is tagged for these events (including first story)
TDT2 consists of 60,000 news stories, Jan-June 1998, about 3,000 are “on topic” for one of 119 topics
Stories are arranged in chronological order
Ideas?
Approach 1: KNN
On-line processing of each incoming story Compute similarity to all previous stories
Cosine similarity Language model Prominent terms Extracted entities
If similarity is below threshold: new story If similarity is above threshold for previous
document d: assign to topic of d Optimal threshold can be chosen based on
historical data Threshold is not topic specific!
Variant: Single Pass Clustering
Assign each incoming document to one of a set of topic clusters
A topic cluster is represented by its centroid (vector average of members)
For incoming story compute similarity s with centroid
As before: s>θ: add document to corresponding cluster s<θ: first story!
Approach 2: KNN + Time
Only consider documents in a (short) time window
Compute similarity in a time weighted fashion:
m: number of documents in window, d_i: ith document in window
Time weighting significantly increases performance.
FSD - Results
Umass , CMU: Single-Pass Clustering
FSD Error vs. Classification Error
Discussion
Hard problem Becomes harder the more topics need to
be tracked. Why? Second Story Detection much easier that
First Story Detection Example: retrospective detection of first
9/11 story easy, on-line detection hard
Summarization
What is a Summary?
Informative summary Purpose: replace original document Example: executive summary
Indicative summary Purpose: support decision: do I want to
read original document yes/no? Example: Headline, scientific abstract
Why Automatic Summarization?
Algorithm for reading many genres is:1) read summary2) decide whether relevant or not3) if relevant: read whole document
Summary is gate-keeper for large number of documents.
Information overload Often the summary is all that is read.
Example from last quarter: summaries of search engine hits
Human-generated summaries are expensive.
Summary Length (Reuters)
Goldstein et al. 1999
Summary Compression (Reuters)
Goldstein et al. 1999
Summarization Algorithms
Natural language understanding / generation Build knowledge representation of text Generate sentences summarizing content Hard to do well
Keyword summaries Display most significant keywords Easy to do Hard to read, poor representation of content
Sentence extraction Extract key sentences Medium hard Summaries often don’t read well Good representation of content
Sentence Extraction
Represent each sentence as a feature vector Compute score based on features Select n highest-ranking sentences Present in order in which they occur in text. Postprocessing to make summary more
readable/concise Eliminate redundant sentences Anaphors/pronouns Delete subordinate clauses, parentheticals
Oracle Context
Sentence Extraction: Example
Sigir95 paper on summarization by Kupiec, Pedersen, Chen
Trainable sentence extraction
Proposed algorithm is applied to its own description (the paper)
Sentence Extraction: Example
Feature Representation
Fixed-phrase feature Certain phrases indicate summary, e.g. “in
summary” Paragraph feature
Paragraph initial/final more likely to be important.
Thematic word feature Repetition is an indicator of importance
Uppercase word feature Uppercase often indicates named entities.
(Taylor) Sentence length cut-off
Summary sentence should be > 5 words.
Feature Representation (cont.)
Sentence length cut-off Summary sentences have a minimum length.
Fixed-phrase feature True for sentences with indicator phrase
“in summary”, “in conclusion” etc. Paragraph feature
Paragraph initial/medial/final Thematic word feature
Do any of the most frequent content words occur?
Uppercase word feature Is uppercase thematic word introduced?
Training
Hand-label sentences in training set (good/bad summary sentences)
Train classifier to distinguish good/bad summary sentences
Model used: Naïve Bayes
Can rank sentences according to score and show top n to user.
Evaluation
Compare extracted sentences with sentences in abstracts
Evaluation
Baseline (choose first n sentences): 24% Overall performance (42-44) not very
good. However, there is more than one good
summary.
Multi-Document (MD) Summarization
Summarize more than one document Why is this harder? But benefit is large (can’t scan 100s of
docs) To do well, need to adopt more specific
strategy depending on document set. Other components needed for a production
system, e.g., manual postediting. DUC: government sponsored bake-off
200 or 400 word summaries Longer -> easier
Types of MD Summaries
Single event/person tracked over a long time period Elizabeth Taylor’s bout with pneumonia Give extra weight to character/event May need to include outcome (dates!)
Multiple events of a similar nature Marathon runners and races More broad brush, ignore dates
An issue with related events Gun control Identify key concepts and select sentences
accordingly
Determine MD Summary Type
First, determine which type of summary to generate
Compute all pairwise similarities Very dissimilar articles -> multi-event
(marathon) Mostly similar articles
Is most frequent concept named entity? Yes -> single event/person (Taylor) No -> issue with related events (gun
control)
MultiGen Architecture (Columbia)
Generation
Ordering according to date Intersection
Find concepts that occur repeatedly in a time chunk
Sentence generator
Processing
Selection of good summary sentences Elimination of redundant sentences Replace anaphors/pronouns with noun
phrases they refer to Need coreference resolution
Delete non-central parts of sentences
Performance (Columbia System)
(1) Precision and recall on “model units” (facts)
(2) Coherence, grammaticality, readability
Newsblaster (Columbia)
Query-Specific Summarization
So far, we’ve look at generic summaries. A generic summary makes no assumption
about the reader’s interests. Query-specific summaries are specialized
for a single information need, the query. Summarization is much easier if we have a
description of what the user wants. Recall from last quarter:
Google-type excerpts – simply show keywords in context
Genre
Some genres are easy to summarize Newswire stories Inverted pyramid structure The first n sentences are often the best
summary of length n Some genres are hard to summarize
Poems Long documents (novels, the bible)
Trainable summarizers are genre-specific.
Non-Text Summaries
Summarization also important for non-text: Speech (phone conversations, radio) Video (surveillance, TV)
Similar techniques are used. Text is easier to scan than speech/video.
Discussion
Correct parsing of document format is critical. Need to know headings, sequence, etc.
Limits of current technology Some good summaries require natural
language understanding Example: President Bush’s nominees for
ambassadorships Contributors to Bush’s campaign Veteran diplomats Others
Coreference Resolution
Coreference
Two noun phrases referring to the same entity are said to corefer.
Example: Transcription from RL95-2 is mediated through an ERE element at the 5-flanking region of the gene.
Coreference resolution is important for many text mining tasks: Information extraction Summarization First story detection
Types of Coreference
Noun phrases: Transcription from RL95-2 … the gene …
Pronouns: They induced apoptosis. Possessives: … induces their rapid
dissociation … Demonstratives: This gene is responsible
for Alzheimer’s
Preferences in Pronoun Interpretation
Recency: John has an Integra. Bill has a legend. Mary likes to drive it.
Grammatical role: John went to the Acura dealership with Bill. He bought an Integra.
Non-ambiguity: John and Bill went to the Acura dealership. He bought an Integra.
Repeated mention: John needed a car to go to his new job. He decided that he wanted something sporty. Bill went to the Acura dealership with him. He bought an Integra.
Copyright: D. Radev
Preferences in Pronoun Interpretation
Parallelism: Mary went with Sue to the Acura dealership. Sally went with her to the Mazda dealership.
??? Mary went with Sue to the Acura dealership. Sally told her not to buy anything.
Verb semantics: John telephoned Bill. He lost his pamphlet on Acuras. John criticized Bill. He lost his pamphlet on Acuras.
Copyright: D. Radev
Algorithm for Coreference Resolution
Two steps: discourse model update and pronoun resolution.
Salience values are introduced when a noun phrase that evokes a new entity is encountered.
Salience factors: set empirically.
Copyright: D. Radev
Salience Weights (Lappin&Leass)
Sentence recency 100
Subject emphasis 80
Existential emphasis 70
Accusative emphasis 50
Indirect object and oblique complement emphasis
40
Non-adverbial emphasis 50
Head noun emphasis 80
Copyright: D. Radev
Lappin&Leass (cont’d)
Recency: weights are cut in half after each sentence is processed.
Examples: An Acura Integra is parked in the lot. There is an Acura Integra parked in the lot. John parked an Acura Integra in the lot. John gave Susan an Acura Integra. In his Acura Integra, John showed Susan his new CD
player.
Copyright: D. Radev
Algorithm (Lappin&Leass)
1. Collect the potential referents (up to four sentences back).
2. Remove potential referents that do not agree in number or gender with the pronoun.
3. Remove potential referents that do not pass intrasentential syntactic coreference constraints.
4. Compute the total salience value of the referent by adding any applicable values for role parallelism (+35) or cataphora (-175).
5. Select the referent with the highest salience value. In case of a tie, select the closest referent in terms of string position.
Copyright: D. Radev
Example
John saw a beautiful Acura Integra at the dealership last week. He showed it to Bob. He bought it.
Rec Subj Exist ObjIndObj
NonAdv
HeadN Total
John 100 80 50 80 310
Integra 100 50 50 80 280
dealership 100 50 80 230
Copyright: D. Radev
Example (cont’d)
Referent Phrases Value
John {John} 165
Integra{a beautiful Acura Integra} 140
dealership {the dealership} 115
Copyright: D. Radev
Example (cont’d)
Referent Phrases Value
John {John, he1} 475
Integra{a beautiful Acura Integra} 140
dealership {the dealership} 115
Copyright: D. Radev
Example (cont’d)
Referent Phrases Value
John {John, he1} 475
Integra {a beautiful Acura Integra, it} 400
dealership {the dealership} 115
Copyright: D. Radev
Example (cont’d)
Referent Phrases Value
John {John, he1} 475
Integra {a beautiful Acura Integra, it} 400
Bill {Bill} 270
dealership {the dealership} 115
Copyright: D. Radev
Example (cont’d)
Referent Phrases Value
John {John, he1} 237.5
Integra{a beautiful Acura Integra, it1} 200
Bill {Bill} 135
dealership {the dealership} 57.5
Copyright: D. Radev
Observations
Lappin & Leass - tested on computer manuals - 86% accuracy on unseen data.
Centering (Grosz, Josh, Weinstein): additional concept of a “center”.
Centering has not been automatically tested on actual data.
Copyright: D. Radev
MUC Information Extraction:State of the Art c. 1997
NE – named entity recognitionCO – coreference resolutionTE – template element constructionTR – template relation constructionST – scenario template production
Resources
[4] Umass at TDT2000, Allan, Lavrenko, Frey, Khandelwal (Umass, 2000) [6] Learning Approaches for Detecting and Tracking News Events, Yang, Carbonell,
Brown (CMU, 1999) A study on retrospective and on-line event detection. Yang, Pierce, Carbonell. http://www.cs.columbia.edu/nlp/newsblaster/ A Trainable Document Summarizer (1995) Julian Kupiec, Jan Pedersen, Francine
ChenResearch and Development in Information Retrieval The Columbia Multi-Document Summarizer for DUC 2002 K. McKeown, D. Evans, A.
Nenkova, R. Barzilay, V. Hatzivassiloglou, B. Schiffman, S. Blair-Goldensohn, J. Klavans, S. Sigelman, Columbia University
Coreference: detailed discussion of the term: http://www.ldc.upenn.edu/Projects/ACE/PHASE2/Annotation/guidelines/EDT/coreference.shtml