A Trainable Document Summarizer Julian Kupiec, Jan Pedersen & Francine Chen ACM SIGIR ‘95

A Trainable Document SummarizerJulian Kupiec, Jan Pedersen & Francine Chen

ACM SIGIR ‘95

Presented by Mat KellyCS895 – Web-based Information Retrieval

Old Dominion UniversityNovember 22, 2011

The Automatic Creationof Literature Abstracts

H. P. LuhnIBM Journal of R&D, 1958

Luhn’s Objectives

• Exploration into automatic methods of obtaining abstracts

• Selects sentences that are most representative of pertinent info

• Citations of author’s own statements constitute “auto-abstract”

Which sentences are best?

• Establish a significance factor– Freq. of word occurrence word significance– Relative position of signif. word in sentence is a

measure for determining signif. of sentence• Why does this work?– Writer repeats certain words as he elaborates

Over-Simplification

• Method does not differentiate words with same stem– Letter-by-letter analysis to

determine P() of same stem• While authors will opt for

synonymous word choice, s/he’ll eventually run out and resort to repetition.

polic = {policingpolicypolice

}

Premise

• No consideration given to meaning of words.• Instead, the closer certain words are

associated, the more specifically an aspect of the subject is being treated

• Where the greatest number of freq. occurring different words are found close to each other, the prob. is high that information is most representative of the article.

• Criterion is relationship of signif words to each other rather than distrib. over whole sentence.

• Consider only portions of sentences that are bracketed by signif word, disregard those beyond limit from consideration of current bracket.

• Useful limit found is 4-5 non-signif words between signif words

Computing Significance Factor

1. Determine extent of cluster by bracketing

2. Count # signif words in cluster3. Divide square of # by total #

words in cluster

Tested on 50 articles of 300-4500 words each, compared against 100-person manual generation

Significant Words* * * * 1 2 3 4 5 6 7[ ]

A portion of a sentence is bracketedIf signif. words are not more than 4 apart, whole sentence is cited

• Resolving power depends on total # words in article and decreases as total # of words increases

• Overcome by running on subdivisions of article, highest ranking sentences combined to form abstract– Divisions might already exist with paper’s

organization– Otherwise, divided arbitrarily and overlapping

Procedures

• Abstracts prepared by first punching on cards(!)• Pronouns & prepositions deleted from lookup routine• Rest of words sorted alphabetically• Words with common beginnings consolidated

(rudimentary form of stemming)– Produced errors up to 5% but did not affect results

• Words with low frequency removed, remaining were marked as significant

• Sentence signif then computed with prev formula

Abstract Creation with Result

• Apply cutoff value of sentence significance• Fixed number of sentences required

irrespective of document length• Sentences could be weighted by assigning

premium value to predetermined set of words if article is of special interest

• If no sentences meet threshold, reject article as too general for purpose of auto-abstracting

Example

Two major recent developments have called the attention of chemists, physiologists, physicists and other scientists to mental diseases: It has been found that extremely minute quantities of chemicals can induce hallucinations and bizarre psychic disturbances in normal people, and mood-altering drugs (tranquilizers, for instance) have made long-institutionalized people amenable to therapy. (4.0)

This poses new possibilities for studying brain chemistry changes in health and sickness and their alleviation, the California researchers emphasized. (5.4) The new studies of brain chemistry have provided practical therapeutic results and tremendous encouragement to those who must care for mental patients. (5.4)

Generated Abstract

Conclusions

• Method proved feasible• Highly reliable, consistent and stable unlike

manual creation• Possibility that author’s style causes inferior

sentences to be promotes• Method helps to realize savings in human

effort

Significant Words

Significant Sentences

Inclusion in Abstract

Kupiec’s Objective

• Motive: provide intermediate point between document title and full text (i.e. abstract)

• Documents as short as 20% of the original can be as informative as the full text*

• Extracts can be non-unique• Combination by numerous methods (including

Luhn’s) would have the best performance.

* A.H. Morris, G.M. Kasper, and D.A. Adams. The effects and limitations of automated text condensing on reading comprehension performance. Information Systems Research, pages 17-35, March 1992

A Statistical Classification Problem• Have training set of documents w/ manually extracted abstracts• Develop classification function that est. prob. That a given

sentence is included in abstract• From this, generate new abstracts by ranking sentences

according to this prob and select user-specified # of top scoring sentences.

Contributes to S’s score

Given Sentences

Feature 1

Feature 2

…

Feature n

Determine P() of abstract inclusion

Using Bayesian Classifier

Inclusion Threshold

SCORE}

• Evaluation criterion: classification success rate/precision

• Requires corpus (expensive)– Acquired from non-profit

Engineering Information Co. – used as basis for experiments

• All previous methods assume that documents exist in isolation

FeaturesExperimentally Obtained• Sentence Len. Cutoff – short sentences are not usually

included in summaries – 5 words• Fixed-Phrase – list of words and those after “Summary",

"Conclusions”, etc are likely to be in summaries• Paragraph – Consider first 10 ¶ and last 5 ¶• Thematic Word – score sentences respective to inclusion of

words within theme• Uppercase Word – e.g. proper names, scored similarly to

thematic words, sentences that start with score double than later occurrences

Classifier

• For each sentence, determine prob that it will be included in summary S given k features:– Since all features are discrete, equation can be put

in terms of probs rather than likelihoods.– Results in simple Bayesian classification function

that assigns s as score, used to select sentences for inclusion in summary

About the Corpus

• Articles w/o abstracts, created manuallyafter the fact

• 188 document/summary pairs from 21 publications in scientific/technical domains

• Summary avg length is 3 sentences

Sentence Matching• Using manually created abstracts, match to

sentences in orig. document• Direct match - Verbatim or w/o minor

modifications• Direct join – 2 or more sentences used to

make summary sentence• Unmatchable – suspected fabrication without

using sentences in document• Incomplete –

– Some overlap exists but content is not preserved in summary– Summary sentence includes content from original but contains other

information that is not covered by a direct join

AbstractDirect match

Direct join

Evaluation

• Insufficient data for separate test corpus, used cross-validation strategy for evaluation

• Documents from a journal were selected for testing one at a time, all other document summary pairs were used for training

• Results were summed over journals• Unmatchable/incomplete sentences were

excluded from training and testing = 498 unique sentences

Evaluating Performance

• Fraction of manual summary sentences that were reproduced, limited by text excerpting: (451+19)/568 =83%

# Sentences

Fract of Corpus

Direct Sentence Matches

451 79%

Direct Joins 19 3%

Unmatchable Sentences 50 9%

Incomplete Single Sentences

21 4%

Incomplete Joins 21 4%

Total Manual Summary Sentences

568

• Sentence produced is correct if:• Has direct sentence match & present in manual summary

– or –• Is in manual summary as part of direct join and all other

components of join have been produced

Distribution of Correspondence in Training Corpus

1 2 5 10 12 15 20 25 30 35 400

20

40

60

80

100

number of sentencespe

rcen

t sen

tenc

es co

rrec

t

Results

• Of 568 Sentenceso 195 direct matches,

6 direct joins 201 correctly ident. summary sentences (35% replication)

• Manual summary generation has only 25% overlap between people and 55% for the same person over time.• 211/498 (42%) sentences correctly identified by the summarizer

Conclusions

• For summarizes 25% size of document– 84% sentences selected that were also selected by

professionals• For smaller summaries, improvement of 74%

observed vs. simply presenting beginning of document.

Contributes to S’s score

Comparing the Processes

• Luhn

Significant Words

Significant Sentences

Inclusion in Abstract

Given Sentences

|Sentence| < 5

↑ After fixed phrase

Prior. 1st & Last ¶s

↑ Thematic Words↑ Capitalized, non-

unit words

Determine P() of abstract inclusion

Using Bayesian Classifier

Inclusion Threshold

SCORE}

• Kupiec

References

• H.P. Luhn. The automatic creation of literature abstracts. IBM J. Res. Develop., 2:159-165, 1959

• Julian Kupiec, Jan Pedersen, and Francine Chen. A trainable document summarizer. In Proc. of the 18th Annual International ACM/SIGIR Conference, pages 68-73, Seattle, WA, 1995

Date post:	24-Feb-2016
Category:	Documents
Upload:	bendek
View:	31 times
Download:	0 times