+ All Categories
Home > Documents > 05Handbook-Summ-hovy

05Handbook-Summ-hovy

Date post: 08-Apr-2018
Category:
Upload: chiranjiv-jain
View: 217 times
Download: 0 times
Share this document with a friend
16
TEXT SUMMARIZATION A This chapter describes research and development on the automated creation o sum- maries o one or more texts. It presents an overview o the principal approaches in summarization, describes the design, implementation, and perormance o various summarization systems, and reviews methods o evaluating summaries. . T N S Early experimentation in the late s and early s suggested that text summar- ization by computer was easible though not straightorward (Luhn ; Edmund- son ). Afer a hiatus o some decades, progress in language processing, coupled with great increases o computer memory and speed, and the growing presence o on-line text—in corpora and especially on the web—renewed interest in automa ted text summarization. Despite encouraging results, some undamental questions remain unaddressed. For example, no one seems to know exact ly what a summary is. In this chapter we use summary as a generic term and dene it as ollows:
Transcript
Page 1: 05Handbook-Summ-hovy

8/7/2019 05Handbook-Summ-hovy

http://slidepdf.com/reader/full/05handbook-summ-hovy 1/16

TEXTSUMMARIZATION

A

This chapter describes research and development on the automated creation o sum-maries o one or more texts. It presents an overview o the principal approaches insummarization, describes the design, implementation, and per ormance o varioussummarization systems, and reviews methods o evaluating summaries.

. T N S

Early experimentation in the late s and early s suggested that text summar-ization by computer was easible though not straight orward (Luhn ; Edmund-son ). Afer a hiatus o some decades, progress in language processing, coupledwith great increases o computer memory and speed, and the growing presence o on-line text—in corpora and especially on the web—renewed interest in automatedtext summarization.

Despite encouraging results, some undamental questions remain unaddressed.For example, no one seems to know exactly what asummary is. In this chapter we usesummary as a generic term and dene it as ollows:

Page 2: 05Handbook-Summ-hovy

8/7/2019 05Handbook-Summ-hovy

http://slidepdf.com/reader/full/05handbook-summ-hovy 2/16

( . )Denition : a summary is a text that is produced rom one or more texts, that contains asignicant portion o the in ormation in the original text(s), and that is no longer thanhal o the original text(s).

‘Text’ here includes multimedia documents, on-line documents, hypertexts, etc. O the many types o summary that have been identied (Spärck Jones ; Hovy andLin ),indicative summaries (that provide an idea o what the text is about with-out giving any content) and informative ones (that do provide some shortened ver-sion o the content) are ofen re erenced.Extracts are summaries created by reusingportions (words, sentences, etc.) o the input text verbatim, whileabstracts are cre-ated by re-generating the extracted content.

Section . outlines the principal approaches to automated text summarizationin general. Section . reviews particular techniques used in several summarizationsystems. Problems unique to multi-document summarization are discussed in sec-

tion . . Finally, although the evaluation o summaries (and o summarization) isnot yet well understood, we review approaches to evaluation in section . .

. T S AT S

Researchers in automated text summarization have identied three distinct stages(Spärck Jones ; Hovy and Lin ; Mani and Maybury ). Most systems today embody the rst stage only.

The rst stage,topic identication , produces the simplest type o summary. (Wedene topic as a particular subject that we write about or discuss.) Whatever crit-erion o importance is used, once the system has identied the most importantunit(s) (words, sentences, paragraphs, etc.), it can either simply list them (thereby creating an extract) or display them diagrammatically (thereby creating a schematicsummary). Typically, topic identication is achieved using several complementary techniques. We discuss topic identication in section . . .

In many genres, humans’ summaries reect their owninterpretation : usion o concepts, evaluation, and other processing. This stage generally occurs afer topicidentication. Since the result is something new, not explicitly contained in the input,this stage requires that the system have access to knowledge separate rom the input.Given the diffi culty o building domain knowledge, ew existing systems per orminterpretation, and no system includes more than a small domain model. We discussinterpretation in section . . .

Page 3: 05Handbook-Summ-hovy

8/7/2019 05Handbook-Summ-hovy

http://slidepdf.com/reader/full/05handbook-summ-hovy 3/16

The results o interpretation are usually unreadable abstract representations, andeven extracts are seldom coherent, due to dangling re erences, omitted discourselinkages, and repeated or omitted material. Systems there ore include a stage o sum-mary generation to produce human-readable text. In the case o extracts, generationmay simply mean ‘smoothing’ the extracted pieces into a coherent, densely phrased,text. We discuss generation in section . . .

. R S M

. . Stage : Topic identicationTo per orm this stage, almost all systems employ several independent modules. Eachmodule assigns a score to each unit o input (word, sentence, or longer passage); thena combination module combines the scores or each unit to assign a single integratedscore to it; nally, the system returns then highest-scoring units, according to thesummary length requested by the user.

An open issue is the size o the unit o text that is scored or extraction. Most sys-tems ocus on one sentence at a time. However, Fukushima, Ehara, and Shirai ( )show that extracting subsentence-size units produces shorter summaries with morein ormation. On the other hand, Strzalkowski et al. ( ) show that including certainsentences immediately adjacent to important sentences increases coherence— ewer

dangling pronoun re erences, etc.The per ormance o topic identication modules is usually measured using Recalland Precision scores (see section . and Chapter ). Given an input text, a human’sextract, and a system’s extract, these scores quanti y how closely the system’s extractcorresponds to the human’s. For each unit, we letcorrect = the number o sentencesextracted by the system and the human;wrong = the number o sentences extractedby the system but not by the human; andmissed = the number o sentences extractedby the human but not by the system. Then

( . )Precision = correct / (correct + wrong )( . )Recall = correct / (correct + missed )

so that Precision reects how many o the system’s extracted sentences were good,and Recall reects how many good sentences the system missed.

Positional criteria . Thanks to regularities in the text structure o many genres,certain locations o the text (headings, titles, rst paragraphs, etc.) tend to containimportant in ormation. The simple method o taking the lead (rst paragraph) assummary ofen outper orms other methods, especially with newspaper articles

Page 4: 05Handbook-Summ-hovy

8/7/2019 05Handbook-Summ-hovy

http://slidepdf.com/reader/full/05handbook-summ-hovy 4/16

(Brandow, Mitze, and Rau ). Some variation o theposition method appears inBaxendale ( ); Edmundson ( ); Donlan ( ); Kupiec, Pedersen, and Chen( ); Teu el and Moens ( ); Strzalkowski et al. ( ); Kupiec et al. and Teu eland Moens both list this as the single best method, scoring around per cent, ornews, scientic, and technical articles.

In order to automatically determine the best positions, and to quanti y their util-ity, Lin and Hovy ( ) dene the genre- and domain-oriented Optimum PositionPolicy (OPP) as a ranked list o sentence positions that on average produce the high-est yields or extracts, and describe an automated procedure to create OPPs giventexts and extracts.

Cue phrase indicator criteria . Since in some genres certain words and phrases (‘sig-nicant’, ‘in this paper we show’) explicitly signal importance, sentences containingthem should be extracted. Teu el and Moens ( ) report per cent joint recall andprecision, using a manually built list o , cue phrases in a genre o scientic texts.

Each cue phrase has a (positive or negative) ‘goodness score’, also assigned manu-ally. In Teu el and Moens ( ) they expand their method to argue that rather thansingle sentences, these cue phrases signal the nature o the multi-sentence rhetoricalblocks o text in which they occur (such as Purpose/Problem, Background, Solution/Method, Conclusion/Claim).

Word and phrase frequency criteria . Luhn ( ) used Zip ’s Law o word distribu-tion (a ew words occur very ofen, ewer words occur somewhat ofen, and many words occur in requently) to develop the ollowing extraction criterion: i a textcontains some words unusually requently, then sentences containing these wordsare probably important.

The systems o Luhn ( ), Edmundson ( ), Kupiec, Pedersen, and Chen

( ), Teu el and Moens ( ), Hovy and Lin ( ), and others employ variousrequency measures, and report per ormance o between per cent and per centrecall and precision (using word requency alone). But both Kupiec et al. and Teu eland Moens show that word requency in combination with other measures is notalways better. Witbrock and Mittal ( ) compute a statistical model describing thelikelihood that each individual word in the text will appear in the summary, in thecontext o certain eatures (part-o -speech tag, word length, neighbouring words,average sentence length, etc.). The generality o this method (also across languages)makes it attractive or urther study.

Query and title overlap criteria . A simple but use ul method is to score each sen-tence by the number o desirable words it contains. Desirable words are, or example,those contained in the text’s title or headings (Kupiec, Pedersen, and Chen ;Teu el and Moens ; Hovy and Lin ), or in the user’s query, or aquery-basedsummary (Buckley and Cardie ; Strzalkowski et al. ; Hovy and Lin ). Thequery method is a direct descendant o IR techniques (see Chapter ).

Cohesive or lexical connectedness criteria. Words can be connected in various ways,including repetition, core erence, synonymy, and semantic association as expressed

Page 5: 05Handbook-Summ-hovy

8/7/2019 05Handbook-Summ-hovy

http://slidepdf.com/reader/full/05handbook-summ-hovy 5/16

in thesauri. Sentences and paragraphs can then be scored based on the degree o connectedness o their words; more-connected sentences are assumed to be moreimportant. This method yields per ormances ranging rom per cent (using a very strict measure o connectedness) to over per cent, with Buckley and Cardie’s useo sophisticated IR technology and Barzilay and Elhadad’s lexical chains (Salton etal. ; Mitra, Singhal, and Buckley ; Mani and Bloedorn ; Buckley andCardie ; Barzilay and Elhadad ). Mani and Bloedorn represent the text as agraph in which words are nodes and arcs represent adjacency, core erence, and lexicalsimilarity.

Discourse structure criteria . A sophisticated variant o connectedness involves pro-ducing the underlying discourse structure o the text and scoring sentences by theirdiscourse centrality, as shown in Marcu ( , ). Using a GSAT-like algorithm tolearn the optimal combination o scores rom centrality, several o the above-men-tioned measures, and scores based on the shape and content o the discourse tree,

Marcu’s ( ) system does almost as well as people orScientic American texts.Combination of various module scores. In all cases, researchers have ound that no

single method o scoring per orms as well as humans do to create extracts. However,since different methods rely on different kinds o evidence, combining them improvesscores signicantly. Various methods o automatically nding a combination unc-tion have been tried; all seem to work, and there is no obvious best strategy.

In their landmark work, Kupiec, Pedersen, and Chen ( ) train a Bayesian classi-er (see Chapter ) by computing the probability that any sentence will be includedin a summary, given the eatures paragraph position, cue phrase indicators, word

requency, upper-case words, and sentence length (since short sentences are generally not included in summaries). They nd that, individually, the paragraph position ea-

ture gives per cent precision, the cue phrase indicators per cent (but when joinedwith the ormer, the two together give per cent), and so on, with individual scoresdecreasing to per cent and the combined ve- eature score totalling per cent.

Also using a Bayesian classier, Aone et al. ( ) nd that even within the singlegenre, different newspapers require different eatures to achieve the same per orm-ance.

Using SUMMARIST, Lin ( ) compares eighteen different eatures, a naive com-bination o them, and an optimal combination obtained using the machine learningalgorithm C . (Quinlan ). These eatures include most o the above mentioned,as well as eatures signalling the presence in each sentence o proper names, dates,quantities, pronouns, and quotes. The per ormances o the individual methods andthe naive and learned combination unctions are graphed in Fig. . , showing extractlength against -score (joint recall and precision). As expected, the top scorer is thelearned combination unction. The second-best score is achieved by query term over-lap (though in other topics the query method did not do as well). The third best score(up to the per cent length) is achieved equally by word requency, the lead method,and the naive combination unction. The curves in general indicate that to be most

Page 6: 05Handbook-Summ-hovy

8/7/2019 05Handbook-Summ-hovy

http://slidepdf.com/reader/full/05handbook-summ-hovy 6/16

use ul, summaries should not be longer than about per cent and not shorter thanabout per cent; no per cent summary achieved an -score o over . .

. . Stage : Interpretation or topic fusionAs described in section . , the stage o interpretation is what distinguishes extract-type summarization systems rom abstract-type systems. During interpretation, thetopics identied as important are used, represented in new terms, and expressedusing a new ormulation, using concepts or words not ound in the original text.

No system can per orm interpretation without prior knowledge about the domain;by denition, it must interpret the input in terms o something extraneous to the text.But acquiring enough (and deep enough) prior domain knowledge is so diffi cult thatsummarizers to date have only attempted it in a small way.

At rst glance, the template representations used in in ormation extraction (Chap-ter ), or other interpretative structures in terms o which to represent stories orsummarization, hold some promise (DeJong ; Lehnert ; Rau and Jacobs

). But the diffi culty o building such structures and lling them makes large-scalesummarization impractical at present.

0.0

0.1

0.2

0.3

0.4

0.5 word in sentenceword list

decision treeOPPbaselineis_1st_sent_paranaive_combined 1has weekday has noun phraseword in sentencehas CD taghas monthhas prepositionhas quantity t²df has adjectivesentence_lengthtf has_ƒJJR_tagMS_Word

SUMMAC Q&A Topic 271

0 . 0 0

0 . 0 5

0 . 1 0

0 . 1 5

0 . 2 0

0 . 2 5

0 . 3 0

0 . 3 5

0 . 4 0

0 . 4 5

0 . 5 0

0 . 5 5

0 . 6 0

0 . 6 5

0 . 7 0

0 . 7 5

0 . 8 0

0 . 8 5

0 . 9 0

1 . 0 0

0 . 9 5

Summary length

F 1 m e a s u r e

Fig. . Summary length vs. f-score for individual and combined methodsof scoring sentences in SUMMARIST

Page 7: 05Handbook-Summ-hovy

8/7/2019 05Handbook-Summ-hovy

http://slidepdf.com/reader/full/05handbook-summ-hovy 7/16

Taking a more ormal approach, Hahn and Reimer ( ) develop operators thatcondense knowledge representation structures in a terminological logic throughconceptual abstraction (Chapter ). To date, no parser has been built to producethe knowledge structures rom text, and no generator to produce language rom theresults.

Taking a lea rom IR, Hovy and Lin ( ) use topic signatures—sets o words andrelative strengths o association, each set related to a single headword—to per ormtopic usion. By automatically constructing these signatures (using , texts romthe Wall Street Journal and TF * IDF to identi y or each topic the set o words mostrelevant to it) they overcome the knowledge paucity problem. They use these topicsignatures both during topic identication (to score sentences by signature overlap)and during topic interpretation (to substitute the signature head or the sentence(s)containing enough o its words). The effectiveness o signatures to per orm interpret-ation has not yet been shown.

Interpretation remains blocked by the problem o domain knowledge acquisition.Be ore summarization systems can produce abstracts, this problem will have to besolved.

. . Stage : Summary generationThe third major stage o summarization is generation. When the summary contenthas been created through abstracting and/or in ormation extraction, it exists withinthe computer in internal notation, and thus requires the techniques o natural lan-

guage generation, namely text planning, sentence (micro-)planning, and sentencerealization. For more on this topic see Chapter .However, as mentioned in section . , extract summaries require no generation

stage. In this case, though, various dysuencies tend to result when sentences (orother extracted units) are simply extracted and printed—whether they are printedin order o importance score or in text order. A process o ‘smoothing’ can be usedto identi y and repair typical dysuencies, as rst proposed in Hirst et al. ( ). Themost typical dysuencies that arise include repetition o clauses or NPs (where therepair is to aggregate the material into a conjunction), repetition o named entities(where the repair is to pronominalize), and inclusion o less important material suchas parentheticals and discourse markers (where the repair is to eliminate them). Inthe context o summarization, Mani, Gates, and Bloedorn ( ) describe a summary revision program that takes as input simple extracts and produces shorter and morereadable summaries.

Text compression is another promising approach. Knight and Marcu’s ( )prize-winning paper describes using the EM algorithm to train a system to compressthe syntactic parse tree o a sentence in order to produce a single, shorter, one, with

Page 8: 05Handbook-Summ-hovy

8/7/2019 05Handbook-Summ-hovy

http://slidepdf.com/reader/full/05handbook-summ-hovy 8/16

the idea o eventually shortening two sentences into one, three into two (or one), andso on. Banko, Mittal, and Witbrock ( ) train statistical models to create headlines

or texts by extracting individual words and ordering them appropriately.Jing and McKeown ( ) make the extract-summary point rom the generation

perspective. They argue that summaries are ofen constructed rom the source docu-ment by a process o cut and paste— ragments o document sentences are combinedinto summary sentences—and hence that a summarizer need only identi y the major

ragments o sentences to include and then weave them together grammatically. Toprove this claim, they train a hidden Markov model to identi y where in the docu-ment each ( ragment o each) summary sentence resides. Testing with human-written abstracts o newspaper articles, Jing and McKeown determine that only percent o summary sentences do not have matching sentences in the document.

In an extreme case o cut and paste, Witbrock and Mittal ( ; see section . . )extract a set o words rom the input document and then order the words into sen-

tences using a bigram language model.

. M -D S

Summarizing a single text is diffi cult enough. But summarizing a collection o the-matically related documents poses several additional challenges. In order to avoidrepetitions, one has to identi y and locate thematic overlaps. One also has to decide

what to include o the remainder, to deal with potential inconsistencies between doc-uments, and, when necessary, to arrange events rom various sources along a singletimeline. For these reasons,multi-document summarization is much less developedthan its single-document cousin.

Various methods have been proposed to identi y cross-document overlaps.SUMMONS (Radev ), a system that covers most aspects o multi-documentsummarization, takes an in ormation extraction approach. Assuming that all inputdocuments are parsed into templates (whose standardization makes comparison eas-ier), SUMMONS clusters the templates according to their contents, and then appliesrules to extract items o major import. In contrast, Barzilay, McKeown, and Elhadad( ) parse each sentence into a syntactic dependency structure (a simple parsetree) using a robust parser and then match trees across documents, using paraphraserules that alter the trees as needed.

To determine what additional material should be included, Carbonell, Geng,and Goldstein ( ) rst identi y the units most relevant to the user’s query, usingmethods described in section . . , and then estimate the ‘marginal relevance’ o allremaining units using a measure called Maximum Marginal Relevance (MMR).

Page 9: 05Handbook-Summ-hovy

8/7/2019 05Handbook-Summ-hovy

http://slidepdf.com/reader/full/05handbook-summ-hovy 9/16

SUMMONS deals with cross-document overlaps and inconsistencies using aseries o rules to order templates as the story un olds, identi y in ormation updates(e.g. increasing death tolls), identi y cross-template inconsistencies (decreasingdeath tolls), and nally produce appropriate phrases or data structures or the lan-guage generator.

Multi-document summarization poses interesting challenges beyond single docu-ments (Goldstein et al. ; Fukumoto and Suzuki ; Kubota Ando et al. ).An important study (Marcu and Gerber ) show that or the newspaper articlegenre, even some very simple procedures provide essentially per ect results. Forexample, taking the rst two or three paragraphs o the most recent text o a series o texts about an event provides a summary equally coherent and complete as that pro-duced by human abstracters. Obviously, this cannot be true o more complex types o summary, such as biographies o people or descriptions o objects. Further researchis required on all aspects o multi-document summarization be ore it can become a

practical reality.

. E S

How can you evaluate the quality o a summary? The growing body o literature onthis interesting question suggests that summaries are so task and genre specic andso user oriented that no single measurement covers all cases. In section . . wedescribe a ew evaluation studies and in section . . we develop some theoreticalbackground.

. . Previous evaluation studiesAs discussed in Chapter , many NLP evaluators distinguish between black-boxand glass-box evaluation. Taking a similar approach or summarization systems,Spärck Jones and Galliers ( ) deneintrinsic evaluations as measuring outputquality (only) and extrinsic as measuring user assistance in task per ormance (seealso Chapter ).

Most existing evaluations o summarization systems are intrinsic. Typically, theevaluators create a set o ideal summaries, one or each test text, and then comparethe summarizer’s output to it, measuring content overlap (ofen by sentence or phraserecall and precision, but sometimes by simple word overlap). Since there is no ‘cor-rect’ summary, some evaluators use more than one ideal per test text, and average the

Page 10: 05Handbook-Summ-hovy

8/7/2019 05Handbook-Summ-hovy

http://slidepdf.com/reader/full/05handbook-summ-hovy 10/16

score o the system across the set o ideals. Comparing system output to some idealwas per ormed by, or example, Edmundson ( ); Paice ( ); Ono, Sumita, andMiike ( ); Kupiec, Pedersen, and Chen ( ); Marcu ( ); Salton et al. ( ). Tosimpli y evaluating extracts, Marcu ( ) and Goldstein et al. ( ) independently developed an automated method to create extracts corresponding to abstracts.

A second intrinsic method is to have evaluators rate systems’ summaries accordingto some scale (readability; in ormativeness; uency; coverage); see Brandow, Mitze,and Rau ( ) or one o the larger studies.

Extrinsic evaluation is easy to motivate; the major problem is to ensure that themetric applied correlates well with task per ormance effi ciency. Examples o extrin-sic evaluation can be ound in Morris, Kasper, and Adams ( ) or GMAT testing,Miike et al. ( ) or news analysis, and Mani and Bloedorn ( ) or in ormationretrieval.

The largest extrinsic evaluation to date is the TIPSTER-SUMMAC study (Firmin

Hand and Sundheim ; Firmin and Chrzanowski ), involving some eighteensystems (research and commercial), in three tests. In the Categorization task testersclassied a set o TREC texts and their summaries created by various systems. Aferclassication, the agreement between the classications o texts and their corres-ponding summaries is measured; the greater the agreement, the better the summary captures that which causes the ull text to be classied as it is. In the Ad Hoc task, test-ers classied query-based summaries as Relevant or Not Relevant to the query. Theagreement o texts and summaries classied in each category reects the quality o the summary. Space constraints prohibit ull discussion o the results; some interest-ing ndings are that, or newspaper texts, all extraction systems per ormed equally well (and no better than the lead method) or generic summarization, and that IR

methods produced the best query-based extracts. Still, despite the act that all thesystems per ormed extracts only, thereby simpli ying much o the scoring process toIR-like recall and precision measures against human extracts, the wealth o materialand the variations o analysis contained in Firmin and Chrzanowski ( ) under-score how little is still understood about summarization evaluation. This conclusionis strengthened in a ne paper by Donaway, Drummey, and Mather ( ) who showhow summaries receive different scores with different measures, or when comparedto different (but presumably equivalent) ideal summaries created by humans.

Recognizing these problems, Jing et al. ( ) compare several evaluation methods,intrinsic and extrinsic, on the same extracts. With regard to inter-human agreement,they nd airly high consistency in the news genre, as long as the summary (extract)length is xed as relatively short (there is some evidence that other genres will deliverless consistency (Marcu )). With regard to summary length, they nd great variation. Comparing three systems, and comparing ve humans, they show thatthe humans’ ratings o systems, and the perceived ideal summary length, uctuate assummaries become longer.

Page 11: 05Handbook-Summ-hovy

8/7/2019 05Handbook-Summ-hovy

http://slidepdf.com/reader/full/05handbook-summ-hovy 11/16

. . Two basic measuresMuch o the complexity o summarization evaluation arises rom the act that it is

diffi cult to speci y what one really needs to measure, and why, without a clear ormu-lation o what precisely the summary is trying to capture. We outline some generalconsiderations here.

In general, to be a summary, the summary must obey two¹ requirements:

• it must be shorter than the original input text;• it must contain the important in ormation o the original (where importance is

dened by the user), and not other, totally new, in ormation.

One can dene two measures to capture the extent to which a summary Scon ormsto these requirements with regard to a textT :

( . )Compression Ratio :CR= (length S) / (length T )

( . )Retention Ratio :RR= (info in S) / (info in T )However we choose to measure the length and the in ormation content, we can say that a good summary is one in whichCR is small (tending to zero) whileRR is large(tending to unity). We can characterize summarization systems by plotting the ratioso the summaries produced under varying conditions. For example, Fig. . (a)shows a airly normal growth curve: as the summary gets longer (grows along thex axis), it includes more in ormation (grows also along the y axis), until it equals theoriginal. Fig. . (b) shows a more desirable situation: at some special point, the add-ition o just a little more text to the summary adds a disproportionately large amounto in ormation. Fig. . (c) shows another: quite early, most o the important materialis included in the summary; as the length grows, the added material is less interesting.

In both the latter cases, summarization is use ul. Measuring length. Measuring length is straight orward; one can count the number

o words, letters, sentences, etc. For a given genre and register, there is a airly goodcorrelation among these metrics, in general.

¹ Ideally, it should also be a coherent, readable text, though a list o keywords or text ragments canconstitute a degenerate summary. Readability is measured in several standard ways, or purposes o lan-guage learning, machine translation, and other NLP applications.

Fig. . Compression Ratio ( CR ) vs. Retention Ratio ( RR )

a.

CR

RR

b.

CR

RR

c.

CR

RR

Page 12: 05Handbook-Summ-hovy

8/7/2019 05Handbook-Summ-hovy

http://slidepdf.com/reader/full/05handbook-summ-hovy 12/16

Measuring information content . Ideally, one wants to measure not in ormationcontent, but interesting in ormation content only. Although it is very diffi cult todene what constitutes interestingness, one can approximate measures o in orma-tion content in several ways. We describe our here.

The Expert Game. Ask experts to underline and extract the most interesting orin ormative ragments o the text. Measure recall and precision o the system’s sum-mary against the human’s extract, as outlined in section . . .

The Classication Game. Two variants o this extrinsic measure were implementedin the TIPSTER-SUMMAC evaluation (Firmin Hand and Sundheim ; Firminand Chrzanowski ); see section . . .

The Shannon Game . In in ormation theory (Shannon ), the amount o in or-mation contained in a message is measured by – p log p, where p is, roughly speaking,the probability o the reader guessing the message (or each piece thereo , individu-ally). To measure the in ormation content o a summary S relative to that o its cor-

responding text T , assemble three sets o testers. Each tester must createT , guessingletter by letter. The rst set readsT be ore starting, the second set readsSbe ore start-ing, and the third set reads nothing, For each set, record the number o wrong guesses g wrong and total guesses g total , and compute the ratio R= g wrong / g total . The quality o Scanbe computed by comparing the three ratios.Rnone quanties how much a tester couldguess rom world knowledge (and should hence not be attributed to the summary),whileRT quanties how much a tester still has to guess, even with ‘per ect’ prior know-ledge. The closerRS is toRT , the better the summary.²

The Question Game. This measure approximates the in ormation content o S by determining how well it allows readers to answer questions drawn up aboutT . Be orestarting, one or more people create a set o questions based on what they consider the

principal content (author’s view or query-based) o T . Then the testers answer thesequestions three times in succession: rst without having read eitherS or T , secondafer having read S, and third afer having read T . Afer each round, the number o questions answered correctly is tallied. The quality o S can be computed by com-paring the three tallies, as above. The closer the testers’ score orS is to their score

orT , the better the summary. The TIPSTER-SUMMAC summarization evaluation(Firmin Hand and Sundheim ) contained a tryout o the Question Game.

F R R R

Mani ( ) provides a thorough overview o the eld, and Mani and Maybury ( )provide a most use ul collection o twenty-six papers about summarization, includ-

² In , the author per ormed a small experiment using the Shannon Game, nding an order o magnitude difference between the three contrast sets.

Page 13: 05Handbook-Summ-hovy

8/7/2019 05Handbook-Summ-hovy

http://slidepdf.com/reader/full/05handbook-summ-hovy 13/16

ing many o the most inuential. Recent workshop proceedings are Hovy and Radev ( ); Hahn et al. ( ); Goldstein and Lin ( ); DUC ( ) and DUC ( ).Use ul URLs are at http://www.cs.columbia.edu/~radev/summarization/.

A

Thanks to Chin-Yew Lin, Daniel Marcu, Hao Liu, Mike Junk, Louke van Wensveen,Thérèse Firmin Hand, Sara Shelton, and Beth Sundheim.

R

Aone, C., M. E. Okurowski, J. Gorlinsky, and B. Larsen. . ‘A scalable summarization systemusing robust NLP’. In Mani and Maybury ( ), – .

Banko, M., V. O. Mittal, and M. J. Witbrock. . ‘Headline generation based on statisticaltranslation.’Proceedings of the th Annual Conference of the Association for Computational Linguistics( ACL ) (Hong Kong), – .

Barzilay, R. and M. Elhadad. . ‘Using lexical chains or text summarization’. In Mani andMaybury ( ), – .

——K. R. McKeown, and M. Elhadad . ‘In ormation usion in the context o multi-docu-ment summarization’. Proceedings of the th Conference of the Association of Computa-tional Linguistics ( ACL ’ ) (College Park, Md.), – .

Baxendale, P. B. . ‘Machine-made index or technical literature: an experiment’.IBM Jour-nal , , – .

Brandow, R., K. Mitze, and L. Rau. . ‘Automatic condensation o electronic publishing pub-lications by sentence selection’.Information Processing and Management ( ), – . Alsoin Mani and Maybury ( ), – .

Buckley, C. and C. Cardie. . ‘Using EMPIRE and SMART or high-precision IR and sum-marization’.Proceedings of the TIPSTER Text Phase III -Math Workshop. San Diego, USA.

Carbonell, J., Y. Geng, and J. Goldstein. . ‘Automated query-relevant summarization anddiversity-based reranking’.Proceedings of the IJCAI- Workshop on AI in Digital Libraries.San Mateo, Cali .: Morgan Kau mann, – .

DeJong, G. J. .Fast skimming of news stories: the FRUMP system’ .Ph.D. thesis, Yale Univer-sity.

Donaway, R. L., K. W. Drummey, and L. A. Mather. . ‘A comparison o rankings producedby summarization evaluation measures.’Proceedings of the NAACL Workshop on Text Sum-marization (Seattle), – .

Donlan, D. . ‘Locating main ideas in history textbooks’. Journal of Reading , , – .

DUC. .Proceedings of the Document Understanding Conference (DUC) Workshop on Multi-Document Summarization Evaluation , at the SIGIR- Con erence. New Orleans,USA. http://www.itl.nist.gov/iad/ . /projects/duc/index.html.

—— .Proceedings of the Document Understanding Conference (DUC) Workshop on Multi-Document Summarization Evaluation , at the ACL- Con erence. Philadelphia, USA( orthcoming).

Page 14: 05Handbook-Summ-hovy

8/7/2019 05Handbook-Summ-hovy

http://slidepdf.com/reader/full/05handbook-summ-hovy 14/16

Edmundson, H. P. . ‘New methods in automatic extraction’. Journal of the ACM , ( ),– . Also in Mani and Maybury ( ), – .

Firmin, T. and M. J. Chrzanowski. . ‘An evaluation o text summarization systems.’ In Mani

and Maybury ( ), – .Firmin Hand, T., and B. Sundheim. . ‘TIPSTER-SUMMAC summarization evaluation’.Proceedings of the TIPSTER Text Phase III Workshop, Washington, DC.

Fukumoto, F. and Y. Suzuki. . ‘Extracting key paragraph based on topic and event detec-tion: towards multi-document summarization.’ Proceedings of the NAACL Workshop onText Summarization (Seattle), – .

Fukushima, T., T. Ehara, and K. Shirai. . ‘Partitioning long sentences or text summariza-tion’. Journal of the Society of Natural Language Processing of Japan, ( ), – (in Japa-nese).

Goldstein, J. and C.-Y. Lin (eds.). .Proceedings of the NAACL Workshop on Text Summar-ization . Pittsburgh, USA.

——M. Kantrowitz, V. Mittal, and J. Carbonell. . ‘Summarizing text documents: sentenceselection and evaluation metrics’.Proceedings of the nd International ACM Conference onResearch and Development in Information Retrieval (SIGIR ’ ) (Berkeley), – .

——V. Mittal, J. Carbonell, and M. Kantrowitz. . ‘Multi-document summarization by sentence extraction’.Proceedings of the NAACL Workshop on Text Summarization (Seattle),

– .Hahn, U. and U. Reimer. . ‘Knowledge-based text summarization: salience and generaliza-

tion operators or knowledge base abstraction’. In Mani and Maybury ( ), – .——C.-Y. Lin, I. Mani, and D. Radev (eds). .Proceedings of the NAACL Workshop on Text

Summarization (Seattle).Hirst, G., C. DiMarco, E. H. Hovy, and K. Parsons. . ‘Authoring and generating health-edu-

cation documents that are tailored to the needs o the individual patient’.Proceedings of theth International Conference on User Modelling (UM ’ ) (Sardinia). http://um.org.

Hovy, E. H. and C.-Y. Lin. . ‘Automating text summarization in SUMMARIST’. In Maniand Maybury ( ), – .

——and D. Radev (eds.), .Proceedings of the AAAI Spring Symposium on Intelligent Text Summarization . Stan ord, Cali .: AAAI Press.

Jing, H. and K. R. McKeown. . ‘The decomposition o human-written summary sentences’.Proceedings of the nd International ACM Conference on Research and Development inInformation Retrieval (SIGIR- ) (Berkeley), – .

——R. Barzilay, K. McKeown, and M. Elhadad. . ‘Summarization evaluation methods:experiments and results’. In Hovy and Radev ( ), – .

Knight, K. and D. Marcu. . ‘Statistics-based summarization—step one: sentence compres-sion’.Proceedings of the Conference of the American Association for Articial Intelligence ( AAAI ) (Austin, Tex.), – .

Kubota Ando, R., B. K. Boguraev, R. J. Byrd, and M. S. Neff. . ‘Multi-document summariza-tion by visualizing topical content’.Proceedings of the NAACL Workshop on Text Summar-

ization (Seattle), – .Kupiec, J., J. Pedersen, and F. Chen. . ‘A trainable document summarizer’.Proceedings of theth Annual International ACM Conference on Research and Development in Information

Retrieval (SIGIR) (Seattle), – . Also in Mani and Maybury ( ), – .Lehnert, W. G. . ‘Plot units and narrative summarization’.Cognitive Science, ( ). See also

‘Plot units: a narrative summarization strategy’, in Mani and Maybury ( ), – .

Page 15: 05Handbook-Summ-hovy

8/7/2019 05Handbook-Summ-hovy

http://slidepdf.com/reader/full/05handbook-summ-hovy 15/16

Lin, C.-Y. . ‘Training a selection unction or extraction’.Proceedings of the th International Conference on Information and Knowledge Management (CIKM ) (Kansas City), – .

——and E. H. Hovy. . ‘Identi ying topics by position’.Proceedings of the Applied Natural

Language Processing Conference( ANLP ’ ) (Washington), – .Luhn, H. P. . ‘The automatic creation o literature abstracts’.IBM Journal of Research and Development , – . Also in Mani and Maybury ( ), – .

Mani, I. . Automatic Summarization . Amsterdam: John Benjamins.——and E. Bloedorn. . ‘Multi-document summarization by graph search and matching’.

Proceedings of AAAI- (Providence), – .——B. Gates, and E. Bloedorn. . ‘Improving summaries by revising them’.Proceedings of

the th Conference of the Association of Computational Linguistics ( ACL ’ ) (College Park,Md.), – .

——and M. Maybury (eds.). . Advances in Automatic Text Summarization . Cambridge,Mass.: MIT Press.

Marcu, D. .The rhetorical parsing, summarization, and generation of natural languagetexts. Ph.D. thesis, University o Toronto.

—— . ‘Improving summarization through rhetorical parsing tuning’.Proceedings of theCOLING-ACL Workshop on Very Large Corpora (Montreal), – .

—— . ‘The automatic construction o large-scale corpora or summarization research’.Proceedings of the nd International ACM Conference on Research and Development inInformation Retrieval (SIGIR ’ ) (Berkeley), – .

——and L. Gerber. . ‘An inquiry into the nature o multidocument abstracts, extracts, andtheir evaluation’.Proceedings of the Workshop on Text Summarization at the nd Conferenceof the North American Association of Computational Linguistics(Pittsburgh), – .

Miike, S., E. Itoh, K. Ono, and K. Sumita. . ‘A ull-text retrieval system with dynamicabstract generation unction’.Proceedings of the th Annual International ACM Conferenceon Research and Development in Information Retrieval (SIGIR), – .

Mitra, M., A. Singhal, and C. Buckley. . ‘Automatic text summarization by paragraphextraction’.Proceedings of the Workshop on Intelligent Scalable Summarization at the ACL/ EACL Conference(Madrid), – .

Morris, A. G. Kasper, and D. Adams. . ‘The effects and limitations o automatic textcondensing on reading comprehension per ormance’.Information Systems Research, ( ),

– .Ono, K., K. Sumita, and S. Miike. . ‘Abstract generation based on rhetorical structure

extraction’.Proceedings of the th International Conference on Computational Linguistics (COLING ’ ) (Kyoto), i, – .

Paice, C. D. . ‘Constructing literature abstracts by computer: techniques and prospects’.Information Processing and Management , ( ), – .

Quinlan, J. R. . ‘Induction o decision trees’. Machine Learning , – .Radev, D. R. .Generating natural language summaries from multiple on-line source: lan-

guage reuse and regeneration. Ph.D. thesis, Columbia University.

Rau, L. S. and P. S. Jacobs. . ‘Creating segmented databases rom ree text or text retrieval’.Proceedings of the th Annual ACM Conference on Research and Development in Informa-tion Retrieval (SIGIR) (New York), – .

Salton, G., A. Singhal, M. Mitra, and C. Buckley. . ‘Automatic text structuring and sum-marization’. Information Processing and Management , ( ), – . Also in Mani andMaybury ( ), – .

Page 16: 05Handbook-Summ-hovy

8/7/2019 05Handbook-Summ-hovy

http://slidepdf.com/reader/full/05handbook-summ-hovy 16/16

Shannon, C. . ‘Prediction and entropy o printed English’.Bell System Technical Journal ,Jan., – .

Spärck Jones, K. . ‘Automatic summarizing: actors and directions’. In Mani and Maybury

( ), – .——and J. R. Galliers. .Evaluating Natural Language Processing Systems: An Analysis and Review. New York: Springer.

Strzalkowski, T., G. Stein, J. Wang, and B. Wise. . ‘A robust practical text summarizer’. InMani and Maybury ( ), – .

Teu el, S. and M. Moens. . ‘Sentence extraction as a classication task’.Proceedings of the ACL Workshop on Intelligent Text Summarization (Madrid), – .

—— —— . ‘Argumentative classication o extracted sentences as a rst step toward ex-ible abstracting’. In Mani and Maybury ( ), – .

Witbrock, M., and V. Mittal. . ‘Ultra-summarization: a statistical approach to generatinghighly condensed non-extractive summaries’.Proceedings of the nd ACM Conference onResearch and Development in Information Retrieval (SIGIR) (Berkeley), –


Recommended