+ All Categories
Home > Documents > A Context-Based Word Indexing Model for Document Summarization

A Context-Based Word Indexing Model for Document Summarization

Date post: 15-Dec-2016
Category:
Upload: thomas-martin
View: 212 times
Download: 0 times
Share this document with a friend
13
A Context-Based Word Indexing Model for Document Summarization Pawan Goyal, Laxmidhar Behera, Senior Member, IEEE, and Thomas Martin McGinnity, Senior Member, IEEE Abstract—Existing models for document summarization mostly use the similarity between sentences in the document to extract the most salient sentences. The documents as well as the sentences are indexed using traditional term indexing measures, which do not take the context into consideration. Therefore, the sentence similarity values remain independent of the context. In this paper, we propose a context sensitive document indexing model based on the Bernoulli model of randomness. The Bernoulli model of randomness has been used to find the probability of the cooccurrences of two terms in a large corpus. A new approach using the lexical association between terms to give a context sensitive weight to the document terms has been proposed. The resulting indexing weights are used to compute the sentence similarity matrix. The proposed sentence similarity measure has been used with the baseline graph-based ranking models for sentence extraction. Experiments have been conducted over the benchmark DUC data sets and it has been shown that the proposed Bernoulli-based sentence similarity model provides consistent improvements over the baseline IntraLink and UniformLink methods [1]. Index Terms—Lexical association, text summarization, document indexing Ç 1 INTRODUCTION D OCUMENT summarization is an information retrieval task, which aims at extracting a condensed version of the original document [2]. A document summary is useful since it can give an overview of the original document in a shorter period of time. Readers may decide whether or not to read the complete document after going through the summary. For example, readers first look at the abstract of a scientific article before reading the complete paper. Search engines also use text summaries to help users make relevance decisions [3]. The main goal of a summary is to present the main ideas in a document/set of documents in a short and readable paragraph. Summaries can be produced either from a single document or many documents [4]. The task of producing summary from many documents is called multidocument summarization [5], [6], [7], [8], [9], [10]. Summarization can also be specific to the information needs of the user, thus called “query-biased” summarization [11], [12], [13]. For instance, the QCS system (query, cluster, and summarize, [12]) retrieves relevant documents in response to a query, clusters these documents by topic and produces a summary for each cluster. Opinion summarization [14], [15], [16], [17] is another application of text summarization. Topic sum- marization deals with the evolution of topics in addition to providing the informative sentences [18]. This paper focuses on sentence extraction-based single document summarization. Most of the previous studies on the sentence extraction-based text summarization task use a graph-based algorithm to calculate the saliency of each sentence in a document and the most salient sentences are extracted to build the document summary. The sentence extraction techniques give an indexing weight to the document terms and use these weights to compute the sentence similarity [1] and/or document centroid [19] and so on. The sentence similarity calculation remains central to the existing approaches. The indexing weights of the document terms are utilized to compute the sentence similarity values. However, very elementary document features are used to allocate an indexing weight to the document terms, which include the term frequency, docu- ment length, occurrence of a term in a background corpus and so on. Therefore, the indexing weight remains independent of the other terms appearing in the document and the context in which the term occurs is overlooked in assigning its indexing weight. This results in “context independent document indexing.” To the authors’ knowl- edge, no other work in the existing literature addresses the problem of “context independent document indexing” for the document summarization task. A document contains both the content-carrying (topical) terms as well as background (nontopical) terms. The traditional indexing schemes cannot distinguish between these terms that are reflected in the sentence similarity values. A context sensitive document indexing model gives a higher weight to the topical terms as compared to the IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 8, AUGUST 2013 1693 . P. Goyal is with INRIA Paris-Rocquencourt, Domaine de Voluceau - Rocquencourt, B.P. 105 - 78153 Le Chesnay, France. E-mail: [email protected]. . L. Behera is with the Department of Electrical Engineering, Indian Institute of Technology Kanpur, UP 208016, India, and the Intelligent Systems Research Centre, School of Computing and Intelligent Systems, University of Ulster, Londonderry, NI, UK, BT48 7JL. E-mail: [email protected], [email protected]. . T.M. McGinnity is with the Intelligent Systems Research Centre, School of Computing and Intelligent Systems, University of Ulster, Magee campus, Londonderry BT48 7JL, Northern Ireland, United Kingdom. E-mail: [email protected]. Manuscript received 25 Jan. 2012; revised 15 May 2012; accepted 19 May 2012; published online 25 May 2012. Recommended for acceptance by J. Zobel. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TKDE-2012-01-0063. Digital Object Identifier no. 10.1109/TKDE.2012.114. 1041-4347/13/$31.00 ß 2013 IEEE Published by the IEEE Computer Society
Transcript
Page 1: A Context-Based Word Indexing Model for Document Summarization

A Context-Based Word Indexing Modelfor Document SummarizationPawan Goyal, Laxmidhar Behera, Senior Member, IEEE, and

Thomas Martin McGinnity, Senior Member, IEEE

Abstract—Existing models for document summarization mostly use the similarity between sentences in the document to extract the

most salient sentences. The documents as well as the sentences are indexed using traditional term indexing measures, which do not

take the context into consideration. Therefore, the sentence similarity values remain independent of the context. In this paper, we

propose a context sensitive document indexing model based on the Bernoulli model of randomness. The Bernoulli model of

randomness has been used to find the probability of the cooccurrences of two terms in a large corpus. A new approach using the

lexical association between terms to give a context sensitive weight to the document terms has been proposed. The resulting indexing

weights are used to compute the sentence similarity matrix. The proposed sentence similarity measure has been used with the

baseline graph-based ranking models for sentence extraction. Experiments have been conducted over the benchmark DUC data sets

and it has been shown that the proposed Bernoulli-based sentence similarity model provides consistent improvements over the

baseline IntraLink and UniformLink methods [1].

Index Terms—Lexical association, text summarization, document indexing

Ç

1 INTRODUCTION

DOCUMENT summarization is an information retrievaltask, which aims at extracting a condensed version of

the original document [2]. A document summary is usefulsince it can give an overview of the original document in ashorter period of time. Readers may decide whether or notto read the complete document after going through thesummary. For example, readers first look at the abstract of ascientific article before reading the complete paper. Searchengines also use text summaries to help users makerelevance decisions [3].

The main goal of a summary is to present the main ideasin a document/set of documents in a short and readable

paragraph. Summaries can be produced either from a singledocument or many documents [4]. The task of producingsummary from many documents is called multidocument

summarization [5], [6], [7], [8], [9], [10]. Summarization canalso be specific to the information needs of the user, thus

called “query-biased” summarization [11], [12], [13]. Forinstance, the QCS system (query, cluster, and summarize,

[12]) retrieves relevant documents in response to a query,clusters these documents by topic and produces a summaryfor each cluster. Opinion summarization [14], [15], [16], [17]is another application of text summarization. Topic sum-marization deals with the evolution of topics in addition toproviding the informative sentences [18].

This paper focuses on sentence extraction-based singledocument summarization. Most of the previous studies onthe sentence extraction-based text summarization task use agraph-based algorithm to calculate the saliency of eachsentence in a document and the most salient sentencesare extracted to build the document summary. The sentenceextraction techniques give an indexing weight to thedocument terms and use these weights to compute thesentence similarity [1] and/or document centroid [19] andso on. The sentence similarity calculation remains central tothe existing approaches. The indexing weights of thedocument terms are utilized to compute the sentencesimilarity values. However, very elementary documentfeatures are used to allocate an indexing weight to thedocument terms, which include the term frequency, docu-ment length, occurrence of a term in a background corpusand so on. Therefore, the indexing weight remainsindependent of the other terms appearing in the documentand the context in which the term occurs is overlooked inassigning its indexing weight. This results in “contextindependent document indexing.” To the authors’ knowl-edge, no other work in the existing literature addresses theproblem of “context independent document indexing” forthe document summarization task.

A document contains both the content-carrying (topical)terms as well as background (nontopical) terms. Thetraditional indexing schemes cannot distinguish betweenthese terms that are reflected in the sentence similarityvalues. A context sensitive document indexing model givesa higher weight to the topical terms as compared to the

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 8, AUGUST 2013 1693

. P. Goyal is with INRIA Paris-Rocquencourt, Domaine de Voluceau -Rocquencourt, B.P. 105 - 78153 Le Chesnay, France.E-mail: [email protected].

. L. Behera is with the Department of Electrical Engineering, IndianInstitute of Technology Kanpur, UP 208016, India, and the IntelligentSystems Research Centre, School of Computing and Intelligent Systems,University of Ulster, Londonderry, NI, UK, BT48 7JL.E-mail: [email protected], [email protected].

. T.M. McGinnity is with the Intelligent Systems Research Centre, School ofComputing and Intelligent Systems, University of Ulster, Magee campus,Londonderry BT48 7JL, Northern Ireland, United Kingdom.E-mail: [email protected].

Manuscript received 25 Jan. 2012; revised 15 May 2012; accepted 19 May2012; published online 25 May 2012.Recommended for acceptance by J. Zobel.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TKDE-2012-01-0063.Digital Object Identifier no. 10.1109/TKDE.2012.114.

1041-4347/13/$31.00 � 2013 IEEE Published by the IEEE Computer Society

Page 2: A Context-Based Word Indexing Model for Document Summarization

nontopical terms and, thus, influences the sentence simi-larity values in a positive manner.

In this paper, we address the problem of “contextindependent document indexing” using the lexical associa-tion between document terms. In a document, the content-carrying words will be highly associated with each other,while the background terms will have very low associationwith the other terms in the document. The associationbetween terms is captured in this paper by the lexicalassociation, computed through a corpus analysis.

The main motivation behind using the lexical associationis the central assumption that the context in which a wordappears provides useful information about its meaning [20].Cooccurrence measures observe the distributional patternsof a term with other terms in the vocabulary and haveapplications in many tasks pertaining to natural languageunderstanding such as word classification [21], knowledgeacquisition [22], word sense disambiguation [23], informa-tion retrieval [24], sentence retrieval [25], and wordclustering [26]. In this paper, we derive a novel termassociation metric using the Bernoulli model of random-ness. Multivariate Bernoulli models have previously beenapplied to document indexing and information retrieval[27], [28]. We use the Bernoulli model of randomness to findthe probability of the cooccurrences of two terms in acorpus and use the classical semantic information theory toquantify the information contained in the cooccurrences ofthese two terms.

The lexical association metric, thus, derived is used topropose a context-sensitive document indexing model. Theidea is implemented using a PageRank-based algorithm[29] to iteratively compute how informative is eachdocument term. Sentence similarity calculated using thecontext sensitive indexing should reflect the contextualsimilarity between two sentences. This will allow twosentences to have different similarity values depending onthe context. The hypothesis is that an improved sentencesimilarity measure would lead to improvements in thedocument summarization.

The text summarization experiments have been per-formed on the single document summarization task overthe DUC01 and DUC02 data sets. It has been shown that theproposed model consistently improves the performance ofthe baseline sentence extraction algorithms under varioussettings and, thus, can be used as an enhancement over thebaseline models. The theoretical foundations along with theempirical results confirm that the proposed model advancesthe state of the art in document summarization.

The main contributions of this paper are summarizedas follows:

1. We propose the novel idea of using the context-sensitive document indexing to improve the sentenceextraction-based document summarization task.

2. We implement the idea by using the lexical associa-tion between document terms in a PageRank-basedframework. A novel term association metric using theBernoulli model of randomness has been derived forthis purpose. Empirical evidence has been providedto show that using the derived lexical associationmetric, average lexical association between the terms

in a target summary is higher compared to theassociation between the terms in a document.

3. Experiments have been conducted over the bench-mark document understanding conference (DUC)data sets to empirically validate the effectiveness ofthe proposed model.

The remainder of this paper has been organized asfollows: Section 2 discusses the related work in the field ofdocument summarization. The proposed lexical associa-tion-based context sensitive indexing model is discussed inSection 3 along with the derivation of the term associationmetric. Experiments and results over the DUC data sets arereported in Section 4, where the proposed approach iscompared to the baseline model in various settings.Discussions with one specific document as an example arereported in Section 5, where summaries obtained throughvarious approaches are shown. Conclusions and futurework are provided in Section 6.

2 RELATED WORK

Text summarization can either be “abstractive” or “ex-tractive.” The abstraction-based models mostly provide thesummary by sentence compression and reformulation [30],[31], [32], allowing summarizers to increase the overallinformation without increasing the summary length [33],[34]. However, these models require complex linguisticprocessing. Sentence extraction models, on the other hand,use various statistical features from the text to identify themost central sentences in a document/set of documents.Radev et al. [19] proposed a centroid-based summarizationmodel. They used the words having tf � idf scores(indexing weights) above a threshold to define the centroidas a pseudodocument. Those sentences containing morewords from the centroid were assumed to be central. Erkanand Radev [35] proposed LexRank to compute sentenceimportance based on the concept of eigenvector centralityand degree centrality. They used the hypothesis that thesentences that are similar to many of the other sentences ina cluster are more salient to the document topic. Sentencesimilarity measures based on cosine similarity wasexploited for computing the adjacency matrix. Once thedocument graph is constructed using the similarity values,the “degree centrality” of a sentence si is defined as thenumber of sentences similar to si, with similarity valueabove a threshold. Eigenvector centrality is computed usingthe LexRank algorithm iteratively, which was an adaptationof the PageRank algorithm. Mihalcea and Tarau [36]proposed TextRank, another iterative graph-based rankingframework for text summarization and showed that othergraph-based algorithms can be derived from this model.

Researchers have used a combination of statistical andlinguistic features, such as term frequency [37], sentenceposition [38], [39], [40], [41], cue words [42], topic signature[43], lexical chains [44] and so on for computing a saliencyscore of the sentences. Ko and Seo [45] combined twoconsecutive sentences into a pseudosentence (bigram). Thesebigrams were supposed to be contextually informative. First,the bigrams are extracted using the sentence extractionmodule. Sentences are extracted from these bigrams usinganother sentence extraction task. Alguliev and Alyguliev[46] used quadratic-type integer linear programming to

1694 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 8, AUGUST 2013

Page 3: A Context-Based Word Indexing Model for Document Summarization

cluster sentences and used the cluster to discover latenttopical sections and information rich sentences.

Wan and Xiao [1] proposed to use a small number ofneighborhood documents to improve the sentence extrac-tion model. Given a document Di, they used its neighboringdocuments to construct an expanded document set Xpi. ThePageRank-based algorithm was applied and both the localand global information was used to compute the saliencyscores of sentences. The model works in two steps.

1. Neighborhood construction.2. Summary extraction using the neighborhood

knowledge.

Given a document Di 2 D, the model finds n neighbor-ing documents for the document Di. These documentsconstruct the neighborhood knowledge context for Di andDi is said to be expanded to a small document set Xpi.Using the expanded document, both the within-documentsentence relationships (local information) and cross-docu-ment sentence relationships (global information) can beused for the summarization process.

An adaptation of the graph-based ranking algorithm,PageRank is used to compute the importance of a sentencewithin a graph in a recursive manner, using the connectiv-ity information of the graph. Given the expanded documentset Xpi, an undirected graph G ¼ ðS;EÞ is used to reflectthe connectivity between sentences in the document set.S ¼ fsiji � i � jSjg is the set of sentences in Xpi and E is amatrix of size S � S, such that an element ejk 2 E stores thesimilarity between sentences sj and sk in S. The similarityvalue is calculated using the cosine similarity measure.

An adjacency matrix M ¼ ðMj;kÞjSj�jSj is used to describeG as

Mj;k ¼�� simðsj; skÞ; j 6¼ k;0; otherwise

� �; ð1Þ

where � denotes a confidence value, which is set to 1 ifthe link is a within-document link and simdocðDl;DmÞ if sjand sk come from different documents Dl and Dm, wheresimdoc denotes the document similarity measure. M isnormalized to ~M to normalize each row to 1. Based on theglobal affinity graph G, the importance score (denoting howinformative a sentence is) IFScoreðsjÞ for sentence sj iscalculated in a recursive manner using the PageRank-basedalgorithm as follows:

IFScoreðsjÞ ¼ � �X8k 6¼j

IFScoreðskÞ � ~Mkj þ1� �jSj ; ð2Þ

where � is a damping factor. Equation (2) is iterativelyapplied till the convergence is achieved. The convergencecriterion is the difference between the importance score ofsentences in two successive iterations.

The above algorithm was named differently for differentsettings, as described below:

UniformLink. Both the within-document and cross-docu-ment relations are used.

InterLink. Only the cross-document relationships are used.IntraLink. Only the within-document relationships are

used.

From (2), it is clear that the only input document featurethat this graph-based sentence extraction algorithm acceptsis the normalized sentence similarity matrix ~Mkj. Thismatrix is constructed using the cosine similarity measurebetween the sentences. However, the sentence vectors areconstructed using the tf � idf-based indexing scheme,which is context independent and does not take thetopicality of the document words into account.

None of the models, as described in this section, addressthe problem of “context insensitive document indexing.”This paper proposes to use the knowledge derived from theunderlying corpus to give a context-sensitive indexingweight to the document terms. Sentence similarity will becalculated using the indexing weights thus obtained.

3 EXPLORING LEXICAL ASSOCIATION FOR TEXT

SUMMARIZATION

Given a document Di, the terms encountered in it can eitherbe topical or nontopical. While it is difficult to decide aboutthe topicality of a term only on the basis of a singledocument, as suggested by the distributional hypothesis[20], the patterns of term cooccurrence over a larger data setcan be helpful. Lexical association measures use the termcooccurrence knowledge extracted from a large corpus.Nontopical terms appear randomly across all the documentwhile topical terms appear in bursts. Therefore, computedon a sufficiently large corpus, the lexical association valuebetween two topical terms should be higher than the lexicalassociation between two nontopical terms or a pair oftopical and nontopical terms.

To motivate the discussion, let us consider two arbitrarydocuments D1 and D2. Two sentences of each documentsare shown as follow:

D1 : S11 ¼ f started; career; engineeringg: S12 ¼ fshifted; engineering; humanitiesg;

D2 : S21 ¼ f engineering; application; scientific; principlesg;: S22 ¼ fengineering; design; build; machinesg:

The first document is discussing a person’s life. Hestarted his career in engineering and later on, shifted tohumanities. The term “engineering” is not a content-carrying term in this document. The second document, onthe other hand, talks about “engineering.” The twosentences, S21 and S22, attempt to define the field ofengineering. The term “engineering” is clearly a topicalterm in D2. By using any of the traditional indexingschemes, “engineering” will be given approximately thesame indexing weight in both the documents. Therefore,the similarity between S11 and S12 will nearly be the same asthe similarity between S21 and S22. However, “engineering”is topically related to D2 and is a background term in D1. Ifan indexing scheme can distinguish between the “topical”and “nontopical” terms, “engineering” will receive a muchlower indexing weight in D1 than in D2, resulting in adecrease in the similarity value between S11 and S12 and anincrease in the similarity value between S21 and S22.

Before developing an algorithm for the identification ofthe “topical” and “nontopical” terms, it is important to

GOYAL ET AL.: A CONTEXT-BASED WORD INDEXING MODEL FOR DOCUMENT SUMMARIZATION 1695

Page 4: A Context-Based Word Indexing Model for Document Summarization

reflect upon how do we decide that the term “engineering”is “topical” in D2 and “nontopical” in D1. In D2, the term“engineering” appears with many other terms such as“application, design, machines, scientific, principles” whichare associated with the term “engineering.” In D1, however,the term “engineering” seems to be related only slightlywith the term “career” and not with other terms such as“humanities, shifted, started.” Our knowledge of the “wordassociation” is based on the knowledge we have capturedabout the world. For computational purposes, this knowl-edge can be discovered by the corpus analysis.

We now present the underlying hypotheses of our

approach in this section.

. A document summary is centered around the topicalterms (content-carrying terms) encountered in thedocument. In other words, H1: “The ratio of topicalwords is higher in a summary of a document than inthe original document.”

. Topical terms appear in bursts while the nontopicalterms appear randomly across all the documents.H2: “For a carefully chosen lexical association metric,lexical association between two topical terms shouldbe higher than the lexical association between twonontopical terms or a pair of topical and nontopicalterms.” This lexical association can be calculatedusing a large corpus.

. Once the lexical association is calculated, we canconstruct the document graph, with the termsappearing in the document as the vertices and thelexical association between these terms as the edgesof the graph. H3: “A PageRank-based algorithm canbe used to determine the context-sensitive indexingweights, resulting in performance improvement fora document summarization task.”

The cooccurrence patterns in a corpus can be used to

derive the lexical association measure. Assuming that the

terms are distributed according to the Bernoulli distribu-

tion, divergence from the randomness behavior can provide

a measure of the lexical association. Before going into the

derivation, we define the notation.

3.1 Notations

We consider a set of N documents. Let these documents

have s unique words, which will be used to index these

documents, thus called “index terms.” Let T ¼ ft1; t2; . . . ;

tsg be the set of these index terms. Let the set of

N documents be D ¼ fD1; D2; . . . ; DNg. Let fij be the

frequency with which term tj occurs in document Di and

Nj be the number of documents in which the term tj occurs

at least once. Nj is also called the document frequency of

term tj. We will denote the probability of term ti appearing

in the corpus by pi. Let Nij denote the number of documents

in which terms ti and tj cooccur.

3.2 Bernoulli Model of Randomness: Derivation ofthe Term Association Metric

Let us consider the distribution of terms ti and tj in a

corpus. The term tj appears in Nj documents. Assuming the

terms to be distributed as per the Bernoulli distribution, the

probability pi of the term ti appearing in a document isgiven by

pi ¼Ni

N: ð3Þ

Consider the Nj documents in which term tj occurs. Term tioccurs in Nij documents out of these Nj documents anddoes not occur in Nj �Nij documents. Therefore, theprobability of Nij cooccurrences in Nj documents is givenby the Bernoulli distribution

ProbðNijÞ ¼ BðN;Nj;NijÞ ð4Þ

¼ Nj

NijpiNij qi

Nj�Nij; ð5Þ

where qi ¼ 1� pi.Equation (5) quantifies the probability that term ti has

Nij cooccurrences in Nj documents. As per classicalsemantic information theory, the quantity of informationassociated is equivalent to the reciprocal of this probability,expressed in bits [47]. Therefore, the information content inthe Nij cooccurrences of term ti in Nj documents can beexpressed as

InfðNijÞ ¼ �log2ðProbðNijÞÞ ð6Þ

¼ �log2Nj

Nij

� �piNij qi

Nj�Nij

� �; ð7Þ

where (6) is the quantification of the “surprise” of the Nij

cooccurrences. Equation (7) requires the computation offactorials and therefore, Stirling’s approximation [48] can beused to approximate the factorials included in the computa-tion. According to Stirling’s approximation:

n! ¼ffiffiffiffiffiffiffiffi2�np n

e

� �n: ð8Þ

Proceeding along the same lines of the derivation, asproposed by Amati and Rijsbergen [27],1 we can derive theinformation content in the Nij cooccurrences as

InfðNijÞ ¼ 0:5log2 2�Nij 1�Nij

Nj

� �� �þNijlog2

pcopi

þ ðNj �NijÞlog21� pco1� pi

;

ð9Þ

where pi ¼ Ni

N and pco ¼ Nij

Nj.

Equation (9) quantifies the self-information of theNij cooccurrences of term ti in Nj documents. We proposeto use this information as the “Bernoulli lexical associationmeasure” to give a context-sensitive indexing weight to thedocument terms.

We return to our hypothesis H1: “The ratio of topicalwords is higher in a summary of a document than in theoriginal document.” Is this hypothesis empirically sup-ported for the proposed measure, i.e., are the words in ahuman generated summary of a document more lexically

1696 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 8, AUGUST 2013

1. Amati and Rijsbergen [27] used the Bernoulli model of randomness topropose an indexing scheme, while the authors propose a cooccurrencemeasure. The derivation proceeds on the similar lines but in a completelydifferent context.

Page 5: A Context-Based Word Indexing Model for Document Summarization

associated than the original document? To investigate thesequestions, empirical evidence was sought from the DUCdata sets, which are the benchmark data sets for theevaluation of text summarization.

For each data set, the Bernoulli lexical associationbetween the indexing terms was calculated using (9). Foreach document, the average lexical association between thedocument terms was calculated (stop words were not used).Similarly, average lexical association was computed for thetarget summary (reference summary), provided in thecorpus.2 For example, if a document/summary hasM words (excluding the stop words), the average lexicalassociation (avLex3) for the documents/summary wascalculated as

avLex ¼PM

i¼1

PMj 6¼i;j¼1 Aij

MðM � 1Þ ; ð10Þ

where Aij corresponds to the lexical association between ithand jth word in the document/summary, which wascalculated a priori using the whole data set. For theDUC01 data set, the avLex for the target summary was2.76 as compared to 2.49 for the original documents.Similarly, for the DUC02 data set, the avLex for the targetsummary was 3.29 as compared to 2.89 for the originaldocuments. The difference was statistically significant to alarge degree. Fig. 1 compares the distribution of averagelexical association of the document and target summaryusing the Bernoulli measure for two different DUC datasets. The sample size for each data set is the number of

unique documents in that data set, i.e., 303 for DUC01 and533 for DUC02.

As is evident from Fig. 1, the distribution is shifted

toward the right for the target summary, when compared

with the original document. Though the lexical association

of summary documents shows a slight shift toward the left

also, it is small as compared to the shift toward higher

lexical association. We quantified the probability mass

corresponding to the left and right shift for both the data

sets. For DUC01, the probability mass corresponding to the

shift toward left is 0.192 and for the shift toward right is

0.439. For DUC02, the probability mass is 0.098 and 0.576

corresponding to the shift toward left and shift toward

right, respectively. Therefore, using the Bernoulli associa-

tion measure, a summary document obtains a higher

average lexical association than the original document in

most cases. Since a document summary is expected to

contain a higher proportion of topical terms, as compared to

the original document, it can be stated that the Bernoulli

model-based lexical association can distinguish between a

topical and nontopical term. Thus, the hypothesis H2 can be

modified as H2: “The Bernoulli model-based lexical

association measure can distinguish between a topical and

nontopical term and can be used to give a context-sensitive

indexing weight to the document terms.”At this point, we study the behavior of some other lexical

association measures to justify that the proposed Bernoulli

lexical association measure is a much better fit for H2. The

following measures are considered:

. Point-wise mutual information (PMI) [49]:

AijPMI ¼ logNNij

NiNj

� �: ð11Þ

. Mutual Information (MI) [50]:

AijMI ¼X�j¼0;1

X�j¼0;1

pð�i; �jÞlogpð�i; �jÞpð�iÞpð�jÞ

; ð12Þ

GOYAL ET AL.: A CONTEXT-BASED WORD INDEXING MODEL FOR DOCUMENT SUMMARIZATION 1697

Fig. 1. Comparison of distribution of average lexical association ofdocument and target summary for the Bernoulli measure for (a) DUC01and (b) DUC02 data sets.

TABLE 1Average Lexical Association, Computed over the DUC01

Data Set Using Various Association Measures

TABLE 2Average Lexical Association, Computed over the DUC02

Data Set Using Various Association Measures

2. If a document has more than one target summary, an average wastaken over the average lexical association for each summary.

3. We will also use avLexDoc and avLexSum to denote the correspondingmeasure for a document and summary respectively.

Page 6: A Context-Based Word Indexing Model for Document Summarization

where the binary variables �i and �j indicate thepresence/absence of terms ti and tj.

Tables 1 and 2 compare the average lexical association ofthe documents and summaries, computed over the DUC01and DUC02 data sets, respectively, for various lexicalassociation measures. The second and third columnsdenote the average lexical association computed using(10) over the document and target summary, respectively,using the corresponding lexical association measures. Thestudents’ t-test and Wilcoxon signed rank test wereperformed to determine if the difference was statisticallysignificant. The significance value (p-value) correspondingto the t-test and Wilcoxon signed rank test are reported inthe fourth and fifth columns of these tables, respectively.The last column shows the ratio of the average of theaverage lexical association of the target summaries(ðavLexSumÞav) and the average of the average lexicalassociation of the documents (ðavLexDocÞav).

From Tables 1 and 2, it is clear that by using the PMImeasure, the lexical association between documents termsis higher than between the summary terms. Therefore, thePMI measure may not be a suitable choice for the possibleapplication in document summarization, which will beverified later in Section 4. Using the MI and Bernoullimeasure, on the other hand, the average lexical associationbetween the terms in human summary is higher than that inthe original document. As verified by the two differentstatistical tests, the difference is statistically significantusing both these association measures and therefore, thehypothesis holds true for both the MI and Bernoullimeasures. However, the significance level as well as theratio of average lexical association between the targetsummary and original document is much higher for theBernoulli measure as compared to the MI measure. Thus,the proposed Bernoulli measure is a better fit for H2.

3.3 Context-Based Word Indexing

Given the lexical association measure between two terms ina document from hypothesis H2, the next task is to calculatethe context sensitive indexing weight of each term in adocument using hypothesis H3. A graph-based iterativealgorithm is used to find the context sensitive indexingweight of each term. Given a document Di, a documentgraph G is built. Let G ¼ ðV ;EÞ be an undirected graph toreflect the relationships between the terms in the documentDi. V ¼ fvjj1 � j � jV jg denotes the set of vertices, whereeach vertex is a term appearing in the document. E is amatrix of dimensions jV j � jV j. Each edge ejk 2 E corre-sponds to the lexical association value between the termscorresponding to the vertices vj and vk. The lexicalassociation between the same term is set to 0. To use thePageRank-based algorithm, E is normalized as ~E ¼ð ~Ej;kÞjV j�jV j to make the sum of each row equal to 1. ~E isdefined as

~Ej;k ¼ejkPjV jk¼1 ejk

; j 6¼ k

0; otherwise

8<:

9=;: ð13Þ

Based on the graph G and the normalized associationmatrix ~E, the context-sensitive indexing weight of each

word vj in a document Di, denoted by indexWtðvjÞ is tobe calculated. It can be found in a recursive way using thePage-rank-based algorithm. Algorithm 3.1 describes thepseudocode of the proposed algorithm. � is the dampingfactor. For implementation, the indexWtðvjÞ is initialized to1.0 for all the document terms. memoWtðvjÞ is a buffer thatstores the indexing weights of the previous iteration.The convergence of the iteration algorithm is achievedwhen the difference between the scores computed at twosuccessive iterations falls below a given threshold �.

Algorithm 3.1. CONTEXTBASEDWORDINDEXING (E, �, �)

initialize indexWt½vj� 18j, error E 1

while E � �

do

E 0forj 1tojSj

do

memoWt½vj� indexWt½vj�indexWt½vj� � �

P8k 6¼j indexWt½vk� � ~Ekj

þ 1��jV j

E E þ ðindexWt½vj� �memoWt½vj�Þ2

8>>><>>>:

E ffiffiffiffiEp

8>>>>>>>>><>>>>>>>>>:

return indexWt

3.4 Sentence Similarity Using the Context-BasedIndexing

The model described above gives a context-sensitiveindexing weight to each document term. The next step isto use these indexing weights to calculate the similaritybetween any two sentences. Given a sentence sj in thedocument Di, the sentence vector is built using theindexWtðÞ computed as per Algorithm 3.1. The sentencevector ~sj is calculated such that if a term vk appears in sj, itis given a weight indexWtðvkÞ; otherwise, it is given aweight 0. The similarity between two sentences sj and sl iscomputed using the dot product, i.e., simðsj; slÞ ¼ ~sj � ~sl.

Besides using the new sentence similarity measure, theparadigm as presented in Wan and Xiao [1], described inSection 2 is used for calculating the score of the sentences.The proposed method will be denoted by “bern” corre-sponding to the “Bernoulli” measure. Section 4 reports theexperiments using the proposed model.

4 EXPERIMENTS

4.1 Evaluation Setup

The benchmark data sets from the DUC are used to evaluatethe text summarization systems. These data sets aredistributed through TREC. The data sets from DUC01 [51]and DUC02 [52] are used for the experiments.4 The aim ofthe tasks in DUC01 and DUC02 was to evaluate genericsummaries for a document with an approximate length of100 words. The data sets used are english news articles,collected from TREC-9 for a single document summariza-tion task. The DUC01 data set contains 303 uniquedocuments, which can broadly be categorized into 30 newstopics. The DUC02 data set contains 533 unique documents,

1698 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 8, AUGUST 2013

4. Consistent with the experiments reported in Wan and Xiao [1].

Page 7: A Context-Based Word Indexing Model for Document Summarization

categorized into 59 new topics. These data sets also containsthe reference summary for each document.

While there are many approaches to evaluate the qualityof system generated summaries [53], [54], ROUGE [55], [56]is the most widely used toolkit for the evaluation of thesystem generated summaries. ROUGE measures the sum-mary quality by counting the overlapping units such as n-gram, word sequences and word pairs between thecandidate summary (system generated) and referencesummary. An n-gram recall measure (ROUGE-N) com-puted through the ROUGE toolkit is given as

ROUGE �N

¼P

S2fRefSumgP

n�gram2S Countmatchðn� gramÞPS2fRefSumg

Pn�gram2S Countðn� gramÞ

;ð14Þ

where n refers to the length of the n-gram, Countðn� gramÞis the number of n-grams in the reference summaries andCountmatchðn� gramÞ is the maximum number of n-grams,cooccurring in the candidate summary and the referencesummary. Of all the scores reported by ROUGE toolkit,ROUGE-1 (unigram-based), and ROUGE-2 (bigram-based)scores have been shown to be in close agreement withhuman judgment [55].

In the DUC01 data set, only one reference summary isprovided for each document. The DUC02 data set, on theother hand, contains multiple reference summaries for eachdocument. The “�l” option in the ROUGE toolkit is used totruncate the system generated summaries to length 100. TheROUGE-1 (unigram-based) and ROUGE-2 (bigram-based)scores returned by the toolkit are reported.

4.2 Sensitivity Analysis

For the context-based word indexing algorithm, the para-meters � (threshold) and � (damping factor) were deter-mined through the experiments over the DUC02 data setwith the Bernoulli lexical association measure used with theIntraLink model. The parameter � was varied in the rangef0:1; 0:01; 0:001; 0:0001g and the parameter � was varied inthe range f0:05; 0:95g with a step-size of 0.1. The sensitivityof the results with respect to the parameter � has beenshown in Table 3. Fig. 2 demonstrates the sensitivity ofROUGE-1 and ROUGE-2 scores with respect to theparameter �.

From Table 3, the results were quite insensitive to thechoice of � in the range f0:1; 0:01; 0:001; 0:0001g. Forthe rest of the experiments, � was set to 0.0001 as it gavethe best results. For � ¼ 0:0001, on average, 5.6 iterationswere required for each document for the proposedalgorithm to converge. From Fig. 2, the results were notvery sensitive to the parameter � in the range ð0:65; 0:95Þ.

� ¼ 0:85 was chosen for the rest of the experiments as itgave the best results.

4.3 Comparison of Various Systems

For comparison, only the IntraLink and UniformLink modelsare used in this paper. IntraLink is the simplest model,which uses a single document. UniformLink was shown toperform superior to InterLink by Wan and Xiao [1], whileboth UniformLink and InterLink are nearly the same interms of the complexity offered.

First, we compare the performance of various lexicalassociation measures for the document summarization.Table 4 compares the performance of various lexicalassociation measures over the DUC01 and DUC02 data setswith IntraLink as the baseline model.

From Table 4, it is clear that the proposed Bernoulli-basedlexical association measure outperforms the PMI and MImeasures. While the PMI measure gave slight improvementsover DUC02, the performance was worse that the baselinefor DUC01. It clearly corresponds to the empirical evidence,as shown in Tables 1 and 2. The MI measure provided

GOYAL ET AL.: A CONTEXT-BASED WORD INDEXING MODEL FOR DOCUMENT SUMMARIZATION 1699

Fig. 2. Sensitivity analysis with respect to the damping factor � forDUC02 IntraLinkþBernoulli for (a) ROUGE-1 and (b) ROUGE-2 scores.

TABLE 3Sensitivity Analysis with Respect to the

Threshold � for DUC02 IntraLinkþBernoulli

TABLE 4Comparison of Various Lexical Association

Measures with the DUC Data Sets

Page 8: A Context-Based Word Indexing Model for Document Summarization

improvements over the baseline for both data sets, with theonly exception being the ROUGE-2 score for the DUC01 dataset. However, the improvements were much higher for theBernoulli measure, clearly validating the choice of theBernoulli measure for the hypothesis H2 in this paper.

For the UniformLink model, document similarities canbe calculated either via the cosine similarity or using theindexing weights obtained using the association measure inthe proposed model. “+neB” will denote the computation ofneighboring documents using the association measure.“�neB” will denote the use of cosine similarity for theneighborhood construction. The comparison with both theDUC01 and DUC02 data sets is shown in Table 5. Pleasenote that the baseline model “UniformLink” is the same as“UniformLink-neB,” since it uses the cosine similarity forthe computation of neighboring documents. We do notreport the experiments with the variation “UniformLin-k+neB” since it does not use the context-based wordindexing for the sentence similarity computation.

As is clear from Table 5, both the systems “bernþneB” and“bern-neB” outperform the baseline UniformLink model.The improvements are more visible on the ROUGE-2 scorethan on ROUGE-1. “bernþneB” will be selected as theproposed enhancement for the UniformLink model asthe improvements obtained were higher. It also makes thesystem consistent as only the context-based word indexingweights will be used for all the computations.

Once the systems “bern” and ’bernþneB” are selected asthe enhancements over the IntraLink and UniformLinkmodels, respectively, Tables 6 and 7 compare the proposedenhancements when applied with IntraLink and Uniform-Link methods over the DUC01 and DUC02 data sets,respectively. For the UniformLink method, the number ofneighboring documents is set to n ¼ 10 for all experiments,

as reported in [1]. The proposed model is denoted by adding“bern” and “bern+neB” to the corresponding baselinemodels “IntraLink” and “UniformLink,” respectively. Foreach system, the ROUGE-1 and ROUGE-2 scores, as returnedby the ROUGE toolkit are provided, along with the 95 percentconfidence interval, shown in the square brackets.

The results in Tables 6 and 7 lead to the followingobservations:

. Using the word indexing by the Bernoulli cooccur-rence measure always outperforms the correspond-ing baseline model.

. For the DUC01 data set, the IntraLink systemperformed better than the UnifomLink system (thedifference is visible on ROUGE-1 scores). On theother hand, for the DUC02 data set, the UniformLinksystem achieved a better performance. Applying theBernoulli model also gives the same results.

. For both the data sets, the Bernoulli model applied inthe simplest settings (IntraLinkþbern) outperformsboth the baseline systems (IntraLink/UniformLink)for both the ROUGE-1 and ROUGE-2 results.

. The improvements provided by the proposed modelare much more visible on ROUGE-2 than onROUGE-1. ROUGE-2 measures the bigram-basedsimilarity and, therefore, resembles more closely thesyntactic similarity between two summaries.

1700 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 8, AUGUST 2013

TABLE 5Comparison Results with the DUC Data Sets for

the UniformLink Model

TABLE 8Summary of Typical Participating Systems in DUC2002

TABLE 6Comparison Results with DUC01

TABLE 7Comparison Results with DUC02

Page 9: A Context-Based Word Indexing Model for Document Summarization

We now compare the performance of our system to theactual participating systems in the DUC2002. Table 8 gives ashort summary of the typical participating systems inDUC2002 with high ROUGE scores. Table 9 shows thecomparison results of our system with these five systems.

From Table 9, we can see that our system performscomparable to the systems 31, 29, and 27, which wereamong the best participating systems in DUC2002. Thoughthe performance of the systems 28 and 21 is better than oursystem, it is to be noted that the systems 28 and 21 used the

supervised model for sentence extraction and the super-

vised techniques have been shown to perform better than

the unsupervised methods.

5 DISCUSSION

The experimental results shown in the previous section,

along with the empirical evidence shown in Fig. 1 validate

the hypotheses proposed in this paper. In this section, a

document from the DUC data sets is taken as an example to

show how the proposed method produces a better

summary than the baseline models. First, the original

document from the DUC data set is shown (see Fig. 3)

accompanied by the corresponding reference summary (see

Fig. 4), as also provided with the data set. Then the

summaries obtained by both the baseline models, IntraLink

(see Fig. 5) and UniformLink (see Fig. 6) are shown. Finally,

the summary obtained by applying the proposed Bernoulli

model over the simplest baseline (i.e., IntraLinkþbern) is

shown (see Fig. 7).

GOYAL ET AL.: A CONTEXT-BASED WORD INDEXING MODEL FOR DOCUMENT SUMMARIZATION 1701

TABLE 9Comparison with Participating Systems in DUC2002

Fig. 3. The original document AP890704-0043 in DUC2001.

Fig. 4. The target summary for AP890704-0043.

Fig. 5. Summary produced by IntraLink for AP890704-0043.

Page 10: A Context-Based Word Indexing Model for Document Summarization

The summaries shown above clearly reflect the advan-tage offered by the proposed Bernoulli-based word indexingmodel. The first two sentences in Fig. 7 are very muchsimilar to the first and third sentences in the manualsummary (see Fig. 4). However, these sentences do notappear in the summary provided by IntraLink. In thesummary provided by UniformLink, only the first sentenceappears at the third position. These two sentences contain alot of contextual words such as “communist,” “rebels,”“violence,” “police,” and so on. Since the proposed indexingmodel gives an indexing weight using the lexical associationwith all other words, the weights of the contextual wordsare increased, reflecting on their sentence similarity values.Therefore, these sentences become more central in the“IntraLink+Bernoulli” method than the baseline models,used without a context-based word indexing.

6 CONCLUSIONS AND FUTURE WORK

In this paper, the Bernoulli model of randomness has beenused to develop a graph-based ranking algorithm forcalculating how informative is each of the document terms.We proposed three hypotheses, which were used for thedevelopment. Hypothesis H1 is based on the intuition that adocument summary contains the most salient informationin a text and therefore, the terms in a summary should bemore lexically associated with each other than in theoriginal document. This hypothesis was translated into anempirical relation, “The average lexical association betweendocument summaries should be higher than the averagelexical association between the original documents.” Theauthors conjectured that if a lexical association measurefollows H2, average lexical association will be higher insummaries than in the documents, i.e., it will follow H1. Theauthors also conjectured that if a lexical association measurefollows H2, it will follows H3, that is, it can be used to give acontext sensitive indexing weight to the document terms.This hypothesis was correlated with the ROUGE scores.

The Bernoulli model of randomness was used to derive anovel lexical association measure and it was compared withtwo different lexical association measures, PMI and MI. Weuse below the subscripts H1, H2, and H3 to the different

measures used to denote the degree to which they satisfythe empirical relation associated with that particularhypothesis. From Tables 1, 2, and 4

ðBernoulliÞH1> ðMIÞH1

> ðPMIÞH1;

ðBernoulliÞH3> ðMIÞH3

> ðPMIÞH3;

and thus, the hypotheses H1 and H3 correlate for the threeassociation measures and it looks rational enough to usethis results as an evidence that

ðBernoulliÞH2> ðMIÞH2

> ðPMIÞH2:

Although PMI does not support the empirical relationassociated with H1, the other two measures, MI andBernoulli do support this relation. From Table 4, we cansee that the results with PMI for the empirical relationassociated with H3 were also not conclusive and it gaveworse performance over DUC01. Because the empiricalrelations associated with H1 and H3 are consistent, the factthat PMI does not support H1 is more likely to point thatPMI does not follow the hypothesis H2.

Using the proposed Bernoulli association measure, thelexical association between the terms in a target summary ishigher compared to the association between the terms in adocument. Thus, the proposed measure satisfies HypothesisH1. It has been used along with the PageRank algorithm togive a context-sensitive indexing weight to the documentterms to validate Hypothesis H3. The indexing weights,thus, obtained have been used to calculate the sentencesimilarity values. The underlying assumption in H3 wasthat the sentence similarity thus obtained would be contextsensitive and, therefore, should provide improvements inthe sentence extraction task for document summarization.The concept of topical and nontopical terms was used tomodify the indexing wights of the document terms.Analysis of some of the documents and the correspondingsummary figured out the specific advantage offered by theproposed Bernoulli model-based context sensitive indexing.

The experiments performed using the benchmark DUCdata sets confirm that the new context-based word indexinggives better performance than the baseline models. The

1702 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 8, AUGUST 2013

Fig. 7. Summary produced by IntraLink+Bernoulli for AP890704-0043.

Fig. 6. Summary produced by UniformLink for AP890704-0043.

Page 11: A Context-Based Word Indexing Model for Document Summarization

interesting observation was that the Bernoulli model forword indexing, when applied with the simplest IntraLinkmodel, performs better than all the baseline models. Theproposed model is general and since the sentence similaritymeasure is central to any sentence extraction model, isapplicable to any of the sentence extraction techniques. Itwas empirically verified that the proposed Bernoulli lexicalassociation measure outperforms the PMI and MI measuresfor the context sensitive word indexing, used for thedocument summarization.

As per the overall results shown in Tables 6 and 7,although the maximum increase seen by any of thetechniques is around 1 percent absolute gain in ROUGE-1and ROUGE-2, it is noteworthy that the improvements arequite consistent. Using the IntraLink+bern model, around6 percent relative gain in ROUGE-2 score was obtained forboth the DUC data sets. It is also to be noted that theauthors have proposed a general enhancement modelwhich can be used in combination with the existingenhancement models to provide additional improvements.Since the improvements are consistent (though not verylarge), it can only have a positive impact on a human user ofthe systems. As has been shown through an examplesummary, the proposed model gives a higher weight tothe content-carrying terms and as a result, the sentences arepresented in such a way that the most informative sentencesappear on the top of the summary, making a positiveimpact on the quality of the summary.

The model performed superior to all the baseline models.However, the cooccurrences were calculated only on thegiven data set, which was actually very small (303 and533 documents for DUC01 and DUC02, respectively).Further work would calculate the lexical association overa large corpus such as the BNC corpus and investigatewhether it leads to additional improvements. It will also beinstructive to perform experiments using this contextsensitive document indexing approach for the informationretrieval task. Additionally, it will be interesting to see theapplicability of the proposed Bernoulli lexical associationmeasure in other natural language applications such asword classification, word sense disambiguation and knowl-edge acquisition.

ACKNOWLEDGMENTS

This research is supported under the Centre of Excellence inIntelligent Systems project, funded by the Northern IrelandIntegrated Development Fund and InvestNI.

REFERENCES

[1] X. Wan and J. Xiao, “Exploiting Neighborhood Knowledge forSingle Document Summarization and Keyphrase Extraction,”ACM Trans. Information Systems, vol. 28, pp. 8:1-8:34, http://doi.acm.org/10.1145/1740592.1740596, June 2010.

[2] K.S. Jones, “Automatic Summarising: Factors and Directions,”Advances in Automatic Text Summarization, pp. 1-12, MIT Press,1998.

[3] L.L. Bando, F. Scholer, and A. Turpin, “Constructing Query-Biased Summaries: A Comparison of Human and SystemGenerated Snippets,” Proc. Third Symp. Information Interaction inContext, pp. 195-204, http://doi.acm.org/10.1145/1840784.1840813, 2010.

[4] X. Wan, “Towards a Unified Approach to Simultaneous Single-Document and Multi-Document Summarizations,” Proc. 23rd Int’lConf. Computational Linguistics, pp. 1137-1145, http://portal.acm.org/citation.cfm?id=1873781.1873909, 2010.

[5] X. Wan, “An Exploration of Document Impact on Graph-BasedMulti-Document Summarization,” Proc. Conf. Empirical Methods inNatural Language Processing, pp. 755-762, http://portal.acm.org/citation.cfm?id=1613715.1613811, 2008.

[6] Q.L. Israel, H. Han, and I.-Y. Song, “Focused Multi-DocumentSummarization: Human Summarization Activity vs. AutomatedSystems Techniques,” J. Computing Sciences in Colleges, vol. 25,pp. 10-20, http://portal.acm.org/citation.cfm?id=1747137.1747140, May 2010.

[7] C. Shen and T. Li, “Multi-Document Summarization via theMinimum Dominating Set,” Proc. 23rd Int’l Conf. ComputationalLinguistics, pp. 984-992, http://portal.acm.org/citation.cfm?id=1873781.1873892, 2010.

[8] X. Wan and J. Yang, “Multi-Document Summarization UsingCluster-Based Link Analysis,” Proc. 31st Ann. Int’l ACM SIGIRConf. Research and Development in Information Retrieval, pp. 299-306,http://doi.acm.org/10.1145/1390334.1390386, 2008.

[9] D. Wang, T. Li, S. Zhu, and C. Ding, “Multi-DocumentSummarization via Sentence-Level Semantic Analysis and Sym-metric Matrix Factorization,” Proc. 31st Ann. Int’l ACM SIGIR Conf.Research and Development in Information Retrieval, pp. 307-314,http://doi.acm.org/10.1145/1390334.1390387, 2008.

[10] S. Harabagiu and F. Lacatusu, “Using Topic Themes for Multi-Document Summarization,” ACM Trans. Information Systems,vol. 28, pp. 13:1-13:47, http://doi.acm.org/10.1145/1777432.1777436, July 2010.

[11] H. Daume III and D. Marcu, “Bayesian Query-Focused Summar-ization,” Proc. 21st Int’l Conf. Computational Linguistics and the 44thAnn. meeting of the Assoc. for Computational Linguistics, pp. 305-312,http://dx.doi.org/10.3115/1220175.1220214, 2006.

[12] D.M. Dunlavy, D.P. O’Leary, J.M. Conroy, and J.D. Schlesinger,“QCS: A System for Querying, Clustering and SummarizingDocuments,” Information Processing and Management, vol. 43,pp. 1588-1605, http://portal.acm.org/citation.cfm?id=1284916.1285163, Nov. 2007.

[13] R. Varadarajan, V. Hristidis, and T. Li, “Beyond Single-Page WebSearch Results,” IEEE Trans. Knowledge and Data Eng., vol. 20,no. 3, pp. 411-424, Mar. 2008.

[14] L.-W. Ku, L.-Y. Lee, T.-H. Wu, and H.-H. Chen, “Major TopicDetection and Its Application to Opinion Summarization,” Proc.28th Ann. Int’l ACM SIGIR Conf. Research and Development inInformation Retrieval, pp. 627-628, http://doi.acm.org/10.1145/1076034.1076161, 2005.

[15] E. Lloret, A. Balahur, M. Palomar, and A. Montoyo, “TowardsBuilding a Competitive Opinion Summarization System: Chal-lenges and Keys,” Proc. Human Language Technologies: The 2009Ann. Conference of the North Am. Ch. Assoc. for ComputationalLinguistics, Companion Vol. : Student Research Workshop and DoctoralConsortium, pp. 72-77, http://portal.acm.org/citation.cfm?id=1620932.1620945, 2009.

[16] J.G. Conrad, J.L. Leidner, F. Schilder, and R. Kondadadi, “Query-Based Opinion Summarization for Legal Blog Entries,” Proc. 12thInt’l Conf. Artificial Intelligence and Law, pp. 167-176, http://doi.acm.org/10.1145/1568234.1568253, 2009.

[17] H. Nishikawa, T. Hasegawa, Y. Matsuo, and G. Kikui, “OpinionSummarization with Integer Linear Programming Formulation forSentence Extraction and Ordering,” Proc. 23rd Int’l Conf. Computa-tional Linguistics: Posters, pp. 910-918, http://portal.acm.org/citation.cfm?id=1944566.1944671, 2010.

[18] C.C. Chen and M.C. Chen, “TSCAN: A Content AnatomyApproach to Temporal Topic Summarization,” IEEE Trans.Knowledge and Data Eng., vol. 24, no. 1, pp. 170-183, Jan. 2012.

[19] D.R. Radev, H. Jing, M. Sty�s, and D. Tam, “Centroid-BasedSummarization of Multiple Documents,” Information Processingand Management, vol. 40, pp. 919-938, http://portal.acm.org/citation.cfm?id=1036118.1036121, Nov. 2004.

[20] Z. Harris, Mathematical Structures of Language. Wiley, 1968.[21] K. Morita, E.-S. Atlam, M. Fuketra, K. Tsuda, M. Oono, and J.-i.

Aoe, “Word Classification and Hierarchy using Co-OccurrenceWord Information,” Information Processing and Management,vol. 40, pp. 957-972, http://portal.acm.org/citation.cfm?id=1036118.1036123, Nov. 2004.

[22] T. Yoshinari, E.-S. Atlam, K. Morita, K. Kiyoi, and J.-i. Aoe,“Automatic Acquisition for Sensibility Knowledge Using Co-Occurrence Relation,” Int’l J. Computer Applications in Technology,vol. 33, pp. 218-225, http://portal.acm.org/citation.cfm?id=1477782.1477797, Dec. 2008.

GOYAL ET AL.: A CONTEXT-BASED WORD INDEXING MODEL FOR DOCUMENT SUMMARIZATION 1703

Page 12: A Context-Based Word Indexing Model for Document Summarization

[23] B. Andreopoulos, D. Alexopoulou, and M. Schroeder, “WordSense Disambiguation in Biomedical Ontologies with Term Co-Occurrence Analysis and Document Clustering,” Int’l J. DataMining and Bioinformatics, vol. 2, pp. 193-215, http://portal.acm.org/citation.cfm?id=1413934.1413935, Sept. 2008.

[24] P. Goyal, L. Behera, and T. McGinnity, “Query RepresentationThrough Lexical Assoc. for Information Retrieval,” IEEE Trans.Knowledge and Data Eng., vol. 24, no. 12, pp. 2260-2273, Dec. 2011.

[25] K. Cai, C. Chen, and J. Bu, “Exploration of Term Relationship forBayesian Network Based Sentence Retrieval,” Pattern RecognitionLetters, vol. 30, no. 9, pp. 805-811, 2009.

[26] H. Li, “Word Clustering and Disambiguation Based on Co-Occurrence Data,” Nat’l Language Eng., vol. 8, pp. 25-42, http://portal.acm.org/citation.cfm?id=973860.973863, Mar. 2002.

[27] G. Amati and C.J. Van Rijsbergen, “Probabilistic Models ofInformation Retrieval Based on Measuring the Divergence fromRandomness,” ACM Trans. Information Systems, vol. 20, pp. 357-389, http://doi.acm.org/10.1145/582415.582416, Oct. 2002.

[28] D.E. Losada and L. Azzopardi, “Assessing Multivariate BernoulliModels for Information Retrieval,” ACM Trans. InformationSystems, vol. 26, pp. 17:1-17:46, http://doi.acm.org/10.1145/1361684.1361690, June 2008.

[29] L. Page, S. Brin, R. Motwani, and T. Winograd, “The PageRankCitation Ranking: Bringing Order to the Web,” technical report,Stanford Digital Library Technologies Project, http://citeseer.ist.psu.edu/page98pagerank.html, 1998.

[30] J. Turner and E. Charniak, “Supervised and UnsupervisedLearning for Sentence Compression,” Proc. 43rd Ann. Meeting onAssoc. for Computational Linguistics, pp. 290-297, http://dx.doi.org/10.3115/1219840.1219876, 2005.

[31] J. Clarke and M. Lapata, “Discourse Constraints for DocumentCompression,” Computational Linguistics, vol. 36, pp. 411-441, 2010.

[32] H. Daume III and D. Marcu, “A Noisy-Channel Model forDocument Compression,” Proc. 40th Ann. Meeting on Assoc. forComputational Linguistics, pp. 449-456, http://dx.doi.org/10.3115/1073083.1073159, 2002.

[33] C.-Y. Lin, “Improving Summarization Performance by SentenceCompression: A Pilot Study,” Proc. Sixth Int’l Workshop InformationRetrieval with Asian Languages, pp. 1-8, http://dx.doi.org/10.3115/1118935.1118936, 2003.

[34] D. Zajic, B.J. Dorr, J. Lin, and R. Schwartz, “Multi-CandidateReduction: Sentence Compression as a Tool for DocumentSummarization Tasks,” Information Processing and Management,vol. 43, pp. 1549-1570, http://portal.acm.org/citation.cfm?id=1284916.1285161, Nov. 2007.

[35] G. Erkan and D.R. Radev, “LexRank: Graph-Based LexicalCentrality as Salience in Text Summarization,” J. ArtificialIntelligence Research, vol. 22, pp. 457-479, http://portal.acm.org/citation.cfm?id=1622487.1622501, Dec. 2004.

[36] R. Mihalcea and P. Tarau, “A Language Independent Algorithmfor Single and Multiple Document Summarization,” Proc. Int’lJoint Conf. Natural Language Processing (IJCNLP), http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.136.1125, 2005.

[37] H.P. Luhn, “The Automatic Creation of Literature Abstracts,” IBMJ. Research and Development, vol. 2, pp. 159-165, http://dx.doi.org/10.1147/rd.22.0159, Apr. 1958.

[38] C.-Y. Lin and E. Hovy, “Identifying Topics by Position,” Proc. FifthConf. Applied Natural Language Processing, pp. 283-290, http://dx.doi.org/10.3115/974557.974599, 1997.

[39] E. Hovy and C.-Y. Lin, “Automated Text Summarization and theSummarist System,” Proc. Workshop Held at Baltimore, Maryland(TIPSTER ’98), pp. 197-214, http://dx.doi.org/10.3115/1119089.1119121, 1998.

[40] R. Katragadda, P. Pingali, and V. Varma, “Sentence PositionRevisited: A Robust Light-Weight Update Summarization ‘Base-line’ Algorithm,” Proc. Third Int’l Workshop Cross Lingual Informa-tion Access: Addressing the Information Need of Multilingual Societies,pp. 46-52, http://portal.acm.org/citation.cfm?id=1572433.1572440, 2009.

[41] Y. Ouyang, W. Li, Q. Lu, and R. Zhang, “A Study on PositionInformation in Document Summarization,” Proc. 23rd Int’l Conf.Computational Linguistics: Posters, pp. 919-927, http://portal.acm.org/citation.cfm?id=1944566.1944672, 2010.

[42] H.P. Edmundson, “New Methods in Automatic Extracting,”J. ACM, vol. 16, pp. 264-285, http://doi.acm.org/10.1145/321510.321519, Apr. 1969.

[43] C.-Y. Lin and E. Hovy, “The Automated Acquisition of TopicSignatures for Text Summarization,” Proc. 18th Conf. ComputationalLinguistics, pp. 495-501, http://dx.doi.org/10.3115/990820.990892, 2000.

[44] R. Barzilay and M. Elhadad, “Using Lexical Chains for TextSummarization,” Proc. ACL Workshop Intelligent Scalable TextSummarization, pp. 10-17, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.29.6428, 1997.

[45] Y. Ko and J. Seo, “An Effective Sentence-Extraction TechniqueUsing Contextual Information and Statistical Approaches for TextSummarization,” Pattern Recognition Letters, vol. 29, pp. 1366-1371,http://portal.acm.org/citation.cfm?id=1371261.1371371, July2008.

[46] R. Alguliev and R. Alyguliev, “Summarization of Text-BasedDocuments with a Determination of Latent Topical Sections andInformation-Rich Sentences,” Automatic Control and ComputerSciences, vol. 41, pp. 132-140, http://dx.doi.org/10.3103/S0146411607030030, 2007.

[47] J. Hintikka, “On Semantic Information,” Physics, Logic, and History,pp. 147-172, Springer, 1970.

[48] G. Marsaglia and J.C.W. Marsaglia, “A New Derivation ofStirling’s Approximation of n!” Am. Math. Monthly, vol. 97,pp. 826-829, http://portal.acm.org/citation.cfm?id=96077.96084,Sept. 1990.

[49] K.W. Church and P. Hanks, “Word Association Norms, MutualInformation, and Lexicography,” Computational Linguistics, vol. 16,no. 1, pp. 22-29, 1990.

[50] M. Karimzadehgan and C. Zhai, “Estimation of StatisticalTranslation Models Based on Mutual Information for Ad HocInformation Retrieval,” Proc. 33rd Int’l ACM SIGIR Conf. Researchand Development in Information Retrieval, pp. 323-330, http://doi.acm.org/10.1145/1835449.1835505, 2010.

[51] P. Over, “Introduction to DUC-2001: An Intrinsic Evaluation ofGeneric News Text Summarization Systems,” Proc. DUC WorkshopText Summarization, 2001.

[52] P. Over and W. Liggett, “Introduction to DUC: An IntrinsicEvaluation of Generic News Text Summarization Systems,” Proc.DUC Workshop Text Summarization, 2002.

[53] I. Mani, G. Klein, D. House, L. Hirschman, T. Firmin, and B.Sundheim, “Summac: A Text Summarization Evaluation,” Nat’lLanguage Eng., vol. 8, pp. 43-68, http://portal.acm.org/citation.cfm?id=973860.973864, Mar. 2002.

[54] A. Nenkova, R. Passonneau, and K. McKeown, “The PyramidMethod: Incorporating Human Content Selection Variation inSummarization Evaluation,” ACM Trans. Speech Language Proces-sing, vol. 4, no. 2, pp. 1-23, http://doi.acm.org/10.1145/1233912.1233913, May 2007.

[55] C.-Y. Lin and E. Hovy, “Automatic Evaluation of Summariesusing N-Gram Co-Occurrence Statistics,” Proc. Conf. North Am. Ch.Assoc. for Computational Linguistics on Human Language Technology,pp. 71-78, http://dx.doi.org/10.3115/1073445.1073465, 2003.

[56] C.-Y. Lin, G. Cao, J. Gao, and J.-Y. Nie, “An Information-TheoreticApproach to Automatic Evaluation of Summaries,” Proc. MainConf. Human Language Technology Conf. North Am. Chapter of theAssoc. of Computational Linguistics, pp. 463-470, http://dx.doi.org/10.3115/1220835.1220894, 2006.

1704 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 25, NO. 8, AUGUST 2013

Page 13: A Context-Based Word Indexing Model for Document Summarization

Pawan Goyal received the BTech degree inelectrical engineering from Indian Institute ofTechnology Kanpur, India, in 2007 and the PhDdegree from Intelligent Systems Research Cen-tre, Faculty of Computing and Engineering,University of Ulster, Magee campus, in 2011.He is currently a postdoctoral researcher atINRIA Paris Rocquencourt. His research inter-ests include information retrieval, data mining,and computational linguistics.

Laxmidhar Behera (S’92-M’03-SM’03) receivedthe BSc and MSc degrees in engineering fromNIT Rourkela, India, in 1988 and 1990, respec-tively, and the PhD degree from IIT Delhi, India.He is currently a professor in the Department ofElectrical Engineering, IIT Kanpur, India. Hejoined the Intelligent Systems Research Center,University of Ulster, United Kingdom, as areader on sabbatical from IIT Kanpur from2007-2009. He was an assistant professor in

BITS Pilani from 1995-1999 and pursued his postdoctoral studies in theGerman National Research Center for Information Technology, GMD,Sank Augustin, Germany, from 2000-2001. He has also worked as avisiting researcher/professor in FHG, Germany, and ETH, Zurich,Switzerland. He has more than 170 papers to his credit published inrefereed journals and presented in conference proceedings. Hisresearch interests include intelligent control, robotics, neural networks,and cognitive modeling. He is a senior member of the IEEE.

Thomas Martin McGinnity (M’82-SM’10) re-ceived the first class honours degree in physics,and a doctorate from the University of Durham.He is a chartered engineer. He holds the post ofprofessor of intelligent systems engineeringwithin the Faculty of Computing and Engineer-ing, University of Ulster, United Kingdom. He iscurrently a director of the Intelligent SystemsResearch Centre, which encompasses theresearch activities of approximately 100 re-

searchers. He was formerly an associate dean of the faculty and adirector of both the university’s technology transfer company, InnovationUlster, and a spin out company, Flex Language Services. His currentresearch interests include the creation of bioinspired intelligentcomputational systems which explore and model biological signalprocessing, particularly in relation to cognitive robotics, and computationneuroscience modeling of neurodegeneration. He is the author orcoauthor of more than 250 research papers, and has been awarded botha Senior Distinguished Research Fellowship and a DistinguishedLearning Support Fellowship in recognition of his contribution toteaching and research. He is a senior member of the IEEE and a fellowof the IET.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

GOYAL ET AL.: A CONTEXT-BASED WORD INDEXING MODEL FOR DOCUMENT SUMMARIZATION 1705


Recommended