+ All Categories
Home > Documents > Generating Reading Orders over Document Collections · Generating Reading Orders over Document...

Generating Reading Orders over Document Collections · Generating Reading Orders over Document...

Date post: 28-May-2018
Category:
Upload: buihuong
View: 217 times
Download: 0 times
Share this document with a friend
13
Keyword(s): Abstract: Generating Reading Orders over Document Collections Georgia Koutrika, Lei Liu, Steven Simske HP Laboratories HPL-2014-5R1 reading sequence mining Given a document collection, existing systems allow users to browse the collection or perform searches that return lists of documents ranked based on their relevance to the user query. While these approaches work fine when a user is trying to locate specific documents, they are insufficient when users need to access the pertinent documents in some logical order, for example for learning or editorial purposes. We present a system that automatically organizes a collection of documents in a tree from general to more specific documents, and allows a user to choose a reading sequence over the documents. This a novel way to content consumption that departs from the typical ranked lists of documents based on their relevance to a user query and from static navigational interfaces. We present a set of algorithms that solve the problem and we evaluate their performance as well as the reading trees generated. External Posting Date: March 6, 2015 [Fulltext] Approved for External Publication Internal Posting Date: March 6, 2015 [Fulltext] Copyright 2015 Hewlett-Packard Development Company, L.P.
Transcript

Keyword(s): Abstract:

Generating Reading Orders over Document Collections

Georgia Koutrika, Lei Liu, Steven Simske

HP LaboratoriesHPL-2014-5R1

reading sequence mining

Given a document collection, existing systems allow users to browse the collection or perform searches thatreturn lists of documents ranked based on their relevance to the user query. While these approaches workfine when a user is trying to locate specific documents, they are insufficient when users need to access thepertinent documents in some logical order, for example for learning or editorial purposes. We present asystem that automatically organizes a collection of documents in a tree from general to more specificdocuments, and allows a user to choose a reading sequence over the documents. This a novel way tocontent consumption that departs from the typical ranked lists of documents based on their relevance to auser query and from static navigational interfaces. We present a set of algorithms that solve the problemand we evaluate their performance as well as the reading trees generated.

External Posting Date: March 6, 2015 [Fulltext] Approved for External PublicationInternal Posting Date: March 6, 2015 [Fulltext]

Copyright 2015 Hewlett-Packard Development Company, L.P.

Generating Reading Ordersover Document Collections

Georgia Koutrika 1, Lei Liu 2, Steve Simske 3

HP Labs, Palo Alto, USA1 [email protected] 2 [email protected] 3 [email protected]

Abstract—Given a document collection, existing systems allowusers to browse the collection or perform searches that returnlists of documents ranked based on their relevance to the userquery. While these approaches work fine when a user is tryingto locate specific documents, they are insufficient when usersneed to access the pertinent documents in some logical order, forexample for learning or editorial purposes. We present a systemthat automatically organizes a collection of documents in a treefrom general to more specific documents, and allows a user tochoose a reading sequence over the documents. This a novel wayto content consumption that departs from the typical ranked listsof documents based on their relevance to a user query and fromstatic navigational interfaces. We present a set of algorithms thatsolve the problem and we evaluate their performance as well asthe reading trees generated.

I. INTRODUCTION

There are two primary ways to access the contents ofa document collection. A navigational interface provides anorganized view of the contents. A search interface allows oneto ask a query and see a dynamically generated ranked listof documents. Both these approaches may work fine when auser is trying to locate specific documents (e.g., ‘the homepageof University X?’) but they are insufficient when users needto access the pertinent documents in some logical order, forexample for learning, research or editorial purposes.

Search engines hide document relationships while naviga-tional interfaces capture only fixed relationships that do notdynamically adapt to a user’s specific need. In both cases, todetermine which resources to read and in what order, usersneed to manually sift through the documents returned by thesystem. This can be a tedious process: an individual may reada significant number of documents, in the wrong order, until itis understood how they relate to each other, and then possiblyre-read them in the right order to fully grasp their contents.

In this paper, we propose organizing documents in a readingorder from general to more specific documents. This typeof reading order is extremely useful in several contexts andapplications. For instance, it can benefit researchers lookingfor papers on a particular topic, editors selecting articles topublish on a web site, IT personnel trying to figure out whichdocumentation they should read first, and so forth.

We propose automatically determining the reading order ofa set of documents from general to more specific documentsbased on their underlying relationships. We define two typesof specificity relationships between two documents d1 and

d2. If they cover the same topics at the same level, theyare considered equivalent. If they are about related topicsbut d2 is more focused on particular topics than d1, then d2is more specific than d1. Equivalent documents can be readin any order (or a reader can choose to read any of them).When a document is more specific than another document,then the former should follow the latter in reading order.For example, a document about classification methods shouldfollow a document that is an introduction to data mining.

We quantitatively define specificity relationships with thehelp of two metrics: document generality and document over-lap. We measure document overlap and generality based onthe documents’ topical relationships. In other words, docu-ment specificity relations are determined by the documenttopical relationships. Our approach is domain-independent asit requires no external knowledge about the documents. Wepropose using the entropy as a measure of document generalityand the weighted Jaccard for document overlap. Note that ourreading order framework is independent of the metrics used.

We organize a set of documents in a reading tree thatcaptures document specificity relationships. We define thereading order problem as the problem of computing a completereading tree over a set of documents. Intuitively, a completereading tree is a hierarchical tree where each node correspondsto one or more documents that are more general than thedocuments in any of their child nodes, and the edges capturethe sequencing patterns between documents in the respectivenodes. A reading tree follows a table-of-contents paradigm andhence it is easily understood by people.

Our approach to solve the reading order problem comprisesof a topic model for modeling the topic relationships betweendocuments, a topic calibration method for refining the resultsof the topic model and a tree generation algorithm that, aswe prove, builds complete reading trees. The topic calibrationmethod makes the whole approach more robust by correctingerrors generated by the topic model. This method leverages thecontent similarity of documents to propagate document-topicscores between strongly similar documents on the account thatdue to their similarity they are more likely to share topics.

Evaluating the reading trees is non-trivial. Since this anovel problem, there is no ground truth. Furthermore, weneed to find a suitable performance metric. For this purpose,we propose a new metric for comparing reading trees whoseobjective is to measure the relative order of two documents and

not how identical two reading trees are. In addition, we usethree strategies to evaluate the generated reading trees: (a) wegenerate ground truth from the Wikipedia, (b) we ask expertsto generate reading trees for given sets of documents and weuse these trees as another type of ground truth, and (c) we askusers to evaluate reading trees generated on different types ofdocuments: research papers, Wikipedia pages, and news.

Generating reading trees provides a natural organizationof any document collection and enables sequential reading.In particular, it can be applied in many areas, such as (a)education, for organizing online educational material, (b)patent searching, for helping users that are not familiar witha domain or problem (e.g. legal people) to understand whichpatents have more general cover than others, (c) research, fororganizing publications to help researchers learn a new topic.

Contributions. The contributions of our work are as follows:

• We introduce the concept of ordering documents basedon specificity relationships, which we define based ontwo topical measures, the document generality and topicoverlap. We propose using the entropy for measuring thegenerality score of a document among other documents.• We define the concept of a complete reading tree and we

formally define the problem of determining the readingorder of a set of documents.• We present a reading tree construction algorithm that

leverages the document topic relationships for buildingcomplete reading trees over a set of documents.• To compensate for possible errors of the topic model, we

propose a topic model calibration method that is designedto minimize an objective function similar to the oneused for semi-supervised learning with local and globalconsistency.• We evaluate the performance of the algorithms and the

effect of the various parameters on the form of the readingtrees and execution times.• We introduce an evaluation metric that is suitable for

comparing reading trees and we apply different strategiesfor evaluating the generated reading trees.

II. RELATED WORK

A document collection can be accessed through navigationalinterfaces and search engines. Navigational interfaces providea fixed view over a known collection. Existing search engines,such as Google, and Bing, rank documents based on theirrelevance to a query. These ranked lists do not suggest anyreading order among the documents. Google’s advanced searchinterface1 organizes search results into three reading levels:basic, intermediate, and advanced. However, the reading levelclassification provides only a very coarse document ordering.

Clustering-based search engines, such as Carrot2, clustersearch results allowing the users to elaborate their initial queryby focusing on a specific cluster. There is substantial prior

1https://www.google.com/advanced search (last accessed October 2014)2http://search.carrot2.org/

work on hierarchical document clustering methods, includingtechniques for iteratively partitioning (merging) the dissimilar(similar) documents [1], incremental clustering [2], and prob-abilistic taxonomy modeling [3]. Hierarchical clustering treesare inherently different from our reading trees: any level ofa clustering tree is a segmentation of all documents whereasany level of a reading tree maps a unique subset of the inputdocuments. Hence, in a clustering tree, a document appears atall levels of the hierarchy. On the other hand, a reading treedefines a partial order on the input documents.

Instead of clustering the documents into groups that areassigned to the nodes of a tree, label tree learning methodsaim at classifying documents in multiple classes [4], [5],[6]. Classes are automatically organized into a label tree thatcaptures the relationships between them. A label tree generatesa segmentation of the training documents but it does notspecify any order among these documents.

Efforts to understand a corpus of documents include corpussummarization approaches [7], [8]. Ordering documents insome meaningful way has been studied in the context ofnews articles [9], [10]. The method described in [10] startswith two news articles and automatically finds a topic-basedcoherent chain of articles linking them together. Incidentthreading is the process of identifying incidents in a newsstream and connecting them through contextual links, suchas ‘consequence of’ or ‘follow-up’ [9]. These approacheswork with particular types of documents (e.g., articles thatdescribe strong events [9]). Our approach is general-purposeand domain-independent, and it aims at ordering a corpus ofdocuments based on the document specificity relationships,which is a novel approach to corpus analysis.

A related line of research is the generation of personalizedcurriculum sequences in e-learning systems [11], [12], [13],[14]. Curriculum sequencing can be seen as a two-part process:deciding on relevant topics based on the current student model,which captures the student’s goals and performance, and thenselecting the best course module(s) to show to the student[15]. The main shortcoming of these methods is that they aredesigned either for a specific course [13], [14] or when thedomain knowledge is available [11], [12]. Therefore, they cannot be extended to ordering random document sets.

The concept of ‘specialization/generality’ is inherent innavigation hierarchies (e.g., IS-A hierarchies in ontologies),where the direction of the edges usually implies that an entrycloser to the root is more general than a descendant deeper inthe hierarchy. However, quantifying how general one entry isrelative to another one is not trivial. Most existing techniquesextract these relationships either (a) from the informationcontent of the terms in an ontology (computed over a largecorpus) [16] or (b) from the structure (eg., density, depth) inthe hierarchy itself [17]. A third way leverages the keywordscontained in the entries to decide the degree of generalitybased on the relative content of two documents [18], [19].

We also rely on the textual content of documents but wemodel documents using topics. We do not assume a pre-existing hierarchy. Our objective is to automatically organize

a collection of documents in a reading tree where each nodecorresponds to one or more similar documents and the edgescapture general-to-more-specific sequencing patterns betweendocuments. To the best of our knowledge, there is no priorwork for generating such reading sequences over documents.

III. SPECIFICITY RELATIONSHIPS BETWEEN DOCUMENTS

There are different types of document relationship thatcan determine the order in which two documents should beread. For example, two documents on the same topic maybe ordered based on their importance: the most importantdocument should be read first. Another example is a pre-requisite relationship: a document on probabilities should beread before a document on Bayesian networks. In this paper,we focus on the specificity relationships between documents,which are determined based on their contents.

We consider a set D of documents. d refers to a singledocument in D (di if more than one document is referred.)We identify two types of specificity relationship, sequencingand equivalence. We first give an intuitive definition. Later wedefine them more formally.

A specificity equivalence relation di ↔ dj signifies that diand dj cover the same topics at the same level. We say thatdi and dj are equivalent based on specificity.

A specificity sequencing relation di → dj signifies that diand dj have some overlap but dj is more specific or focusedthan di. We say that di precedes dj based on specificity.

As an example, consider the following documents: d1 isan introduction to data mining, d2 describes classificationmethods, and d3 is another introductory paper on data mining.Both d1 and d3 cover the same topic to a similar extent andhence they have a specificity equivalence relation, d1 ↔ d3.Consequently, one can choose to read any of them. However,d2 is more focused, i.e., it has a specificity sequencing relationto the other documents: d1 → d2 and d3 → d2.

We define two specificity metrics:• The generality score of a document is a function g:D →R

that computes the generality of a document. It holds thatdi is more general than dj iff g(di) ≥ g(dj).

• The overlap score of a pair of documents is a functiono:D×D→[0, 1] that computes the degree of their common-ality. Score equal to 0 means no overlap, while 1 meansmaximum overlap.

We now formally define specificity relations as follows:Equivalence: di ↔ dj iff |g(di)− g(dj)| ≤ κ ∧ o(di, dj) ≥ τ (1)

Sequencing: di → dj iff g(di) > g(dj) ∧ o(di, dj) > 0

∧ (|g(di)− g(dj)| > κ ∨ o(di, dj) < τ)(2)

τ defines the minimum overlap between two equivalentdocuments and κ defines the maximum difference of theirgenerality scores.

A. Measures of document generality and overlap

We measure document overlap and generality based on thedocuments’ topical relationships. Topic models [20] representdocuments as mixtures of topics, where a topic is a probability

distribution over words. A topic model aims at discoveringthe hidden thematic structure of a collection of documents byfinding how topics are assigned to documents, and how topicsare described by words in the documents. Representing adocument using topics rather than document keywords allowscapturing implicit relationships between documents, not justthe explicit similarity of their common words.

Given a collection D of n documents, a topic modelgenerates s topics, t1, . . . ts, that describe this collection. Fn×sis the document-topic matrix that captures how the s topics areassigned to the n documents. We denote Fi the topic vectorassociated with di. Fim ∈ [0, 1], with i ≤ n and m ≤ s, is theprobability that topic tm is assigned to document di.

Based on the above, the generality score is defined as a func-tion over Fn×s, i.e., g : Fn×s →R. Similarly, the overlap scoreof a pair of documents is defined as o : Fn×s ×Fn×s→[0, 1].

Document generality. We can measure the document gen-erality based on the document’s entropy over the topics.The basic intuition behind the entropy is that the higher adocument’s entropy is, the more topics the document covershence the more general it is. Using the Shannon entropy, thegenerality score g(di) of document di is computed as follows:

g(di) = H(di) =∑m

−Fim log(Fim) (3)

Document overlap. The overlap of two documents can becomputed as their weighted Jaccard score [21]. The weightedJaccard extends the classic Jaccard index, which is defined asthe size of the intersection divided by the size of the unionof the topic sets assigned to each document, by taking intoaccount their topic probabilities. The overlap score can becomputed as follows:

o(di, dj) = Jaccard(di, dj) =Fi · Fj

|Fi|2 + |Fj |2 − Fi · Fj(4)

Higher overlap scores indicate more common topics betweenthe documents.

Example 1. To illustrate the document generality and overlapusing the entropy and Jaccard score, respectively, we consider6 Wikipedia documents related to “Machine Learning” asshown in Table I, and we use the content of the correspond-ing Wikipedia page as input to the topic model. Table IIshows the assignment of five topics to the six documents.To get an idea of the meaning for each topic, we showthe top five terms for each topic in Table III. Consider, forexample, documents d1 and d3. d1 covers all topics almostto the same extent whereas d3 focuses on topic2. Theiroverlap is o(d1, d3) = 0.33605 and their generality scoresare g(d1) = 0.6494 > g(d3) = 0.53066. Hence, d1 (‘MachineLearning’) is more general than d3 (‘Support Vector Machine’)which is actually true. As another case, documents d2 and d5have very small overlap, o(d2, d5) = 0.2289. On the otherhand, their generality scores are close, g(d2) = 0.53748and g(d5) = 0.60333. This indicates that d2 (‘SupervisedLearning’) and d5 (‘Unsupervised Learning’) will end up at

TABLE IDOCS

ID Doc Named1 Machine Learningd2 Supervised Learningd3 Support Vector Machined4 Linear Regressiond5 Unsupervised Learningd6 Data Clustering

TABLE IIDOC-TOPIC MAPPING

ID topic1 topic2 topic3 topic4 topic5d1 0.2334 0.1317 0.2602 0.3108 0.0636d2 0.0486 0.5544 0.1657 0.0554 0.1756d3 0.0542 0.5756 0.1568 0.0594 0.1536d4 0.1587 0.0495 0.0331 0.0274 0.7311d5 0.2887 0.0594 0.286 0.3203 0.0453d6 0.2959 0.0586 0.2859 0.3152 0.0441

TABLE IIITOPIC-TERM MAPPING

Topic Terms (top 5)topic1 analysis; learning; variable; unsupervised; componenttopic2 learning; space; set; vector; training;topic3 data; classification; number; problem; settopic4 learning; cluster; PCA; decomposition; principaltopic5 linear; regression; input; function; models

different branches of the reading tree but probably at the samelevel, which reflects their actual relation. �

Note that other metrics for measuring document generalityand overlap are possible. For example, instead of the Shannonentropy, we could use the residual entropy (entropy of non-common terms). Given a set of estimated topic probabilitiesfor each document, the overlap of two documents can be com-pared using similarity measures such as the cosine similarityor the unnormalized dot product. The algorithms we presentare independent of how generality and overlap are measured.

IV. THE READING ORDER PROBLEM

A reading graph R(V,E) over a document set D is adirected acyclic graph whose nodes correspond to the inputdocuments and edges capture specificity relations among thedocuments. In particular, a node vi ∈ V maps a non-empty setDi ⊆ D of equivalent documents. An edge vi → vj betweennodes vi and vj signifies that documents belonging to thecorresponding document set Di precede documents belongingto the respective set Dj .

A complete reading tree over a document set D is a readinggraph R(V,E) with the following properties:(a) For each node vi ∈ V with Di being its corresponding setof documents, it holds that: a document d ∈ D maps to vi iffd↔ di, for all di ∈ Di.(b) For each pair of nodes vi, vj ∈ V with Di and Dj beingtheir sets of documents, and an edge vi → vj , it holds that: Foreach pair of documents di ∈ Di and dj ∈ Dj , it is di → dj .(c) For each pair of nodes vi, vj ∈ V with Di and Dj beingtheir sets of documents, and an edge vi → vj , it holds that:there is no other node vk, such that: vi → vk → vj .

A reading sequence dm1→ dm2

· · · → dmkof documents

dmi∈ D, i = 1 . . . k, maps to a path vl1 → vl2 · · · → vlk on

the graph with vli ∈ V, i = 1 . . . k such that dmi∈ Dli of

node vli , i = 1 . . . k.Figure 1(a) shows an example reading tree over a set of six

documents. Several reading sequences may be derived over thedocument relationships. The figure shows an example readingsequence: d1 → d4 → d6. Furthermore, there may be morethan one way to represent the same set of documents as acomplete reading graph. For example, consider documents d1,d2 and d3, which have some overlap and d1 ↔ d2 and d2 ↔ d3but d1 = d3. Figure 1(b) shows two possible complete readingtrees. Finally, multiple reading trees may be needed to cover adocument collection. For instance, documents with no overlapwill be mapped to different reading trees.

The reading order problem is defined as follows:

(b)(a)

d1 d5

d4 d3

d2 d6

d1 d2

d3

d2 d3

d1

Fig. 1. Example reading trees.

READING ORDER COMPUTATION. Given a document col-lection D, find all complete reading trees over D.

V. SYSTEM ARCHITECTURE

The system is depicted in Figure 2. The input is a set Dof documents and the output is one or more reading treesmapping the documents in D.

Pre-processing. Pre-processing removes noisy and stopwords, performs stemming, and transforms each document to aterm vector. The output of this step is a document-term matrixthat maps all documents to terms.

Topic Model. The topic model transforms the originaldocument-term matrix into a document-topic matrix, whereeach cell indicates the document-topic membership score. Weextract the topics that occur in the documents using the LatentDirichlet Allocation model with Gibbs sampling [22].

Topic Model Calibration. Since topical relationships de-termine the document relationships, we perform topic modelcalibration to correct errors generated by the topic model, suchas topic misassignments. This module leverages the contentsimilarity of documents by comparing the term representationsof the documents and propagating document-topic scoresbetween strongly similar documents on the account that dueto their similarity they are more likely to share similar topics.

Reading Tree Generation. The Reading Tree Generationalgorithm uses the document-topic scores learnt from thescore propagation module to determine the specificity relationsbetween the documents and build the reading tree(s).

VI. ALGORITHMS

The input to the system is a set D of n documents. Atthe end of the preprocessing phase, a document-term matrixXn×m is generated where n is the number of documentsand m is the number of terms. For a document di, its termrepresentation is a vector denoted xi that maps to the ith rowof the matrix Xn×m.

Pre-processing

Topic Model

Topic Model Calibration

Similarity Graph Construction

Score Propagation

Set of documents

reading tree(s)

document-term matrix

Similarity graph

document-topic matrix

Reading Tree Generation

improved document-topic

matrixdocument-term

matrix

Fig. 2. System architecture.

A. Topic Model

Topic modeling algorithms are statistical methods that an-alyze the words of the original texts to discover the themesthat run through them, how those themes are connected toeach other, and how they change over time [20], [22], [23],[24]. The idea behind a topic model is that when a documentis about a particular topic, some words should appear morefrequently. Hence, documents are mixtures of topics, whereeach topic is a probability distribution over words.

The topic model algorithm generates a document-topic ma-trix Fn×s, where each matrix element captures the document-topic membership as a probability. In this paper, we use LatentDirichlet Allocation (LDA) with Gibbs sampling [22]. In thismodel, the number s of topics to be generated is given asinput to the algorithm and it depends on the document set. Asmall number of topics could provide a broad overview of thedocument structure whereas a large number could provide fine-grained topics at the cost of computational time. We treat thenumber of topics as one of the parameters of our algorithmsand analyze its impact in the experiments.

B. Topic Model Calibration

Despite their powerfulness, topic models have a numberof known issues. Among those, two are important for ourproblem setting: when same documents end up having notexactly the same topics [25], and when topics may be missedor misassigned [26]. The topic model calibration aims at ame-liorating these issues by using the explicit textual similaritiesof documents to influence the topic assignment. This methodallows the topics of a document to be influenced by thetopics of its most similar neighbors. We first compute thedocument similarity graph that captures the explicit similaritiesof the documents based on their keywords. Then we present atopic score propagation method that smooths the initial topicdistribution of a document based on those of its neighbors onthe similarity graph.

1) Similarity Graph Construction: Given the document-term matrix Xn×m for the document set D, we can compute ann×n kernel matrix K for all pairs of documents in X , whereeach entry Kij is a numeric value reflecting the similaritybetween two documents di and dj . Any kernel method canbe used for this purpose. Without loss of generality, we usethe cosine similarity as K(xi, xj) =

xi·xj

‖xi‖×‖xj‖ .The document similarity graph G = (V ′, E′,W ) for the

set D is an undirected weighted graph where V ′ is the set ofnodes, each node mapping to a document di ∈ D, E′ is the setof the edges, and W is an n×n adjacency matrix. For each pair

of documents di and dj , there is an edge eij ∈ E′ connectingthe respective nodes with weight Wij equal to K(xi, xj).

Considering all possible connections between documentsmay lead to a very dense graph that cannot be processed effi-ciently. Furthermore, the topic model calibration is performedamong similar documents, hence edges with low weightsshould not be included in the similarity graph. Therefore,we use the concept of the ε-neighborhood graph construction,which is widely used in tasks such as semi-supervised learning[27] and spectral clustering [28].

Given a document set D, the ε-neighborhood documentsimilarity graph Gε is a document similarity graph such thatfor each pair of documents di and dj in D, there is an edgeeij in the graph connecting the respective nodes with weightWij = K(xi, xj) if and only if K(xi, xj) ≥ ε. ε is a parameterto control the sparseness of the graph. The higher the valueof ε, the sparser the graph is as fewer links are present thatcorrespond to the more similar documents. In this paper, weassign ε = 0.3. For simplicity hereafter, we will use the termdocument similarity graph G = (V ′, E′,W ) implying that itis an ε-neighborhood document similarity graph.

Apart from content similarity, links between documents,such as web links and citations, also indicate that the doc-uments are related to some extent. When such links exist, wecould build a document similarity graph by combining bothcontent and link information [29].

Algorithm 1 Score PropagationInput: G = (V ′, E′,W ), F (0), β and MaxIterOutput: F : topic probabilistic scores of nodes in Gε

Build W using kernel matrix K(x, x)Calculate W = D−

12WD−

12

for i = 1 to MaxIter

Updating topic scores F using equation (7) as:F = βWF + (1− β)F (0)

end for

2) Topic Score Propagation: Both the document-topic ma-trix and the similarity graph are input to the score propagationmodule, whose target is to propagate the document-topicassignments over the graph. Score propagation is inspiredfrom label propagation that leverages the idea that stronglyconnected nodes should share similar class information to labelnodes in a graph [27], [30]. In our setting, strongly similardocuments have a high chance to share similar topics, and ourtarget is to learn the topics in the similarity graph rather thanperform network node classification task.

The basic idea of score propagation is the following: aftergetting the initial topic distribution across the documents usingthe topic model algorithm, we propagate the topic probabilisticscores over the document similarity graph G = (V ′, E′,W ),so the potential topics of a document take into considerationthe topic probabilistic scores of their neighbors (which in turn,depend on the scores of their respective neighbors, and so on).The algorithm iteratively updates the topic probabilistic scores

of a node in the document-topic matrix Fn×s based on theweighted average of the scores of its neighbors.

Given the document-topic matrix Fn×s and the documentsimilarity graph G = (V ′, E′,W ), our score propagation algo-rithm is formally designed to minimize an objective functionsimilar to the one used for semi-supervised learning with localand global consistency [27]. Our objective function is:

Q(F ) =1

2

∑ij

Wij

[Fi√Dii

− Fj√Djj

]2+µ

2

∑i

(Fi − F (0)i )2

where Fi is a vector that maps to the ith row of the matrixFn×s and captures the topic distribution for document di,and F

(0)i corresponds to the initial topic probabilistic scores

determined by the topic model. Wij is the similarity weightfor two documents di and dj . Finally, D is a diagonal matrixwhose diagonal elements are given by Dii =

∑jWij .

Intuitively, the first term in the objective function ensuresthat the topic probabilistic scores for any pair of nodesconnected by a highly weighted link should not differ substan-tially. The second term ensures that the scores of the nodesshould not deviate significantly from their initial values. Theparameter µ controls the tradeoff between the two terms.

Let us now see how we can solve the objective functionand arrive to the updating formula for the matrix F . First, ourobjective function can be rewritten in the following form:

Q(F ) =1

2

∑ij

Wij

[F 2i

Dii− 2

FiFj√Dii√Djj

+F 2j

Djj

]+

µ

2

∑i

(Fi − F (0)i )2

=1

2

[∑i

F 2i − 2

∑ij

FiWij√

Dii√Djj

Fj +∑j

F 2j

]+

µ

2

∑i

(Fi − F (0)i )2 (5)

Now, we can express the objective function in matrixnotation as follows:

Q(F ) = FT (I −D−12WD−

12 )F +

µ

2‖F − F (0)‖2

= FT LF +µ

2‖F − F (0)‖2 (6)

where L = I − D− 12WD−

12 = D−

12 (D −W )D−

12 is the

normalized Laplacian of the graph.To optimize the objective function, we take its partial

derivative with respect to F and set it to zero:

∂Q(F )

∂F= (I −D−

12WD−

12 )F + µ(F − F (0))

= (1 + µ)F −D−12WD−

12F − µF (0) = 0

The preceding equation can be re-written in the form of aniterative update formula:

F =1

1 + µWF +

µ

1 + µF (0) = βWF + (1− β)F (0)(7)

where W = D−12WD−

12 is the normalized adjacency matrix

and β = 11+µ is a damping factor that controls the tradeoff

between biasing the scores according to the graph structureas opposed to the initial score matrix F (0). If β = 0, thetopic scores are equal to the initial values obtained from topicmodel. On the other hand, if β = 1, the topic score of a nodedepends only on the scores of its neighbors.

Since β determines the contributions of the neighbors of anode in a graph (while 1-β determines the contribution of thenode itself), β operates in a similar fashion as the dampingfactor used in the PageRank formula. For Web graphs, theparameter β is often set to 0.85 [31]. We use the same valuethroughout our experiments.

The score propagation method is shown in Algorithm 1.

C. Reading Tree Generation

The tree generation process progressively weaves the read-ing order for a set of documents by determining the specificityrelations among the documents. It takes as input the setof documents, the document-topic assignments F , and twoparameters: τ , which defines the minimum overlap betweentwo equivalent documents, and κ, which defines the maximumdifference of their generality scores. Both parameters are usedto define the specificity relations (Section III). We use theformulas (3) and (4) for computing document generality andoverlap, respectively, but the algorithm is really independentof how document overlap and generality are estimated.

The algorithm builds a complete reading tree in an iterativeway. At each round, it handles a subset of similar documentsthat need to be connected to the tree already created. Thealgorithm’s intention is to grow a sub-tree out of this set ofdocuments and connect it to the existing tree. For this purpose,it first creates the root of this new sub-tree by putting togetherthe most general documents from the set in consideration thatare also equivalent. The remaining documents of this set areclustered such that they have some overlap with the root andamong them. Each cluster becomes a new set of documentsout of which the algorithm will further create new nodes andedges. This process repeats until no more tree growing ispossible and there are no documents unprocessed. Initially, theset of documents under consideration contains all documentsand the current root node is a dummy node. The algorithm isoutlined in Algorithm 2.

The algorithm starts by computing the generality score foreach input document and the generality difference matrix Ewhere Eij = |g(di)−g(dj)| indicates the generality differenceof documents di and dj . All documents are then ordered indescending order of their generality score. A dummy nodeis created and this node becomes the root of the tree. It isalso the first node from where the tree will start to expand(called expansion node). The core operations of the algorithmare described as the function DynOrder, which is executedrepeatedly but for a different subset of the input documents andgrowing the tree from a different expansion node each time.Its objective is to build a tree out of its input documents andconnect it to the expansion node vr. Steps 1-4 are responsiblefor the creation of the root node of this tree. This node, vs,groups the more general documents from the current set of

documents that have the required topic overlap and generalitycloseness based on the equivalence condition (1).

To build this node, the process starts with the most gen-eral document d. Subsequently, it considers documents indescending order of generality. As long as a document djhas a close generality score to d, and its overlap with allthe documents already selected for the node vs satisfies theequivalence criterion, this document dj is also added to theset of documents for the node vs. The node vs becomes theroot node for the remaining documents in the current set andit is connected to the node vr at step 5.

The remaining steps are responsible for re-organizing therest of the documents in groups such that each group will beused to further grow the tree in a different direction. Step 7creates a set C of all documents that have non-zero overlapwith at least the most general document d of the node vsjust created. The reason is that any document in any node atany level under vs should have even a distant relation to thedocuments of vs. Step 8 divides C into sets Dc, such that thedocuments contained in each set have non-zero overlap amongthem. The reason is that the nodes of a tree should have somerelatedness. The node vs becomes the new expansion node.The function DynOrder is called for each set Dc to grow anew reading tree. Each of these trees is connected to vs.

For simplicity in presentation, Algorithm 2 shows the caseof constructing one reading tree. The algorithm builds morethan one tree if required. The critical point is the output of step7. If this is an empty set but there are still documents whosereading order has not been determined, then the algorithmstarts a new set of rounds building a new tree that is connectedto the dummy root node for the remaining documents.

THEOREM. The algorithm builds complete reading trees.Proof. We first show that steps 1-4 create a node of

equivalent documents. Each pair of documents in the nodesatisfies the topic overlap condition. Furthermore, each docu-ment dj satisfies the generality condition wrt the most generaldocument d, i.e., g(d)−g(dj) < κ. This condition is sufficientfor making sure that all pairs of documents in the node satisfythe generality condition since they are selected in descendingorder of their generality score. It is obvious that the first timethese steps are executed, a node containing all documentsequivalent to d is created. We now show that each subsequentiteration of the steps also groups all equivalent nodes togethersatisfying property (a) of a complete reading tree.

Let S be the set of documents of a node and d be its mostgeneral document. Let us assume that there is a documentd′ /∈ S such that g(d)− g(d′) < κ and o(di, d′) > τ ∀di ∈ S.However, step 8 groups together all documents that have non-zero overlap (o(di, dj) > 0). For each such group Dc, steps1-4 place all documents that satisfy the equivalence conditionto the same node. Hence, d′ would have been included in S.

Consequently, the algorithm builds all nodes of equivalentdocuments. We now need to show that it also builds allsequencing relations satisfying properties (b) and (c) of thecomplete reading tree definition. Step 5 connects a newlycreated node vs to the current expansion node vr. The first

Algorithm 2 Reading Tree GenerationInput: doc set D, doc-topic matrix F ,

generality difference threshold κ, overlap threshold τOutput: Tree Structureforeach di in D

Calculate generality score g(di)end forCalculate generality score difference matrix ECreate a dummy root node vrDynOrder(D, F , E, κ, τ , vr)

Function DynOrder

Input: doc set D, doc-topic matrix F ,generality score difference matrix E,generality difference threshold κ,overlap threshold τ , expansion node vr

Output: Tree Structurewhile D 6= ∅ do

1. Create a set S containing the most general d in D2. Select the next most general document dj in D3. while o(d, dj) > τ and g(d)− g(dj) < κ

if o(di, dj) > τ ∀di ∈ S then add dj to SSelect the next most general document dj in D

end while4. S contains the most general equivalent documents and

it is mapped to a new node vs5. Connect node vs to the expansion node vr6. Remove from D all documents belonging to S7. C ← { dj in D | o(d, dj) > 0 }8. Divide C into clusters Dc s.t.:

for each Dc it holds o(di, dj) > 0,∀di, dj ∈ Dc9. foreach cluster Dc

DynOrder(Dc, F , E, κ, τ , vs)end for

end while

time the algorithm runs, vr is the dummy root. In subsequentrounds, vr has been created earlier by steps 1-4. Let R bethe documents for vr, and S for vs. Since the algorithmexamines documents in decreasing order of generality, it holdsthat g(d) > g(d′) for any pair of documents d ∈ R,d′ ∈ S. If g(d) − g(d′) > κ, then it is easily shown that thesequencing relation between d and d′ holds. Let us assumethat g(d) − g(d′) < κ. Then, d′ could not have had enoughoverlap with the documents inR, otherwise it would have beengrouped there. Hence, for the nodes vr, vs, their correspondingdocuments are in a sequencing relation. Hence property (b) ofthe complete reading graph holds.

Property (c) is easily shown. Let us take the simple case ofdi → dj . If there was a document dk such that di → dk → dj ,then it would be g(di) > g(dk) > g(dj). But then dk wouldhave been picked before dj .�

Example 2. Let us illustrate the tree generation algorithmwith the help of Figure 3. Figure 3(a) shows the document-topic matrix. The document generality scores are as follows:

ID t1 t2 t3 t4d1 0.5 0.5 0 0d2 0 0 0.5 0.5d3 0.25 0.25 0.25 0.25d4 0.25 0.25 0.25 0.25d5 1 0 0 0

(a) Doc-topic matrix

d1 d5 d2

d3 d4

d1

d5

d1 d5

d3 d4

d1 d1 d5

d3 d4

d1

d5

d3 d4

round 1 round 2 round 3 final round

d3 d4

d1 d5 d2

root creation document clustering

d2

root creation document clustering

d1 d5

d3 d4

d1d2

d5

d2

root creation root creation

(b) Algorithm Progress

Fig. 3. Illustration of the tree generation algorithm

g(d3) = g(d4) = −4 × 0.25 × log(0.25) = 0.602059, whileg(d1) = g(d2) = −2×0.5×log(0.5) = 0.30102, and g(d5) =−1 × log(1) = 0. Let us use κ = 0.2, τ = 0.8. Figure 3(b)shows the algorithm progress and its final output.d3 is selected as it has the largest entropy score. d4 is also

selected because g(d3)−g(d4)= 0 < κ and o(d3, d4)=1 >τ .d3, d4 together become the root node of the tree. The remain-ing documents form two groups based on their overlap: d1 andd5 comprise one group, and d2 comprises another becauseo(d1, d5) = 0.5, o(d1, d2) = 0 and o(d5, d2) = 0. The treewill grow two branches, one from each group, by repeatingthe same process. For the cluster of d1 and d5, we selectd1 as the most general document among the two. However,g(d1) − g(d5) = 0.30102 > κ, hence, the new node thatwill be added under the node {d3, d4} contains only d1. Thealgorithm continues iterating through these steps. �

D. Complexity Analysis

We now analyze the time complexity of the methods. Thetopic model complexity is O(n ∗ s ∗ Iter)[32], which Iter isthe number of iterations, n is the number of documents ands is the number topics to be generated. For large documentsets, s and Iter are relatively small constant values, thus thecomplexity of the topic model can be seen as O(n).

The topic model calibration has two parts. The complexityof graph construction through pairwise similarity calculation isO(n2), where n is number of documents. We could reduce thecomplexity of ε-neighborhood graph construction algorithmfrom quadratic to linear, O(n), with the method proposed in[33]. Regarding the score propagation component, the timecomplexity is O(MaxIter ∗ n2 ∗ s), where MaxIter is themaximum number of iterations. In this work, we found thatour score propagation could converge for a small number ofiterations (less than 20), so we set MaxIter = 20. Then,the overall complexity of score propagation for the densegraph is O(n2). However, in this paper, two documents areconnected only if their similarity is larger than ε = 0.3, whichleads to a sparse graph (the matrix W of Algorithm 1 issparse). Consequently, the time complexity can be rewrittenas O(MaxIter ∗ n ∗ c ∗ s), where c is the average degree ofthe nodes in the graph, which is a much smaller value than n.

The complexity of reading tree generation is O(n2 ∗ d),where d is the depth of the tree. In general, it is d ≈ log n sothe complexity for tree construction can be O(n2 log n). Wecould reduce the time complexity as O(n log n) by using amore efficient algorithm for the clustering component.

To summarize, the overall complexity of our approach isO(n2 log n) and could be further reduced to n log n.

VII. EXPERIMENTS

A. Datasets and Setup

We use a variety of datasets. To understand the effect ofthe tree generation parameters and how our methods scale,we use DB, a set of 25000 research papers from subsequent,recent years from TKDE, VLDB, CIKM, and SIGMOD, andNEWS, a set of 20000 news articles from 20 newsgroups.The latter is the 20 Newsgroup benchmark3, a popular datasetfor experiments in text application of machine learning. Forevaluating the output of our approach, we use DB, NEWS,and WIKIPEDIA. By using different datasets, we alsosee how well our method performs for different data and fordifferent applications. All the experiments were performed ona single PC with Intel Core i7 CPU 2.6GHz and 8 GB memory,running Windows 7 operating system.

B. Tree Generation Parameters

The tree shape. Our reading trees are shaped by threeparameters: (a) the number s of topics used to describe thedocument set, (b) the minimum topic overlap τ between twoequivalent documents, and (c) the maximum difference κ oftheir generality scores. By controlling these parameters, wecan grow different trees that capture document relations ina more or less fine-grained way. At the same time, theseparameters affect execution times of the various algorithms.In this section, we present experiments on the impact of theparameters.

We have performed several experiments with different com-binations of the parameters and for different subsets of the DBand NEWS datasets. Here, we discuss representative experi-ments with parameters set as follows: s ∈ [10, 20, 30, 40, 50],τ ∈ [0.5, 0.7, 0.9] and κ ∈ [0.001, 0.005, 0.01].

To see how the parameters affect the features of the tree,we measured the number of nodes created in a tree. Figure4 reports the results on a V LDB sample of 400 papersspanning approximately 5 consecutive years (the trends aresimilar across different datasets). In Figure 4(a), we observea smooth uplift in the number of the nodes created withincreasing number of topics and the same κ. On the otherhand, smaller κ creates significantly larger number of nodes.This is intuitive because the generality score captures noise

3http://qwone.com/ jason/20Newsgroups/

30

50

100200300400

of nodes

300‐400

200‐300

100‐200

10

30

0100

0.001 0.005 0.01

number 

generality κ

(a) Topics-generality (τ = 0.9)

0.5100

200

300

ber of nodes

200‐300

100‐200

0‐100

0.90

100

10 20 30 40 50

num

topics s

(b) overlap-topics (κ = 0.005)

Fig. 4. Impact of Parameters on the Tree

across topics for a document. More topics help capture noisemore precisely. So κ is a critical knob for the tree shape.

Figure 4(b) is quite different. Interestingly, the number ofnodes is not proportional with the number of topics. We createthe largest number of nodes with an average number of topicsand an average overlap threshold τ . Too few topics provide avery coarse description for the documents making several oneslooking similar. Too many topics makes hard to understandwhich documents are really similar and hence many docu-ments may be placed in a single node. Hence, in both cases,we do a poor job recognizing document relationships.

From the two graphs of Figure 4, we observe that the stricterthe generality constraint is the more nodes are created in thetree. On the other hand, the number of topics in combinationwith the overlap constraint can affect the number of tree nodes.

The number of nodes in the tree shows how a tree isshaped between two extremes: all documents being groupedin a single node versus each document being a different node.Another parameter that describes the shape of the tree isbranching. Branching essentially shows how a tree is shapedbetween being very deep (e.g. a chain) or very shallow. Inour experiments, we found that a better way to capture thetree shape, instead of using the branching factor of a tree, isby measuring the number of leaf nodes. The reason is thatthe branching factor may vary across a tree and an averagevalue is not a good indicator. Our experiments on how thetree generation parameters affect the number of tree leafnodes have showed that the number of topics and the overlapconstraint are defining factors: stricter overlap or fewer topicslead to deeper trees. We do not show the corresponding graphsdue to space considerations.

Execution time. We now examine the effect of parameters onthe execution times. The score propagation time is negligiblein all experiments, and the topic model calibration time isbasically equal to the similarity graph computation time.Hence, we ignore score propagation in the results below.Figure 5 reports times for two datasets: the left column shows

V LDB (400 papers) and the right column shows CIKM (150papers). We observe that in all graphs the topic model timedoes not vary a lot compared to other components’ times, andit increases relatively slowly with the number of topics. Thesimilarity graph computation always comprises an importantpart of the total execution time and depends on the numberof documents: for V LDB, the average time is 100 seconds,while for CIKM is 30 seconds.

From all the execution times, the tree generation time is themost important. The reason is that the topic model computa-tion and the similarity graph computation can be performedoffline for a document collection while the tree generationcan be performed online for the desired subset of documentsdepending on the application. The tree generation is quiteefficient. The generation time grows with the number of topicsand the overlap τ but it shrinks with the generality parameterκ. Increasing the topics or τ means that more documents maybe placed at the same node hence more pair-wise comparisonsare needed to check the document overlap. On the other hand,shrinking κ shrinks the number of documents per node andhence the number of pair-wise comparisons.

Our discussion above has focused on research papers. Theseare long documents. Figure 6(a) shows how the size ofthe document affects execution times based on two datasetsof 500 documents each: TKDE contains papers from thejournal, and NEWS contains short news articles. We observethat similarity graph times are affected by the length of thedocument: long papers contain many terms generating a bigdocument-term matrix. Tree generation is also longer for thelonger TKDE documents because the tree created has 190nodes almost twice the size of the NEWS tree.

In the experiments above, we have considered the fulldocument for each paper. In practice, the introduction andrelated work sections are sufficient to build the reading tree andthe execution times are significantly improved. We performedexperiments to compare the outputs in both scenarios: thegenerated topics and trees were equally good (in Section VII-Dwe explain how we compare different trees).

C. Scalability

We tested how our approach scales with the number ofdocuments for the DB and NEWS. Generally, we observedthat the overall execution time is less than 1 minute for 1000documents, a relatively large number: 50 seconds for DBwhen processing the whole article, and 6 seconds for NEWS.

Figure 6(b) shows execution times for three sets of doc-uments from the NEWS dataset with size 1000, 10000and 20000, respectively. We observe that as the number ofdocuments increases, the topic model and the similarity graphcomputation are the most time-consuming components.

In practical scenarios, it does not make sense to computea reading tree over thousands or millions of documents andpresent it to a user. In interactive scenarios, such as whenorganizing the results of a search (e.g., on research papers),the reading tree will be generated on the fly for a relatively

200

400

600

800

me (seconds)

Similarity GraphTopic ModelTree Generation

0

200

10 20 30 40 50

Tim

number of topics

(a) VLDB (κ = 0.001, τ = 0.9)

10

20

30

40

me (seconds)

Similarity Graph Topic Model

Tree Generation

0

10

10 20 30 40 50

Tim

number of topics

(b) CIKM (κ = 0.001, τ = 0.9)

10

20

30

40

me (seconds)

Similarity Graph Topic Model

Tree Generation

0

0.001 0.005 0.01

Tim

generality condition

(c) VLDB (s=30, τ = 0.9)

100

200

300

me (seconds)

Similarity GraphTopic ModelTree Generation

0

0.001 0.005 0.01

Tim

generality condition

(d) CIKM (s=30, τ = 0.9)

50

100

150

200

me (seconds)

Similarity Graph Topic Model Tree Generation

0

50

0.5 0.7 0.9

Tim

overlap

(e) VLDB (s=30, κ = 0.005)

10

20

30

40

me (seconds)

Similarity Graph Topic Model

Tree Generation

0

10

0.5 0.7 0.9

Tim

overlap

(f) CIKM (s=30, κ = 0.005)

Fig. 5. Impact of Parameters on the Time

reasonable number of documents (e.g., top 200). In theseinteractive scenarios, our approach is very efficient.

There may be scenarios where it is desired to compute areading tree over thousands of documents and store it in orderto show parts of it depending on the user or the context. Thensuch computation can be performed offline. As a trade-offbetween disk space and time requirements, one may decideto store only the output of the topic model and similaritygraph and compute a reading tree on the fly based on theset of documents that a user has selected in an application.For example, for short documents (or snippets) in Figure 6(b),tree generation is still less than 1 minute for 10000 documents.

D. Reading Tree Evaluation

1) Evaluation Metric: In order to compare and evaluate thestructures generated from our algorithms, we need a way tocompare trees. Edit distance metrics, initially introduced forstring comparison, have been used to compare ordered trees[34]. Ordered labeled trees are trees in which the left-to-rightorder among siblings is significant. A distance between twotrees is computed by considering an optimal mapping betweentwo trees as the minimum cost of a sequence of elementaryoperations that converts one tree into the other. An alternativeto mapping and tree edition is tree alignment [35].

Our reading order problem is different, and thus we arenot interested in how identical two trees are. We care for therelative ordering of each pair of documents. To quantify thetree difference based on the pairwise document orderings, wefirst build the adjacency matrix A for a tree structure usingthe following formula:Aij =

1numhops(di→dj) if there is a directed path from di to

dj ; otherwise Aij = 0.Aij is the element of the adjacency matrix corresponding todocuments di and dj and numhops(di → dj) is the numberof hops from document di to dj .

1600)

Similarity GraphT i M d l

1200

cond

s Topic ModelTree Generation

400

800

me (sec

0

400

Tim

tkde us news

(a) document type

6000)

Similarity GraphT i M d l

4000

cond

s Topic ModelTree Generation

2000

me (sec

0Tim

1000 10000 20000number of documents

(b) number of documents

Fig. 6. Impact of documents on Time (s = 30 τ = 0.9, κ = 0.005)

To measure the difference of two tree structures over a setof documents represented by matrixes A and A, we use themean squared error (MSE), which is defined as:

MSE(A, A) = (1

n

n∑i,j=1

(Aij − Aij)2)

Figure 7 illustrates an example of how to compare treestructures using MSE. In this example, A is the adja-cency matrix of the ground truth, Aresult1 (Aresult2) is theadjacency matrix of the tree result1 (result2). Since wehave MSE(A, Aresult1) = 0.2469 > MSE(A, Aresult2) =0.1229, the tree structure result2 is better than result1because compared with the ground truth documents are placedmostly in the right order. There are two sources for the meansquared error of result2. We observe that document d4’srelative order is different in result2 compared to the groundtruth. In the ground truth, it is A(d1, d4) = 0.5 as the numberof hops from document d1 to d4 is two, while in result2,Aresult2(d1, d4) = 1 as document d1 becomes direct parentnode of document d4. Moreover, the fact that document d2 is aparent node of document d4 is completely missed in result2.

2) Using Wikipedia for Ground Truth: There is no groundtruth on the actual reading order for any set of documents. Asone strategy for evaluating our approach, we use the Wikipediahierarchy of categories to approximate the reading order forWikipedia pages. For example, “Cluster Analysis” is a subcat-egory of ‘Machine Learning”. We assume that all the articlepages belonging to ‘Machine Learning” are more general andshould be read before articles in “Cluster Analysis”.

However, Wikipedia’s hierarchy of categories is far fromperfect. It contains errors, and cycles. To build our groundtruth, we start from the “Machine Learning” category andexpand 3 steps away in the category structure building ahierarchy with no cycles. After removing empty categoriesand articles, we have 118 categories, from which we randomselect two articles out of all the pages from each category,resulting in 236 articles in total.

We feed these articles to our system and we compare itsoutput to the ground truth using different parameter combi-nations, τ = {0.5, 0.7, 0.9}, κ = {0.001, 0.005, 0.01} and for20 topics. Table IV summarizes the results. The low MSEsuggests that our reading trees generate good reading orders.For τ = 0.7, κ = 0.005, we obtain the best performance withthe minimum MSE = 0.1214.

Our actual algorithm performance may be better thanthese MSE scores indicate, because the category hierarchy ofWikipedia does not provide a perfect ordering. For example,

d10 1 1 0.5 0.50 0 0 1 10 0 0 0 00 0 0 0 00 0 0 0 0

Ground Truth

d2 d3

d4 d5

d2

result1

d4 d5

d1 d3

0 0 0 0 00.5 0 0.5 1 10 0 0 0 00 0 0 0 01 0 1 0 0

d1

result2

d2 d3

d5

d4

0 1 1 1 0.50 0 0 0 10 0 0 0 00 0 0 0 00 0 0 0 0

Fig. 7. Example of Mapping Tree Structure to Adjacency Matrix

TABLE IVCOMPARISON TO THE WIKIPEDIA GROUND TRUTH

τ = 0.5 τ = 0.7 τ = 0.9κ = 0.001 0.1739 0.1279 0.1558κ = 0.005 0.1366 0.1214 0.1625κ = 0.01 0.1632 0.1738 0.1456

“Machine Learning Researchers” is a subcategory of “MachineLearning”. Pages under “Machine Learning” cover the contentof machine learning algorithms and applications. However,the pages under “Machine Learning Researchers” cover thebackground of the person, personal history, etc.

3) Using Experts for Ground Truth Creation: As a differentevaluation strategy, we also asked experts to generate theground truth for different topics (Data Mining, Biology, Math,and Arts) and sets of Wikipedia pages. For each set ofdocuments, we asked the opinion of two experts. Of course,it is not possible to ask experts to create very large graphs, soeach tree created by an expert is around 20 documents. Alsowe observed that on the same set of documents, two expertsmay create different trees. This is reasonable, since there is anamount of subjectivity in the problem. Our purpose is not tocreate the exact same tree but create the same reading ordersas possible as captured by MSE.

We then compared our reading trees against the experts’ground truth. Since for each set we had two trees as groundtruth, we compared the tree we generated to each one of themand then take the average MSE as the final MSE for agenerated tree. The average MSE across the four topics is0.18 for s = 10, τ = 0.9, κ = 0.001. We observed thatthe Math tree had the poorer performance (MSE=0.3). Webelieve that in Math other types of document relationships, e.g.pre-requisites, may be more frequent than specificity relations.

4) User study: A third way to evaluate the results is tohave a user study, where feedback is collected from users thatevaluate the generated reading trees manually. This evaluationmethod is effective only when the size of the tree is small,and is hard to utilize when the output tree structure is large.

We created user studies for three scenarios: (a) researchscenario, where the purpose is to see research papers insome logical order, (b) news editor scenario, where the treeis presented to an editor for selecting articles to place in anews publication and (c) search scenario, where the purposeis to search pages related on a topic. For each scenario, weuse DB, NEWS, and the WIKIPEDIA pages for datamining, respectively, and we pre-generated the reading graphsfor all the documents per case. We intentionally used the sameconfiguration, s = 40, τ = 0.9, κ = 0.001, for all datasets

because we wanted to see how good a single configurationwill be for different data.

For scenario (a), we asked 3 CS researchers and 2 students,for (b) we asked 2 editors, and for (c) we asked 10 non-CSresearchers. For each scenario, we selected random subtreesfrom the graph containing at most 15 documents, and weasked the users to evaluate them by counting how manydocuments they thought were in the wrong order. Each userevaluated 4 trees and each tree was evaluated by one user.The tree was represented graphically. Each node contained ashort description for each document in the node. Clicking thedescription links opened the whole document allowing the userto inspect the document.

The average number of misplaced documents pertree reported was: 2.2 (DB), 2.9 (NEWS), and 2.5(WIKIPEDIA). The respective percentages over thedocument set size were: 14.6% (DB), 19.33% (NEWS),and 16.66% (WIKIPEDIA). We observed that thepercentage was higher for the NEWS. It is likely that theselected configuration was not optimal for this collection.Still, the result were useful for the editors.

E. Topic Model Calibration

As a final note, we would like to discuss the effectivenessof the score propagation method we use for the topic modelcalibration. The topic model is a probabilistic model andhence every time it is executed over exactly the same inputs,its output may be different. In this paper, we utilize thetopic model calibration to compensate for the errors andvariations caused by the topic model. In order to justify itseffectiveness, we compare the trees generated with and withoutscore propagation.

For this purpose, we ran the tree generation algo-rithm 11 times with all meaningful parameter combina-tions of s = {5, 10, 15, 20, 25, 30, 35, 40}, and κ ={0.01, 0.03, 0.05, 0.07, 0.09, 0.11, 0.13, 0.15}. We comparethe tree generated in one of the last 10 runs with the treegenerated in the first run (which serves as a kind of groundtruth). Figure 8(a) shows the average MSE score and Figure8(b) shows the worst. Each point corresponds to one of the10 repetitions (shown on the x-axis) and is the average MSEvalue of all parameter combinations (shown on the y-axis).

With score propagation, both the worst and the averageperformance of the tree generation method are better. Thescore propagation acts as a smoothing filter compensating forthe variations in the output of the topic model.

0.15

0.155

0.16

0.165

0.17verage M

SEwith Score Propagation

without Score Propagation

0.135

0.14

0.145

1 2 3 4 5 6 7 8 9 10

Av

rounds

(a) Average Performance

0.195

0.2

0.205

0.21

0.215

0.22

Worst MSE

with Score Propagation

without Score Propagation

0.18

0.185

0.19

1 2 3 4 5 6 7 8 9 10rounds

(b) Worst Performance

Fig. 8. The effect of Score Propagation

VIII. CONCLUSION AND FUTURE WORK

In this paper, we proposed generating reading trees toorganize a collection of documents from general to morespecific content. We proposed a set of algorithms for topicmodel calibration and tree generation. We evaluated the impactof the various parameters of the problem for the tree form andthe performance of the approach. As there is no ground truthfor our problem, we applied different methods for judging theresult of the reading tree generation.

As future work, we would like to examine methods forincrementally growing and refining a reading tree based on asubset of known documents. Other future research directionsinclude considering other types of document relationship fordocument sequencing, and personalizing reading trees fordifferent users.

REFERENCES

[1] Y. Zhao, G. Karypis, and U. Fayyad, “Hierarchical clustering algorithmsfor document datasets,” Data Min. Knowl. Discov., vol. 10, no. 2, Mar.2005.

[2] N. Sahoo, J. Callan, R. Krishnan, G. Duncan, and R. Padman, “In-cremental hierarchical clustering of text documents,” in 15th ACM Int’lconference on Information and Knowledge Management, ser. CIKM ’06,2006, pp. 357–366.

[3] Q. Ho, J. Eisenstein, and E. P. Xing, “Document hierarchies from textand links,” in Proceedings of the 21st international conference on WorldWide Web, 2012, pp. 739–748.

[4] S. Bengio, J. Weston, and D. Grangier, “Label embedding trees for largemulti-class tasks.” in Neural Information Processing Systems, 2010, pp.163–171.

[5] L. Liu, P. M. Comar, S. Saha, P.-N. Tan, and A. Nucci, “Recursive nmf:Efficient label tree learning for large multi-class problems,” in ICPR,2012, pp. 2148–2151.

[6] L. Liu and P.-N. Tan, “A framework for co-classification of articles andusers in wikipedia,” in Web Intelligence and Intelligent Agent Technology(WI-IAT), 2010 IEEE/WIC/ACM International Conference on, vol. 1,2010, pp. 212–215.

[7] B. Shaparenko and T. Joachims, “Information genealogy: Uncoveringthe flow of ideas in non-hyperlinked document databases,” in KDD,2007, pp. 619–628.

[8] R. Sipos, A. Swaminathan, P. Shivaswamy, and T. Joachims,“Temporal corpus summarization using submodular word coverage,”in Proceedings of the 21st ACM International Conference onInformation and Knowledge Management, ser. CIKM ’12. NewYork, NY, USA: ACM, 2012, pp. 754–763. [Online]. Available:http://doi.acm.org/10.1145/2396761.2396857

[9] A. Feng and J. Allan, “Incident threading for news passages,”in Proceedings of the 18th ACM Conference on Informationand Knowledge Management, ser. CIKM ’09. New York, NY,USA: ACM, 2009, pp. 1307–1316. [Online]. Available: http://doi.acm.org/10.1145/1645953.1646118

[10] D. Shahaf and C. Guestrin, “Connecting two (or less) dots:Discovering structure in news articles,” ACM Trans. Knowl. Discov.Data, vol. 5, no. 4, pp. 24:1–24:31, Feb. 2012. [Online]. Available:http://doi.acm.org/10.1145/2086737.2086744

[11] H.-S. Chung and J.-M. Kim, “Ontology design for creating adaptivelearning path in e-learning environment,” in Proceedings of the Interna-tional MultiConference of Engineers, 2012.

[12] R. Pirrone, G. Pilato, and R. Rizzo, “Learning path generation bydomain ontology transformation,” in AI*IA 2005: Advances in ArtificialIntelligence. Springer-Verlag, 2005, pp. 359–369.

[13] M. K. Stern and B. P. Woolf, “Curriculum sequencing in a web-basedtutor,” in Proceedings of Intelligent Tutoring Systems, 1998.

[14] T. Y. Tang and G. Mccalla, “Smart recommendation for evolving e-learning system,” International Journal on E-Learning, vol. 4, no. 1,2005.

[15] P. Brusilovsky, “A framework for intelligent knowledge sequencing andtask sequencing,” in Intelligent Tutoring Systems, 1992, pp. 499–506.

[16] P. Resnik, “Semantic similarity in a taxonomy: An information-basedmeasure and its application to problems of ambiguity in natural lan-guage,” Journal of Artificial Intelligence Research, vol. 11, pp. 95–130,1999.

[17] R. Rada, H. Mili, E. Bicknell, and M. Blettner, “Development andapplication of a metric on semantic nets,” IEEE Transactions on Systems,Man and Cybernetics, vol. 11, pp. 17–30, 1989.

[18] K. S. Candan, M. E. Donderler, T. Hedgpeth, J. W. Kim, Q. Li, andM. L. Sapino, “Sea: Segment-enrich-annotate paradigm for adaptingdialog-based content for improved accessibility,” ACM Trans. Inf.Syst., vol. 27, no. 3, pp. 15:1–15:45, May 2009. [Online]. Available:http://doi.acm.org/10.1145/1508850.1508853

[19] J. Kim, K. Candan, and M. E. Donderler, “Topic segmentation ofmessage hierarchies for indexing and navigation support,” in WWW,2005.

[20] D. M. Blei, “Probabilistic topic models,” Commun. ACM, vol. 55, no. 4,pp. 77–84, Apr. 2012.

[21] G. Grefenstette, Explorations in Automatic Thesaurus Discovery.Kluwer Academic Publishers, 1994.

[22] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,”Journal of Machine Learning Research, vol. 3, pp. 993–1022, 2003.

[23] T. Hofmann, “Probabilistic latent semantic analysis,” in In Proc. ofUncertainty in Artificial Intelligence, UAI, 1999, pp. 289–296.

[24] M. Steyvers and T. Griffiths, “Probabilistic topic models,” in LatentSemantic Analysis: A Road to Meaning., T. Landauer, D. Mcnamara,S. Dennis, and W. Kintsch, Eds., 2006.

[25] H. D. III, “Markov random topic fields,” in Artificial Intelligence andStatistics, 2009.

[26] D. Ramage, D. Hall, R. Nallapati, and C. D. Manning, “Labeled LDA:A supervised topic model for credit attribution in multi-labeled corpora,”in Conference on Empirical Methods in Natural Language Processing,2009, p. 248256.

[27] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf, “Learningwith local and global consistency,” in Advances in Neural InformationProcessing Systems, 2004, pp. 321–328.

[28] M. Hein, J. yves Audibert, U. V. Luxburg, and S. Dasgupta, “Graphlaplacians and their convergence on random neighborhood graphs,”Journal of Machine Learning Research, p. 2007, 2006.

[29] S. Zhu, K. Yu, Y. Chi, and Y. Gong, “Combining content and link forclassification using matrix factorization,” in ACM SIGIR Conference,2007.

[30] L. Liu, S. Saha, R. Torres, J. Xu, P.-N. Tan, A. Nucci, and M. Mellia,“Detecting malicious clients in isp networks using http connectivitygraph and flow information,” in Advances in Social Networks Analysisand Mining (ASONAM), 2014 IEEE/ACM International Conference on,2014, pp. 150–157.

[31] S. Brin and L. Page, “The anatomy of a large-scale hypertextual websearch engine,” Computer Networks, vol. 30, no. 1-7, pp. 107–117, 1998.

[32] D. Newman, P. Smyth, and M. Steyvers, “Scalable parallel topicmodels,” Journal of Intelligence Community Research and Development,2006.

[33] T. Uno, M. Sugiyama, and K. Tsuda, “Efficient construction of neighbor-hood graphs by the multiple sorting method,” CoRR, vol. abs/0904.3151,2009.

[34] E. Tanaka and K. Tanaka, “The tree-to-tree editing problem,” Interna-tional Journal Pattern Recognition and Artificial Intelligence, vol. 2,no. 2, pp. 221–224, 1988.

[35] T. Jiang, L. Wang, and K. Zhang, “5th annual symposium on combina-torial pattern matching,” in Advances in Neural Information ProcessingSystems, 1994, pp. 75–86.


Recommended