Post on 20-Mar-2016
description
transcript
A Ranking Scheme for XML Information Retrieval
Based on Benefit and Reading Effort
Toshiyuki Shimizu (Kyoto University)Masatoshi Yoshikawa (Kyoto University)
ICADL 2007 12th December
2
XML-IR systems Growing demand for XML Information
Retrieval (XML-IR) Systems We can identify meaningful document
fragments by encoding documents in XML ex) Sections, subsections and paragraphs
in scholarly articles Browsing only document fragments relevant to a
certain topic The most simple form of queries for XML-IR is
just a set of keywords Simple, intuitively understandable, yet useful form
of queries, especially for unskilled end-users Active research area as in INEX*
* INitiative for the Evaluation of XML Retrieval (http://inex.is.informatik.uni-duisburg.de/)
3
Results of XML-IR Systems Document fragment (element)
With relevance degree (Score) ex) Query term was “XML”
<?xml version="1.0"?><article> <sec> <p>XML labeling</p> <p>The structure of XML is a
tree, and each node in the XML is labeled.</p>
<p>We can get tag name of each XML element.</p>
</sec> <sec> <p>Tree index</p> <p>XML index is constructed
using the labels</p> </sec></article>
Score
articlee0
pe4
pe3
sec
e1
sec
e5
pe2
e6 e7pp
0.56
0.64 0.35
0.9 0.800.4 0.33
4
Naïve XML-IR System
e3 (0.9)e7 (0.8)e1 (0.64)e0 (0.56)e2 (0.4)e5 (0.35)e4 (0.33)
Thorough strategy of INEX 2005 Simply retrieves relevant elements from all
elements and ranks them in order of relevanceScore
articlee0
pe4
pe3
sec
e1
sec
e5
pe2
e6 e7pp
0.56
0.64 0.35
0.9 0.800.4 0.33
Thorough is considered for system evaluation User behavior of browsing search results must be
considered
5
Problems of Thorough Retrieval for XML-IR Nesting elements
Browsing both elements is useless Ancestor element ea Descendant element ed
ed has been fully seen Descendant element ed Ancestor element ea
ea has been partially seen before
Element size Elements retrieved by XML-IR systems varies widely
in size Large element, such as article (whole document) Small element, such as p (paragraph)
Total output size of top-k elements is uncontrollable by simply giving an integer k
6
Overview of our Approach Introduction of the concepts of benefit and
reading effort Users can control the total output size Systems can retrieve non-overlapping elements
7
Properties of Benefit and Reading Effort (1/2) Benefit
The benefit of an element is the amount of gain about the query by reading the element
Assumption 1: The benefit of an element is greater than or equal to the sum of the benefit of the child elements
Information complementation among sibling elements
ex) For two query terms A and Be6 contains topics about A e7 contains topics about B The benefit of e5 seems to be greater than the sum of benefit of e6 and e7
sec
e5
e6 e7pp
8
Properties of Benefit and Reading Effort (2/2) Reading Effort
The reading effort of an element is the amount of cost by reading the content of the element
Assumption 2: The reading effort of an element is less than or equal to the sum of the reading effort of the child elements
Readability of continuous readingex) Users can read the same content
more easily by reading e5 rather than separate e6 and e7
sec
e5
e6 e7pp
:e7:e6
:e5:
9
Overview of our Approach Introduction of the concepts of benefit and
reading effort Users can control the total output size Systems can retrieve non-overlapping elements
Flexible retrieval Users specify a threshold for the total amount of
reading effort The systems return relevant elements that provide
larger benefit and that can be read within specified reading effort
10
Flexible Retrieval Systems calculate benefit and reading effort A variant of knapsack problems
ex) Threshold of reading effort : 15
articlee0
pe4
pe3
sec
e1
sec
e5
pe2 e6 e7
pp
2818
238
52
109
155
5028
150
108
Effort ReadingBenefit
Retrieve {e2, e3} (Total benefit : 11)
11
Flexible Retrieval Systems calculate benefit and reading effort A variant of knapsack problems
ex) Threshold of reading effort : 20
articlee0
pe4
pe3
sec
e1
sec
e5
pe2 e6 e7
pp
2818
238
52
109
155
5028
150
108
Effort ReadingBenefit
Retrieve {e3, e7} (Total benefit : 17)
Search Result Continuity
The running example violate search result continuity The content of element set for reading effort r must be contained
in the content of element set for reading effort r’ if r <= r’
The optimal solution is NP-hard (A variant of knapsack problems) may violate search result continuity
Greedy retrieval algorithm12
articlee0
pe4
pe3
sece1
sece5
pe2 e6 e7
pp
2818
238
52
109
155
5028
150
108
ex) reading effort : 15 Retrieve {e2, e3}
(benefit : 11) reading effort : 20
Retrieve {e3, e7}(benefit : 17)
13
Retrieval Algorithm Based on the result of Thorough strategy*
Adjust benefit and reading effort for nesting elements of retrieved element, and rerank
Remove overlapping contents by nestings
e3 (0.9)e7 (0.8)e1 (0.64)e0 (0.56)e2 (0.4)e5 (0.35)e4 (0.33)
* Simply retrieves relevant elements from all elements and ranks them in order of relevance
articlee0
pe4
pe3
sec
e1
sec
e5
pe2 e6 e7
pp
2818
238
52
109
155
5028
150
108
0.56
0.64 0.35
0.9 0.800.4
0.33
Result of Thorough
14
Retrieval Algorithm
articlee0
pe4
pe3
sec
e1
sec
e5
pe2 e6 e7
pp
2818
238
52
109
155
5028
150
108
0.56
0.64 0.35
0.9 0.800.4
0.33
Result of Thorough
e3 (0.9)e3 (0.9)e7 (0.8)e1 (0.64)e0 (0.56)e2 (0.4)e5 (0.35)e4 (0.33) 18
9
0.5
4019
0.48
e1 (0.5)e0 (0.48) Adjust e1 , e0
Amount of benefit : 9Amount of reading effort : 10
Our result
Threshold of reading effort : 40
15
Retrieval Algorithm
articlee0
pe4
pe3
sec
e1
sec
e5
pe2 e6 e7
pp
238
52
109
155 15
010
8
0.35
0.9 0.800.4
0.33
e3 (0.9)e3 (0.9)e7 (0.8)
e2 (0.4)e5 (0.35)e4 (0.33) 13
0
0
3011
0.37
e1 (0.5)e0 (0.48)
e7 (0.8)
4019
0.48
189
0.5
e5 (0)
e0 (0.37)Adjust and rerank e5 , e0
Amount of benefit : 17Amount of reading effort : 20
Result of Thorough Our result
Threshold of reading effort : 40
Amount of benefit : 9Amount of reading effort : 10
16
Retrieval Algorithm
articlee0
pe4
pe3
sec
e1
sec
e5
pe2 e6 e7
pp
52
109
155 15
010
8
0.9 0.800.4
0.33
e3 (0.9)e3 (0.9)e7 (0.8)
e2 (0.4)
e4 (0.33)
122
0.17
e1 (0.5)e7 (0.8)
189
0.5e5 (0)
e0 (0.37)
e1 (0.5)30
11
0.37e0 (0.17) Adjust and rerank e0
Amount of benefit : 26Amount of reading effort : 38
Threshold of reading effort : 40
Result of Thorough Our result
130
0
Amount of benefit : 17Amount of reading effort : 20
17
Retrieval Algorithm
articlee0
pe4
pe3
sec
e1
sec
e5
pe2 e6 e7
pp
52
109
155 15
010
8
0.9 0.800.4
0.33
e3 (0.9)e7 (0.8)
e2 (0.4)e4 (0.33)
e1 (0.5)
e7 (0.8)
189
0.5e5 (0)
e1 (0.5)
e0 (0.17)
Amount of benefit : 26Amount of reading effort : 38
Result of Thorough Our result
Threshold of reading effort : 40
130
0
122
0.17
Amount of benefit : 26Amount of reading effort : 38
18
Retrieval Algorithm
articlee0
pe4
pe3
sec
e1
sec
e5
pe2 e6 e7
pp
52
109
155 15
010
8
0.9 0.800.4
0.33
e3 (0.9)e7 (0.8)
e2 (0.4)e4 (0.33)
e1 (0.5)
e7 (0.8)
189
0.5e5 (0)
e1 (0.5)
e0 (0.17)
Amount of benefit : 26Amount of reading effort : 38
Result of Thorough Our result
Threshold of reading effort : 40
130
0
122
0.17
Amount of benefit : 26Amount of reading effort : 38
19
Retrieval Algorithm
articlee0
pe4
pe3
sec
e1
sec
e5
pe2 e6 e7
pp
52
109
155 15
010
8
0.9 0.800.4
0.33
e3 (0.9)e7 (0.8)
e2 (0.4)e4 (0.33)
e1 (0.5)
e7 (0.8)
189
0.5e5 (0)
e1 (0.5)
e0 (0.17)
Our result
Amount of benefit : 26Amount of reading effort : 38
Result of Thorough
Threshold of reading effort : 40
130
0
122
0.17
Amount of benefit : 26Amount of reading effort : 38
20
Evaluation Metrics Based on benefit and reading effort
b/e graph (benefit/effort graph)
Comparison with BTIL (Best Thorough Input List) BTIL system is the system which use actual benefit and
reading effort Actual benefit is calculated using manually constructed
assessments (e.g. INEX) We can observe relative effectiveness of benefit
changing the specified threshold of reading effort Use the same values for reading effort between
implemented system and BTIL system
21
Implemented system retrieves {e3, e7}Obtained actual benefit is 10
BTIL system retrieves {e3, e6} Obtained actual benefit is 23
For the threshold value 30 of reading effort
articlee0
pe4
pe3
sece1
sece5
pe2 e6 e7
pp
2818
238
52
109
155
5028
150
108
articlee0
pe4
pe3
sece1
sece5
pe2 e6 e7
pp
2820
2313
50
1010
1510
5035
1513
100
Calculated benefit / reading effort Actual benefit / reading effort
22
Examples of b/e Graph using INEX 2005 Test Collection (1/2) XML document set, Topics, Assessments Calculate actual benefit and reading effort from
Assessments
ex (Exhaustivity): Highly exhaustive (HE) 1 Partially exhaustive (PE) 0.5 Not exhaustive(NE) 0
rsize: relevant text length (in number of characters) size: element length (in number of characters)
We implemented a system using tf-ief ief stands for inverse element frequency satisfies Assumptions for benefit and reading effort
rsizeexbenefite *. sizeeffortreadinge _.
,, : parameter
23
Topic 207 Topic 206
Examples of b/e Graph using INEX 2005 Test Collection (2/2)
We can observe relative effectiveness of implemented systems against BTIL system
24
Conclusions and Future Works Conclusions
Introduction of benefit and reading effort Handling nesting elements Variety of element size
Algorithm for flexible retrieval Result elements change depending on the specified
reading effort System evaluation
Future Works Introduction of switching effort
Cost of switching a result item in the results list Retrieving numerous results increases the cost of
browsing Integration with user interface