1
Adaptive XML Search
Dr Wilfred NgDepartment of Computer Science
The Hong Kong University of Science and Technology
2
OutlineOutline
Motivation
Key-Tag Search
Multi-Ranker Model
Ranking Support vector machine in voting SpyNB Framework (RSSF)
Experiments
Conclusions and Ongoing Work
3
Motivation
4
Why we need XML Search Why we need XML Search Engine?Engine?
Different nature of HTML and XML data HTML data
Hyperlink-intensive Declarative languages Tags have no semantic meaning
XML data Self-describing tags Extra structural information XML search engines retrieve more accurate
fragments
5
Why we need XML Search Engine?Why we need XML Search Engine?
Web searching Document paradigm Matching keywords Vs documents Return links to whole document (web page)
XML searching Query Keywords maybe tags or data values Structure of XML document is diverse, e.g. DBLP
and Shakespeare Not return whole document: 100Mb or larger Return fragments
6
DBLP
<dblp>
<incollection mdate="2002-01-03" key="books/acm/kim95/AnnevelinkACFHK95">
<author>Jurgen Annevelink</author>
<title>Object SQL - A Language for the Design and Implementation of Object Databases.</title>
<pages>42-68</pages>
<year>1995</year>
<booktitle>Modern Database Systems</booktitle>
<url>db/books/collections/kim95.html</url>
</incollection>
….
7
Shakespeare
<SPEECH> <SPEAKER>OCTAVIUS CAESAR</SPEAKER> <LINE>No, my most wronged sister; Cleopatra</LINE> <LINE>Hath nodded him to her. He hath given his empire</LINE> <LINE>Up to a whore; who now are levying</LINE> <LINE>The kings o' the earth for war; he hath assembled</LINE> <LINE>Bocchus, the king of Libya; Archelaus,</LINE> <LINE>Of Cappadocia; Philadelphos, king</LINE> <LINE>Of Paphlagonia; the Thracian king, Adallas;</LINE> <LINE>King Malchus of Arabia; King of Pont;</LINE> <LINE>Herod of Jewry; Mithridates, king</LINE> <LINE>Of Comagene; Polemon and Amyntas,</LINE> <LINE>The kings of Mede and Lycaonia,</LINE> <LINE>With a more larger list of sceptres.</LINE> </SPEECH>
8
Research IdeasResearch Ideas In Information Retrieval community, many
ranking techniques are developed Weighted keywords Vector space
Searching and ranking XML as plain text using IR techniques is possible but Too simple Do not use the advantage of XML data
Can achieve better accuracy using features of XML data: Structures Tag semantics
9
OutlineOutline
Motivation
Key-Tag Search
Multi-Ranker Model
Ranking Support vector machine in voting SpyNB Framework (RSSF)
Experiments
Conclusions and Ongoing Work
10
Key-Tag Search
11
Key-Tag Query vs. XQuery
Keywords in Web search engine vs. SQL The goals of key-tag query and XQuery are
different Key-Tag Query
Simple Easy to understand Flexible
XQuery:for $x in doc(“some.xml") where $x/author[(.ftcontains(‘Mary’)]return $x/title
Key-Tag Query:<author>Mary</author>
Too complicate for ordinary users!!
Will users input such complex XQuery in search engines?
12
Key-Tag Search Query
<author>Mary</author> For example, Tag Key
author
title
year
Mary
XML
2007
Tag Key
*
*
*
Mary
XML
2007
Tag Key
author
title
year
*
*
*
Tag Key
*
*
year
Mary
XML
*
KeyTag
13
Key-Tag Query Semantics A fragment is considered as a result candidate if at least one
key-tag is found in it. If F1 and F2 both contain the same instance of key-tag and F1 is a
subtree of F2, F1is chosen to be the only answer.
For example, a query <b>b</b>
F1: <b>b</b>
F2: <b><c><b>b</b></c></b>
F1 will be the answer
If there is a fragment:<b>
<c><b>b</b>
</c></b>
If there is a fragment:F1:<a>
<b>b</b>---------(B1)</a>F2:<a>
<c><b>b</b>----------(B2)
</c></a>
14
OutlineOutline
Motivation
Key-Tag Search
Multi-Ranker Model
Ranking Support vector machine in voting SpyNB Framework (RSSF)
Experiments
Conclusions and Ongoing Work
15
Multi-Ranker Model
16
Introduction to MRMIntroduction to MRM
Handle diversified XML documents and user preferences
17
Multi-Ranker Model
AR1 AR2 ARn…
STR DAT DFT CUS
Similarity Granularity
1 2 n
Adaptive Ranking Level (AR)
Standard Ranking Level (XR)
Feature Ranking Level
RSSF
User Profiles
w11 w12 w13 w14
W1
KeywordAccessPathElementOrderCategory
SiblingChildrenDistance+Distance-TagAttribute
NEW
NEW
Feature1…Feature2Feature3
18
Adaptive Ranking Level (AR)Adaptive Ranking Level (AR) AR maintains a feature vector,, which adapts to the four
XRs, the vector is weighted and trained by RSSF = (STR, DAT, DFT, CUS, STR, DAT, DFT, CUS)
The adaptive ranking of fragments is calculated by:W * ,
where W is generated by RSSF, we will introduce it later.
19
Standard Ranking Level (XR)
Four XRs Structure ranker (STR): focus on ranking XML
fragments based on their structure Data ranker (DAT): ignore the structure and rank the
XML fragments with their textual data System default ranker (DFT): a balance of structure
and data ranker Customized ranker (CUS): system administrator
selects low-level feature for tuning, in our experiment, the low-level features are randomly pick
20
Feature Ranking Level
Similarity Features Keyword Access Path Element Order Category
For example, Q = {<author>Mary</author>, <title>XML</title>}
Keyword similarity = 5.0)5.01(
5.0)5.01(log
2
1
Path similarity = 3/4Access similarity = 3/7 Element similarity = 2/7
Order in Q: author > title
Ancestor order similarity = 0Sibling order similarity = 1/4
Sibling order in F:author>title, author>year, title>year, first>last
Predefine that: Academic category {article, title, author}Sport category {team, player, match, year}…
Category Vector for Q: <2/3, 0>Category Vector for F: <1, 1/4>Category similarity = distance of sqrt((1/3)2+(1/4)2)=0.4167
21
Feature Ranking Level
Granularity Features Sibling Children Distance+ Distance- Tag Attribute
Involves statistical data in the database
For example, Q = {<author>Mary</author>, <title>XML</title>}
Number of fragments whose roots are dblpNumber of tags whose parent are dblpThe length of the path from root to farthest leafdblp/article/author/first: length = 4The length of the path from root to nearest leafdblp/article/title: length = 3
Number of tag in F: 7Number of attributes in F: 0
22
Highlights of MRM Highly Flexible
Add or remove of new features or new XR is straightforward
Only require to update the feature vector, “Ranking Level Independence”
Analogous to data independence in relational model
23
OutlineOutline
Motivation
Key-Tag Search
Multi-Ranker Model
Ranking Support vector machine in voting SpyNB Framework (RSSF)
Experiments
Conclusions and Ongoing Work
24
Features of RSSF Input: set of labeled fragment Output: a trained ranker
Naïve Bayes is a successful algorithm for learning to classify text documents
Require small amount of training data, both positive and negative samples
In our setting, we only have labeled and unlabeled data, we extend the Naïve Bayes with spying technique to obtain the negative training samples
25
The RSSF
26
Ranking SVM Techniques
Find a vector that makes the inequality holds: F1 < F2 <F3
27
Voting Spy Naïve Bayes
28
Training Completed
Voting Spy Naïve Bayes
Positive Unclassified Negative
Training Naïve Bayes…
Estimated Negative
P1
P2
P3
29
Voting Spy Naïve Bayes
Positive Unclassified Negative
F11
F12
P1 P2 P3
F14
F11
F12
F11
F13
The Final Estimated Negative is……
F11
30
OutlineOutline
Motivation
Key-Tag Search
Multi-Ranker Model
Ranking Support vector machine in voting SpyNB Framework (RSSF)
Experiments
Conclusions and Ongoing Work
31
Effect of Varying Voting ThresholdX: voting thresholdY: Relative average rank of labeled fragments: new average rank / original average rank
32
Effectiveness of Low-Level Features on XR•In this experiment, we remove individual low-level feature from STR and DAT rankers and measure the new precision•The queries we use can be found in the appendix of the proposal
33
Processing Time
34
Comparison with TopX
Average precision over 100 recall points for each query.
Then, take the average.
Number of top k relevant resultsk
TopX is a searching engine for XML data available online State-of-the-art XML search engine
We measure the MAP and precison@k MAP: mean average precision precison@k: top k precision
35
OutlineOutline
Motivation
Key-Tag Search
Multi-Ranker Model
Ranking Support vector machine in voting SpyNB Framework (RSSF)
Experiments
Conclusions and Ongoing Work
36
Further remarksFurther remarks
Searching and ranking XML data are important, since existing Web search engines cannot handle them well
We present effective approach to perform adaptive XML searching and ranking by extending traditional IR techniques by considering different features of XML data
37
Ongoing Work – INEX 2007
The Initiative for Evaluation of XML retrieval (INEX) A community which aims to provide large test data and
scoring method for researchers to evaluate their retrieval systems
It is getting attention recently We participate INEX in 2006 and 2007 INEX 2007 Collection is a Wikipedia XML Corpus with a
set of 659388 XML documents We are running experiments using their data and queries
38
Ongoing Work – INEX 2007
39
Ongoing Work – Merging
Displaying a list of fragments one by one to the user may not be adequate in XML setting. Fragments may be scattered on the list Duplicated fragments in different structures Refine a search query to obtain more and better
results Ideas: Make use of the schema information
(DTD) and consider the fragments as entities and merge them in a concise way
40
My Publications Ho-Lam LAU and Wilfred NG. A Multi-Ranker Model for Adaptive XML Searching. Accepted and to
appear: VLDB Journal. (2007). Ho-Lam LAU and Wilfred NG.
Towards an Adaptive Information Merging Using Selected XML Fragments. International Conference of Database Systems for Advanced Applications. DASFAA 2007, Lecture Notes in Computer Science Vol. 4443, Bangkok, Thailand, pp. 1013-1019, (2007).
James CHENG and Wilfred NG. A Development of Hash-Lookup Trees to Support Querying Streaming XML. International Conference of Database Systems for Advanced Applications. DASFAA 2007, Lecture Notes in Computer Science Vol. 4443, Bangkok, Thailand, pp. 768-780, (2007).
Wilfred NG and James CHENG. An Efficient Index Lattice for XML Query Evaluation. International Conference of Database Systems for Advanced Applications. DASFAA 2007, Lecture Notes in Computer Science Vol. 4443, Bangkok, Thailand, pp. 753-767, (2007).
Wilfred NG and Ho-Lam LAU. A Co-Training Framework for Searching XML Documents. Information Systems, 32(3), pp. 477-503, (2007).
Yin YANG, Wilfred NG, Ho-Lam LAU and James CHENG . An Efficient Approach to Support Querying Secure Outsourced XML Information. Conference on Advanced Information Systems Engineering. CAiSE 2006, Lecture Notes in Computer Science Vol. 4007, Luxembourg, pp. 157-171, (2006).
Wilfred NG and Ho-Lam LAU. Effective Approaches for Watermarking XML Data. 10th International Conference on Database Systems for Advanced Applications DASFAA 2005, Lecture Notes of Computer Science Vol.3453, Beijing, China, page 68-80, (2005).
Ho-Lam LAU and Wilfred NG. A Unifying Framework for Merging and Evaluating XML Information. 10th International Conference on Database Systems for Advanced Applications DASFAA 2005, Lecture Notes of Computer Science Vol.3453, Beijing, China, page 81-94, (2005).