Post on 22-Dec-2015
transcript
RMIT University at INEX 2004Heterogeneous Track Experiments
Jovan Pehcevski
Email: jovanp@cs.rmit.edu.au
School of Computer Science and Information Technology, RMIT University,
Melbourne, Australia.
Overview
Research questionsCollection statisticsTopicsRetrieval systems
Zettair (using two similarity measures)Hybrid (Zettair with eXist, using two retrieval heuristics)
Runs: all automatic, title-only runs#1: Zettair (Okapi BM25)#2: Zettair (Pivoted Cosine)#3: Hybrid (MpE heuristic)#3: Hybrid (PME heuristic)
ResultsEfficiencyEffectiveness (for the IEEE collection)
Final thoughts
Research Questions
The goal of the Heterogeneous track at INEX 2004 is to set up a test collection (a heterogeneous XML document collection, suitable retrieval topics, and relevance assessments that correspond to these topics) and to explore new retrieval challenges
Our group at RMIT focuses on answering the following questions:
For CO queries, what methods are feasible for determining elements that would be reasonable answers? Should the data be organised (and indexed) as a single heterogeneous collection, or is it better to treat this collection as a set of homogeneous sub-collections?
Methods that can be used to map structural criteria from one DTD to another are NOT considered in this work
Heterogeneous collection
The heterogeneous XML collection at INEX 2004 consists of the following sub-collections:
QMULDCSDBPub - Publications database of QMUL Department of Computer ScienceBibDBPub - BibTeX converted to XML by the IS group at the University of Duisburg-EssenHCIBIB - Human-Computer Interaction Resources, bibliography from www.hcibib.org Berkeley – library catalog records of books in the area of computer and information science from BerkeleyDBLP - from the Digital Bibliography & Library Project in Trier CompuScience - from the Computer Science database of FIZ KarlsruheIEEE – IEEE Computer Society publications in the period between 1995 - 2002
Collection statistics
We analyse and pre-process each sub-collection to determine the concept of a Document
Collection Size (MB)
Document
(tag name)
Number of Documents
QMULDCSDBPub 1.2 DOCUMENT 2024
BibDBPub 2.4 entry 3465
HCIBIB 32.4 entry 26399
Berkeley 34.4 USMARC 12800
DBLP 239.3 article | book | phdthesis | mastersthesis | proceedings | inproceedings | incollection | www
501102
CompuScience 338.6 article | book | inbook | dissertation | proceedings |
inproceedings | incollection | techreport | misc
250987
IEEE 494.5 article 12107
Het Collection
1142.8 (all the distinct tags above) 808884
Topics
Four types of retrieval topics are considered for the Heterogeneous track at INEX 2004
CO (Content-Only) – plain queries, no structural constraints and target elements (10 topics)
Example: XML information retrievalBCAS (Basic Content-And-Structure) – queries using single structural and content-based constraints to enable synonym matches (1 topic)
Example: //article[about(., XML information retrieval)]CCAS (Complex Content-And-Structure) – queries using complex structural and content-based constraints to enable a wide range of path transformations and partial mappings (13 topics)
Example: //article[about(.//sec, XML information retrieval)]ECCAS (Extended Complex Content-And-Structure) – queries using probability likelihood of a structural constraint (0 topics)
Example: //article(0.8)[about(.//sec(0.5), XML information retrieval)]
CCAS Topic Example
<inex_topic topic_id="3" query_type="CCAS"><title> //article[about(.//abs, Web usage mining) or about(.//sec, "Web mining" traversal navigation patterns)]</title><content_description> We are looking for documents that describe capturing and mining Web usage, in particular the
traversal and navigation patterns; motivations include Web site redesign and maintenance.</content_description><structure_description> Article is a tag identifying a document, which can also be represented as a book tag, an inproceedings
(or incollection) tag, an entry tag, etc. Abs is a tag identifying abstract of a document, which can be represented as an abstract tag, an abs tag, etc. Sec is a tag identifying an informative document component, such as section or paragraph. It can also be represented as sec, ss1, ss2, p, ip1 or other similar tags. </structure_description>
<narrative> To be relevant, a document must describe methods for capturing and analysing web usage, in
particular traversal and navigation patterns. The motivation is using Web usage mining for site reconfiguration and maintenance, as well as providing recommendations to the user. Methods that are not explicitly applied to the Web but could apply are still relevant. Capturing browsing actions for pre-fetching is not relevant.</narrative>
<keywords> Web usage mining, Web log analysis, browsing pattern, navigation pattern, traversal pattern, Web
statistics, Web design, Web maintenance, user recommendations </keywords></inex_topic>
Retrieval Systems
Our runs use two systems
Zettair – a compact and fast full-text search engineHybrid – a modular system using best retrieval features from Zettair and eXist (a native XML database), and a top-up module to identify the appropriate units of retrieval
Unconstrained, plain text queries are used by each retrieval system. For each topic, the structural constraints and the target element are removed. Terms from the <title> are used to formulate the queries
The systems use two different strategies to index the terms in the heterogeneous XML collection
Zettair
From zetta (1021) and IRA scalable, fast search engine server
Supports ranked, simple Boolean, and phrase queriesIndexes HTML, XML, plain text, and TREC-formatted documentsUsable as a C and python libraryNative support for TREC experiments (not yet for INEX)Documented. Includes easy-to-follow examples
BSD licenseEmphasis on simplicity and efficiency
One executable does everythingUnder continued development
Ported to Mac OS X, FreeBSD, MS Windows, Linux, SolarisAvailable from www.seg.rmit.edu.au/zettair
Zettair Indexing
With Zettair, the seven homogeneous XML collections are indexed as a single heterogeneous XML collectionSingle-pass, sort-merge schemeDocument-ordered, word position inverted indexesEfficient, variable-byte index compressionIndexed the HET collection (1.14 GB) in under 5 minutes on a single AUD$2000 Intel P4 machine.
Throughput: 230MB/minuteFast configurable parser. Handles badly-formed HTML:
Validates each tag by matching < with > within a characterHTML comments are not indexed but are validatedEntity references translatedNo support for internationalised text
Zettair Querying
B-tree vocabulary bulk-loaded at index construction time
For a 1.14 Gb collection, average query time is 10 milliseconds (without explicit caching or other optimisations)
Single-threaded, blocking I/O, and relatively unoptimised
Provides query-biased summaries of documents (see Tombros and Sanderson, “Advantages of query biased summaries in information retrieval”, SIGIR 1998)
Supports Pivoted Cosine and Okapi BM25 similarity measuresWorking on further measuresMeasures can be manipulated externally
Zettair Querying…
The Pivoted Cosine similarity measure is:
where:
and:
Wd = document length WAL = average document length
s = 0.25 (the slope) N = number of docs in collection
ft = collection frequency fd,t = within-document frequency
(# of docs that t occurs in)
te
Qttde
QD f
Nf
WW1loglog1
1,
AL
dD W
WssW 0.1
Qt teQ f
NW
2
1log
Zettair Querying…
The Okapi BM25 similarity measure is:
where:
and:Wd = document length WAL = average document length
k1 = 1.2 k3 = 1000 (effectively infinite)
b = 0.75 N = number of docs in collection
fq,t = query-term frequency fd,t = within-document frequency
ft = collection frequency
(# of docs that t occurs in)
tq
tq
Qt td
tdt fk
fk
fK
fkw
,3
,3
,
,1 11
5.0
5.0log
t
tet f
fNw
AL
d
W
WbbkK 11
Hybrid
Utilising best features from Zettair and eXist
With eXist, the seven homogeneous XML collections are indexed separately, but queries can span across the XML collections
The Hybrid system uses a “fetch and browse” approach, where heterogeneous Documents are first retrieved and ranked by Zettair (the fetch phase), and the most specific elements from the highly ranked Documents are then extracted by eXist (the browse phase)
The system also uses a retrieval module that identifies and ranks Coherent Retrieval Elements (CREs) (more on next slides)
Coherent Retrieval ElementsDefinition:
A Coherent Retrieval Element (CRE) is an element that contains at least two matching elements (extracted by eXist), or at least two other Coherent Retrieval Elements, or a combination of a matching element and a Coherent Retrieval Element.
In plain words:
The list of matching elements, extracted by eXist, is a document-ordered list (see Table 1 on the next slide). The list is processed by considering a pair of elements, starting from the first element down to the last. In each step, a CRE is identified as the most specific ancestor of the two matching elements that constitute this pair.
Ranking the CREs
To determine the final ranks of CREs, the retrieval module uses a combination of the following heuristics:
The number of times a CRE appears in the absolute path of each extracted element in the eXist list of matching elements - more matches (M) or fewer matches (m)The length of the absolute path of the CRE, taken from the root element - longer path (P) or shorter path (p)The ordering of the XPath sequence in the absolute path of the CRE - nearer to beginning (B) or nearer to end (E)
For INEX 2003 test set, MpE yields best performance, although PME is more suitable for some metrics
Ranking the CREs…
Article Answer element Matches Length Sequence
ic/1999/w4095 /article[1] 12 1 1
ic/1999/w4095 /article[1]/bdy[1] 9 2 11
ic/1999/w4095 /article[1]/bdy[1]/sec[4] 4 3 114
ic/1999/w4095 /article[1]/bdy[1]/sec[2] 5 3 112
ic/1999/w4095 /article[1]/bm[1]/app[1] 3 3 111
ic/1999/w4095 /article[1]/bdy[1]/sec[2]/ss1[1] 2 4 1121
ic/1999/w4095 /article[1]/bm[1]/app[1]/sec[2] 2 4 1112
Table 2. Ranked list of Coherent Retrieval elements (using the MpE heuristic)
Ranking the CREs…
Table 3. Ranked list of Coherent Retrieval elements (using the PME heuristic)
Article Answer element Matches Length Sequence
ic/1999/w4095 /article[1]/bdy[1]/sec[2]/ss1[1] 2 4 1121
ic/1999/w4095 /article[1]/bm[1]/app[1]/sec[2] 2 4 1112
ic/1999/w4095 /article[1]/bdy[1]/sec[2] 5 3 112
ic/1999/w4095 /article[1]/bdy[1]/sec[4] 4 3 114
ic/1999/w4095 /article[1]/bm[1]/app[1] 3 3 111
ic/1999/w4095 /article[1]/bdy[1] 9 2 11
ic/1999/w4095 /article[1] 12 1 1
Runs
Four runs: automatic, title-only
Zettair_BM25, using Zettair with Okapi BM25 similarity measureZettair_PCosine, using Zettair with Pivoted Cosine similarity measureHybrid_MpE, using the hybrid system with MpE heuristic combinationHybrid_PME, using the hybrid system with PME heuristic combination
The two hybrid runs use Zettair with Pivoted Cosine similarity measure
We use each of the above runs in each topic category (except ECCAS), resulting in 12 runs in total*
* Our official INEX 2004 submission had 9 runs, since Hybrid_MpE was not initially considered
Efficiency Results
The following efficiency results apply for Zettair only
HET collection indexed on a single $2000 Intel P4 machine808884 documents, 1.14 GB of text5 minutes to index, at 230 MB/minute10 milliseconds per query to search (on average)
No stopping or stemmingLimited accumulators with “continue” strategy
Interesting statistics:Full text index size, with full word positions, was 38.4% of the collection size (438.5 MB)Distinct terms: 1.94 millionTerm occurrences: 1.06 billion
Efficiency Results…
Detailed statistics (per collection):
Collection Size (MB) Index size (MB)
Index time (sec)
Distinct terms
QMULDCSDBPub
1.2 0.73 0.43 8816
BibDBPub 2.4 1.2 0.67 19200
HCIBIB 32.4 16.8 5.26 115321
Berkeley 34.4 9.2 4.49 126761
DBLP 239.3 128.1 65.98 1021698
CompuScience 338.6 128.8 89.95 389266
IEEE 494.5 162 102.63 694894
Het Collection
1142.8 438.5 282.12 1935988
Effectiveness Results
The following results consider the IEEE collection only
RUNCO Topics CCAS Topics
MAP P@10 MAP P@10
Zettair_BM25 0.0123 0.0875 0.0771 0.1500
Zettair_PCosine 0.0122 0.0500 0.0887 0.1667
Hybrid_MpE 0.0420 0.0875 0.1251 0.1167
Hybrid_PME 0.0227 0.0625 0.0484 0.0500
Effectiveness Results…
Quantitative, rather than qualitative analysis for the IEEE collection (although we will perform a detailed qualitative, query-and-run oriented analysis once Het relevance assessments are ready)With P@10 for the IEEE collection, the hybrid runs are (on average) NOT substantially better than the full text runsCO topics
Okapi better than Pivoted CosineMpE heuristic better than PME heuristicHybrid_MpE is best, although with P@10 Zettair_BM25 is competitive
CCAS topicsPivoted Cosine better than OkapiMpE heuristic (again) better than PME heuristicHybrid_MpE is best (with MAP), but Zettair_PCosine is best (with P@10)
With P@10, for either CO or CCAS topic type the best Zettair run is equal or better than the best Hybrid run
Final Thoughts
Four very different runs, exploring different similarity measures and retrieval heuristics (Okapi BM25 versus Pivoted Cosine, MpE heuristic versus PME heuristic)
Surprises in the resultsPlain full-text search engine very competitive More evaluation and follow up after INEX 2004
Research questionsFor CO queries, what methods are feasible for determining elements that would be reasonable answers?
The MpE heuristic in the CRE module appears to be a feasible method
Should the data be organised (and indexed) as a single heterogeneous collection, or is it better to treat this collection as a set of homogeneous sub-collections?
Indexing the data as a single heterogeneous collection appears to be both an efficient and an effective choice