Post on 02-Jan-2016
description
transcript
1
Flexible and Efficient Toolbox for Flexible and Efficient Toolbox for Information RetrievalInformation Retrieval
MIRACLE groupMIRACLE group
José Miguel Goñi-Menoyo (UPM)José Carlos González-Cristóbal (UPM-Daedalus)
Julio Villena-Román (UC3M-Daedalus)
2
Our approachOur approach
New Year’s Resolution: work with all languages in CLEFNew Year’s Resolution: work with all languages in CLEFadhoc, image, web, geo, iclef, qa…adhoc, image, web, geo, iclef, qa…
Wish list: Wish list: Language-dependent stuffLanguage-dependent stuffLanguage-independent stuffLanguage-independent stuffVersatile combinationVersatile combinationFast Fast Simple for non computer scientistsSimple for non computer scientists
Not to reinvent the wheel again every year!Not to reinvent the wheel again every year! Approach: Toolbox for information retrievalApproach: Toolbox for information retrieval
3
AgendaAgenda
ToolboxToolbox
2005 Experiments2005 Experiments
2005 Results2005 Results
2006 Homework2006 Homework
4
Toolbox BasicsToolbox Basics
Toolbox made of small one-function tools Toolbox made of small one-function tools
Processing as a pipeline (borrowed from Unix):Processing as a pipeline (borrowed from Unix):Each tool combination leads to a different run approachEach tool combination leads to a different run approach
Shallow I/O interfaces: Shallow I/O interfaces: tools in several programming languages (C/C++, Java, Perl, tools in several programming languages (C/C++, Java, Perl,
PHP, Prolog…),PHP, Prolog…), with different design approaches, andwith different design approaches, and from different sources (own development, downloading, …)from different sources (own development, downloading, …)
5
MIRACLE Tools MIRACLE Tools Tokenizer:Tokenizer:
pattern matchingpattern matching isolate punctuationisolate punctuationsplit sentences, paragraphs, passagessplit sentences, paragraphs, passages
identifies some entitiesidentifies some entitiescompounds, numbers, initials, abbreviations, datescompounds, numbers, initials, abbreviations, dates
extracts indexing termsextracts indexing termsown-development (written in Perl) or “outsourced”own-development (written in Perl) or “outsourced”
Proper noun extractionProper noun extractionNaive algorithm: Uppercase words Naive algorithm: Uppercase words unlessunless stop-word, stop- stop-word, stop-
clef or verb/adverbclef or verb/adverb Stemming: generally “outsourced”Stemming: generally “outsourced” Transforming tools: lowercase, accents and diacritical Transforming tools: lowercase, accents and diacritical
characters are normalized, transliterationcharacters are normalized, transliteration
6
More MIRACLE Tools More MIRACLE Tools Filtering tools:Filtering tools:
stop-words and stop-clefsstop-words and stop-clefsphrase pattern filter (for topics)phrase pattern filter (for topics)
Automatic translation issues: “outsourced” to available on-Automatic translation issues: “outsourced” to available on-line resources or desktop applicationsline resources or desktop applications
Bultra (EnBultra (EnBu)Bu) Webtrance (EnWebtrance (EnBu)Bu) AutTrans (EsAutTrans (EsFr, EsFr, EsPt)Pt)
MoBiCAT (EnMoBiCAT (EnHu)Hu) SystranSystran BabelFish AltavistaBabelFish Altavista
BabylonBabylon FreeTranslationFreeTranslation Google Language ToolsGoogle Language Tools
InterTransInterTrans WordLingoWordLingo ReversoReverso
Semantic expansionSemantic expansionEuroWordNetEuroWordNetown resources for Spanishown resources for Spanish
The philosopher's stone: indexing and retrieval systemThe philosopher's stone: indexing and retrieval system
7
Indexing and Retrieval SystemIndexing and Retrieval System
Implements boolean, vectorial and probabilistic BM25 retrieval Implements boolean, vectorial and probabilistic BM25 retrieval modelsmodels
Only BM25 in used in CLEF 2005Only BM25 in used in CLEF 2005 Only OR operator was used for termsOnly OR operator was used for terms
Native support for UTF-8 (and others) encodingsNative support for UTF-8 (and others) encodings No transliteration scheme is neededNo transliteration scheme is needed Good results for BulgarianGood results for Bulgarian
More efficiency achieved than with previous enginesMore efficiency achieved than with previous engines Several orders of magnitude in indexing timeSeveral orders of magnitude in indexing time
8
Trie-based indexTrie-based index
calm, cast, coating, coat, money, monk, month
9
1st course implementation: linked arrays1st course implementation: linked arrays
calm, cast, coating, coat, money, monk, month
10
Efficient tries: avoiding empty cellsEfficient tries: avoiding empty cells
abacus, abet, ace, baby be, beach, bee
11
Basic ExperimentsBasic Experiments
SS: Standard sequence (tokenization, filtering, stemming, : Standard sequence (tokenization, filtering, stemming, transformation)transformation)
NN: Non stemming: Non stemming
RR: Use of narrative field in topics: Use of narrative field in topics TT: Ignore narrative field: Ignore narrative field r1r1: Pseudo-relevance feedback (with 1st retrieved : Pseudo-relevance feedback (with 1st retrieved
document)document) PP: Proper noun extraction (in topics): Proper noun extraction (in topics)
SR, ST, r1SR, NR, NT, NPSR, ST, r1SR, NR, NT, NP
12
Paragraph indexingParagraph indexing
HH: Paragraph indexing: Paragraph indexingdocparsdocpars (document paragraphs) are indexed instead of docs (document paragraphs) are indexed instead of docs
termterm doc1#1, doc69#5 … doc1#1, doc69#5 …combination of combination of docpars docpars relevance:relevance:
relrelNN = rel = relmNmN + + αα / n * ∑ / n * ∑ j≠mj≠m rel reljNjN
n=paragraphs retrieved for doc Nn=paragraphs retrieved for doc N
relreljNjN=relevance of paragraph i of doc N=relevance of paragraph i of doc N
m=paragraph with maximum relevancem=paragraph with maximum relevanceαα=0.75 (experimental)=0.75 (experimental)
HR, HTHR, HT
13
Combined experimentsCombined experiments ““Democratic system”: documents with good score in many Democratic system”: documents with good score in many
experiments are likely to be relevantexperiments are likely to be relevant
aa: Average:: Average:Merging of several experiments, adding relevanceMerging of several experiments, adding relevance
xx: WDX - asymmetric combination of two experiments:: WDX - asymmetric combination of two experiments:First (more relevant) non-weighted D documents from run AFirst (more relevant) non-weighted D documents from run ARest of documents from run A, with W weightRest of documents from run A, with W weightAll documents from run B, with X weightAll documents from run B, with X weightRelevance re-sortingRelevance re-sorting
Mostly used for combining base runs with proper nouns Mostly used for combining base runs with proper nouns runsruns
aHRSR, aHTST, xNP01HR1, xNP01r1SR1aHRSR, aHTST, xNP01HR1, xNP01r1SR1
14
Multilingual mergingMultilingual merging
Standard approaches for merging:Standard approaches for merging:No normalization and relevance re-sortingNo normalization and relevance re-sortingStandard normalization and relevance re-sortingStandard normalization and relevance re-sortingMin-max normalization and relevance re-sortingMin-max normalization and relevance re-sorting
Miracle approach for merging:Miracle approach for merging:The number of docs selected from a collection (language) is The number of docs selected from a collection (language) is
proportional to the average relevance of its first N docs (N=1, proportional to the average relevance of its first N docs (N=1, 10, 50, 125, 250, 1000). Then one of the standard 10, 50, 125, 250, 1000). Then one of the standard approaches is usedapproaches is used
15
Results Results
We performed…We performed…
… … countless experiments!countless experiments!
(just for the adhoc task)(just for the adhoc task)
16
Monolingual BulgarianMonolingual Bulgarian
Stemmer (UTF-8): NeuchâtelStemmer (UTF-8): Neuchâtel
Rank: 4thRank: 4th
17
Bilingual EnglishBilingual EnglishBulgarianBulgarian
(83% monolingual)(83% monolingual)
EnEnBu: Bultra, WebtranceBu: Bultra, Webtrance
Rank: 1stRank: 1st
18
Monolingual HungarianMonolingual Hungarian
Stemmer: NeuchâtelStemmer: Neuchâtel
Rank: 3rdRank: 3rd
19
Bilingual EnglishBilingual EnglishHungarianHungarian
(87% monolingual)(87% monolingual)
EnEnHu: MoBiCATHu: MoBiCAT
Rank: 1stRank: 1st
20
Monolingual FrenchMonolingual French
Stemmer: SnowballStemmer: Snowball
Rank: >5thRank: >5th
21
Bilingual EnglishBilingual EnglishFrenchFrench
(79% monolingual)(79% monolingual)
EnEnFr: SystranFr: Systran
Rank: 5thRank: 5th
22
Bilingual SpanishBilingual SpanishFrenchFrench
(81% monolingual)(81% monolingual)
EsEsFr: ATrans, SystranFr: ATrans, Systran
(Rank: 5th)(Rank: 5th)
23
Monolingual PortugueseMonolingual Portuguese
Stemmer: SnowballStemmer: Snowball
Rank: >5th (4th)Rank: >5th (4th)
24
Bilingual EnglishBilingual EnglishPortuguesePortuguese
(55% monolingual)(55% monolingual)
EnEnPt: SystranPt: Systran
Rank: 3rdRank: 3rd
25
Bilingual SpanishBilingual SpanishPortuguesePortuguese
(88% monolingual)(88% monolingual)
EsEsPt: ATransPt: ATrans
(Rank: 2nd)(Rank: 2nd)
26
Multilingual-8 (En, Es, Fr)Multilingual-8 (En, Es, Fr)
Rank: 2nd [Fr, En] Rank: 2nd [Fr, En] 3rd [Es]3rd [Es]
27
Conclusions and homeworkConclusions and homework
Toolbox = “imagination is the limit”Toolbox = “imagination is the limit” Focus on interesting linguistic things instead of boring text manipulationFocus on interesting linguistic things instead of boring text manipulation Reusability (half of the work is done for next year!)Reusability (half of the work is done for next year!)
Keys for good results:Keys for good results:Fast IR engine is essentialFast IR engine is essentialNative character encoding supportNative character encoding supportTopic narrativeTopic narrativeGood translation engines make the differenceGood translation engines make the difference
Homework: Homework: further development on system modules, fine tuningfurther development on system modules, fine tuningSpanish, French, Portuguese… Spanish, French, Portuguese…