Flexible and Efficient Toolbox for Information Retrieval MIRACLE group

transcript

Flexible and Efficient Toolbox for Flexible and Efficient Toolbox for Information RetrievalInformation Retrieval

MIRACLE groupMIRACLE group

José Miguel Goñi-Menoyo (UPM)José Carlos González-Cristóbal (UPM-Daedalus)

Julio Villena-Román (UC3M-Daedalus)

Our approachOur approach

New Year’s Resolution: work with all languages in CLEFNew Year’s Resolution: work with all languages in CLEFadhoc, image, web, geo, iclef, qa…adhoc, image, web, geo, iclef, qa…

Wish list: Wish list: Language-dependent stuffLanguage-dependent stuffLanguage-independent stuffLanguage-independent stuffVersatile combinationVersatile combinationFast Fast Simple for non computer scientistsSimple for non computer scientists

Not to reinvent the wheel again every year!Not to reinvent the wheel again every year! Approach: Toolbox for information retrievalApproach: Toolbox for information retrieval

AgendaAgenda

ToolboxToolbox

2005 Experiments2005 Experiments

2005 Results2005 Results

2006 Homework2006 Homework

Toolbox BasicsToolbox Basics

Toolbox made of small one-function tools Toolbox made of small one-function tools

Processing as a pipeline (borrowed from Unix):Processing as a pipeline (borrowed from Unix):Each tool combination leads to a different run approachEach tool combination leads to a different run approach

Shallow I/O interfaces: Shallow I/O interfaces: tools in several programming languages (C/C++, Java, Perl, tools in several programming languages (C/C++, Java, Perl,

PHP, Prolog…),PHP, Prolog…), with different design approaches, andwith different design approaches, and from different sources (own development, downloading, …)from different sources (own development, downloading, …)

MIRACLE Tools MIRACLE Tools Tokenizer:Tokenizer:

pattern matchingpattern matching isolate punctuationisolate punctuationsplit sentences, paragraphs, passagessplit sentences, paragraphs, passages

identifies some entitiesidentifies some entitiescompounds, numbers, initials, abbreviations, datescompounds, numbers, initials, abbreviations, dates

extracts indexing termsextracts indexing termsown-development (written in Perl) or “outsourced”own-development (written in Perl) or “outsourced”

Proper noun extractionProper noun extractionNaive algorithm: Uppercase words Naive algorithm: Uppercase words unlessunless stop-word, stop- stop-word, stop-

clef or verb/adverbclef or verb/adverb Stemming: generally “outsourced”Stemming: generally “outsourced” Transforming tools: lowercase, accents and diacritical Transforming tools: lowercase, accents and diacritical

characters are normalized, transliterationcharacters are normalized, transliteration

More MIRACLE Tools More MIRACLE Tools Filtering tools:Filtering tools:

stop-words and stop-clefsstop-words and stop-clefsphrase pattern filter (for topics)phrase pattern filter (for topics)

Automatic translation issues: “outsourced” to available on-Automatic translation issues: “outsourced” to available on-line resources or desktop applicationsline resources or desktop applications

Bultra (EnBultra (EnBu)Bu) Webtrance (EnWebtrance (EnBu)Bu) AutTrans (EsAutTrans (EsFr, EsFr, EsPt)Pt)

MoBiCAT (EnMoBiCAT (EnHu)Hu) SystranSystran BabelFish AltavistaBabelFish Altavista

BabylonBabylon FreeTranslationFreeTranslation Google Language ToolsGoogle Language Tools

InterTransInterTrans WordLingoWordLingo ReversoReverso

Semantic expansionSemantic expansionEuroWordNetEuroWordNetown resources for Spanishown resources for Spanish

The philosopher's stone: indexing and retrieval systemThe philosopher's stone: indexing and retrieval system

Indexing and Retrieval SystemIndexing and Retrieval System

Implements boolean, vectorial and probabilistic BM25 retrieval Implements boolean, vectorial and probabilistic BM25 retrieval modelsmodels

Only BM25 in used in CLEF 2005Only BM25 in used in CLEF 2005 Only OR operator was used for termsOnly OR operator was used for terms

Native support for UTF-8 (and others) encodingsNative support for UTF-8 (and others) encodings No transliteration scheme is neededNo transliteration scheme is needed Good results for BulgarianGood results for Bulgarian

More efficiency achieved than with previous enginesMore efficiency achieved than with previous engines Several orders of magnitude in indexing timeSeveral orders of magnitude in indexing time

Trie-based indexTrie-based index

calm, cast, coating, coat, money, monk, month

1st course implementation: linked arrays1st course implementation: linked arrays

calm, cast, coating, coat, money, monk, month

Efficient tries: avoiding empty cellsEfficient tries: avoiding empty cells

abacus, abet, ace, baby be, beach, bee

Basic ExperimentsBasic Experiments

SS: Standard sequence (tokenization, filtering, stemming, : Standard sequence (tokenization, filtering, stemming, transformation)transformation)

NN: Non stemming: Non stemming

RR: Use of narrative field in topics: Use of narrative field in topics TT: Ignore narrative field: Ignore narrative field r1r1: Pseudo-relevance feedback (with 1st retrieved : Pseudo-relevance feedback (with 1st retrieved

document)document) PP: Proper noun extraction (in topics): Proper noun extraction (in topics)

SR, ST, r1SR, NR, NT, NPSR, ST, r1SR, NR, NT, NP

Paragraph indexingParagraph indexing

HH: Paragraph indexing: Paragraph indexingdocparsdocpars (document paragraphs) are indexed instead of docs (document paragraphs) are indexed instead of docs

termterm doc1#1, doc69#5 … doc1#1, doc69#5 …combination of combination of docpars docpars relevance:relevance:

relrelNN = rel = relmNmN + + αα / n * ∑ / n * ∑ j≠mj≠m rel reljNjN

n=paragraphs retrieved for doc Nn=paragraphs retrieved for doc N

relreljNjN=relevance of paragraph i of doc N=relevance of paragraph i of doc N

m=paragraph with maximum relevancem=paragraph with maximum relevanceαα=0.75 (experimental)=0.75 (experimental)

HR, HTHR, HT

Combined experimentsCombined experiments ““Democratic system”: documents with good score in many Democratic system”: documents with good score in many

experiments are likely to be relevantexperiments are likely to be relevant

aa: Average:: Average:Merging of several experiments, adding relevanceMerging of several experiments, adding relevance

xx: WDX - asymmetric combination of two experiments:: WDX - asymmetric combination of two experiments:First (more relevant) non-weighted D documents from run AFirst (more relevant) non-weighted D documents from run ARest of documents from run A, with W weightRest of documents from run A, with W weightAll documents from run B, with X weightAll documents from run B, with X weightRelevance re-sortingRelevance re-sorting

Mostly used for combining base runs with proper nouns Mostly used for combining base runs with proper nouns runsruns

aHRSR, aHTST, xNP01HR1, xNP01r1SR1aHRSR, aHTST, xNP01HR1, xNP01r1SR1

Multilingual mergingMultilingual merging

Standard approaches for merging:Standard approaches for merging:No normalization and relevance re-sortingNo normalization and relevance re-sortingStandard normalization and relevance re-sortingStandard normalization and relevance re-sortingMin-max normalization and relevance re-sortingMin-max normalization and relevance re-sorting

Miracle approach for merging:Miracle approach for merging:The number of docs selected from a collection (language) is The number of docs selected from a collection (language) is

proportional to the average relevance of its first N docs (N=1, proportional to the average relevance of its first N docs (N=1, 10, 50, 125, 250, 1000). Then one of the standard 10, 50, 125, 250, 1000). Then one of the standard approaches is usedapproaches is used

Results Results

We performed…We performed…

… … countless experiments!countless experiments!

(just for the adhoc task)(just for the adhoc task)

Monolingual BulgarianMonolingual Bulgarian

Stemmer (UTF-8): NeuchâtelStemmer (UTF-8): Neuchâtel

Rank: 4thRank: 4th

Bilingual EnglishBilingual EnglishBulgarianBulgarian

(83% monolingual)(83% monolingual)

EnEnBu: Bultra, WebtranceBu: Bultra, Webtrance

Rank: 1stRank: 1st

Monolingual HungarianMonolingual Hungarian

Stemmer: NeuchâtelStemmer: Neuchâtel

Rank: 3rdRank: 3rd

Bilingual EnglishBilingual EnglishHungarianHungarian

EnEnHu: MoBiCATHu: MoBiCAT

Rank: 1stRank: 1st

Monolingual FrenchMonolingual French

Stemmer: SnowballStemmer: Snowball

Rank: >5thRank: >5th

Bilingual EnglishBilingual EnglishFrenchFrench

EnEnFr: SystranFr: Systran

Rank: 5thRank: 5th

Bilingual SpanishBilingual SpanishFrenchFrench

EsEsFr: ATrans, SystranFr: ATrans, Systran

(Rank: 5th)(Rank: 5th)

Monolingual PortugueseMonolingual Portuguese

Stemmer: SnowballStemmer: Snowball

Rank: >5th (4th)Rank: >5th (4th)

Bilingual EnglishBilingual EnglishPortuguesePortuguese

EnEnPt: SystranPt: Systran

Rank: 3rdRank: 3rd

Bilingual SpanishBilingual SpanishPortuguesePortuguese

EsEsPt: ATransPt: ATrans

(Rank: 2nd)(Rank: 2nd)

Multilingual-8 (En, Es, Fr)Multilingual-8 (En, Es, Fr)

Rank: 2nd [Fr, En] Rank: 2nd [Fr, En] 3rd [Es]3rd [Es]

Conclusions and homeworkConclusions and homework

Toolbox = “imagination is the limit”Toolbox = “imagination is the limit” Focus on interesting linguistic things instead of boring text manipulationFocus on interesting linguistic things instead of boring text manipulation Reusability (half of the work is done for next year!)Reusability (half of the work is done for next year!)

Keys for good results:Keys for good results:Fast IR engine is essentialFast IR engine is essentialNative character encoding supportNative character encoding supportTopic narrativeTopic narrativeGood translation engines make the differenceGood translation engines make the difference

Homework: Homework: further development on system modules, fine tuningfurther development on system modules, fine tuningSpanish, French, Portuguese… Spanish, French, Portuguese…

Flexible and Efficient Toolbox for Information Retrieval MIRACLE group

Documents