+ All Categories
Home > Documents > CS344: Introduction to Artificial Intelligence

CS344: Introduction to Artificial Intelligence

Date post: 23-Feb-2016
Category:
Upload: meli
View: 51 times
Download: 0 times
Share this document with a friend
Description:
CS344: Introduction to Artificial Intelligence. Vishal Vachhani M.Tech , CSE Lecture 34-35 : CLIR and Ranking in IR. Road Map. Cross Lingual IR Motivation CLIA architecture CLIA demo Ranking Various Ranking methods Nutch/lucene Ranking Learning a ranking function - PowerPoint PPT Presentation
Popular Tags:

of 18

Click here to load reader

Transcript

Cross lingual Information Access

CS344: Introduction to Artificial Intelligence

Vishal VachhaniM.Tech, CSELecture 34-35: CLIR and Ranking in IR

1Road MapCross Lingual IRMotivation CLIA architectureCLIA demo Ranking Various Ranking methods Nutch/lucene Ranking Learning a ranking function Experiments and results

Cross Lingual IRMotivation Information unavailability in some languages Language barrier Definition:Cross-language information retrieval (CLIR)is a subfield ofinformation retrievaldealing with retrieving information written in a language different from the language of the user's query (wikipedia)Example: A user may ask query in Hindi but retrieve relevant documents written in English.4Why CLIR?

Query in TamilEnglish DocumentSystemMarathi DocumentsearchEnglish DocumentSnippet Generation and Translation Cross Lingual Information Access Cross Lingual Information Access (CLIA)A web portal supporting monolingual and cross lingual IR in 6 Indian languages and EnglishDomain : Tourism It supports :Summarization of web documents Snippet translation into query language Temple based information extraction The CLIA system is publicly available at http://www.clia.iitb.ac.in/clia-beta-ext

CLIA Demo Various Ranking methods Vector Space Model Lucene, Nutch , Lemur , etc Probabilistic Ranking Model Classical spark Johns ranking (Log ODD ratio) Language ModelRanking using Machine Learning AlgoSVM, Learn to Rank, SVM-Map, etcLink analysis based Ranking Page Rank, Hubs and Authorities, OPIC , etc

Nutch RankingCLIA is built on top on Nutch A open source web search engine. It is based on Vector space modelLink analysisCalculates the importance of the pages using web graph Node: pages Edge: hyperlinks between pagesMotivation: link analysis based score is hard to manipulate using spamming techniques Plays an important role in web IR scoring functionPage rankHub and Authority Online Page Importance Computation (OPIC)Link analysis score is used along with the tf-idf based scoreWe use OPIC score as a factor in CLIA.

Learning a ranking functionHow much weight should be given to different part of the web documents while ranking the documents?A ranking function can be learned using following methodMachine learning algorithms: SVM, Max-entropyTrainingA set of query and its some relevant and non-relevant docs for each query A set of features to capture the similarity of docs and query In short, learn the optimal value of featuresRanking Use a Trained model and generate score by combining different feature score for the documents set where query words appears Sort the document by using score and display to user

Extended Features for Web IRContent based features Tf, IDF, length, co-ord, etc Link analysis based featuresOPIC scoreDomains based OPIC scoreStandard IR algorithm based featuresBM25 scoreLucene scoreLM based scoreLanguage categories based features Named EntityPhrase based features

Content based Features

Details of featuresFeature NoDescriptions1Length of body2length of title3length of URL4length of Anchor5-14C1-C10 for Title of the page15-24C1-C10 for Body of the page25-34C1-C10 for URL of the page 35-44C1-C10 for Anchor of the page 45 OPIC score46Domain based classification scoreDetails of features(Cont)Feature NoDescriptions48BM25 Score 49 Lucene score50 Language Modeling score 51 -54Named entity weight for title, body , anchor , url55-58Multi-word weight for title, body , anchor , url59-62Phrasal score for title, body , anchor , url63-66Co-ord factor for title, body , anchor , url71 Co-ord factor for H1 tag of web documentExperiments and results MAPNutch Ranking 0.22670.22670.26670.2137DIR with Title + content0.69330.640.59110.3444DIR with URL+ content0.720.620.53330.3449DIR with Title + URL + content0.720.65330.560.36DIR with Title+URL+content+anchor0.730.660.580.3734DIR with Title+URL+ content + anchor+ NE feature0.760.630.60.4Thanks

Feature FormulationDescriptions

C1Term frequency (tf)

C2SIGIR feature

C3Normalized tf

C4SIGIR feature

C5Inverse doc frequency (IDF)

C6SIGIR feature

C7SIGIR feature

C8Tf*IDF

C9SIGIR feature

C10SIGIR feature


Recommended