+ All Categories
Home > Software > Document ranking using Qprp with concept of Multi-Dimensional Subspace

Document ranking using Qprp with concept of Multi-Dimensional Subspace

Date post: 27-Aug-2014
Category:
Upload: prakash-dubey
View: 30 times
Download: 0 times
Share this document with a friend
Description:
 
Popular Tags:
85
PRESENTATION ON PROJECT TOPIC:- “DOCUMENT RANKING USING QPRP WITH CONCEPT OF MULTI-DIMENSIONAL SUBSPACE” 1 Presented By:- Prakash Kumar Dubey (08) Nanhen Gaurav (07) Dilip Chauhan (27) Guided By:- Mr. Sourish Dhar (Dept. of IT) Mr. Bhagaban Swain (Dept. of IT)
Transcript
Page 1: Document ranking using Qprp with concept of Multi-Dimensional Subspace

PRESENTATION ON PROJECT TOPIC:- “DOCUMENT RANKING USING QPRP WITH CONCEPT OF MULTI-DIMENSIONAL SUBSPACE”

1

Presented By:-• Prakash Kumar Dubey (08)• Nanhen Gaurav (07)• Dilip Chauhan (27)

Guided By:-• Mr. Sourish Dhar (Dept. of IT)• Mr. Bhagaban Swain (Dept. of IT)

Page 2: Document ranking using Qprp with concept of Multi-Dimensional Subspace

2

Overview Introduction. Architecture of IR. Classical models of IR. Quantum probability. Document ranking using qPRP. Proposed solution. Implementation and Data collection. Conclusion. Future work.

Page 3: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Information Retrieval 3

• Information Retrieval (IR) is to search for relevant information in large collections of data.

• Examples of IR

Q- Give me articles about Laloo Prasad Yadav and the fodder scam.R- Evidence regarding Laloo Prasad Yadav's involvement in the fodder scam. - text retrieval. Q- What does a brain tumor look like on a CT-scan? R- A picture of a brain tumor - image retrieval.

• Not to be confused with Data Retrieval.

Page 4: Document ranking using Qprp with concept of Multi-Dimensional Subspace

4

Main Components

There are five main components of the basic information retrieval system.

i. Crawling. ii. Indexing. iii. User’s Query. iv. Ranking. v. Relevance Feedback.

Page 5: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Basic Architecture of IR5

Page 6: Document ranking using Qprp with concept of Multi-Dimensional Subspace

6

Cont….. Crawling:- The system browses the document collection and fetches documents. i. Selection Policy. ii. Revisit Policy. iii. Politeness Policy.

Indexing:- System builds an index of the documents. i. Tokenization. ii. Stop-word Eliminator. iii. Stemmer. iv. Inverted Index.

Page 7: Document ranking using Qprp with concept of Multi-Dimensional Subspace

7

Cont…. Ranking:- When user gives a query the index is consulted to get

most relevant document. Relevant documents are then ranked as per their importance.

Relevance Feedback:- It is a classical way of refining search engine rankings. eg:- Matrix(maths or movie).

Three Types of relevance feedback:- * Explicit. * Implicit. * Pseudo.

Page 8: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Theoretical Models in IR8

Theoretical models gives us different ways of solving IR related Problems. IR model is defined as 4-tuple [D,Q, F,R(qi,dj)]. Here,

D- It represents the document collection. Q- Query collection collected from the users. F- Framework for modeling document representation, queries and their relationships. R(qi,dj)- Ranking function which associates a score with

the pair (qi,dj).

Page 9: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Classical Models Of IR9

The main three classical models of Information Retrieval are:-

Boolean Model Vector Space Model Probabilistic Model.

Page 10: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Boolean Model10

The model is based on the set theory and boolean algebra.

Each document is considered as a bag of index terms(words or phrases from the documents important to establish its meaning).

Query here is the expression using boolean algebra connectives like , , etc.

And Or Not Document retrieved should completely match the given query and it

is not ordered.

Page 11: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Boolean Query Example..11

Suppose we have 3 documents:-Doc1:- Cricket is the most popular game of India.Doc2:- Ricky Ponting is the most successful captain of cricket Australia.Doc3:- India is ranked 5th in the latest ICC test cricket ranking.

If a user wants to know about Indian Cricket then a simple query is: India Cricket Australia.

Inverted index is formed. India is present in document {1,3}, Cricket is present in document {1,2,3} and Australia is present in document {2}.So finally {1,3} is selected.

Page 12: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Pros and Cons…

12

Advantages:- i. Simple, efficient and easy to implement. ii. Very precise in nature, user gets exact thing. Disadvantages:- i. Partial matches are not retrieved, which in many cases is not suitable. Retrieved documents are not ranked. ii. Given large set of documents, it retrieves either too many or very few documents. iii. Query does not captures synonymous terms. iv. Model does not use term weights.

Page 13: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Vector Space Model13

In this model the documents are represented as a vector of index terms. It has the ability to fetch partial matches.

Here we do not consider only the presence or absence of terms. So, in vector model the term weights are not binary.

Queries are also represented as vectors.

The similarity between the two vectors is actually calculated as the cosine similarity between them using which we find the relevance of the document.

�� 𝒋= {𝒘𝟏 𝒋 ,𝒘𝟐 𝒋 , ……,𝒘𝒕𝒋 }

Page 14: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Some Important Terms14

Modelling as a Clusturing Method. Fixing the Term weights.

i. Term Frequency(tf)

ii. Inverse Document Frequency(idf)

Page 15: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Similarity Measure Between Two Vectors

15

The most widely used method to measure the similarity between the two vectors is Cosine Similarity.

The Cosine Similarity of the two qi and dj is given by:-

Here, Ɵ = Angle between two vectors. w(i,j)= Term weight of ith term of jth document. w(i,q)= Term weight assigned to ith term of the query.

Page 16: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Cont..16

The retrieved set of documents dk are those for which similarity(di,qj) is greater than a threshold value.

The value of threshold can be brought down if for some query the highest similarity is on lower side hence allowing the partial matches to be retrieved.

Value of cos Ɵ increasesdj

qj

Page 17: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Pros and Cons…17

Advantages:- i. Partial matching possible. ii. Ranking of retrieved results according to cosine similarity is possible.

Disadvantages:- i. Index terms are considered to be mutually independent which does not allow it to capture semantic of query or document. ii. It cannot denote the “clear logic view” like Boolean model.

Page 18: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Probabilistic Model18

We try to capture the information retrieval process from a probabilistic framework.

Idea is to retrieve the documents according to the probability of the document being relevant.

Several version of Probabilistic model are available.

We will use version of Robertson-Spark-Jones.

Page 19: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Probabilistic Model (Why)19

Other model are empirical for most part success measured by experimental results few properties provable

Probabilistic Ranking Principle provable “minimization of risk”

Information Retrieval deals with Uncertain Information And it makes uncertain guess of whether a document satisfies the

query. Probability theory provides a principled foundation for such reasoning

under uncertainty. Vector space model: rank documents according to similarity to query.

Page 20: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Probability Ranking Principle

Collection of Documents

User issues a query

A Set of documents needs to be returned

Question: In what order to present documents to user ?

20

Page 21: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Probability Ranking Principle

Question: In what order to present documents to user ?

Intuitively, want the “best” document to be first, second best - second, etc…

Need a formal way to judge the “goodness” of documents w.r.t. queries.

Idea: Probability of relevance of the document w.r.t. query

21

Page 22: Document ranking using Qprp with concept of Multi-Dimensional Subspace

The Probabilistic Ranking Principle

22

If a reference retrieval system's response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of that data.

What is the probability of this document being relevant given this query?

Page 23: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Probabilistic Ranking Principle Definition

All index term weights are all binary i.e., wi,j {0,1}

Let R be the set of documents known to be relevant to query q

Let be the set on non relevant document.

Let be the probability that the document dj is relevant to the query q

Let be the probability that the document dj is non relevant to query q

R

)|( jdRP

)|( jdRP

23

Page 24: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Cont…24

Here we want to rank the documents (d w.r.t. query q) according to the probability of the document to be relevant.

Mathematically scoring function is given by:-P(R = 1| d,q)

R is indicator variable, it takes value 1 if it d(document) is relevant w.r.t. q, and 0 if d is non-relevant w.r.t. q(query).

Page 25: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Probability Ranking PrincipleLet x be a document in the collection.

Let R represent relevance of a document w.r.t. given (fixed) query and let NR represent non-relevance.

)()()|()|(

)()()|()|(

xpNRpNRxpxNRp

xpRpRxpxRp

p(x|R), p(x|NR) - probability that if a relevant (non-relevant) document is retrieved, it is x.

Need to find p(R|x) - probability that a retrieved document x is relevant.

p(R),p(NR) - prior probabilityof retrieving a (non) relevantdocument

25

Page 26: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Probability Ranking Principle

)()()|()|(

)()()|()|(

xpNRpNRxpxNRp

xpRpRxpxRp

Ranking Principle (Bayes’ Decision Rule):

If p(R|x) > p(NR|x) then x is relevant, otherwise x is not relevant

The similarity sim(dj,q) of the document dj to the query q is defined

as the ratio.

Using Bayes’ rule,

)|()|(

),(j

jj dRP

dRPqdsim

26

Page 27: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Binary Independence Model27

Binary Independence Model for calculating the probability of relevance.

Name is binary because the documents and queries are represented as binary (Boolean) term incidence vectors.

, iff term i is present in document x. Independence means terms are independent of each other.

),,( 1 nxxx

1ix

Page 28: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Cont…28

3 Assumptions are made by Binary Independence Model (BIM)

1. The documents are independent of each other.2. The terms in a document are independent of

each other.3. The terms not present in query are equally

likely to occur in any document i.e. do not affect the retrieval process.

Page 29: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Okapi BM25 Ranking Function

29

Probabilistic IR model is very generic in nature.

Many versions of probabilistic IR exist which are used practically.

Okapi-BM25 algorithm is based on the probabilistic IR.

Pays attention to the t.f. and document length.

Page 30: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Disadvantages of PRP30

PRP model does not hold when the assumptions fails. Calibration:- If the estimation of probability by the IR system does not matches the users assessment of relevance. Independent Relevance:- Relevance of documents are independent of

each other.

Certainty in Estimation:-Probability of relevance of a document is reported as scalar by IR system.

Page 31: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Quantum Probability31

Quantum probability theory naturally includes interference effects between events.

We assume that this interference shows the inter-dependency of relevance of the documents.

The outcome is a more sophisticated principle, the Quantum Probability Ranking Principle(qPRP).

To understand the difference between Kolmogorovian and Quantum probability theory on the basis of relevance of documents we will use Double Slit Experiment.

Page 32: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Double Slit Experiment32

Settings of Double Slit Experiment

Page 33: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Cont..33

Distribution of pA and pB

in the double slit experiment.

Distribution of pkAB in the

double slit experiment as estimated by Kolmogorovian probability.

Distribution of pAB asmeasured in the double slit experiment.

^

Page 34: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Cont…34

Kolmogorovian Probability Theory:- pk

AB = p(x|A) +p(x|B)

= pA + pB

Quantum Probability Theory:-pQ

AB = pA + pB + 2*√pA √pB * cos(ƟAB)

= pA + pB +IAB

Where,

ƟAB = ƟA - ƟB

In Reality:-

pAB ≠ pA + pB

≠ pkAB

Quantum Interference Term

Page 35: Document ranking using Qprp with concept of Multi-Dimensional Subspace

An Analogy with Document Ranking

35

Here Particle corresponds to the user who is characterized by an information need.

Each Slit Corresponds to document. Ex- 2 Slit means 2 doc.

The event of a particle passing from left of the screen to the right is comparable with the user examining the set of doc.

p(x|A,B) is analogous to p(S|dA , dB) .

S- an event to stop the search with user being satisfied.

Page 36: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Cont…36

Fig:- Analogy between Double Slit Experiment and Document Ranking Process in IR.

Page 37: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Ranking Document Within Analogy.

37

Fig- IR analogous of the previous figure

Page 38: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Cont…38

Kolmogorovian Probability:- pk

AB = pA + pB

Following equalities can be defined:-

argmax(pAB) = argmax(pkAB)

=argmax(pA + pB)

=argmax(pB )

B ϵ Ɓ

B ϵ Ɓ

B ϵ Ɓ

B ϵ Ɓ

Page 39: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Cont…39

Quantum Probability:- pQ

AB = pA + pB + IAB

Following equalities can be defined:-

argmax(pAB) = argmax(pQAB)

=argmax(pA + pB + IAB )

=argmax(pB + IAB )

B ϵ Ɓ

B ϵ Ɓ

B ϵ Ɓ

B ϵ Ɓ

Page 40: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Ranking The First Document40

Kolmogorovian and Quantum Probability Theory gives the same estimation i.e. pk

AB = pQ

AB = pA

Page 41: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Ranking Subsequent Documents

41

Slit A and B are kept fixed and the 3rd slit is varied among the slits of set Ƈ

Page 42: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Cont…42

Kolmogorovian Probability:- pk

ABC = pA + pB +pC

Following equalities can be defined as:-

argmax(pABC) = argmax(pkABC)

=argmax(pA + pB +pC)

=argmax(pC)

C ϵ Ƈ

C ϵ Ƈ

C ϵ Ƈ

C ϵ Ƈ

Page 43: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Cont…43

Quantum Probability:-

pQABC = pA + pB +pC + 2*√pA √pB * cos(ƟA -ƟB)

+ 2*√pA √pC * cos(ƟA -ƟC)+ 2*√pB √pC * cos(ƟB -ƟC).

pQABC = pA + pB + pC + IAB +IAC +IBC

Following equalities can be defined:-

argmax(pABC) = argmax(pQABC)

=argmax(pA + pB + pC + IAB +IAC +IBC )

=argmax(pC + IAC + IBC )

C ϵ Ƈ

C ϵ Ƈ

C ϵ Ƈ

C ϵ Ƈ

Page 44: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Quantum Probability Ranking Principle(qPRP)

44

Assumptions:-

I. Ranking is Performed Sequentially.II. Empirical data is best described using Quantum Probabilities.III. It is assumed that the documents that have been ranked before

may influence further relevance assessments.

Page 45: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Interpretation of Interference in qPRP45

Quantum interference is central in the formalization of qPRP. Once interference is expressed in terms of IR, these questions may

arise:-

1. What does quantum interference mean in qPRP and in IR?2. How does the quantum interference term influence document

ranking?

Page 46: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Estimating Interference in qPRP

Information Retrieval

46

IdAdB =2.√P(R|q,ddA) √P(R|q,dB).cos ϴdAdB

≈ 2.√P(R|q,ddA) √P(R|q,dB).βfsim (dAdB)

ϴ present in interference term is computed using a function fsim( dAdB).

where,

fsim* is a function used to compute the similarity between

dA and dB.

β is a real valued parameter.

Note(*):- Different similarity function can be used viz Cosine Similarity, Jaccard Similarity etc.

Page 47: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Constructing Document Representation

47

We associate each document to a vector. Vector is defined on the vector space made up by the terms present

in the documents. Each term in a collection is considered as a dimension of the vector

space. Different strategies can be employed to compute the components of

the term-vector for a document. example:- Binary Schema, TF-IDF, BM25 etc.

Page 48: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Proposed Solution48

We do not find any major drawbacks in qPRP approach.

qPRP can be thought as new model for IR.

Existing qPRP approach considers term present in different section of document equally.

Our belief is that representing the document as multidimensional subspace will give better result.

We cannot give equal weight to the term present in title and body.

Page 49: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Reason of considering Document as Multidimensional Objects49

Writers write the different part of document with different views. Title:- Gives idea about the content of document in 3-7 words. 1st Paragraph or Abstract :- Is an overview of document of whole

paragraph. Body:- Content of Document Conclusion:

Writers write the term present in document with different font and size. Ex: Keyterms->italics, etc.

Considering documents as multidimensional will allow building “truly” interactive IR system.

Page 50: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Reason of considering Document as Multidimensional Objects50

Complex aspects of the retrieval process benefit from more sophisticated representation of doc. & queries.

It reduces the length of subspaces

Hence if words appears at any segment then it is more likely to satisfy user.

Page 51: Document ranking using Qprp with concept of Multi-Dimensional Subspace

How document is represented as multidimensional subspace?

51

In previous representation of document

Title: School of Tech.Abstract or1st paragraph of doc.

Body :………………………….School of …………….. Technology……………………………………Conclusion…………………………………………………………

Document

0111001111

Doc 1 =

Page 52: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Document Fragments52

To represent document as multidimensional subspace, we need to divide document in different fragments.

Choice 1: Use single fragment the document itself

Choice 2: Use different section of document (i.e. title, abstract, etc) as fragments.

Choice 3: Use paragraphs as fragments as they seem to be an appropriate size to correspond Information Need(IN).

Choice 4: Use sentence as fragments.

Page 53: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Fragments as Document Section 53

Title: School of Tech.Abstract or1st paragraph of doc.

Body :………………………….School of …………….. Technology……………………………………Conclusion…………………………………………………………………………

Doc

1 1 1 1

0 0 0 0

1 1 1 1

0 0 1 0

1 0 0 1

0 1 1 0

1 1 0 1

0 0 1 0

1 1 1 1

0 1 0 0

Title Abstract Body Conclusion

Doc 1 =

Page 54: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Fragment as Paragraph & Sentence

54

Document can be represented as a set of information needs (IN), each being represented as a vector.

We can decompose paragraph or sentence into text excerpts that are associated with one or more INs.

In same way query can be broken to IN.

Page 55: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Representation for each Segmentation

55

Three weighting schemes are used:-1. Term Frequency-Inverse Document Frequency (TF-IDF)2. Term Frequency(TF)3. Binary(Term presence/absence)

TF-IDF causes substantial overhead We can use TF and binary.

Page 56: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Implementing Multidimensional Subspace with qPRP

56

To decide the rank between two document, from qPRP we know that,

pQAB = pA + pB + 2*√pA √pB * cos(ƟAB )

Different parts of document has different weightage. There are two approaches for implementing MD subspace with

qPRP:-1. Implementing with whole formula2. Implementing only with similarity function

Page 57: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Implementing with whole formula

57

This formula is to be used for different section of document independently.

After calculating for different part and multiply with respective weightage we add with other fragment of document.

Same similarity function can be used. Let suppose we give weightage and 2 document A and B

Title= 0.2 Abstract=0.3 Body=0.3 Conclusion=0.2

pQAB = title* (pQ

AB)title+abstract*(pQAB) abstract

+body*(pQAB)body +conclusion*(pQ

AB) conclusion

Page 58: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Implementing only with Similarity Function58

Only similarity function is implemented with document fragment rather than whole formula.

Calculate similarity function between respective fragments of documents and add all of them.

ƟAB = title* (ƟAB)title+abstract*(ƟAB) abstract

+body*(ƟAB)body +conclusion*(ƟAB) conclusion

Use different types of formula for calculating similarity between multidimensional subspaces.

Page 59: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Metrics for measuring extent of Interference

59

The subspace similarity sims(Sa, Sb) between the p dimensional sub-spaces Sa and the r dimensional subspace Sb is defined as:-

sims(Sa,Sb) = 1-

This formula can also used to calculate similarity between two semantic spaces.

Page 60: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Implementation and Results For the implementation of the project and evaluation of results in an

efficient way some of the pre-requisites we have used are:- Software requirements:-

Windows 7 Microsoft Office 2010 (For project report) JDK 1.6.0 (Compiler) or higher version Notepad++ (with WebEdit)

Data Set requirement:- Ad-Hoc standard Dataset

60

Page 61: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Cont...

Hardware requirements:- 3 GB RAM. 5 GB Hard Disk Free Space. Intel Core i5 Processor or higher version.

Package requirements:- Lucene 2.4.0 BM25 Implementation. Apache Commons Math 2.2.0

61

Page 62: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Data Collection FIRE Ad-Hoc of the year 2010 has been used.

The queries has also been taken from the same.

The data set obtained contains around 1,30,000 documents that comprises of the collection of news from the leading newspaper “The Telegraph” for the period of 2004-07.

We have divided the documents into 3 fragment i.e. <title></title>, <fp></fp> and <sp></sp>.

62

Page 63: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Why fragments???

The fragments are made so as to bring the concept of multi-subspace. In our case the number of sub-spaces is 3. The reason behind choosing these three fragments in this order are:- Titles are most important part of any document. Inverted pyramid is the model for newswriting.

So the title is kept at the top and the main content of the document has been divided into two parts: First paragraph. Second paragraph.

63

Page 64: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Implementing of proposed solutionWe have divided our implementation process into 3 modules:- Indexing of the data set. Searching the indexed document using cosine similarity. Search using Quantum based similarity measure.For implementing the proposed solution we have chosen certain library. They are:- DOM Parser (inbuilt in Java). Apache Lucene 2.4.0. Apache Commons Math Library 2.2.0. BM25 Implemented Library.

64

Page 65: Document ranking using Qprp with concept of Multi-Dimensional Subspace

UML Diagram

Class diagram used for indexing:-

65

Indexer-IndexWriter-Document+getIndexWriter(boolean)+closeIndexWriter()+indexDocument(TryDOM)+recursion(File)+rebuildIndexes(String)

TryDOM-Document-NodeList

+buildDocument(File)+String getName()+String getDocNum()+String getTitle()+String getFirstPara()+String getSecondPara()+String getWholeDocument()

Main

+public static void main(String[])

Page 66: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Class diagram used for searching

66

SearchFrame +String+Jpanel+JTextField+Jbutton

-actionPerformed(ActionEvent)

Class Diagram for Searcher (Part 1)

Mainpublic static void main()

Page 67: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Cont…. 67

DocVector+SparseRealVector+Map

+DocVector(Map<Str,Int>terms)+setEntry(String term, int freq)+normalize()

MySearcher

#HashMap#ArrayList#IndexSearcher#Document#double tempScore#int tempDoc#int num#IndexReader

MySearcher()ScoreDoc[] getProbableRelvDoc(String, String)HashMap sortQPRP(String, String)double getSimilarity(int,int)double testSimilarityUsingCosine(int,int,str)

Class Diagram for Searcher (Part 2)

Page 68: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Explanation(Indexer)

Indexing mainly starts from the Main class which takes as a input the ‘directory path’ where the documents to be indexed are kept.

Main class instantiates the Indexer class and call its method rebuildIndexes(String), and passes the given directory path to it which in turn calls recursive(File). All the files available in the directory will get indexed recursively by this function.

Each file is then parsed by TryDOM class and it is passed to indexDocument(TryDOM) to get indexed.

68

Page 69: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Explanation(Searcher) Now as the indexing is done, the next step is to search the indexed

document for the given query which is done by using the MySearcher class.

Program will start from Main class instantiating the SearchFrame and one GUI is popped up.

GUI takes two input as query and file name(where result is stored). Clicking the search button MySearcher class is initiated and the

method sortQPRP() is called. It then calls getProbableRelvDoc() to get the top k result using BM25 model, now sortQPRP() rearranges the result according to qPRP model.

69

Page 70: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Collaboration diagram for indexer.

70

Indexer

TryDOM Main

getWholeDocument()

getDocNum()

getName()

getSecondPara()

getFirstPara()

getTitle()

buildDocument()

rebuildIndexes()

Page 71: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Collaboration diagram for searcher.

71

Main SearchFrame

MySearcher

DocVector

sortQPRP()

result()

setEntry()normalize()

Page 72: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Lucene Index Structure

Documents in Lucene are stored as an object in Index. We need to convert the data into document object and store them into index.We break the data into different part and store them in Document object as Field object.

72

Doc ID

Title

First Paragraph

Second Paragraph

DocNum

Document Name

Page 73: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Evaluation MeasuresIn order to evaluate the result we need to consider two dimensions.

Recall:- Measure of ability of system to present all relevant documents. Mathematically, recall=

Precision:- Measure of ability of system to present only relevant documents.

Recall and Precision are set based Measures.

73

Page 74: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Cont..74

To measure ranked list precision is plotted against recall. Whenever new nonrelevant document is retrieved, recall value is

same but precision decreases.

Page 75: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Mean Average Precision(MAP):- In recent years TREC community using MAP. It provides single figures across recall levels. To calculate Mean Average Precision the following formula is used.

Rjk is the set of ranked retrieval results from the top result until you get to document dk

qj Q is {d∈ 1, . . . dmj}

75

Page 76: Document ranking using Qprp with concept of Multi-Dimensional Subspace

76

First we retrieved top 150 result using BM25 model. Then we sort the result according to qPRP using cosine similarity. We noted down the ranked list given by both the model. We calculated the recall and precision of both list, whenever new

relevant document is retrieved in the list. We plotted the histogram using recall-precision.

Page 77: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Ranking of relevant document77

Query No.

Relevant Document Ranking(PRP)

Relevant Document Ranking(qPRP)

77 78, 98, 41, 69, 16, 132,47,134,135

60, 48, 47, 46, 42, 52,44,49,54

79 18 2085 26,38,27,1,44,6,42,2,22,104 52,30,25,1,45,26,87,36,42,4888 27,6,49,44,7,59,13,20,12,23,21,4

3,5635,11,82,54,2,21,29,3,39,24,26,18,28

100 5,19,13,12,14,33,06,03,04,29,1,22,9,18,15,23

4,21,12,11,31,33,10,03,06,20,1,27,14,15,7,8

102 2,37,129,84,62,147 2,13,18,16,17,20103 1,17,30,12,130 1,6,7,5,143112 18,3,37,14,6,8,1,73,24 7,5,16,3,6,2,1,13,9121 11,16,5,20,4,1,3,19,6,13 11,10,13,7,1,8,15,14,16122 6,1,4,10,23,9 3,1,16,7,21,14

Page 78: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Comparison of Precision for PRP and qPRP(cosine) on same recall value(Query:100)

78

0.062r 0.125r 0.187r 0.25r 0.312r 0.375r 0.437r 0.5r 0.562r 0.625r 0.687r 0.75r 0.812r 0.875r 0.937r 1r0

0.2

0.4

0.6

0.8

1

1.2

Precision(PRP)Precision(Cosine)

Page 79: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Comparison of Precision for PRP and qPRP(cosine) on same recall value(Query:112)

79

0.111r 0.222r 0.333r 0.444r 0.555r 0.666r 0.777r 0.888r 1r0

0.2

0.4

0.6

0.8

1

1.2

Precision(PRP)Precision(Cosine)

Page 80: Document ranking using Qprp with concept of Multi-Dimensional Subspace

MAP comparison with respect to the queries

80

Q77 Q79 Q85 Q88 Q100 Q102 Q103 Q112 Q121 Q1220

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

MAP(PRP)MAP(COSINE)

Page 81: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Ranking of relevant document using qPRP (using Quantum based simiarity )

81

Model Name Document Ranking Average Precision

PRP 11,16,5,20,4,1,3,19,6,13

0.66

qPRP(using cosine similarity)

11,10,13,7,1,8,15,14,16

0.508

qPRP(using quantum based

similarity)

11,15,19,3,1,2,18,5,12 0.68

Page 82: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Conclusion:- We have calculated the Mean Average Precision (MAP) for both the

models using set of queries. We obtained MAP

The difference between them comes 0.049177 . Result obtained for qPRP is 14.1% more precise than that of PRP. The result that we have obtained is better in most of the cases but

for very few queries result of PRP is better than qPRP.

82

Model Name MAP

PRP 0.347060

qPRP(using cosine similarity) 0.396237

Page 83: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Future Work:- After observing the above result we deduce that qPRP can be used to

rank the Ad Hoc data set. Following direction can be undertaken to get even better result:-

Alternative document representation can be used. For example:- We may divide subspaces on the basis of most informative terms. Most informative terms can be deducted by font, term appearing near to query term appearing in document.

Different similarity measure can be used. For example, one may use the similarity in paper.

83

Page 84: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Cont.. By finding the similarity by capturing the meaning of

document. For capturing the meaning of document we may HAL representation.

Azzopardi, Leif, Probabilistic Hyperspace Analogue to Language

One can also test the solution which we have proposed under section “Implementing with whole formula”

84

Page 85: Document ranking using Qprp with concept of Multi-Dimensional Subspace

Thank You..

85


Recommended