+ All Categories
Home > Documents > Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira...

Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira...

Date post: 19-Dec-2015
Category:
Upload: roxanne-obrien
View: 215 times
Download: 2 times
Share this document with a friend
Popular Tags:
24
Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science Federal University of Minas Gerais, Brazil
Transcript
Page 1: Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.

Set-Based Model: A New Approach for Information Retrieval

Bruno Pôssas Nivio Ziviani

Wagner Meira Jr. Berthier Ribeiro-Neto

Department of Computer Science

Federal University of Minas Gerais, Brazil

Page 2: Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

Introduction

Vector space model (VSM) Query terms and documents are represented

as weighted vectors in a vector space Query answers are documents whose

representative vectors have high similarity to the query vector

Term weighting scheme: TF x IDF

Page 3: Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

Motivation

In VSM, index terms are assumed to be

mutually independent Linear weighting function Not realistic but easy to compute

Our hypothesis:

Exploration of correlation among index

terms might improve retrieval effectiveness

Page 4: Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

Our Goal

Propose a new model for computing index

term weights, based on set theory Terms Sets of terms (termsets) Correlation among index terms High retrieval effectiveness keeping

computational costs small

Exploit the intuition that related term

occurrences often occur close to each other

Page 5: Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

Related Work

Correlation among index terms Raghavan and Yu (1979) Rijsbergen (1977), Harper and Rijsbergen (1978) Wong et al. (1985 and 1987) Common limitations:

• Expensive to compute dependency factors• Exhaustive application of term co-occurences hurts

overall effectiveness and performance

Association rule mining Zaki (2000)

Page 6: Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

Termsets

T = {t1, t2, …, tt} is the set of t unique terms

of a collection of documents D.

An n-termset s is an ordered set of n terms,

such that s T.

ds is the frequency of a termset s.

S is the set of 2t unique termsets that may

appear in a document (power set of T).

Page 7: Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

Termsets: Example

D = {d1, d2, d3}

T = {A,C,D,T}

S ={sA,sC,…,sAC,

sAD,…,sACDT}

Collection D

A C

T

d1 C

D

d2 C D

T

d3

sA = {A} (1-termset)

sCD = {C,D} (2-termset)

sCDT = {C,D,T} (3-termset)

dsA = 1

dsCD = 2

dsCDT = 1

Page 8: Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

Termsets: Definitions

Frequent termset Is a termset with frequency greater or equal to

a given minimal frequency.

Closed termset Is a frequent termset that is (1) the largest

among its subsets and (2) its subsets occur in the same set of documents.

The use of closed termsets reduces significantly the number of termsets taken into consideration

Page 9: Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

{ }

A: 1 C: 3 D: 2 T: 2

AC: 1 AT: 1

ACT: 1

CD: 2 CT: 2 DT: 1

CDT: 1

Termsets: Example

Collection D

A C

T

d1 C

D

d2 C D

T

d3

Empty set

Frequent Termset

Closed Termset

Page 10: Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

Set-Based Model

Documents and queries are described by

sets of closed termsets, instead of terms.

Closed termsets provide all elements of the

TF x IDF scheme.

Computational cost is linear on the number

of documents in the collection.

Page 11: Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

Set-Based Model: Termset Weights

Extension of a TF x IDF scheme

sfi,j number of occurrences of si in dj

dsi number of occurrences of si in D

Idsi inverted freq. of occurrence of si in D

ijiijiji ds

Nsfidssfw log)log1( ,,

*,

SBM VSM, if only 1-termsets are considered

Page 12: Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

Set-Based Model: Similarity Calculation

ti qi

ti ji

Cqs

*s,q

*s,j

ww

ww

|q||dj|

qdjsim(q,dj)

12,1

2,

sA

sAT

sT

d1

d2

Q

Normalization uses just terms instead of termsets

Page 13: Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

Set-Based Model: Query Mechanism

SBM Algorithm:

1. Obtain the 1-termsets from query terms;

2. Enumerate all closed termsets from 1-termsets;

3. Calculate similarities between query and documents using the closed termsets;

4. Normalize document similarities;

5. Select the k largest document similarities.

Page 14: Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

Experimental Results

Reference Collection

CFC WSJ TReC-3

# Documents 1,240 173,252 1,078,166

# Distinct Terms 2,105 230,902 1,016,709

# Queries 100 300 300

# Query Size 3.82 18.88 22.43

Size (MB) 1.9 509 3,225

Page 15: Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

TReC-3: Recall x Precision

Page 16: Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

Average Precision

Collection

Average Precision (%) SBM Gain (%)

VSM GVSM SBM VSM GVSM

CFC 22.42 24.47 26.56 18.47 8.54

WSJ 31.76 34.27 41.78 31.55 21.91

TReC-3 32.58 * 44.59 36.86 *

* GVSM could not be evaluated for TReC-3 collection due to exponential cost of the min-term build phase

Page 17: Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

Average Precision at 10

Collection

Average Precision at 10 (%) SBM Gain (%)

VSM GVSM SBM VSM GVSM

CFC 10.97 12.93 16.02 46.03 23.90

WSJ 12.71 16.58 19.17 50.82 15.62

TReC-3 13.66 * 21.42 56.80 *

•GVSM could not be evaluated for TReC-3 collection

due to exponential cost of the min-term build phase

Page 18: Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

Computational Efficiency

Collection

Avg. Response Time (s) Increase (%)

VSM GVSM SBM GVSM SBM

CFC 0.0023 0.0056 0.0025 243.5 8.7

WSJ 0.4286 2.0143 0.6296 469.9 46.9

TReC-3 1.2732 * 2.2930 * 80.1

* GVSM could not be evaluated for TReC-3 collection due to exponential cost of the min-term build phase

Page 19: Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

Conclusions and Future Work

SBM exploits index terms correlations improving retrieval effectiveness efficiently.

Future work:Investigate behavior of SBM when applied to larger collections.

Extend SBM to take into account the proximity information of index terms.

Page 20: Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

Termsets: Complexity

Worst Case Avg. Case

O(2|q|.N) O(c.N)

Time Complexity:

Space Complexity Worst Case: O(r.2l.N)

|q| = query size,

c = number of closed termsets,

N = number of documents,

r = number of maximal termsets,

l = length of the largest termset.

Page 21: Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

TReC-3: Number of Closed Termsets

Collection Worst Case Average Case

CFC 14.12 3.14

WSJ 456,419.21 3,217.28

TReC-3 5,650,707.18 4,081.25

The average case scenario is significantly smaller than the worst case scenario.

Page 22: Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

TReC-3: Minimal Frequency

Trade-off between precision, the number of termsets taken into consideration and performance

0

5

10

15

20

25

30

35

40

45

50

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30

Minimal Frequency (# docs)

Avg

. Pre

cisi

on

(%

)

0

1

2

3

4

5

6

7

Avg

. Res

po

nse

Tim

e (s

)

Page 23: Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

Termsets: Enumeration

An incremental algorithm that employs a

very powerful pruning strategy.

1. Enumeration of (n+1)-termsets from n-termsets Union of all pairs (si,sj) that have the same prefix.

2. Evaluation if a frequent termset ‘s’ being verified is closed

Check if all current termsets have ‘s’ as its closure,

being discarded if such condition holds.

Page 24: Set-Based Model: A New Approach for Information Retrieval Bruno Pôssas Nivio Ziviani Wagner Meira Jr. Berthier Ribeiro-Neto Department of Computer Science.

LATIN - Lab for Treating Information -- Federal University of Minas Gerais, Brazil

Termsets: Example

1-termsets

lsA = {d1}

lsC = {d1,d2,d3}

lsD = {d2,d3}

lsT = {d1,d3}

2-termsets

lsAC = {d1}

lsAT = {d1}

3-termsets

lsACT = {d1}

Closed termset

lsACT = {d1}

Collection D

A C

T

d1 C

D

d2 C D

T

d3


Recommended