Toward Consistent Evaluation of Relevance Feedback Approaches in Multimedia Retrieval Xiangyu Jin,...

Toward Consistent Evaluation of Relevance Feedback Approaches in Multimedia Retrieval

Xiangyu Jin, James French, Jonathan Michel

July, 2005

Outline

Motivation & Contributions RF (Relevance Feedback) in MMIR PE (Performance Evaluation) Problems Rank Normalization Experimental Results Conclusions

Motivation

RF in MMIR is a cross discipline research area

(1). CV & PR [Rui 98] [Porkaew 99]

(2). Text IR [Rocchio 71] [Williamson 78]

(3). DB & DM [Ishikawa 98][Wu 00][Kim 03]

(4). HCI & Psychology ...

These different background groups follow different traditions and have different standards toward evaluation, which makes it hard to:

(1). study relations among them(2). compare their performance fairly

(1). Different testbed

Dataset: COREL [Muller 03], TRECVid

-”Every evaluation is done on a different image subset thus making comparison impossible”

Groundtruth

manual-judged [Kim 03] TRECVid Pure human labeling

auto-judged [Rui 98] [Porkaew 99] MARS as reference system

semi-auto-judged [Liu 01] MSRA MiAlbum, system-assisted human labeling

Motivation

(2). Different methodology

System-oriented vs. user-oriented

User-oriented method is not ideal for comparison since user experience varies from person to person, time to time.

Normalized rank vs. non-normalized rank

Rank-normalization is generally accepted in Text IR [Williamson 78]

but not MMIR

Motivation

Prob 1. It is hard to study the relations among RF approaches

Cont. 1. Briefly summarize RF algorithms according to their implementation to multi-query retrieval, so that each approach can be treated as a special case under the same framework and their intrinsic relations can be studied.

Prob 2. It is hard to compare RF performance fairly

Cont. 2. Give the critics toward PE work in the listed works. Demonstrate an example of how to fairly compare three typical RF approaches in large scale testbeds (both text and image). And show improper PE methodology can lead to different conclusions.

Problems & Contributions

Where are we?


RF in MMIR (framework)

Both the document and queries can be abstracted as points in some space

rel-docirel-doc

General RF model in distance-based IR


Each pair of points’ distance is defined by some distance function D(q,d) (assume D is metric).

D(q,d)

rel-docirel-doc



Retrieval can be interpreted as getting the document points in the neighborhood of the query points (nearest neighbor search).

rel-docirel-doc



RF can be interpreted as a process to move and reshape the query region, so that it fits the user interested region in the space

rel-docirel-doc




Query Set Search Engine Results

Feedback Examples

Feedback examples are used to modify query points in the query set. The search engine could handle multiple query points, hence the search results is modified by the change of the query set.




Feedback Examples

In above discussion, D can only adapt to a single query point. We need to extend D(q,d) to D’(Q,d) so that it could handle a query set Q.

Two possible solutions are: Combine Queries & Combine Distances.




Feedback Examples

Assumptions:

(1). Our focus to RF research is on how to handle multiple query points. i.e., given D, how to construct D’.

(2). Assume retrieval result is presented as a rank list.

(3). User select feedback examples in the retrieval result.


Combine Queries Approach

Query Set

Search Engine Results

A single query point is generated from the query set by some algorithm. Then the synthetic query is issued to search.



)),((),(' dQfDdQD

f is a function which map Q to a single query point q,

e.g.

qi is a query point in the query set, wi is its corresponding weight.

i

Q

ii qwQf *)(

||

1

q4

q3 q2

q1

q

d



Combine query feedback modify the query region by the following mechanisms

(1). Move the query center by f so that query region is moved.

(2). Modify distance function D so that the query region is reshaped.

Usually the distance function is defined as a squared distance

D(q,d)=(q-d)TM(q-d), where M is the distance matrix.

• Query-point-movement (QPM) [Rocchio 71]: M is an identity matrix

• Re-weighting (Standard Deviation Approach) [Rui 98]: M is a diagonal matrix

• MindReader [Ishikawa 98]: M is a symmetric matrix where det(M)=1


Query Set

Mid-result FusionSearch Engine

Results

Each query point is issued to search. The results are then combined in post-processing with some merging algorithm.

Mid-result

Combine Distances Approach


1||

1

),(*),('

dqDwdQD i

Q

ii

0),(' dQD 0 0),( dqDi iDefine if both and

q4

q3 q2

q1

d

The query center movement and distance function’s modification is a hidden process.


The distances are combined using weighted power mean.


Query-expansion [Porkaew 99]: α=1

FALCON [Wu 00]:

α>0 Fuzzy AND merge;

α<0 Fuzzy OR merge

1||

1

),(*),('

dqDwdQD i

Q

ii



Mixed Approach

Q-Cluster [Kim 03]

Feedback examples are clustered.

(1). The cluster centers (denote as a set C) are used for combine distances feedback (by FALCON’s fuzzy OR merge)

(2). Each cluster center use its own distance function Di. Di is trained using MindReader for query points in cluster i.

1||

1

),(*),('

dcDwdQD i

C

iii

Extremely complex!

RF in MMIR (example)

An illustrative example

Suppose we have a 2D database of weights, height of people.

Name Weight Height

Steve 120 180

Mike 180 120

… … …

Steve Mike


Combine Queries Approaches

Query-point-movement: Rocchio’s method

Weight

Height

0

The initial query regionThe new query region

M is an identity matrix, the query region is a circle.



Re-weighting: Standard Deviation Approach

Weight

Height

Band query: we want to find “tall” people, whose height is around 6 feet. The user interested region is a sharp band. No matter how you move the circle you cannot fit the query region.



Weight

Height

Solution: Extend M to be a diagonal matrix, so that the query region is a ellipse (align to axis). The larger the variance along a axis is, the smaller the weight this axis is.

Re-weighting: Standard Deviation Approach



Diagonal query: we want to find “good shape” people, whose height/weight varies within a range. Since re-weighting can only form ellipse align to axes, so it cannot fits the region well.

Weight

Height

0

MindReader



Solution: extend M to be is a symmetric matrix with det(M)=1. Now the query region can be an arbitrary ellipse.

Weight

Height

0

MindReader


Combine Distances Approaches

Triangle query: the user interested region is arbitrary shaped.

Weight

Height

0

Query-expansion in MARS



Solution: hiddenly change the distance function

Weight

Height

0

Query-expansion in MARS

),(*),('||

1

dqDwdQD i

Q

ii

This is a special case where α=1 (arithmetic mean)



Disjoint query: user interest region is not a continuous region. Suppose user interested in two types of person, either small size or large size.

This is common in a MMDB where low-level feature cannot reflect high-level semantic clusters.

Weight

Height

0

FALCON



Solution: use small circles (not necessary circle) around each query point to combine to a non-continuous region.

Weight

Height

0

FALCON

αis negative

1||

1

),(*),('

dqDwdQD i

Q

ii


Mixed Approach

Idea: use small ellipse to combine to a non-continuous region. Use MindReader to construct each ellipse and use FALCON’s fuzzy OR merge to combine them.

Weight

Height

0

Q-Cluster

Where are we?


PE Problems

There are many kinds of PE problems and we only list several.

(1). Dataset

(2). Comparison

(3). Impractical parameter settings

We give examples only in previous listed works. But this doesn’t mean ONLY these works has PE problems.

PE Problems

Dataset Problems

(1). Unverified assumptions in simulated environment

The algorithm is proposed based on some assumption.

E.g., re-weighting [Rui 98] requires ellipse query exist, MindReader [Ishikawa 98] requires diagonal query exist.

In MARS works [Rui 98] [Porkaew 99], groundtruth is generated by their own retrieval system with arbitrary distance function so that an “ellipse” query already exist. It would be not astonishing that re-weighting would over perform Rocchio method in this environment.

We are not arguing that these approaches are not useful, but the PE tells us very little information (since it is a high probability event).

Dataset Problems

(2). Real data is not typical for application

small scale (1k image), highly structured (strong linear relation), low dimensional (2D).

E.g., in MindReader [Ishikawa 98], a highly structural 2D dataset (the Montgomery Country dataset) is used to evaluation, where the task is in favor of their approach.

MMIR usually employ very high dimensional features, and only a few dozen examples are available to feedback. In this case, it is extremely hard to mine the relations among hundreds of dimensions via so few training examples. It would be risky to learn a more “free” & “powerful” distance function.

It is highly possible that user intention is overwhelmed by the noise and wrong knowledge is learned.

PE Problems

Comparison Problems

(1). Misused comparison

The author proposed some modification to quicksort, but instead of compare his new sorting with quicksort, he compare it to bubble sort.

For example, Q-Clusteris a modification to FALCON’s fuzzy OR merge. But [Kim 03] compared it to Q-Expansion and QPM.

It is not astonishing that their approach performs much better, since COREL database is in favor of any fuzzy OR similar approach.

PE Problems

Comparison Problems

(2). Unfair comparison

How to treat the training samples (feedback examples) in the evaluation?

E.g., FALCON shift them to head, Rocchio doesn’t. But all methods can do this in post-processing!

Directly compare approaches which inconsistently process the feedback examples will result in unfair comparison. FALCON [Wu 00] and Q-Cluster [Kim 03] papers all have this problem.

PE Problems

Impractical Parameter Settings

Assume a “diligent” user

(1). Ask the user to judge too many

[Re-weighting] ask the too look through top 1100 retrieved results to find feedback examples.

(2). Ask the user to click/select too many times

[Kim 03] and [Porkaew 99] feedback all relevant images in top 100 retrieved result.

Remember, COREL only has 100 relevant images for each query. This could result in their conclusion of improvement appears mostly in the FIRST iteration!

(3). Feedback too many iterations

[Wu 00] do feedback over 30 iterations

PE Problems

Where are we?


Rank Normalization

Rank normalization: re-rank retrieval result according to feedback examples

Although Rank Normalization is a generally accepted in text IR [Rank-Norm], but it is seldom paid enough attention in MMIR.

Rank-shifting: shift the feedback examples to the head of the refined result even if they are not there. Easy to implement, fair for cross system comparison, unfair for cross iteration comparison.

Rank-freezing [Rank-Norm]: freeze the rank of the feedback examples during refinement process. Hard to implement, fair for cross system comparison, fair for cross iteration comparison.

Rank Normalization

3 1 4 5 7 9 6 2 83 4 6 Previous result

rel-doc

Rank-Shifting

Rank Normalization

3 1 4 5 7 9 6 2 83 4 6

2 3 8 4 5 9 1 6 7

Previous result

Refined result

(before rank-shifting)

rel-doc

Rank-Shifting

Rank Normalization

3 1 4 5 7 9 6 2 83 4 6

2 3 8 4 5 9 1 6 7

3 4 6

Previous result

Refined result


Refined result

(after rank-shifting)

rel-doc

Rank-Shifting

Rank Normalization

3 1 4 5 7 9 6 2 83 4 6

2 3 8 4 5 9 1 6 7

3 4 6 2 8 5 9 1 7

Previous result

Refined result


Refined result

(after rank-shifting)

rel-doc

Rank-Shifting

Rank Normalization

3 1 4 5 7 9 6 2 83 4 6

2 3 8 4 5 9 1 6 7

3 4 6

Previous result

Refined result

(before rank-freezing)

Refined result

(after rank-freezing)

rel-doc

Rank-Freezing

Rank Normalization

3 1 4 5 7 9 6 2 83 4 6

2 3 8 4 5 9 1 6 7

3 2 4 8 5 9 6 1 7

Previous result

Refined result

(before rank-freezing)

Refined result

(after rank-freezing)

rel-doc

Rank-Freezing

Where are we?


Experimental Results

Testbed

(1). CBIR

Basic CBIR in 3K image DB (COREL)

(2). Text-IR

Lucene in TREC-3


Feedback approaches

(1). QPM, Query-point-movement

(2). AND, FALCON with α=1 (Q-Expansion)

(3). OR, FALCON with α=-1

Rank-normalization

(1). Without rank-normalization

(2). Rank-shifting

(3). Rank-freezing

CBIR testbed

(1). OR>QPM=AND (multi-cluster)

(2). Without rank-normalization, we would exaggerate OR merge’s performance. Rank-shifting would exaggerate performance improvement cross iteration.

0.31

0.23

TEXT IR testbed

(1). OR=QPM=AND (single-cluster)

(2). Without rank-normalization, we would exaggerate OR merge’s performance. Rank-shifting would exaggerate performance improvement cross iteration.

(3). OR merge converge slower.

0.46

0.38

Where are we?


Conclusions

(1). For RF approaches

QPM is quite similar to AND merge.

For case if the relevant documents is scattered several clusters in the space, OR merge is preferred; If the relevant documents is clustered together, QPM is preferred since it has smaller computational cost and converge faster.

(2). For rank-normalization issue

If cross system comparison is needed, both approach could be used.

If cross iteration comparison is needed, rank-freezing is required.

References

[Rui 98] Y Rui and TS Huang and S Mehrotra: Relevance Feedback Techniques in Interactive Content-Based Image Retrieval. Storage and Retrieval for Image and Video Databases (SPIE 1998). (1998) 25-36.

[Ishikawa 98] Y Ishikawa and R Subramanya and C Faloutsos: MindReader: Querying Databases Through Multiple Examples. Proc. of VLDB'98. (1998) 218-227.

[Rocchio 71] JJ Rocchio: Relevance Feedback in Information Retrieval. In G Salton ed.: The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice-Hall (1971) 313-323.

[Wu 00] L Wu and C Faloutsos and K Sycara and TR Payne: Falcon: Feedback adaptive loop for content-based retrieval. Proc. of VLDB'00“. (2000) 297-306.

[Kim 03] D Kim and C Chung: Qcluster: Relevance Feedback Using Adaptive Clustering for Content-based Image Retrieval. Proc. of ACM SIGMOD'03. (2003) 599-610.

[Porkaew 99] K Porkaew and M Ortega and S Mehrotra: Query reformulation for content based multimedia retrieval in MARS. Proc. of ICMCS'99. (1999) 747-751.

References

[Williamson 78] RE Williamson: Does relevance feedback improve document retrieval performance? ACM SIGIR'78. (1978) 151-170.

[Yan 03] R Yan and R Jin and A Hauptmann: Multimedia Search with Pseudo-Relevance Feedback. Proc. of CIVR'03. (2003)

[Westerveld 03] Thijs Westerveld and Arjen P. de Vries: Experimental result analysis for a generative probabilistic image retrieval model. Proc. of SIGIR'03. (2003) 135-142.

[Muller 03] Henning Muller and Stephane Marchand-Maillet and Thierry Pun: The Truth about Corel - Evaluation in Image Retrieval. Proc. of CIVR '02. (2002) 38-49.

[Liu 01] W Liu and Z Su and S Li and Y Sun and H Zhang: Performance Evaluation Protocol for Content-Based Image Retrieval Algorithms/Systems. CVPR Workshop on Empirical Evaluation Methods in Computer Vision. (2001).

End

Back up slides


Testbed

(1). CBIR

DB: 34 COREL categories (3.4 K images)

Groundtruth: COREL groundtruth (image inside the same category is considered relevant)

Retrieval systems: Basic CBIR (global visual feature based)

Queries: randomly select 6 images from each category (204 queries in total).

In [French CIVR’04] we have shown this testbed is a good representative to a much larger 60K COREL testbed for RF experiments.


Testbed

(2). TEXT IR

DB: TREC-3 ad hoc (750K document)

Groundtruth: TREC’s qrel (by pooling)

Retrieval systems: Lucene’s default setting

Queries: manually created for TREC topic 151-200, one for each (50 queries in total)

Other settings

(1). How many document (top of the rank list) is used for performance evaluation?

For CBIR, 150. For TEXT IR, 1000. Avg-precision is used as PE metric.

(2). How many document (top of the rank list) is shown to the user for feedback selection?

150.

(3). How many document the user is suppose to feedback (in system-oriented approach), assume the user make selection in order.

Up to 8 (if there are 8), sequential order.


Date post:	04-Jan-2016
Category:	Documents
Upload:	ronald-wilkins
View:	216 times
Download:	1 times

Toward Consistent Evaluation of Relevance Feedback Approaches in Multimedia Retrieval Xiangyu Jin,...

Documents