Date post: | 10-May-2015 |
Category: |
Technology |
Upload: | mounia-lalmas |
View: | 1,373 times |
Download: | 0 times |
Which Ver)cal Search Engines are Relevant?
Understanding Ver)cal Relevance Assessments for Web Queries
Ke Zhou1, Ronan Cummins2, Mounia Lalmas3, Joemon M. Jose1
1University of Glasgow 2University of Greenwich 3Yahoo! Labs Barcelona
WWW 2013, Rio de Janeiro
Aggregated Search
• Diverse search ver)cals (image, video, news, etc.) are available on the web.
• Aggrega)ng (embedding) ver)cal results into “general web” results has become de-‐facto in commercial web search engine.
Ver)cal search engines
General web search
Mo)va)on
Aggregated Search
• Diverse search ver)cals (image, video, news, etc.) are available on the web.
• Aggrega)ng (embedding) ver)cal results into “general web” results has become de-‐facto in commercial web search engine.
Ver)cal search engines
General web search
Mo)va)on
Ver)cal selec)on
Evalua)on of Aggregated Search
• Evalua)on solely based on ver)cal selec)on.
• Compare system predic)on set against user annota%on set.
• Annota)on is gathered – Explicitly (assessing) – Implicitly (deriving from search logs)
Mo)va)on
assessor
Topic: yoga poses
System A
System B
System C > System B > System A
System C
Assessor: which ver)cal search engines are relevant?
• Defini)on of relevance of a ver)cal, given a query, remains complex. – Different work makes different assump)ons. – The underlying assump)ons made may have a major effect on the evalua)on of a SERP.
• We want to understand different ver)cal assessment processes and inves)gate the impact of these.
Mo)va)on
Assessor: which ver)cal search engines are relevant?
• Defini)on of relevance of a ver)cal, given a query, remains complex. – Different work makes different assump)ons. – The underlying assump)ons made may have a major effect on the evalua)on of a SERP.
• We want to understand different ver)cal assessment processes and inves)gate the impact of these.
Mo)va)on
(RQ1) Assump)ons: user perspec)ve • Pre-‐retrieval: – Ver%cal Orienta%on: before issuing the query, the user thinks about which ver)cals might provide be`er results.
• Post-‐retrieval: – Aaer viewing search results, the user considers which ver)cal provides be`er results.
• Influencing factors – Ver)cal orienta)on (type preference) – Within-‐ver)cal ranking
• Serendipity – Visual a`rac)veness
Problem and Previous Work
pre-‐retrieval user-‐need
(RQ1) Assump)ons: user perspec)ve • Pre-‐retrieval: – Ver%cal Orienta%on: before issuing the query, the user thinks about which ver)cals might provide be`er results.
• Post-‐retrieval: – Aaer viewing search results, the user considers which ver)cal provides be`er results.
• Influencing factors – Ver)cal orienta)on (type preference) – Within-‐ver)cal ranking
• Serendipity – Visual a`rac)veness
Problem and Previous Work
……
Post-‐retrieval user perspec)ve
(RQ1) Assump)ons: user perspec)ve • Pre-‐retrieval: – Ver%cal Orienta%on: before issuing the query, the user thinks about which ver)cals might provide be`er results.
• Post-‐retrieval: – Aaer viewing search results, the user considers which ver)cal provides be`er results.
• Influencing factors – Ver)cal orienta)on (type preference) – Within-‐ver)cal ranking
• Serendipity – Visual a`rac)veness
Problem and Previous Work
……
pre-‐retrieval user-‐need
Post-‐retrieval user perspec)ve
(RQ1) Assump)ons: user perspec)ve • Pre-‐retrieval: – Ver%cal Orienta%on: before issuing the query, the user thinks about which ver)cals might provide be`er results.
• Post-‐retrieval: – Aaer viewing search results, the user considers which ver)cal provides be`er results.
• Influencing factors – Ver)cal orienta)on (type preference) – Within-‐ver)cal ranking
• Serendipity – Visual a`rac)veness
Problem and Previous Work
……
pre-‐retrieval user-‐need
Post-‐retrieval user perspec)ve
(RQ2) Assump)ons: dependency of relevance • Inter-‐dependent approach: – quality of ver)cals is rela)ve and dependent on each other.
• Web-‐anchor approach: – quality of “general web” serves as a reference criteria for deciding relevance.
• Context – Does the context (results returned from other ver)cals) affect a user’s percep)on of the relevance of the ver)cal of interest?
• U)lity vs. Effort
Problem and Previous Work
Inter-‐dependent approach
(RQ2) Assump)ons: dependency of relevance • Inter-‐dependent approach: – quality of ver)cals is rela)ve and dependent on each other.
• Web-‐anchor approach: – quality of “general web” serves as a reference criteria for deciding relevance.
• Context – Does the context (results returned from other ver)cals) affect a user’s percep)on of the relevance of the ver)cal of interest?
• U)lity vs. Effort
Problem and Previous Work
Web-‐anchor approach
(RQ2) Assump)ons: dependency of relevance • Inter-‐dependent approach: – quality of ver)cals is rela)ve and dependent on each other.
• Web-‐anchor approach: – quality of “general web” serves as a reference criteria for deciding relevance.
• Context – Does the context (results returned from other ver)cals) affect a user’s percep)on of the relevance of the ver)cal of interest?
• U)lity vs. Effort
Problem and Previous Work
Inter-‐dependent approach
Web-‐anchor approach
(RQ2) Assump)ons: dependency of relevance • Inter-‐dependent approach: – quality of ver)cals is rela)ve and dependent on each other.
• Web-‐anchor approach: – quality of “general web” serves as a reference criteria for deciding relevance.
• Context – Does the context (results returned from other ver)cals) affect a user’s percep)on of the relevance of the ver)cal of interest?
• U)lity vs. Effort
Problem and Previous Work
Inter-‐dependent approach
Web-‐anchor approach
(RQ2) Assump)ons: dependency of relevance • Inter-‐dependent approach: – quality of ver)cals is rela)ve and dependent on each other.
• Web-‐anchor approach: – quality of “general web” serves as a reference criteria for deciding relevance.
• Context – Does the context (results returned from other ver)cals) affect a user’s percep)on of the relevance of the ver)cal of interest?
• U)lity vs. Effort
Problem and Previous Work
Inter-‐dependent approach
Web-‐anchor approach
(RQ3) Assump)ons: assessment grade • Binary (pairwise) preference • Mul)-‐grade preference • SERP (one possible slot) – ToP: top of the page – NS: not shown
• Is the binary (pairwise) preference informa)on provided by a popula)on of users able to predict the “perfect” embedding posi)on of a ver)cal?
Problem and Previous Work
Binary preference (ToP or NS)
End of SERP
ToP
(RQ3) Assump)ons: assessment grade • Binary (pairwise) preference • Mul)-‐grade preference • SERP (three possible slots) – ToP: top of the page – MoP: middle of the page – BoP: bo`om of the page – NS: not shown
• Is the binary (pairwise) preference informa)on provided by a popula)on of users able to predict the “perfect” embedding posi)on of a ver)cal?
Problem and Previous Work
Mul)-‐grade preference (ToP, MoP, BoP or NS)
End of SERP
ToP
MoP
BoP
(RQ3) Assump)ons: assessment grade • Binary (pairwise) preference • Mul)-‐grade preference • SERP (three possible slots) – ToP: top of the page – MoP: middle of the page – BoP: bo`om of the page – NS: not shown
• Is the binary (pairwise) preference informa)on provided by a popula)on of users able to predict the “perfect” embedding posi)on of a ver)cal?
Problem and Previous Work
Binary preference (ToP or NS)
Mul)-‐grade preference (ToP, MoP, BoP or NS)
End of SERP
ToP
End of SERP
ToP
MoP
BoP
(RQ3) Assump)ons: assessment grade • Binary (pairwise) preference • Mul)-‐grade preference • SERP (three possible slots) – ToP: top of the page – MoP: middle of the page – BoP: bo`om of the page – NS: not shown
• Is the binary (pairwise) preference informa)on provided by a popula)on of users able to predict the “perfect” embedding posi)on of a ver)cal?
Problem and Previous Work
Binary preference (ToP or NS)
Mul)-‐grade preference (ToP, MoP, BoP or NS)
End of SERP
ToP
End of SERP
ToP
MoP
BoP
Experimental Design Overview • Manipula)on (Independent) Variables
– Search Tasks – Ver)cals of Interest – User Perspec)ve (Study 1: RQ1) – Dependency of Relevance (Study 2: RQ2) – Assessment Grade (Study 3: RQ3)
Experimental Design
• Dependent Variables – Inter-‐assessor Agreement
• Measured by Fleiss’ Kappa (KF)
– Ver)cal Relevance Correla)on • Measured by Spearman Correla)on
RQ1
RQ2 RQ3
Experimental Design Overview • Manipula)on (Independent) Variables
– Search Tasks – Ver)cals of Interest – User Perspec)ve (Study 1: RQ1) – Dependency of Relevance (Study 2: RQ2) – Assessment Grade (Study 3: RQ3)
Experimental Design
• Dependent Variables – Inter-‐assessor Agreement
• Measured by Fleiss’ Kappa (KF)
– Ver)cal Relevance Correla)on • Measured by Spearman Correla)on
RQ1
RQ2 RQ3
Experiment Design Details • Crowd-‐sourcing Data Collec)on
– We hire crowd-‐sourced workers on Amazon Mechanical Turk to make assessments.
• Ver)cals – Cover a variety of 11 ver)cals employed by three major
commercial search engines. – Use exis)ng commercial ver)cal search engines.
• Search Tasks – 44 tasks cover a variety of (ver)cal) intents – Come from exis)ng aggregated search collec)on (TREC)
• Quality Control – 4 assessment points for one manipula)on – Trap HITs (assessment page with results of other queries) – Trap search tasks (assessment page with explicit ver)cal
request)
Experimental Design
Experiment Design Details • Crowd-‐sourcing Data Collec)on
– We hire crowd-‐sourced workers on Amazon Mechanical Turk to make assessments.
• Ver)cals – Cover a variety of 11 ver)cals employed by three major
commercial search engines. – Use exis)ng commercial ver)cal search engines.
• Search Tasks – 44 tasks cover a variety of (ver)cal) intents – Come from exis)ng aggregated search collec)on (TREC)
• Quality Control – 4 assessment points for one manipula)on – Trap HITs (assessment page with results of other queries) – Trap search tasks (assessment page with explicit ver)cal
request)
Experimental Design
Experiment Design Details • Crowd-‐sourcing Data Collec)on
– We hire crowd-‐sourced workers on Amazon Mechanical Turk to make assessments.
• Ver)cals – Cover a variety of 11 ver)cals employed by three major
commercial search engines. – Use exis)ng commercial ver)cal search engines.
• Search Tasks – 44 tasks cover a variety of (ver)cal) intents – Come from exis)ng aggregated search collec)on (TREC)
• Quality Control – 4 assessment points for one manipula)on – Trap HITs (assessment page with results of other queries) – Trap search tasks (assessment page with explicit ver)cal
request)
Experimental Design
Experiment Design Details • Crowd-‐sourcing Data Collec)on
– We hire crowd-‐sourced workers on Amazon Mechanical Turk to make assessments.
• Ver)cals – Cover a variety of 11 ver)cals employed by three major
commercial search engines. – Use exis)ng commercial ver)cal search engines.
• Search Tasks – 44 tasks cover a variety of (ver)cal) intents – Come from exis)ng aggregated search collec)on (TREC)
• Quality Control – 4 assessment points for one manipula)on – Trap HITs (assessment page with results of other queries) – Trap search tasks (assessment page with explicit ver)cal
request)
Experimental Design
Experimental Design: Study 1 • Manipula)on (Independent) Variables
– Search Tasks – Ver)cals of Interest – User Perspec)ve (Study 1: RQ1) – Dependency of Relevance (Study 2: RQ2) – Assessment Grade (Study 3: RQ3)
Experimental Design
• Dependent Variables – Inter-‐assessor Agreement
• Measured by Fleiss’ Kappa (KF)
– Ver)cal Relevance Correla)on • Measured by Spearman Correla)on
RQ1
Study 1 Results: pre-‐retrieval vs. post-‐retrieval
• Both pre-‐retrieval and post-‐retrieval inter-‐assessor agreements are moderate and assessors have the similar level of difficulty in assessing for both.
• Ver)cal relevance are moderately (but significantly) correlated (0.53) between pre-‐retrieval and post-‐retrieval.
• Highly relevant ver)cals derived from pre-‐retrieval and post-‐retrieval overlap significantly. – Almost 60% overlap on at least 2 out of 3 top ver)cals
• There is a bias in visually salient ver)cals for post-‐retrieval search u%lity.
Experimental Results
Study 1 Results: pre-‐retrieval vs. post-‐retrieval
• Both pre-‐retrieval and post-‐retrieval inter-‐assessor agreements are moderate and assessors have the similar level of difficulty in assessing for both.
• Ver)cal relevance are moderately (but significantly) correlated (0.53) between pre-‐retrieval and post-‐retrieval.
• Highly relevant ver)cals derived from pre-‐retrieval and post-‐retrieval overlap significantly. – Almost 60% overlap on at least 2 out of 3 top ver)cals
• There is a bias in visually salient ver)cals for post-‐retrieval search u%lity.
Experimental Results
Study 1 Results: pre-‐retrieval vs. post-‐retrieval
• Both pre-‐retrieval and post-‐retrieval inter-‐assessor agreements are moderate and assessors have the similar level of difficulty in assessing for both.
• Ver)cal relevance are moderately (but significantly) correlated (0.53) between pre-‐retrieval and post-‐retrieval.
• Highly relevant ver)cals derived from pre-‐retrieval and post-‐retrieval overlap significantly. – Almost 60% overlap on at least 2 out of 3 top ver)cals
• There is a bias in visually salient ver)cals for post-‐retrieval search u%lity.
Experimental Results
Study 1 Results: pre-‐retrieval vs. post-‐retrieval
• Both pre-‐retrieval and post-‐retrieval inter-‐assessor agreements are moderate and assessors have the similar level of difficulty in assessing for both.
• Ver)cal relevance are moderately (but significantly) correlated (0.53) between pre-‐retrieval and post-‐retrieval.
• Highly relevant ver)cals derived from pre-‐retrieval and post-‐retrieval overlap significantly. – Almost 60% overlap on at least 2 out of 3 top ver)cals
• There is a bias in visually salient ver)cals for post-‐retrieval search u%lity.
Experimental Results
Study 1 Results: topical relevance vs. pre-‐retrieval orienta)on
Experimental Results
N R N
RRN
nDCG(vi)
nDCG(w)
……
Orienta)on
Topical relevance
• Ver)cal relevance between pre-‐retrieval orienta)on and post-‐retrieval is moderately correlated (0.53).
• Ver)cal relevance between topical-‐relevance and post-‐retrieval is weakly correlated (0.36).
• Impact of pre-‐retrieval orienta%on is more important for post-‐retrieval search u%lity, compared with post-‐retrieval topical relevance.
Post-‐retrieval search u)lity
Study 1 Results: topical relevance vs. pre-‐retrieval orienta)on
Experimental Results
N R N
RRN
nDCG(vi)
nDCG(w)
……
Orienta)on
Topical relevance
• Ver)cal relevance between pre-‐retrieval orienta)on and post-‐retrieval is moderately correlated (0.53).
• Ver)cal relevance between topical-‐relevance and post-‐retrieval is weakly correlated (0.36).
• Impact of pre-‐retrieval orienta%on is more important for post-‐retrieval search u%lity, compared with post-‐retrieval topical relevance.
Post-‐retrieval search u)lity
Study 1 Results: topical relevance vs. pre-‐retrieval orienta)on
• Ver)cal relevance between pre-‐retrieval orienta)on and post-‐retrieval is moderately correlated (0.53).
• Ver)cal relevance between topical-‐relevance and post-‐retrieval is weakly correlated (0.36).
• Impact of pre-‐retrieval orienta%on is more important for post-‐retrieval search u%lity, compared with post-‐retrieval topical relevance.
Experimental Results
N R N
RRN
nDCG(vi)
nDCG(w)
……
Orienta)on
Topical relevance Post-‐retrieval
search u)lity
Experimental Design: Study 2 • Manipula)on (Independent) Variables
– Search Tasks – Ver)cals of Interest – User Perspec)ve (Study 1: RQ1) – Dependency of Relevance (Study 2: RQ2) – Assessment Grade (Study 3: RQ3)
Experimental Design
• Dependent Variables – Inter-‐assessor Agreement
• Measured by Fleiss’ Kappa (KF)
– Ver)cal Relevance Correla)on • Measured by Spearman Correla)on
RQ2
Study 2 (Dependency of Relevance) Results • Both inter-‐assessor agreements are moderate and there is not much difference between the user agreement for both approaches.
• Ver)cal relevance correla)on between inter-‐dependent and web-‐anchor approach is moderate (0.573).
• The overlap of top-‐three relevant ver)cals between two approaches is quite high. – More than 70% overlap on 2 out of 3 top ver)cals.
• Web-‐anchor approach provides be`er trade-‐off between u)lity and effort.
Experimental Results
Study 2 (Dependency of Relevance) Results • Both inter-‐assessor agreements are moderate and there is not much difference between the user agreement for both approaches.
• Ver)cal relevance correla)on between inter-‐dependent and web-‐anchor approach is moderate (0.573).
• The overlap of top-‐three relevant ver)cals between two approaches is quite high. – More than 70% overlap on 2 out of 3 top ver)cals.
• Web-‐anchor approach provides be`er trade-‐off between u)lity and effort.
Experimental Results
Study 2 (Dependency of Relevance) Results • Both inter-‐assessor agreements are moderate and there is not much difference between the user agreement for both approaches.
• Ver)cal relevance correla)on between inter-‐dependent and web-‐anchor approach is moderate (0.573).
• The overlap of top-‐three relevant ver)cals between two approaches is quite high. – More than 70% overlap on 2 out of 3 top ver)cals.
• Web-‐anchor approach provides be`er trade-‐off between u)lity and effort.
Experimental Results
Study 2 (Dependency of Relevance) Results • Both inter-‐assessor agreements are moderate and there is not much difference between the user agreement for both approaches.
• Ver)cal relevance correla)on between inter-‐dependent and web-‐anchor approach is moderate (0.573).
• The overlap of top-‐three relevant ver)cals between two approaches is quite high. – More than 70% overlap on 2 out of 3 top ver)cals.
• Web-‐anchor approach provides be`er trade-‐off between u)lity and effort.
Experimental Results
Study 2 (Dependency of Relevance) Results
• Not much difference is observed by using different anchors (different observed topical relevance level).
• Context ma`ers – The context of other ver)cals can diminish the u)lity of a ver)cal.
– Examples: (“Answer”, “Wiki”), (“Books”, “Scholar”), etc.
Experimental Results
Study 2 (Dependency of Relevance) Results
• Not much difference is observed by using different anchors (different observed topical relevance level).
• Context ma`ers – The context of other ver)cals can diminish the u)lity of a ver)cal.
– Examples: (“Answer”, “Wiki”), (“Books”, “Scholar”), etc.
Experimental Results
Experimental Design: Study 3 • Manipula)on (Independent) Variables
– Search Tasks – Ver)cals of Interest – User Perspec)ve (Study 1: RQ1) – Dependency of Relevance (Study 2: RQ2) – Assessment Grade (Study 3: RQ3)
Experimental Design
• Dependent Variables – Inter-‐assessor Agreement
• Measured by Fleiss’ Kappa (KF)
– Ver)cal Relevance Correla)on • Measured by Spearman Correla)on
RQ3
Study 3 (Assessment Grade) Results
• Deriving “perfect” embedding posi)on from mul)-‐graded assessments
• Thresholding for binary assessment (User type simula)on) – Risk-‐seeking – Risk-‐medium – Risk-‐averse
Experimental Results
1.0 0.5 0.75 0.25 0.25 0.0
……
0.0 ……
Risk-‐seeking
Risk-‐medium
Risk-‐averse ToP
ToP
ToP MoP
MoP
BoP
BoP
MoP BoP
NS
NS
Ver)cals Majority preference
Study 3 (Assessment Grade) Results
• Inter-‐assessor agreement are moderate and different users have different risk-‐level.
• Ver)cal Relevance Correla)on – Most of the binary approach significantly correlates with the mul)-‐graded ground-‐truth, however mostly are modest.
– Risk-‐medium thresholding approach performs best.
Experimental Results
Study 3 (Assessment Grade) Results
• Inter-‐assessor agreement are moderate and different users have different risk-‐level.
• Ver)cal Relevance Correla)on – Most of the binary approach significantly correlates with the mul)-‐graded ground-‐truth, however mostly are modest.
– Risk-‐medium thresholding approach performs best.
Experimental Results
Final take-‐out • Study 1 – Assessing for aggregated search is difficult. – Highly relevant ver)cals overlaps significantly for pre-‐retrieval and post-‐retrieval user perspec)ves.
– Ver)cal (type) orienta)on is more important than topical relevance. • Study 2 – Anchor-‐based approach might be a be`er approach than inter-‐dependent approach, with respect to u)lity-‐effort trade-‐off.
– Context ma`ers. • Study 3 – Binary approach can be used to determine “perfect” embedding posi)on of the ver)cals and it performs rela)vely well with not a lot of assessments.
Final take-‐out • Study 1 – Assessing for aggregated search is difficult. – Highly relevant ver)cals overlaps significantly for pre-‐retrieval and post-‐retrieval user perspec)ves.
– Ver)cal (type) orienta)on is more important than topical relevance. • Study 2 – Anchor-‐based approach might be a be`er approach than inter-‐dependent approach, with respect to u)lity-‐effort trade-‐off.
– Context ma`ers. • Study 3 – Binary approach can be used to determine “perfect” embedding posi)on of the ver)cals and it performs rela)vely well with not a lot of assessments.
Final take-‐out • Study 1 – Assessing for aggregated search is difficult. – Highly relevant ver)cals overlaps significantly for pre-‐retrieval and post-‐retrieval user perspec)ves.
– Ver)cal (type) orienta)on is more important than topical relevance. • Study 2 – Anchor-‐based approach might be a be`er approach than inter-‐dependent approach, with respect to u)lity-‐effort trade-‐off.
– Context ma`ers. • Study 3 – Binary approach can be used to determine “perfect” embedding posi)on of the ver)cals and it performs rela)vely well with not a lot of assessments.
Conclusions
• We compare different ver)cal relevance assessment processes and analyzed their impact.
• Our work has implica)ons with regard to "how" and "what" evalua)on design decisions affect the actual evalua)on.
• This work also creates a need to re-‐interpret previous evalua)on efforts in this area.
Ques)ons?
• Thanks!