+ All Categories
Home > Documents > Overview of the NTCIR-13 OpenLiveQ Taskresearch.nii.ac.jp/.../01-NTCIR13-OV-OPENLIVEQ-KatoM.pdfIn...

Overview of the NTCIR-13 OpenLiveQ Taskresearch.nii.ac.jp/.../01-NTCIR13-OV-OPENLIVEQ-KatoM.pdfIn...

Date post: 03-Jul-2020
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
6
Overview of the NTCIR-13 OpenLiveQ Task Makoto P. Kato Kyoto University [email protected] Takehiro Yamamoto Kyoto University [email protected] u.ac.jp Tomohiro Manabe Yahoo Japan Corporation tomanabe@yahoo- corp.jp Akiomi Nishida Yahoo Japan Corporation [email protected] Sumio Fujita Yahoo Japan Corporation [email protected] ABSTRACT This is an overview of the NTCIR-13 OpenLiveQ task. This task aims to provide an open live test environment of Yahoo Japan Corporation’s community question-answering service (Yahoo! Chiebukuro) for question retrieval systems. The task was simply defined as follows: given a query and a set of questions with their answers, return a ranked list of questions. Submitted runs were evaluated both offline and online. In the online evaluation, we employed optimized multileaving, a multileaving method that showed high effi- ciency over the other methods in our preliminary experi- ment. We describe the details of the task, data, and evalua- tion methods, and then report official results at NTCIR-13 OpenLiveQ. 1. INTRODUCTION Community Question Answering (cQA) services are In- ternet services in which users can ask a question and obtain answers from other users. Users can obtain relevant infor- mation to their search intents not only by asking questions in cQA, but also by searching for questions that are similar to their intents. Finding answers to questions similar to a search intent is an important information seeking strategy especially when the search intent is very specific or com- plicated. While a lot of work has addressed the question retrieval problem [5, 1, 6], there are still several important problems to be tackled: Ambiguous/underspecified queries Most of the exist- ing work mainly focused on specific queries. However, many queries used in cQA services are as short as Web search queries, and, accordingly, ambiguous/underspecified. Thus, question retrieval results also need diversifica- tion so that users with different intents can be satis- fied. Diverse relevance criteria The notion of relevance used in traditional evaluation frameworks is usually topi- cal relevance, which can be measured by the degree of match between topics implied by a query and ones written in a document. Whereas, real question searchers have a wide range of relevance criteria such as fresh- ness, concreteness, trustworthiness, and conciseness. Thus, traditional relevance assessment may not be able to measure real performance of question retrieval sys- tems. In order to address these problems, we propose a new task called Open Live Test for Question Retrieval (Open- LiveQ), which provides an open live test environment of Yahoo! Chiebukuro 1 (a Japanese version of Yahoo! An- swers) for question retrieval systems. Participants can sub- mit ranked lists of questions for a particular set of queries, and receive evaluation results based on real user feedback. Involving real users in evaluation can solve problems men- tioned above: we can consider the diversity of search intents and relevance criteria by utilizing real queries and feedback from users who are engaged in real search tasks. Our realistic evaluation framework would bring in novel challenges for participants and insights into the gap between evaluation in laboratory settings and that in production en- vironments. More specifically, we expect that (1) partic- ipants can propose methods to consider different types of intents behind a query, and to diversify search results so that they can satisfy as many search intents as possible; (2) participants can address the problem of diverse relevance criteria by utilizing several properties of questions; and (3) participants can evaluate their systems with real users in Yahoo! Chiebukuro. The remainder of the paper is organized as follows. Sec- tion 2 describes the OpenLiveQ task in details. Section 3 introduces the data distributed to OpenLiveQ participants. Section 4 explains the evaluation methodology applied to the OpenLiveQ task. 2. TASK The task of the OpenLiveQ task is simply defined as fol- lows: given a query and a set of questions with their answers, return a ranked list of questions. Our task consists of three phases: 1. Offline Training Phase Participants are given train- ing data including a list of queries, a set of questions for each query, and clickthrough data (see Section 3 for details). They can develop and tune their question retrieval systems based on the training data. 2. Offline Test Phase Participants are given only a list of queries and a set of questions for each query. They are required to submit a ranked list of questions for each query by a deadline. We evaluate submitted re- sults by using graded relevance for each question, and decide which question retrieval systems can be evalu- ated in the online test phase. 1 http://chiebukuro.yahoo.co.jp/ 85 Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies, December 5-8, 2017 Tokyo Japan
Transcript
Page 1: Overview of the NTCIR-13 OpenLiveQ Taskresearch.nii.ac.jp/.../01-NTCIR13-OV-OPENLIVEQ-KatoM.pdfIn the online evaluation, we employed optimized multileaving, a multileaving method that

Overview of the NTCIR-13 OpenLiveQ Task

Makoto P. KatoKyoto University

[email protected]

Takehiro YamamotoKyoto University

[email protected]

Tomohiro ManabeYahoo Japan Corporationtomanabe@yahoo-

corp.jpAkiomi Nishida

Yahoo Japan [email protected]

Sumio FujitaYahoo Japan Corporation

[email protected]

ABSTRACTThis is an overview of the NTCIR-13 OpenLiveQ task. Thistask aims to provide an open live test environment of YahooJapan Corporation’s community question-answering service(Yahoo! Chiebukuro) for question retrieval systems. Thetask was simply defined as follows: given a query and aset of questions with their answers, return a ranked listof questions. Submitted runs were evaluated both offlineand online. In the online evaluation, we employed optimizedmultileaving, a multileaving method that showed high effi-ciency over the other methods in our preliminary experi-ment. We describe the details of the task, data, and evalua-tion methods, and then report official results at NTCIR-13OpenLiveQ.

1. INTRODUCTIONCommunity Question Answering (cQA) services are In-

ternet services in which users can ask a question and obtainanswers from other users. Users can obtain relevant infor-mation to their search intents not only by asking questionsin cQA, but also by searching for questions that are similarto their intents. Finding answers to questions similar to asearch intent is an important information seeking strategyespecially when the search intent is very specific or com-plicated. While a lot of work has addressed the questionretrieval problem [5, 1, 6], there are still several importantproblems to be tackled:

Ambiguous/underspecified queries Most of the exist-ing work mainly focused on specific queries. However,many queries used in cQA services are as short as Websearch queries, and, accordingly, ambiguous/underspecified.Thus, question retrieval results also need diversifica-tion so that users with different intents can be satis-fied.

Diverse relevance criteria The notion of relevance usedin traditional evaluation frameworks is usually topi-cal relevance, which can be measured by the degreeof match between topics implied by a query and oneswritten in a document. Whereas, real question searchershave a wide range of relevance criteria such as fresh-ness, concreteness, trustworthiness, and conciseness.Thus, traditional relevance assessment may not be ableto measure real performance of question retrieval sys-tems.

In order to address these problems, we propose a new

task called Open Live Test for Question Retrieval (Open-LiveQ), which provides an open live test environment ofYahoo! Chiebukuro1 (a Japanese version of Yahoo! An-swers) for question retrieval systems. Participants can sub-mit ranked lists of questions for a particular set of queries,and receive evaluation results based on real user feedback.Involving real users in evaluation can solve problems men-tioned above: we can consider the diversity of search intentsand relevance criteria by utilizing real queries and feedbackfrom users who are engaged in real search tasks.

Our realistic evaluation framework would bring in novelchallenges for participants and insights into the gap betweenevaluation in laboratory settings and that in production en-vironments. More specifically, we expect that (1) partic-ipants can propose methods to consider different types ofintents behind a query, and to diversify search results sothat they can satisfy as many search intents as possible; (2)participants can address the problem of diverse relevancecriteria by utilizing several properties of questions; and (3)participants can evaluate their systems with real users inYahoo! Chiebukuro.

The remainder of the paper is organized as follows. Sec-tion 2 describes the OpenLiveQ task in details. Section 3introduces the data distributed to OpenLiveQ participants.Section 4 explains the evaluation methodology applied tothe OpenLiveQ task.

2. TASKThe task of the OpenLiveQ task is simply defined as fol-

lows: given a query and a set of questions with their answers,return a ranked list of questions. Our task consists of threephases:

1. Offline Training Phase Participants are given train-ing data including a list of queries, a set of questionsfor each query, and clickthrough data (see Section 3for details). They can develop and tune their questionretrieval systems based on the training data.

2. Offline Test Phase Participants are given only a listof queries and a set of questions for each query. Theyare required to submit a ranked list of questions foreach query by a deadline. We evaluate submitted re-sults by using graded relevance for each question, anddecide which question retrieval systems can be evalu-ated in the online test phase.

1http://chiebukuro.yahoo.co.jp/

85

Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies, December 5-8, 2017 Tokyo Japan

Page 2: Overview of the NTCIR-13 OpenLiveQ Taskresearch.nii.ac.jp/.../01-NTCIR13-OV-OPENLIVEQ-KatoM.pdfIn the online evaluation, we employed optimized multileaving, a multileaving method that

3. Online Test Phase Selected question retrieval sys-tems are evaluated in a production environment of Ya-hoo Japan Corporation. Multileaved comparison meth-ods [4] are used in the online evaluation.

As the open live test will be conducted on a Japanese ser-vice, the language scope is limited to Japanese. Meanwhile,we supported participants by providing a tool for featureextraction so that Japanese NLP is not required for partic-ipation.

3. DATAThis section explains the data used in the OpenLiveQ

task.

3.1 QueriesWe sampled 2,000 queries from a Yahoo! Chiebukuro

search query log, and used 1,000 queries for training andthe rest for testing. Before sampling the queries from thequery log, we applied the several filtering rules to removequeries that were not desirable for the OpenLiveQ task.

First, we filtered out time-sensitive queries. Participantswere given a fixed set of questions for each query, and wererequested to submit a ranked list of those questions. Sincethe frozen set of questions were presented to real users duringthe online evaluation phase, it was not desirable that therelevance of each question changes depending on the time.Thus, we filtered out queries that were highly time-sensitive,and only used the time-insensitive queries in the OpenLiveQtask.

The procedure how we filtered out the time-sensitive queriesis as follows. Letting nq

recent be the number of questions forquery q posted from July 16th, 2017 to September 16th,2017, and nq

past be the number of questions of q from Jan-uary 1st, 2013 to July 15th, 2017, we removed queries suchthat nq

recent/nqpast > 1.0 as the time-sensitive queries.

Second, we filtered out porn queries. We judged a querywas a porn query if more than 10% questions retrieved bythe query belonged to the porn-related category of Yahoo!Chiebukuro.

After removing time-sensitive and porn-related queries,we further manually removed queries that were related toany of the ethic, discrimination, or privacy issues. The or-ganizers checked each of the queries and its questions, andfiltered out a query if at least one of the organizers judgedit had the issues above.

Finally, we sampled 2,000 queries from the remaining queriesand used them in the OpenLiveQ task.

3.2 QuestionsWe input each query to the current Yahoo! Chiebukuro

search system as of December 1-9 in 2016, recorded the top1,000 questions, and used them as questions to be ranked.Information about all the questions as of December 1-9, 2016was distributed to the OpenLiveQ participants, and includes

• Query ID (a query by which the question was retrieved),

• Rank of the question in a Yahoo! Chiebukuro searchresult for the query of Query ID,

• Question ID,

• Title of the question,

• Snippet of the question in a search result,

• Status of the question (accepting answers, acceptingvotes, or solved),

• Last update time of the question,

• Number of answers for the question,

• Page view of the question,

• Category of the question,

• Body of the question, and

• Body of the best answer for the question.

The total number of questions is 1,967,274. As was men-tioned earlier, participants were required to submit a rankedlist of those questions for each test query.

3.3 Clickthrough DataClickthrough data are available for some of the questions.

Based on the clickthrough data, one can estimate the clickprobability of the questions, and understand what kinds ofusers click on a certain question. The clickthrough data werecollected from August 24, 2016 to November 23, 2016.

The clickthrough data include

• Query ID (a query by which the question was retrieved),

• Question ID,

• Most frequent rank of the question in a Yahoo! Chiebukurosearch result for the query of Query ID,

• Clickthrough rate,

• Fraction of male users among those who clicked on thequestion,

• Fraction of female users among those clicked on thequestion,

• Fraction of users under 10 years old among those whoclicked on the question,

• Fraction of users in their 10s among those who clickedon the question,

• Fraction of users in their 20s among those who clickedon the question,

• Fraction of users in their 30s among those who clickedon the question,

• Fraction of users in their 40s among those who clickedon the question,

• Fraction of users in their 50s among those who clickedon the question, and

• Fraction of users over 60 years old among those whoclicked on the question.

The clickthrough data contain click statistics of a questionidentified by Question ID when a query identified by QueryID was submitted. The rank of the question can change evenfor the same query. This is why the third value indicates themost frequent rank of the question. The number of query-question pairs in the clickthrough data is 440,163.

86

Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies, December 5-8, 2017 Tokyo Japan

Page 3: Overview of the NTCIR-13 OpenLiveQ Taskresearch.nii.ac.jp/.../01-NTCIR13-OV-OPENLIVEQ-KatoM.pdfIn the online evaluation, we employed optimized multileaving, a multileaving method that

Figure 1: Screenshot of the relevance judgment sys-tem.

4. EVALUATIONThis section introduces the offline evaluation, in which

runs were evaluated with relevance judgment data, and on-line evaluation, in which runs were evaluated with real usersby means of multileaving.

4.1 Offline EvaluationOffline test is carried out before online test explained later,

and determines participants whose systems are evaluated inthe online test, based on results in the offline test. Theoffline evaluation was conducted in a similar way to tradi-tional ad-hoc retrieval tasks, in which results are evaluatedby relevance judgment results and evaluation metrics such asnDCG (normalized discounted cumulative gain), ERR (ex-pected reciprocal rank), and Q-measure. During the offlinetest period, participants can submit their results once perday through our Web site2, and obtain evaluation resultsright after the submission.

To simulate the online test in the offline test, we conductedrelevance judgment with an instruction shown below: “Sup-pose you input <query> and received a set of questions asshown below. Please select all the questions on which youwant to click”. Assessors were not present with the full con-tent of each question, and requested to evaluate questionsin a similar page to the real SERP in Yahoo! Chiebukuro.This type of relevance judgment is different from traditionalones, and expected to result in being similar to results of theonline test. Five assessors were assigned to each query, andthe relevance grade of each question was estimated as thenumber of assessors who select the question in the relevancejudgment. For example, the relevance grade was 2 if twoout of five assessors marked a question. We used Lancers3,a Japanese crowd-sourcing service, for the relevance judg-ment. Figure 1 shows a screenshot of a system used for therelevance judgment, where assessors can click on either thetitle or the blue square for voting.

Only a score of nDCG@10 for each submitted run wasdisplayed at our website. The top 10 teams in terms of

2http://www.openliveq.net/3http://www.lancers.jp/

Algorithm 1: Optimized Multileaving (OM)

Require: Input rankings I, number of output rankingsm, and number of items in each outputranking l

1 O ← {};2 for k = 1, . . . ,m do3 for i = 1, . . . , l do4 Select j randomly;5 r = 1;6 while Ij,r ∈ Ok do7 r ← r + 18 end9 if r ≤ |Ij | then

10 Ok,i = Ij,r;11 end

12 end13 O ← O ∪ {Ok};14 end15 return O;

nDCG@10 were invited to the online evaluation.

4.2 Online EvaluationSubmitted results were evaluated by multileaving [4]. Ranked

lists of questions were combined into a single SERP, pre-sented to real users during the online test period, and eval-uated on the basis of clicks observed. Based on our prelim-inary experiment [2], we opt to use optimized multileaving(OM) in the multileaved comparison. Results submitted inthe offline test period were used in as-is in the online test.Note that some questions could be excluded in the onlinetest if they were deleted for some reasons before or duringthe online test.

A multileaving method takes a set of rankings I = {I1, I2, . . . , In}and returns a set of combined rankingsO = {O1, O2, . . . , Om},where each combined ranking Ok consists of l items. Thei-th items of an input ranking Ij and an output rankingOk are denoted by Ij,i and Ok,i, respectively. When a userissues a query, we return an output ranking Ok with proba-bility pk and observe user clicks on Ok. If Ok,i is clicked bythe user, we give a credit δ(Ok,i, Ij) to each input rankingIj . Each multileaving method consists of a way to constructO from I, probability pk for each output ranking Ok, and acredit function δ. The original multileaving methods decidewhich input ranking is better for each input ranking pairevery time it presents an output ranking, whereas we opt toaccumulate the credits through all the presentations and tomeasure the effectiveness of each input ranking on the basisof the sum of the credits, mainly because this approach mustprovide more informative evaluation results.

OM [4] is a multileaving method that generates outputrankings by Algorithm 1, and computes the presentationprobability pk that maximizes the sensitivity of the outputrankings while ensuring no bias. The sensitivity of an outputranking is the power to discriminate effectiveness differencesbetween input rankings when the output ranking is beingpresented to users. Intuitively, the sensitivity is high if ran-dom clicks on an output ranking give a similar amount ofcredits to each input ranking. High sensitivity is desirableas it leads to fast convergence of evaluation results. Biasof output rankings measures the difference between the ex-

87

Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies, December 5-8, 2017 Tokyo Japan

Page 4: Overview of the NTCIR-13 OpenLiveQ Taskresearch.nii.ac.jp/.../01-NTCIR13-OV-OPENLIVEQ-KatoM.pdfIn the online evaluation, we employed optimized multileaving, a multileaving method that

pected credits of input rankings for random clicks. If thebias is high, a certain input ranking can be considered bet-ter than the others even if only random clicks are given.Thus, multileaving methods should reduce the bias as muchas possible.

The sensitivity can be maximized by minimizing insensi-tivity, defined as the variance of credits given by an outputranking Ok through rank-dependent random clicks [4]:

σ2k =

∑nj=1

((∑li=1 f(i)δ(Ok,i, Ij)

)− µk

)2, (1)

where f(i) is the probability with which a user clicks onthe i-th item. We follow the original work [4] and usef(i) = 1/i and δ(Ok,i, Ij) = 1/i if Ok,i ∈ Ij ; otherwise,1/(|Ij |+1). The mean credit µk of the output ranking Ok is

computed as: µk = (1/n)∑n

j=1

∑li=1 f(i)δ(Ok,i, Ij). Since

each output ranking Ok is presented to users with proba-bility pk, OM should minimize the expected insensitivity:E[σ2

k] =∑m

k=1 pkσ2k.

The bias of output rankings O is defined as the differencebetween expected credits of input rankings. The expectedcredit of an input ranking Ij given rank-independent randomclicks on top-1, top-2, · · · , and top-r items is defined asfollows:

E[g(Ij , r)] =∑m

k=1 pk∑r

i=1 δ(Ok,i, Ij). (2)

If the expected credits of input rankings are different, eval-uation results obtained by multileaving is biased. Thus, theoriginal version of OM imposes a constraint that the ex-pected credits of all the input rankings must be the same,i.e. (∀r, ∃cr, ∀j) E[g(Ij , r)] = cr.

According to the paper by Schuth et al. [4] and their pub-licly available implementation4, their version of OM triesfirst to satisfy the constraint for letting the bias be zero,and then to maximize the sensitivity given that the biasconstraint is satisfied. However, we found that the bias con-straint cannot be satisfied for more than 90% of the casesin our experiment, i.e. we could not find any solution thatsatisfied the bias constraint. Hence, we propose using amore practical implementation of OM that minimizes a lin-ear combination of the sum of biases and the insensitivityas follows:

minpk α∑l

r=1 λr +∑m

k=1 pkσ2k (3)

subject to∑m

k=1 pk = 1, 0 ≤ pk ≤ 1 (k = 1, . . . ,m), and

∀r, ∀j, j′,−λr ≤ E[g(Ij , r)]− E[g(Ij′ , r)] ≤ λr ,

where α is a hyper-parameter that controls the balance be-tween the bias and insensitivity, and λr is the maximum dif-ference of the expected credits in any input rankings pairs.If λr is close to zero, the expected credits of input rank-ings are close, and accordingly, the bias becomes small. Ourimplementation is publicly available5.

The online evaluation at Yahoo! Chiebukuro was con-ducted between May 9, 2017 and August 8, 2017. The to-tal number of impressions used for the online evaluation is410,812.

4https://github.com/djoerd/mirex5https://github.com/mpkato/interleaving

Bas

elin

e (a

s is

)

SLO

LQ

OK

SA

T

KU

IDL

TUA

1

Bas

elin

e (#

Ans

)

cdla

b

OR

G

YJR

S

Erle

r

Runs

0

5000

10000

15000

20000

Cre

dit

Figure 3: Cumulated credits in the online evalua-tion.

5. EVALUATION RESULTSFigures 2(a), 2(b), and 2(c) show results of the offline

evaluation in terms of nDCG@10, ERR@10, and Q-measure.Baseline runs are indicated in red. The first baseline (ORG4) produced exactly the same ranked list as that used inthe production. The other baselines used a linear feature-based model optimized by coordinate ascent [3], with dif-ferent parameter settings. One of the linear feature-basedmodel (ORG 7) was the best baseline method in terms of allthe metrics. The best baseline was outperformed by someteams: OKSAT, YJRS, and cdlab in terms of nDCG@10and ERR, and YJRS and Erler in terms of Q-measure.

Figure 5 shows cumulated credits in the online evaluation.In the online evaluation, the best run from each team wasselected: KUIDL 18, TUA1 19, Erler 22, SLOLQ 54, cdlab83, YJRS 86, and OKSAT 88. Moreover, we included thebest baseline method (ORG 7), current production system(ORG 4), and a simple baseline that ranks questions by thenumber of answers obtained (not used in the offline eval-uation). YJRS and Erler outperformed the best baseline(ORG), while there is not a statistically significant differ-ence between them as explained later. The online evaluationshowed a different result from those obtained in the offlineevaluation. In particular, OKSAT performed well at the on-line evaluation, while it showed relatively low performanceamong the submitted runs.

Figure 5 shows the number of run pairs between whicht-tests did not find significant differences, where the x-axisindicates the number of days passed. As this is multiple com-parison, p-values were corrected by Bonferroni’s method.The multileaving evaluation could find statistically signif-icant differences for most of the run pairs (37/45 = 82.2%)within 10 days. After 20 days passed, significant differenceswere found for 41/45 = 91.1% of run pairs. After 64 dayspassed, only three run pairs remained insignificant: KUIDL-TUA1, KUIDL-OKSAT, and YJRS-Erler.

6. CONCLUSIONSIn this paper, we described the details of the NTCIR-13

88

Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies, December 5-8, 2017 Tokyo Japan

Page 5: Overview of the NTCIR-13 OpenLiveQ Taskresearch.nii.ac.jp/.../01-NTCIR13-OV-OPENLIVEQ-KatoM.pdfIn the online evaluation, we employed optimized multileaving, a multileaving method that

OK

SA

T 58

cdla

b 12

cdla

b 72

cdla

b 64

OK

SA

T 15

cdla

b 67

OK

SA

T 20

cdla

b 26

SLO

LQ 6

9O

KS

AT

36O

KS

AT

30cd

lab

76cd

lab

79S

LOLQ

85

SLO

LQ 6

1S

LOLQ

35

SLO

LQ 8

1S

LOLQ

54

cdla

b 59

OK

SA

T 33

KU

IDL

14cd

lab

6Y

JRS

16

YJR

S 2

5O

KS

AT

37O

KS

AT

46K

UID

L 8

YJR

S 2

8Y

JRS

5E

rler 6

0TU

A1

31cd

lab

32E

rler 6

2TU

A1

78O

RG

4E

rler 7

3O

KS

AT

27E

rler 6

8K

UID

L 18

OR

G 1

7TU

A1

74cd

lab

9O

KS

AT

23Y

JRS

10

OR

G 1

3O

KS

AT

43O

KS

AT

11Y

JRS

71

cdla

b 21

Erle

r 57

cdla

b 29

cdla

b 24

TUA

1 19

OK

SA

T 40

OK

SA

T 51

YJR

S 3

8Y

JRS

48

Erle

r 49

Erle

r 47

Erle

r 44

YJR

S 8

2E

rler 4

5E

rler 6

5E

rler 5

3Y

JRS

77

OK

SA

T 70

YJR

S 6

6cd

lab

41cd

lab

39cd

lab

34E

rler 2

2cd

lab

87Y

JRS

50

OR

G 7

cdla

b 56

cdla

b 42

cdla

b 52

cdla

b 83

YJR

S 8

6O

KS

AT

55O

KS

AT

63O

KS

AT

75O

KS

AT

80O

KS

AT

84O

KS

AT

88

Run IDs

0.0

0.1

0.2

0.3

0.4nD

CG

@10

(a) nDCG@10

OK

SA

T 58

cdla

b 12

cdla

b 67

cdla

b 72

OK

SA

T 20

OK

SA

T 15

cdla

b 64

cdla

b 26

OK

SA

T 36

OK

SA

T 30

SLO

LQ 6

9K

UID

L 14

SLO

LQ 8

1cd

lab

79cd

lab

76Y

JRS

16

SLO

LQ 6

1Y

JRS

25

OK

SA

T 33

cdla

b 6

SLO

LQ 3

5Y

JRS

28

OK

SA

T 37

cdla

b 32

SLO

LQ 5

4O

KS

AT

46Y

JRS

5TU

A1

31E

rler 6

0cd

lab

59S

LOLQ

85

OK

SA

T 27

OR

G 4

Erle

r 73

OK

SA

T 23

Erle

r 68

Erle

r 62

KU

IDL

8TU

A1

74TU

A1

78cd

lab

9Y

JRS

71

KU

IDL

18cd

lab

29Y

JRS

48

YJR

S 3

8E

rler 5

7cd

lab

24cd

lab

21O

RG

17

OR

G 1

3E

rler 4

9E

rler 4

4E

rler 4

7Y

JRS

10

YJR

S 8

2E

rler 4

5O

KS

AT

51O

KS

AT

43O

KS

AT

11E

rler 6

5TU

A1

19O

KS

AT

40E

rler 5

3Y

JRS

77

YJR

S 6

6O

KS

AT

70E

rler 2

2cd

lab

41cd

lab

34cd

lab

56O

RG

7cd

lab

42cd

lab

39O

KS

AT

55Y

JRS

86

YJR

S 5

0cd

lab

52cd

lab

87O

KS

AT

63cd

lab

83O

KS

AT

75O

KS

AT

84O

KS

AT

80O

KS

AT

88

Run IDs

0.00

0.05

0.10

0.15

0.20

0.25

ER

R@

10

(b) ERR@10

OK

SA

T 58

cdla

b 12

OK

SA

T 20

cdla

b 67

SLO

LQ 8

5O

KS

AT

15cd

lab

26cd

lab

72cd

lab

64O

KS

AT

30cd

lab

76S

LOLQ

69

OK

SA

T 36

SLO

LQ 6

1cd

lab

59S

LOLQ

81

SLO

LQ 3

5S

LOLQ

54

cdla

b 79

cdla

b 6

OR

G 1

3Y

JRS

25

OK

SA

T 33

YJR

S 2

8Y

JRS

16

OK

SA

T 46

Erle

r 73

OK

SA

T 27

OR

G 4

KU

IDL

18O

KS

AT

23K

UID

L 8

cdla

b 9

OR

G 1

7K

UID

L 14

cdla

b 24

cdla

b 21

OK

SA

T 37

YJR

S 5

Erle

r 49

Erle

r 44

Erle

r 47

Erle

r 45

Erle

r 60

YJR

S 8

2cd

lab

32Y

JRS

71

Erle

r 65

TUA

1 78

Erle

r 68

TUA

1 31

Erle

r 53

TUA

1 74

Erle

r 62

Erle

r 57

YJR

S 3

8Y

JRS

48

YJR

S 1

0O

KS

AT

55O

KS

AT

63O

KS

AT

11cd

lab

87O

KS

AT

75O

KS

AT

80cd

lab

29TU

A1

19O

KS

AT

70cd

lab

41cd

lab

34cd

lab

42O

KS

AT

51cd

lab

56O

KS

AT

40O

KS

AT

43cd

lab

52cd

lab

83cd

lab

39O

KS

AT

84O

KS

AT

88Y

JRS

77

OR

G 7

YJR

S 6

6E

rler 2

2Y

JRS

50

YJR

S 8

6

Run IDs

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Q

(c) Q-measure

Figure 2: Offline evaluation.

20 40 60 80 100Days passed

2.5

5.0

7.5

10.0

12.5

15.0

17.5

20.0

22.5

Num

ber o

f ins

igni

fican

t run

pai

rs

Figure 4: Number of insignificant run pairs as afunction of days passed.

OpenLiveQ task, and reported official results of the submit-ted runs. Out findings are summarized as follows:

• There was a big difference between the offline evalua-tion where assessors voted for relevant questions andonline evaluation where real users voted for relevantquestions via clicks.

• The current online evaluation method, multileaving,could seemingly handle a dozen of runs, while it couldnot evaluate a hundred of runs within a few months.

• Systems developed by participants could achieve bigimprovements from the current question retrieval sys-tem in terms of the online evaluation, while there isroom for improvement over a strong baseline usinglearning to rank.

Those findings motivated us to organize the second roundof OpenLiveQ (OpenLiveQ-2) in NTCIR-14, in which wewill bring the following changes:

• We will employ a log-based offline evaluation methodthat turned out to be line up with online evaluationresults according to our recent study [2].

• We will improve multileaving methods to evaluate ahundred of runs within a few months.

7. ACKNOWLEDGMENTSWe thank the NTCIR-13 OpenLiveQ participants for their

effort in submitting runs. We appreciate significant effortsmade by Yahoo Japan Corporation for providing valuablesearch data and an open live test environment.

8. REFERENCES[1] X. Cao, G. Cong, B. Cui, and C. S. Jensen. A

generalized framework of exploring category

89

Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies, December 5-8, 2017 Tokyo Japan

Page 6: Overview of the NTCIR-13 OpenLiveQ Taskresearch.nii.ac.jp/.../01-NTCIR13-OV-OPENLIVEQ-KatoM.pdfIn the online evaluation, we employed optimized multileaving, a multileaving method that

information for question retrieval in communityquestion answer archives. In WWW, pages 201–210,2010.

[2] T. Manabe, A. Nishida, M. P. Kato, T. Yamamoto, andS. Fujita. A comparative live evaluation of multileavingmethods on a commercial cqa search. In SIGIR 2017,pages 949–952, 2017.

[3] D. Metzler and W. B. Croft. Linear feature-basedmodels for information retrieval. Information Retrieval,10(3):257–274, 2007.

[4] A. Schuth, F. Sietsma, S. Whiteson, D. Lefortier, andM. de Rijke. Multileaved comparisons for fast onlineevaluation. In CIKM, pages 71–80, 2014.

[5] K. Wang, Z. Ming, and T.-S. Chua. A syntactic treematching approach to finding similar questions incommunity-based qa services. In SIGIR, pages 187–194,2009.

[6] G. Zhou, Y. Liu, F. Liu, D. Zeng, and J. Zhao.Improving question retrieval in community questionanswering using world knowledge. In IJCAI, pages2239–2245, 2013.

90

Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies, December 5-8, 2017 Tokyo Japan


Recommended