Overview of the NTCIR-13 OpenLiveQ Task
Makoto P. KatoKyoto University
Takehiro YamamotoKyoto University
Tomohiro ManabeYahoo Japan Corporationtomanabe@yahoo-
corp.jpAkiomi Nishida
Yahoo Japan [email protected]
Sumio FujitaYahoo Japan Corporation
ABSTRACTThis is an overview of the NTCIR-13 OpenLiveQ task. Thistask aims to provide an open live test environment of YahooJapan Corporation’s community question-answering service(Yahoo! Chiebukuro) for question retrieval systems. Thetask was simply defined as follows: given a query and aset of questions with their answers, return a ranked listof questions. Submitted runs were evaluated both offlineand online. In the online evaluation, we employed optimizedmultileaving, a multileaving method that showed high effi-ciency over the other methods in our preliminary experi-ment. We describe the details of the task, data, and evalua-tion methods, and then report official results at NTCIR-13OpenLiveQ.
1. INTRODUCTIONCommunity Question Answering (cQA) services are In-
ternet services in which users can ask a question and obtainanswers from other users. Users can obtain relevant infor-mation to their search intents not only by asking questionsin cQA, but also by searching for questions that are similarto their intents. Finding answers to questions similar to asearch intent is an important information seeking strategyespecially when the search intent is very specific or com-plicated. While a lot of work has addressed the questionretrieval problem [5, 1, 6], there are still several importantproblems to be tackled:
Ambiguous/underspecified queries Most of the exist-ing work mainly focused on specific queries. However,many queries used in cQA services are as short as Websearch queries, and, accordingly, ambiguous/underspecified.Thus, question retrieval results also need diversifica-tion so that users with different intents can be satis-fied.
Diverse relevance criteria The notion of relevance usedin traditional evaluation frameworks is usually topi-cal relevance, which can be measured by the degreeof match between topics implied by a query and oneswritten in a document. Whereas, real question searchershave a wide range of relevance criteria such as fresh-ness, concreteness, trustworthiness, and conciseness.Thus, traditional relevance assessment may not be ableto measure real performance of question retrieval sys-tems.
In order to address these problems, we propose a new
task called Open Live Test for Question Retrieval (Open-LiveQ), which provides an open live test environment ofYahoo! Chiebukuro1 (a Japanese version of Yahoo! An-swers) for question retrieval systems. Participants can sub-mit ranked lists of questions for a particular set of queries,and receive evaluation results based on real user feedback.Involving real users in evaluation can solve problems men-tioned above: we can consider the diversity of search intentsand relevance criteria by utilizing real queries and feedbackfrom users who are engaged in real search tasks.
Our realistic evaluation framework would bring in novelchallenges for participants and insights into the gap betweenevaluation in laboratory settings and that in production en-vironments. More specifically, we expect that (1) partic-ipants can propose methods to consider different types ofintents behind a query, and to diversify search results sothat they can satisfy as many search intents as possible; (2)participants can address the problem of diverse relevancecriteria by utilizing several properties of questions; and (3)participants can evaluate their systems with real users inYahoo! Chiebukuro.
The remainder of the paper is organized as follows. Sec-tion 2 describes the OpenLiveQ task in details. Section 3introduces the data distributed to OpenLiveQ participants.Section 4 explains the evaluation methodology applied tothe OpenLiveQ task.
2. TASKThe task of the OpenLiveQ task is simply defined as fol-
lows: given a query and a set of questions with their answers,return a ranked list of questions. Our task consists of threephases:
1. Offline Training Phase Participants are given train-ing data including a list of queries, a set of questionsfor each query, and clickthrough data (see Section 3for details). They can develop and tune their questionretrieval systems based on the training data.
2. Offline Test Phase Participants are given only a listof queries and a set of questions for each query. Theyare required to submit a ranked list of questions foreach query by a deadline. We evaluate submitted re-sults by using graded relevance for each question, anddecide which question retrieval systems can be evalu-ated in the online test phase.
1http://chiebukuro.yahoo.co.jp/
85
Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies, December 5-8, 2017 Tokyo Japan
3. Online Test Phase Selected question retrieval sys-tems are evaluated in a production environment of Ya-hoo Japan Corporation. Multileaved comparison meth-ods [4] are used in the online evaluation.
As the open live test will be conducted on a Japanese ser-vice, the language scope is limited to Japanese. Meanwhile,we supported participants by providing a tool for featureextraction so that Japanese NLP is not required for partic-ipation.
3. DATAThis section explains the data used in the OpenLiveQ
task.
3.1 QueriesWe sampled 2,000 queries from a Yahoo! Chiebukuro
search query log, and used 1,000 queries for training andthe rest for testing. Before sampling the queries from thequery log, we applied the several filtering rules to removequeries that were not desirable for the OpenLiveQ task.
First, we filtered out time-sensitive queries. Participantswere given a fixed set of questions for each query, and wererequested to submit a ranked list of those questions. Sincethe frozen set of questions were presented to real users duringthe online evaluation phase, it was not desirable that therelevance of each question changes depending on the time.Thus, we filtered out queries that were highly time-sensitive,and only used the time-insensitive queries in the OpenLiveQtask.
The procedure how we filtered out the time-sensitive queriesis as follows. Letting nq
recent be the number of questions forquery q posted from July 16th, 2017 to September 16th,2017, and nq
past be the number of questions of q from Jan-uary 1st, 2013 to July 15th, 2017, we removed queries suchthat nq
recent/nqpast > 1.0 as the time-sensitive queries.
Second, we filtered out porn queries. We judged a querywas a porn query if more than 10% questions retrieved bythe query belonged to the porn-related category of Yahoo!Chiebukuro.
After removing time-sensitive and porn-related queries,we further manually removed queries that were related toany of the ethic, discrimination, or privacy issues. The or-ganizers checked each of the queries and its questions, andfiltered out a query if at least one of the organizers judgedit had the issues above.
Finally, we sampled 2,000 queries from the remaining queriesand used them in the OpenLiveQ task.
3.2 QuestionsWe input each query to the current Yahoo! Chiebukuro
search system as of December 1-9 in 2016, recorded the top1,000 questions, and used them as questions to be ranked.Information about all the questions as of December 1-9, 2016was distributed to the OpenLiveQ participants, and includes
• Query ID (a query by which the question was retrieved),
• Rank of the question in a Yahoo! Chiebukuro searchresult for the query of Query ID,
• Question ID,
• Title of the question,
• Snippet of the question in a search result,
• Status of the question (accepting answers, acceptingvotes, or solved),
• Last update time of the question,
• Number of answers for the question,
• Page view of the question,
• Category of the question,
• Body of the question, and
• Body of the best answer for the question.
The total number of questions is 1,967,274. As was men-tioned earlier, participants were required to submit a rankedlist of those questions for each test query.
3.3 Clickthrough DataClickthrough data are available for some of the questions.
Based on the clickthrough data, one can estimate the clickprobability of the questions, and understand what kinds ofusers click on a certain question. The clickthrough data werecollected from August 24, 2016 to November 23, 2016.
The clickthrough data include
• Query ID (a query by which the question was retrieved),
• Question ID,
• Most frequent rank of the question in a Yahoo! Chiebukurosearch result for the query of Query ID,
• Clickthrough rate,
• Fraction of male users among those who clicked on thequestion,
• Fraction of female users among those clicked on thequestion,
• Fraction of users under 10 years old among those whoclicked on the question,
• Fraction of users in their 10s among those who clickedon the question,
• Fraction of users in their 20s among those who clickedon the question,
• Fraction of users in their 30s among those who clickedon the question,
• Fraction of users in their 40s among those who clickedon the question,
• Fraction of users in their 50s among those who clickedon the question, and
• Fraction of users over 60 years old among those whoclicked on the question.
The clickthrough data contain click statistics of a questionidentified by Question ID when a query identified by QueryID was submitted. The rank of the question can change evenfor the same query. This is why the third value indicates themost frequent rank of the question. The number of query-question pairs in the clickthrough data is 440,163.
86
Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies, December 5-8, 2017 Tokyo Japan
Figure 1: Screenshot of the relevance judgment sys-tem.
4. EVALUATIONThis section introduces the offline evaluation, in which
runs were evaluated with relevance judgment data, and on-line evaluation, in which runs were evaluated with real usersby means of multileaving.
4.1 Offline EvaluationOffline test is carried out before online test explained later,
and determines participants whose systems are evaluated inthe online test, based on results in the offline test. Theoffline evaluation was conducted in a similar way to tradi-tional ad-hoc retrieval tasks, in which results are evaluatedby relevance judgment results and evaluation metrics such asnDCG (normalized discounted cumulative gain), ERR (ex-pected reciprocal rank), and Q-measure. During the offlinetest period, participants can submit their results once perday through our Web site2, and obtain evaluation resultsright after the submission.
To simulate the online test in the offline test, we conductedrelevance judgment with an instruction shown below: “Sup-pose you input <query> and received a set of questions asshown below. Please select all the questions on which youwant to click”. Assessors were not present with the full con-tent of each question, and requested to evaluate questionsin a similar page to the real SERP in Yahoo! Chiebukuro.This type of relevance judgment is different from traditionalones, and expected to result in being similar to results of theonline test. Five assessors were assigned to each query, andthe relevance grade of each question was estimated as thenumber of assessors who select the question in the relevancejudgment. For example, the relevance grade was 2 if twoout of five assessors marked a question. We used Lancers3,a Japanese crowd-sourcing service, for the relevance judg-ment. Figure 1 shows a screenshot of a system used for therelevance judgment, where assessors can click on either thetitle or the blue square for voting.
Only a score of nDCG@10 for each submitted run wasdisplayed at our website. The top 10 teams in terms of
2http://www.openliveq.net/3http://www.lancers.jp/
Algorithm 1: Optimized Multileaving (OM)
Require: Input rankings I, number of output rankingsm, and number of items in each outputranking l
1 O ← {};2 for k = 1, . . . ,m do3 for i = 1, . . . , l do4 Select j randomly;5 r = 1;6 while Ij,r ∈ Ok do7 r ← r + 18 end9 if r ≤ |Ij | then
10 Ok,i = Ij,r;11 end
12 end13 O ← O ∪ {Ok};14 end15 return O;
nDCG@10 were invited to the online evaluation.
4.2 Online EvaluationSubmitted results were evaluated by multileaving [4]. Ranked
lists of questions were combined into a single SERP, pre-sented to real users during the online test period, and eval-uated on the basis of clicks observed. Based on our prelim-inary experiment [2], we opt to use optimized multileaving(OM) in the multileaved comparison. Results submitted inthe offline test period were used in as-is in the online test.Note that some questions could be excluded in the onlinetest if they were deleted for some reasons before or duringthe online test.
A multileaving method takes a set of rankings I = {I1, I2, . . . , In}and returns a set of combined rankingsO = {O1, O2, . . . , Om},where each combined ranking Ok consists of l items. Thei-th items of an input ranking Ij and an output rankingOk are denoted by Ij,i and Ok,i, respectively. When a userissues a query, we return an output ranking Ok with proba-bility pk and observe user clicks on Ok. If Ok,i is clicked bythe user, we give a credit δ(Ok,i, Ij) to each input rankingIj . Each multileaving method consists of a way to constructO from I, probability pk for each output ranking Ok, and acredit function δ. The original multileaving methods decidewhich input ranking is better for each input ranking pairevery time it presents an output ranking, whereas we opt toaccumulate the credits through all the presentations and tomeasure the effectiveness of each input ranking on the basisof the sum of the credits, mainly because this approach mustprovide more informative evaluation results.
OM [4] is a multileaving method that generates outputrankings by Algorithm 1, and computes the presentationprobability pk that maximizes the sensitivity of the outputrankings while ensuring no bias. The sensitivity of an outputranking is the power to discriminate effectiveness differencesbetween input rankings when the output ranking is beingpresented to users. Intuitively, the sensitivity is high if ran-dom clicks on an output ranking give a similar amount ofcredits to each input ranking. High sensitivity is desirableas it leads to fast convergence of evaluation results. Biasof output rankings measures the difference between the ex-
87
Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies, December 5-8, 2017 Tokyo Japan
pected credits of input rankings for random clicks. If thebias is high, a certain input ranking can be considered bet-ter than the others even if only random clicks are given.Thus, multileaving methods should reduce the bias as muchas possible.
The sensitivity can be maximized by minimizing insensi-tivity, defined as the variance of credits given by an outputranking Ok through rank-dependent random clicks [4]:
σ2k =
∑nj=1
((∑li=1 f(i)δ(Ok,i, Ij)
)− µk
)2, (1)
where f(i) is the probability with which a user clicks onthe i-th item. We follow the original work [4] and usef(i) = 1/i and δ(Ok,i, Ij) = 1/i if Ok,i ∈ Ij ; otherwise,1/(|Ij |+1). The mean credit µk of the output ranking Ok is
computed as: µk = (1/n)∑n
j=1
∑li=1 f(i)δ(Ok,i, Ij). Since
each output ranking Ok is presented to users with proba-bility pk, OM should minimize the expected insensitivity:E[σ2
k] =∑m
k=1 pkσ2k.
The bias of output rankings O is defined as the differencebetween expected credits of input rankings. The expectedcredit of an input ranking Ij given rank-independent randomclicks on top-1, top-2, · · · , and top-r items is defined asfollows:
E[g(Ij , r)] =∑m
k=1 pk∑r
i=1 δ(Ok,i, Ij). (2)
If the expected credits of input rankings are different, eval-uation results obtained by multileaving is biased. Thus, theoriginal version of OM imposes a constraint that the ex-pected credits of all the input rankings must be the same,i.e. (∀r, ∃cr, ∀j) E[g(Ij , r)] = cr.
According to the paper by Schuth et al. [4] and their pub-licly available implementation4, their version of OM triesfirst to satisfy the constraint for letting the bias be zero,and then to maximize the sensitivity given that the biasconstraint is satisfied. However, we found that the bias con-straint cannot be satisfied for more than 90% of the casesin our experiment, i.e. we could not find any solution thatsatisfied the bias constraint. Hence, we propose using amore practical implementation of OM that minimizes a lin-ear combination of the sum of biases and the insensitivityas follows:
minpk α∑l
r=1 λr +∑m
k=1 pkσ2k (3)
subject to∑m
k=1 pk = 1, 0 ≤ pk ≤ 1 (k = 1, . . . ,m), and
∀r, ∀j, j′,−λr ≤ E[g(Ij , r)]− E[g(Ij′ , r)] ≤ λr ,
where α is a hyper-parameter that controls the balance be-tween the bias and insensitivity, and λr is the maximum dif-ference of the expected credits in any input rankings pairs.If λr is close to zero, the expected credits of input rank-ings are close, and accordingly, the bias becomes small. Ourimplementation is publicly available5.
The online evaluation at Yahoo! Chiebukuro was con-ducted between May 9, 2017 and August 8, 2017. The to-tal number of impressions used for the online evaluation is410,812.
4https://github.com/djoerd/mirex5https://github.com/mpkato/interleaving
Bas
elin
e (a
s is
)
SLO
LQ
OK
SA
T
KU
IDL
TUA
1
Bas
elin
e (#
Ans
)
cdla
b
OR
G
YJR
S
Erle
r
Runs
0
5000
10000
15000
20000
Cre
dit
Figure 3: Cumulated credits in the online evalua-tion.
5. EVALUATION RESULTSFigures 2(a), 2(b), and 2(c) show results of the offline
evaluation in terms of nDCG@10, ERR@10, and Q-measure.Baseline runs are indicated in red. The first baseline (ORG4) produced exactly the same ranked list as that used inthe production. The other baselines used a linear feature-based model optimized by coordinate ascent [3], with dif-ferent parameter settings. One of the linear feature-basedmodel (ORG 7) was the best baseline method in terms of allthe metrics. The best baseline was outperformed by someteams: OKSAT, YJRS, and cdlab in terms of nDCG@10and ERR, and YJRS and Erler in terms of Q-measure.
Figure 5 shows cumulated credits in the online evaluation.In the online evaluation, the best run from each team wasselected: KUIDL 18, TUA1 19, Erler 22, SLOLQ 54, cdlab83, YJRS 86, and OKSAT 88. Moreover, we included thebest baseline method (ORG 7), current production system(ORG 4), and a simple baseline that ranks questions by thenumber of answers obtained (not used in the offline eval-uation). YJRS and Erler outperformed the best baseline(ORG), while there is not a statistically significant differ-ence between them as explained later. The online evaluationshowed a different result from those obtained in the offlineevaluation. In particular, OKSAT performed well at the on-line evaluation, while it showed relatively low performanceamong the submitted runs.
Figure 5 shows the number of run pairs between whicht-tests did not find significant differences, where the x-axisindicates the number of days passed. As this is multiple com-parison, p-values were corrected by Bonferroni’s method.The multileaving evaluation could find statistically signif-icant differences for most of the run pairs (37/45 = 82.2%)within 10 days. After 20 days passed, significant differenceswere found for 41/45 = 91.1% of run pairs. After 64 dayspassed, only three run pairs remained insignificant: KUIDL-TUA1, KUIDL-OKSAT, and YJRS-Erler.
6. CONCLUSIONSIn this paper, we described the details of the NTCIR-13
88
Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies, December 5-8, 2017 Tokyo Japan
OK
SA
T 58
cdla
b 12
cdla
b 72
cdla
b 64
OK
SA
T 15
cdla
b 67
OK
SA
T 20
cdla
b 26
SLO
LQ 6
9O
KS
AT
36O
KS
AT
30cd
lab
76cd
lab
79S
LOLQ
85
SLO
LQ 6
1S
LOLQ
35
SLO
LQ 8
1S
LOLQ
54
cdla
b 59
OK
SA
T 33
KU
IDL
14cd
lab
6Y
JRS
16
YJR
S 2
5O
KS
AT
37O
KS
AT
46K
UID
L 8
YJR
S 2
8Y
JRS
5E
rler 6
0TU
A1
31cd
lab
32E
rler 6
2TU
A1
78O
RG
4E
rler 7
3O
KS
AT
27E
rler 6
8K
UID
L 18
OR
G 1
7TU
A1
74cd
lab
9O
KS
AT
23Y
JRS
10
OR
G 1
3O
KS
AT
43O
KS
AT
11Y
JRS
71
cdla
b 21
Erle
r 57
cdla
b 29
cdla
b 24
TUA
1 19
OK
SA
T 40
OK
SA
T 51
YJR
S 3
8Y
JRS
48
Erle
r 49
Erle
r 47
Erle
r 44
YJR
S 8
2E
rler 4
5E
rler 6
5E
rler 5
3Y
JRS
77
OK
SA
T 70
YJR
S 6
6cd
lab
41cd
lab
39cd
lab
34E
rler 2
2cd
lab
87Y
JRS
50
OR
G 7
cdla
b 56
cdla
b 42
cdla
b 52
cdla
b 83
YJR
S 8
6O
KS
AT
55O
KS
AT
63O
KS
AT
75O
KS
AT
80O
KS
AT
84O
KS
AT
88
Run IDs
0.0
0.1
0.2
0.3
0.4nD
CG
@10
(a) nDCG@10
OK
SA
T 58
cdla
b 12
cdla
b 67
cdla
b 72
OK
SA
T 20
OK
SA
T 15
cdla
b 64
cdla
b 26
OK
SA
T 36
OK
SA
T 30
SLO
LQ 6
9K
UID
L 14
SLO
LQ 8
1cd
lab
79cd
lab
76Y
JRS
16
SLO
LQ 6
1Y
JRS
25
OK
SA
T 33
cdla
b 6
SLO
LQ 3
5Y
JRS
28
OK
SA
T 37
cdla
b 32
SLO
LQ 5
4O
KS
AT
46Y
JRS
5TU
A1
31E
rler 6
0cd
lab
59S
LOLQ
85
OK
SA
T 27
OR
G 4
Erle
r 73
OK
SA
T 23
Erle
r 68
Erle
r 62
KU
IDL
8TU
A1
74TU
A1
78cd
lab
9Y
JRS
71
KU
IDL
18cd
lab
29Y
JRS
48
YJR
S 3
8E
rler 5
7cd
lab
24cd
lab
21O
RG
17
OR
G 1
3E
rler 4
9E
rler 4
4E
rler 4
7Y
JRS
10
YJR
S 8
2E
rler 4
5O
KS
AT
51O
KS
AT
43O
KS
AT
11E
rler 6
5TU
A1
19O
KS
AT
40E
rler 5
3Y
JRS
77
YJR
S 6
6O
KS
AT
70E
rler 2
2cd
lab
41cd
lab
34cd
lab
56O
RG
7cd
lab
42cd
lab
39O
KS
AT
55Y
JRS
86
YJR
S 5
0cd
lab
52cd
lab
87O
KS
AT
63cd
lab
83O
KS
AT
75O
KS
AT
84O
KS
AT
80O
KS
AT
88
Run IDs
0.00
0.05
0.10
0.15
0.20
0.25
ER
R@
10
(b) ERR@10
OK
SA
T 58
cdla
b 12
OK
SA
T 20
cdla
b 67
SLO
LQ 8
5O
KS
AT
15cd
lab
26cd
lab
72cd
lab
64O
KS
AT
30cd
lab
76S
LOLQ
69
OK
SA
T 36
SLO
LQ 6
1cd
lab
59S
LOLQ
81
SLO
LQ 3
5S
LOLQ
54
cdla
b 79
cdla
b 6
OR
G 1
3Y
JRS
25
OK
SA
T 33
YJR
S 2
8Y
JRS
16
OK
SA
T 46
Erle
r 73
OK
SA
T 27
OR
G 4
KU
IDL
18O
KS
AT
23K
UID
L 8
cdla
b 9
OR
G 1
7K
UID
L 14
cdla
b 24
cdla
b 21
OK
SA
T 37
YJR
S 5
Erle
r 49
Erle
r 44
Erle
r 47
Erle
r 45
Erle
r 60
YJR
S 8
2cd
lab
32Y
JRS
71
Erle
r 65
TUA
1 78
Erle
r 68
TUA
1 31
Erle
r 53
TUA
1 74
Erle
r 62
Erle
r 57
YJR
S 3
8Y
JRS
48
YJR
S 1
0O
KS
AT
55O
KS
AT
63O
KS
AT
11cd
lab
87O
KS
AT
75O
KS
AT
80cd
lab
29TU
A1
19O
KS
AT
70cd
lab
41cd
lab
34cd
lab
42O
KS
AT
51cd
lab
56O
KS
AT
40O
KS
AT
43cd
lab
52cd
lab
83cd
lab
39O
KS
AT
84O
KS
AT
88Y
JRS
77
OR
G 7
YJR
S 6
6E
rler 2
2Y
JRS
50
YJR
S 8
6
Run IDs
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Q
(c) Q-measure
Figure 2: Offline evaluation.
20 40 60 80 100Days passed
2.5
5.0
7.5
10.0
12.5
15.0
17.5
20.0
22.5
Num
ber o
f ins
igni
fican
t run
pai
rs
Figure 4: Number of insignificant run pairs as afunction of days passed.
OpenLiveQ task, and reported official results of the submit-ted runs. Out findings are summarized as follows:
• There was a big difference between the offline evalua-tion where assessors voted for relevant questions andonline evaluation where real users voted for relevantquestions via clicks.
• The current online evaluation method, multileaving,could seemingly handle a dozen of runs, while it couldnot evaluate a hundred of runs within a few months.
• Systems developed by participants could achieve bigimprovements from the current question retrieval sys-tem in terms of the online evaluation, while there isroom for improvement over a strong baseline usinglearning to rank.
Those findings motivated us to organize the second roundof OpenLiveQ (OpenLiveQ-2) in NTCIR-14, in which wewill bring the following changes:
• We will employ a log-based offline evaluation methodthat turned out to be line up with online evaluationresults according to our recent study [2].
• We will improve multileaving methods to evaluate ahundred of runs within a few months.
7. ACKNOWLEDGMENTSWe thank the NTCIR-13 OpenLiveQ participants for their
effort in submitting runs. We appreciate significant effortsmade by Yahoo Japan Corporation for providing valuablesearch data and an open live test environment.
8. REFERENCES[1] X. Cao, G. Cong, B. Cui, and C. S. Jensen. A
generalized framework of exploring category
89
Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies, December 5-8, 2017 Tokyo Japan
information for question retrieval in communityquestion answer archives. In WWW, pages 201–210,2010.
[2] T. Manabe, A. Nishida, M. P. Kato, T. Yamamoto, andS. Fujita. A comparative live evaluation of multileavingmethods on a commercial cqa search. In SIGIR 2017,pages 949–952, 2017.
[3] D. Metzler and W. B. Croft. Linear feature-basedmodels for information retrieval. Information Retrieval,10(3):257–274, 2007.
[4] A. Schuth, F. Sietsma, S. Whiteson, D. Lefortier, andM. de Rijke. Multileaved comparisons for fast onlineevaluation. In CIKM, pages 71–80, 2014.
[5] K. Wang, Z. Ming, and T.-S. Chua. A syntactic treematching approach to finding similar questions incommunity-based qa services. In SIGIR, pages 187–194,2009.
[6] G. Zhou, Y. Liu, F. Liu, D. Zeng, and J. Zhao.Improving question retrieval in community questionanswering using world knowledge. In IJCAI, pages2239–2245, 2013.
90
Proceedings of the 13th NTCIR Conference on Evaluation of Information Access Technologies, December 5-8, 2017 Tokyo Japan