[Lecture Notes in Computer Science] Advances in Information Retrieval Volume 6611 || Design and...

Design and Implementation of

Relevance Assessments Using Crowdsourcing

Omar Alonso1 and Ricardo Baeza-Yates2

1 Microsoft Corp., Mountain View, California, [email protected]

2 Yahoo! Research, Barcelona, [email protected]

Abstract. In the last years crowdsourcing has emerged as a viable plat-form for conducting relevance assessments. The main reason behind thistrend is that makes possible to conduct experiments extremely fast, withgood results and at low cost. However, like in any experiment, there areseveral details that would make an experiment work or fail. To gatheruseful results, user interface guidelines, inter-agreement metrics, and jus-tification analysis are important aspects of a successful crowdsourcing ex-periment. In this work we explore the design and execution of relevancejudgments using Amazon Mechanical Turk as crowdsourcing platform,introducing a methodology for crowdsourcing relevance assessments andthe results of a series of experiments using TREC 8 with a fixed budget.Our findings indicate that workers are as good as TREC experts, evenproviding detailed feedback for certain query-document pairs. We alsoexplore the importance of document design and presentation when per-forming relevance assessment tasks. Finally, we show our methodologyat work with several examples that are interesting in their own.

1 Introduction

In the world of the Web 2.0 and user generated content, one important sub-classof peer collaborative production is the phenomenon known as crowdsourcing.In crowdsourcing, potentially large jobs are broken into many small tasks thatare then outsourced directly to individual workers via public solicitation. One ofthe best examples is Wikipedia, where each entry or part of an entry could beconsidered as a task being solicited. As in the later example, workers sometimesdo it for free, motivated either because the work is fun or due to some form ofsocial reward [1,12]. However, successful examples of volunteer crowdsourcingare difficult to replicate. As a result, crowdsourcing increasingly uses a financialcompensation, usually as micro-payments of the order of a few cents per task.This is the model of Amazon Mechanical Turk (AMT1), where many tasks canbe done quickly and cheaply.

AMT is currently used as a feasible alternative for conducting all kind of rel-evance experiments in information retrieval and related areas. The lower cost of1 www.mturk.com

P. Clough et al. (Eds.): ECIR 2011, LNCS 6611, pp. 153–164, 2011.c© Springer-Verlag Berlin Heidelberg 2011

154 O. Alonso and R. Baeza-Yates

running experiments in conjunction with the flexibility of the editorial approachat a larger scale, makes this approach very attractive for testing new ideas with afast turnaround. In AMT workers choose from a list of jobs being offered, wherethe reward being offered per task and the number of tasks available for thatrequest are indicated. Workers can click on a link to view a brief description ora preview of each task. The unit of work per task to be performed is called aHIT (Human Intelligence Task). Each HIT has an associated payment and anallotted completion time; workers can see sample HITs, along with the paymentand time information, before choosing whether to work on them or not. Afterseeing the preview, workers can choose to accept the task, where optionally, aqualification exam must be passed to assign officially the task to them. Tasks arevery diverse in size and nature, requiring from seconds to minutes to complete.On the other hand, the typical compensation ranges from one cent to less thana dollar per task and is usually correlated to the task complexity.

However, what is not clear is how exactly to implement an experiment. First,given a certain budget, how we spend it and how we design the tasks? Thatis, in our case, how many people evaluate how many queries looking at howmany documents? Second, how the information for each relevance assessmenttask should be presented? What should be the right interaction? How can wecollect the right user feedback, considering that relevance is a personal subjec-tive decision? In this paper we explore these questions providing a methodologyfor crowdsourcing relevance assessments and its evaluation, giving guidelines toanswer the questions above. In our analysis we consider binary relevance assess-ments. That is, after presenting a document to the user with some context, thepossible outcome of the task is relevant or non-relevant. Ranked list relevanceassessment is out of the scope of this work, but it is a matter of future research.

This paper is organized as follows. First, in Section 2 we present an overviewof the related work in this area. Second, we describe our proposed methodologyin Section 3. Then, we explain the experimental setup in Section 4 and discussour experiments in Section 5. We end with some final remarks in Section 6.

2 Related Work

There is previous work on using crowdsourcing for IR and NLP. Alonso & Miz-zaro [2] compared a single topic to TREC and found that workers were as goodas the original assessors, and in some cases they were able to detect errors in thegolden set. A similar work by Alonso et al. [3] in the context of INEX with alarger data set shows similar results. Kazai et al. [8] propose a method for gath-ering relevance assessments for collections of digital books and videos. Tang &Sanderson used crowdsourcing to evaluate user preferences on spatial diversity[14]. Grady and Lease focused their work in human factors for crowdsourcingassessments [7].

The research work by Snow et al. [13] shows the quality of workers in thecontext of four different NLP tasks, namely, affect recognition, word similar-ity, textual entailment, and event temporal ordering. Callison-Burch shows howAMT can be used for evaluating machine translation [6]. Mason & Watts [11]

Design and Implementation of Relevance Assessments Using Crowdsourcing 155

recently found that increased financial incentives increase the quantity, but notthe quality, of work performed by the workers.

3 Methodology

One of the most important aspects of performing evaluations using crowdsourc-ing is to design the experiment carefully. Our crowdsourcing based methodologyfor relevance assessments is based in three parameters that we later analyze:

P Number of people used for each evaluation task.T Number of topics chosen for the relevance tasks in the target document

collection.D Number of documents per query that will be judged for relevance.

The other typical parameter is the compensation per HIT. However, as we al-ready have three parameters, we decided to keep that constant. We do this fortwo reasons. First, as we already mentioned, quality does not really improve ifwe pay more [11]. Second, if quality would improve, we would not be able tocompare relevance assessments done with different compensations.

3.1 Data Preparation

The first step of the methodology is preparing the data, similar to any relevanceassessment study:

– Select the document collection.– Select the T topics (queries).– For each topic, select D documents per topic. We will use an even number as

it is better to have the same number of relevant and non-relevant documentsfor the case of binary relevance.

– Select the number of workers P that will judge each topic/document pair.We use always an odd number to have a majority vote in most cases. If wehave ties, an additional relevance assessment is made (we use this) or wecan use a non-binary relevance measure. Also, topics should be assigned toworkers randomly, so any possible bias is eliminated.

3.2 Interface Design

This is the most important part of the AMT experiment design: how to ask theright questions in the best possible way. At first, this looks pretty straightfor-ward, but in practice the outcome can be disastrous if this is performed in anad-hoc manner. The first step is to follow standard guidelines for survey andquestionnaire design [5] in conjunction with usability techniques.

The second step is to provide clear instructions for the experiment. In TREC,relevance assessments are performed by hired editors and instructions are verytechnical. The original instructions have four pages [15] and it was not possible touse them as is. In AMT, one cannot make any assumptions about the populationso it is better to use plain English to indicate the task and avoid jargon.


We created a template, based on the TREC guidelines, that presents theinstructions along with the web form for performing the task. The form containsa document with a close question (binary relevance; yes/no) and an open-endedquestion for justifying the selection. A back-end process reads the qrel file andtopic data, instantiates the template variables accordingly and produces a HITfor that particular query-document pair.

As part of the experiment setup, we automatically generate metadata so thatusers can identify each task on the AMT website. Like in any market place wecompete with other requesters for workers to get our experiment done. A cleartitle, description, and keywords allow potential workers to find experiments andpreview the content before accepting tasks. In terms of keywords, we use acommon set for all experiments (relevance, news articles, search, TREC ) andthen add specific terms depending on a given run. For example, in experimentE2 we use: behavioral genetics, osteoporosis, Ireland, peace talks.

3.3 Filtering the Workers

A possible filter for selecting good workers is to use the approval rate. Theapproval rate is a metric provided by AMT that measures the overall ratingof each worker. However, using very high approval rates decreases the workerpopulation available and implies a longer time to do the evaluation.

An alternative is to use qualifications to control which workers can performcertain HITs. A qualification test, is a set of questions similar to a HIT thata worker must answer when requesting the qualification. A qualification test isa much better quality filter but also involves more development cycles. In thecase of relevance evaluation is somewhat difficult to test “relevance”. What wepropose is to generate questions about the topics so workers can get familiar withthe content before performing the tasks, even if they search online for a particularanswer. In our case, the qualification test has ten multiple choice questions with10 points value each. The goal of the qualification test is to eliminate lazy workersand workers that have bad performance for our specific task.

Another approach instead of qualification tests, is to interleave assignmentswhere we already know the correct answer, so it is easy to detect workers that arerandomly selecting the answers. This technique, some times called honey pots, isuseful for checking if the workers are paying attention to the task. For example,when testing a topic one could include a document or web page that is completelyun-related to the question and expect the worker to answer correctly. If theyfailed to pass, then there is indication that they are not following instructionscorrectly or just doing spamming.

AMT provides features for blocking workers from experiments and rejectingindividual assignments. That said, it is easier to blame workers for bad answerswhen probably the interface was confusing or instructions not clear. In the nextsection will show that an iterative approach (like when designing user interfaces)that incorporates feedback earlier in the process is a good approach.


3.4 Scheduling the Tasks

It is known that there is a delicate balance between the compensation and thefiltering step, with regards to the time that the experiment will take. A lowcompensation and/or a strict filtering procedure would drastically reduce thenumber of interested workers and hence will significantly increase the time lengthof the experiment. In fact, the distribution of number of tasks versus completiontime is a power law as we show later.

One solution is to split even more long tasks and submit them in parallel.This has several advantages. First, the waiting time will decrease even thoughthe total time spent in the tasks may not. Second, the overall time spent mayalso decrease because shorter tasks usually attract more workers. So in summary,as common sense suggests, it is better to have many small tasks in the systemthan one very large task. In our case the smallest task is to judge the relevancefor one single document.

Regarding the experiments schedule, it is better to submit shorter tasks first,such that any important implicit or explicit feedback coming from the experimentcan be used to re-design larger experiments. This is also helpful for debuggingthe experiment in the long run.

4 Experimental Setup

As ground truth for relevance assessments we use TREC-8, using the LA Timesand FBIS sub-collections. TREC-8 has T = 50 different topics.To cover all pos-sible topics, we can use for example D = 10 documents per topic and P = 5workers per document. Paying $0.04 cents for each assessment we would need tospend $100. However we want to study a larger space of assessments that canbe done with the same amount of money.

We measure the agreement level between the raters using Cohen’s kappa (κ).As pointed out by many researchers, κ is a very conservative measure of agree-ment and is not perfect. Because we have many raters and the number of themvaries, as well as we do not know the number of TREC raters for each topic, weaverage the answers of the workers as the value of the crowd.

For the analysis, instead of trying all possible combinations, we covered alltopics with different workers. In this way, each experiment is completely inde-pendent. Before running our seven experiments, we built an experimental testcase and run it with a small number of documents and users. The goal was tomake sure that workers understood the task and how long it took to complete.

We use an incremental approach for running our experiments (shown in ta-ble 1). We increase the number of topics in every experiment until we coveredmost of the topics. We also keep some parameter constant, such that we cancover more variables combinations. In our case we do that with the number ofdocuments and workers. In terms of cost per assignment (one query-documentpair), we came up with $0.04 cents. The rationale was $0.02 cents per the answer(binary) plus $0.02 cents per comment/feedback.

We tried to have one single experiment running on the system at all time.However, we noticed that as the experiments grew larger, the completion time


Table 1. Breakdown of the seven experiments

Exp #T # D # P Topics Total cost

E1 1 4 1 401 $0.16E2 3 4 3 402, 403, 404 $1.44E3 5 6 3 405, 407, 408, 410, 411 $3.60E4 7 6 5 412, 413, 414, 415, 417, 418, 419 $8.40E5 10 8 5 420, 421, 422, 423, 424, 425, 426, 427, 428, 430 $16.00E6 11 8 5 406, 429, 433, 435, 445, 437, 438, 439, 442, 444 $17.60E7 11 10 7 437, 438, 439, 442, 444, 440, 441, 446, 448, 449, 450 $33.60

Fig. 1. Workers versus number of tasks

was taking longer. Starting with E5, we decided to split experiments in batchesso they could finish in a reasonable amount of time. This shows that havingsmaller tasks is better. Another effect of this approach is to avoid worker fatiguein a single experiment and this allows some degree of parallelism.

5 Results and Discussion

For all experiments, there were 97 unique workers. The number of unique workersthat completed 5 or less tasks was 60 and the number of unique workers thatcompleted 1 task was 23. Figure 1 shows a graph of workers versus tasks. Aswe can see, this resembles a power law distribution, as expected. Figure 2 showsa similar graph but using workers that have 5 or more tasks. However, in bothcases we can appreciate two power laws. One up to twenty workers with a muchmore flat exponent and one for more than twenty workers that is much morebiased. This double power law behavior also appears in other cases such as wordfrequencies or number of links in web pages. This could imply that there are twotypes of workers, being the first group the ones that profit more from the AMTsystem. Figure 3 shows the number of unique workers in each experiment. Clearlythis number decreases when a qualification exam is used and then increases butnever to the level of the non-qualification exam case. This is due to the size ofthe work force in each case.


Fig. 2. Workers with 5 or more tasks versus number of tasks

Fig. 3. Unique workers vs. workers required

Table 2 shows the transaction details from AMT for all experiments. As wecan see, for all experiments with no qualification test (E1-E4) we used a very highapproval rate. For those with qualification test, we lowered a bit the approvalrate but had our own test. Experiments with no qualification test tend to gofaster compared with the first one with test in place (E5). This is expected assome workers may not feel taking the test or others fail. However, once a numberof workers have passed the test, the other experiments tend to go faster (E6-E7).Another reason is trust. By then, workers know that we pay the work on timeand in case we rejected the work, a clear explanation was given. This shows thatthe word of mouth effect works as in any other markets.

Table 3 shows more information about assignments and workers. Relevanceevaluation is very subjective, so in principle we don’t like to reject work becausethere is disagreement with a particular answer. We do reject when there is aclear indication of a robot doing tasks very fast or when the worker is choosinganswers at random. After a number of rejections in E4, we decided to turn thequalification test a prerequisite for the rest of the experiments. Still, E5 had ahigher number of rejections. This is possible due to the fact that even if a userpasses a qualification test, she/he may chose answers at random.

To visualize the agreement between the workers and the original TREC as-sessments, in Figure 4 we show how the judgment converges to the correct value.Here we can see that in the relevant case, the majority always agrees with the


Table 2. Transaction details of every experiment

Exp Approval rate Qual. test Completed time Launched

E1 98% No 8 min Sunday AME2 98% No 6 min Sunday PME3 98% No 5hrs, 31min Sunday AME4 98% No 4days, 2hrs, 45min Friday AME5 batch1 98% 60% 8days, 4hrs, 40min Thursday PME5 batch2 98% 60% 6days, 5hrs, 45min Friday AME6 batch1 95% 60% 4days, 3hrs, 2min Friday PME6 batch2 95% 60% 1day, 5hrs, 23min Tuesday AME7 batch1 96% 70% 2days, 11hrs, 34min Friday PME7 batch2 96% 70% 1day, 4hrs, 29min Monday PM

Table 3. Details about assignments and workers per experiment

Exp # workers # approved # rejected # answers # comments

E1 2 4 0 4 2E2 11 53 1 53 44E3 26 89 1 89 78E4 28 181 28 181 141E5 batch1 9 160 40 158 160E5 batch2 8 200 0 199 200E6 batch1 6 195 0 194 195E6 batch2 9 235 0 235 235E7 batch1 25 420 0 419 413E7 batch2 19 420 0 419 420

Table 4. Inter agreement level between TREC and workers

Exp Agreement level Avg. topic difficulty

E1 0.00 (chance) 3.60E2 0.66 (substantial) 3.13E3 0.53 (moderate) 3.00E4 0.39 (fair) 2.51E5 batch1 0.25 (fair) 2.60E5 batch2 0.25 (fair) 2.36E6 batch1 0.21 (fair) 2.73E6 batch2 0.32 (fair) 2.72E7 batch1 0.71 (substantial) 2.08E7 batch2 0.41 (moderate) 2.56

correct result. Nevertheless, the disagreement increases as is natural when thenumber of people involved in the decision grows. On the other hand, the majoritydoes not agree in the case of non-relevant documents, showing that non-relevantcases are more difficult to judge. Hence, in practice, if the assessment seems tobe non-relevant, additional workers should be included (say two more). In bothcases, using a qualification exam clearly improves the judgments.


Fig. 4. Average R and NR judgments depending on the number of workers

After we finished running all batches we performed a short experiment thatasked workers to rate on a scale 1 to 5 (1=easy, 5=very difficult) the topics. InTable 4 we show the values of κ for all experiments along with the average topicdifficulty. Figure 5 shows that there is an inverse correlation between agreementand topic difficulty with the exception of two experiments. Notice that thesevalues also depend on the money spent on each case. Normalizing by the costwe get the red dots in the same figure, where the inverse correlation is moreclear. Choosing the number of workers to 5 in any experiment tends to be agood practice.

5.1 Presentation and Readability

One important factor is the effect of the user interface in the quality of the rele-vance assessments. To study that, we compared two different interfaces. One thathelped the users by highlighting the terms of the query in the text and anotherone that just showed the plain text. The data preparation for this experimentconsisted on taking the original document and producing two identical versions:plain and highlighting. The plain version has the visual effect of a continuousline. The highlighted version contains the topic title (up to 3 terms) highlightedin black with a yellow background. Figure 6 shows the results of this experiment.


Fig. 5. Agreement vs. Topic Difficulty

Fig. 6. Relevance votes on highlighting vs plain versions of the same documents

In the figure we show a set of relevant documents. For each document there is theTREC vote (1), and the number of votes for the highlighted and plain version.With the exception of two cases, highlighting does contribute to higher relevancejudgments compared to plain. This tends to suggest that generalist workers mayrely on a good document presentation when assessing relevance. In this partic-ular experiment the number of documents is not that significant, however theresults indicate that the presence of keywords (in our case highlighted) impactassessment [9].

5.2 Feedback Analysis

One way to get user feedback is to ask an optional open-ended question at the endof the task. Table 3 shows that the number of comments per experiment increasesas more HITs are in the system. In experiments E1-E3, we made comments op-tional and found the feedback to be very useful. We also noticed that by asking


Fig. 7. Average comment length for all the experiments

workers to write a short explanation, we can not only gain more data but also de-tect spammers. By looking at some feedback we can observe that, in certain cases,a binary scale may not be suitable and a graded version should be applied.

Figure 7 shows the average comment length in characters for all experiments.As the number of documents (and experiments) goes up, the average lengthtends to go up as well. However, there is a dip starting in E4. The reason isthat we wanted all workers to produce comments so we adjusted the guidelinesand made it mandatory. Unfortunately, workers simply answered “relevant” or“not relevant” to make sure they get the money for the task. To prove that, wedid re-launch E5 (the lowest with average 13.16) and changed the instructionsso feedback was optional but with a bonus pay ($0.01 cent per comment) if thecontent was good. The average comment length was 426.47, a clear indicationthat the bonus technique works for this kind of situation.

6 Concluding Remarks

We presented a methodology for crowdsourcing relevance assessments that con-sists on dividing a large document set into a series of smaller experiments andexecuting them separately according to a well-design template. We start with asmaller set and later tune documents, topics, and workers as parameters whilekeeping the same cost per assignments across all experiments. We have demon-strated the benefit of our approachby evaluatingTREC-8with a budget of just $100.

Quality control is an important part of the experiment and it should be appliedacross different levels not just workers. As presented, quality of the instructionsand document presentation have impact on the results. It is possible to detect badworkers but at the same time, the requester may have a bad reputation amongworkers. Our experience and findings show that the bulk of the experimentdesign should be on the the user interface and instructions. We showed in asmall experiment that document presentation can have effects on relevance andreadability, so it is not a matter of just uploading in some ad-hoc format andlet the workers do the job. The user interface should make their tasks easier not


more difficult. As workers go through the experiments, diversity of topics in asingle run can help avoid stalling the experiment due to lack of interest.

Our results together with previous results reinforce the advantages of crowd-sourcing, in particular in the case of relevance assessments, which usually are notdifficult, but are tedious and large in volume. This even works in non-Englishlanguages although experiments will take longer [4]. Overall, when it is possibleto use social rewards, such as harnessing intrinsic motivation [10], the quality ofthe work will be good. If this is not possible, we should pay as little as possible,assuming that a large enough crowd exists to make up for the diminished quan-tity of individual output the low pay would imply. In other words, paying moremay get the work done faster, but not better.

As future work we would like to study this dimension, compensating for wholetopics or other possible aggregation of relevance assessments. Another avenue ofwork is to test the methodology with a ranked list to evaluate search enginesearch results. We also plan to continue working on document presentation andimproving the overall user experience for getting better results.

References

1. von Ahn, L.: Games with a purpose. IEEE Computer 39(6), 92–94 (2006)2. Alonso, O., Mizzaro, S.: Can we get rid of TREC Assessors? Using Mechanical Turk

for Relevance Assessment. In: SIGIR Workshop Future of IR Evaluation (2009)3. Alonso, O., Schenkel, R., Theobald, M.: Crowdsourcing Assessments for XML

Ranked Retrieval. In: 32 ECIR, Milton Keynes, UK (2010)4. Alonso, O., Baeza-Yates, R.: An Analysis of Crowdsourcing Relevance Assessments

in Spanish. In: CERI 2010, Madrid, Spain (2010)5. Bradburn, N., Sudman, S., Wansink, B.: Asking Questions: The Definitive Guide

to Questionnaire Design. Josey-Bass (2004)6. Callison-Burch, C.: Fast, Cheap, and Creative: Evaluating Translation Quality Us-

ing Amazon’s Mechanical Turk. In: Proceedings of EMNLP (2009)7. Grady, C., Lease, M.: Crowdsourcing Document Relevance Assessment with Me-

chanical Turk. In: NAACL HLT Workshop on Creating Speech and Language Datawith Amazons Mechanical Turk (2010)

8. Kazai, G., Milic-Frayling, N., Costello, J.: Towards Methods for the CollectiveGathering and Quality Control of Relevance Assessments. In: 32 SIGIR (2009)

9. Kinney, K., Huffman, S., Zhai, J.: How Evaluator Domain Expertise Affects SearchResult Relevance Judgments. In: 17 CIKM (2008)

10. Malone, T.W., Laubacher, R., Dellarocas, C.: Harnessing Crowds: Mapping theGenome of Collective Intelligence. MIT Press, Cambridge (2009)

11. Mason, W., Watts, D.: Financial Incentives and the ‘Performance of Crowds’. In:HCOMP Workshop at KDD, Paris, France (2009)

12. Nov, O., Naaman, M., Ye, C.: What Drives Content Tagging: The Case of Photoson Flickr. In: CHI, Florence, Italy (2008)

13. Snow, R., O’ Connor, B., Jurafsky, D., Ng, A.Y.: Cheap and Fast - But is it Good?Evaluating Non-Expert Annotations for Natural Language Tasks. In: EMNLP(2008)

14. Tang, J., Sanderson, M.: Evaluation and User Preference Study on Spatial Diver-sity. In: 32 ECIR, Milton Keynes, UK (2010)

15. Voorhees, E.: Personal communication (2009)

Date post:	02-Oct-2016
Category:	Documents
Upload:	vanessa
View:	216 times
Download:	2 times

[Lecture Notes in Computer Science] Advances in Information Retrieval Volume 6611 || Design and...

Documents