Tutorial on Using Amazon Mechanical Turk (MTurk) for HCI Research

transcript

Quality ��Crowdsourcing for Human Computer Interaction Research

Ed H. Chi

Research Scientist Google (work done while at [Xerox] PARC)

Aniket Kittur, Ed H. Chi, Bongwon Suh. Crowdsourcing User Studies With Mechanical Turk. In CHI2008.

Example Task from Amazon MTurk

Historical Footnote

•  De Prony, 1794, hired hairdressers •  (unemployed after French revolution; knew only

addition and subtraction)

•  to create logarithmic and trigonometric tables. •  He managed the process by splitting the

work into very detailed workflows.

–  Grier, When computers were human, 2005

!"#$% &'#(")$)*'%+ ,'"%- .• !"#$%/ 0121 )31 4*2/)56'#(")12/+7 "/1- 4'2#$)3 6'#(")$)*'%/

• !"#$%&'() 6'#(")$)*'%8– &9$*2$")+ $/)2'%'#:+ .;<=8:&'#(")1- )31 !$991:>/6'#1) '2?*) @)3211 ?'-:(2'?91#A )&*&)&%# +,((2'?91#A )&*&)&%# +,(-$./" '4 %"#12*66'#(")$)*'%/ $62'// B$/)2'%'#12/

C2*12+ D31% 6'#(")12/ 0121 3"#$%+ EFF<C2*12+ GHHH I%%$9/ .JJ=

Using Mechanical Turk for user studies

Traditional user studies

Mechanical Turk

Task complexity Complex Long

Simple Short

Task subjectivity Subjective Opinions

Objective Verifiable

User information Targeted demographics High interactivity

Unknown demographics Limited interactivity

Can Mechanical Turk be usefully used for user studies?

•  Assess quality of Wikipedia articles •  Started with ratings from expert Wikipedians

–  14 articles (e.g., “Germany”, “Noam Chomsky”) –  7-point scale

•  Can we get matching ratings with mechanical turk?

Experiment 1

•  Rate articles on 7-point scales: –  Well written

–  Factually accurate

–  Overall quality

•  Free-text input: –  What improvements does the article need?

•  Paid $0.05 each

Experiment 1: Good news

•  58 users made 210 ratings (15 per article) –  $10.50 total

•  Fast results –  44% within a day, 100% within two days

–  Many completed within minutes

Experiment 1: Bad news

•  Correlation between turkers and Wikipedians only marginally significant (r=.50, p=.07)

•  Worse, 59% potentially invalid responses

•  Nearly 75% of these done by only 8 users

Experiment 1

Invalid comments

<1 min responses

Not a good start

•  Summary of Experiment 1: –  Only marginal correlation with experts.

–  Heavy gaming of the system by a minority

•  Possible Response: –  Can make sure these gamers are not rewarded

–  Ban them from doing your hits in the future

–  Create a reputation system [Delores Lab]

•  Can we change how we collect user input ?

Design changes

•  Use verifiable questions to signal monitoring –  “How many sections does the article have?”

–  “How many images does the article have?”

–  “How many references does the article have?”

Design changes

•  Use verifiable questions to signal monitoring •  Make malicious answers as high cost as good-faith

answers –  “Provide 4-6 keywords that would give someone a

good summary of the contents of the article”

Design changes

answers

•  Make verifiable answers useful for completing task –  Used tasks similar to how Wikipedians evaluate quality

(organization, presentation, references)

Design changes

answers

•  Make verifiable answers useful for completing task

•  Put verifiable tasks before subjective responses –  First do objective tasks and summarization –  Only then evaluate subjective quality

–  Ecological validity?

Experiment 2: Results

•  124 users provided 277 ratings (~20 per article) •  Significant positive correlation with Wikipedians

–  r=.66, p=.01

•  Smaller proportion malicious responses •  Increased time on task

Experiment 1 Experiment 2

Invalid comments

49% 3%

<1 min responses

31% 7%

Median time 1:30 4:06

Generalizing to other MTurk studies

•  Combine objective and subjective questions –  Rapid prototyping: ask verifiable questions about content/

design of prototype before subjective evaluation

–  User surveys: ask common-knowledge questions before asking for opinions

•  Filtering for Quality –  Put in a field for Free-Form Responses and Filter out

data without answers –  Results that came in too quickly

–  Sort by WorkerID and look for cut and paste answers

–  Look for outliers in the data that are suspicious

Quick Summary of Tips

1.  Use verifiable questions to signal monitoring 2.  Make malicious answers as high cost as good-faith answers 3.  Make verifiable answers useful for completing task

4.  Put verifiable tasks before subjective responses

•  Mechanical Turk offers the practitioner a way to access a large user pool and quickly collect data at low cost

•  Good results require careful task design

Managing Quality

•  Quality through redundancy: Combining votes –  Majority vote [work best when similar worker quality]

–  Worker-Quality‐adjusted vote

–  Managing dependencies

•  Quality through gold data –  Advantaged when imbalanced dataset & bad workers

•  Estimating worker quality (Redundancy + Gold) –  Calculate the confusion matrix and see if you actually

get some information from the worker

•  Toolkit: http://code.google.com/p/get‐another‐label/

17 Source: Ipeirotis, WWW2011

Coding and Machine Learning

•  Integration with Machine Learning –  Build automatic classification models using

crowdsourced data

!"#$%& '(%)*"(+

• ,)#-+' %-.&% */-"+"+0 1-*-,)#-+' %-.&% */-"+"+0 1-*-• 2'& */-"+"+0 1-*- *( .)"%1 #(1&%

Data from existing

crowdsourced answerscrowdsourced answers

N CNew Case Automatic Model

(through machine learning)

Automatic

Answer

Source: Ipeirotis, WWW2011

Limitations of Mechanical Turk

•  No control of users’ environment –  Potential for different browsers, physical distractions

–  General problem with online experimentation

•  Not designed for user studies –  Difficult to do between-subjects design

–  May need some programming

•  Users –  Somewhat hard to control demographics, expertise

Crowdsourcing for HCI Research

•  Does my interface/visualization work? –  WikiDashboard: transparency vis for Wikipedia [Suh et al.]

–  Replicating Perceptual Experiments [Heer et al., CHI2010]

•  Coding of large amount of user data –  What is a Question in Twitter? [Sharoda Paul, Lichan Hong, Ed Chi]

•  Incentive mechanisms –  Intrinsic vs. Extrinsic rewards: Games vs. Pay –  [Horton & Chilton, 2010 for Mturk] and [Ariely, 2009] in general

•  Incentive mechanisms –  Intrinsic vs. Extrinsic rewards: Games vs. Pay –  [Horton & Chilton, 2010 for MTurk] and Satisficing –  [Ariely, 2009] in general: Higher pay != Better work

Crowd Programming for Complex Tasks

•  Decompose tasks into smaller tasks –  Digital Taylorism

–  Frederick Winslow Taylor (1856-1915)

–  1911 'Principles Of Scientific Management’

•  Crowd Programming Explorations –  MapReduce Models

•  Kittur, A.; Smus, B.; and Kraut, R. CHI2011EA on CrowdForge. •  Kulkarni, Can, Hartmann, CHI2011 workshop & WIP

–  Little, G.; Chilton, L.; Goldman, M.; and Miller, R. C. In KDD 2010 Workshop on Human Computation.

Crowd Programming for Complex Tasks

•  Crowd Programming Explorations –  Kittur, A.; Smus, B.; and Kraut, R. CHI2011EA on

CrowdForge. –  Kulkarni, Can, Hartmann, CHI2011 workshop & WIP

“Please solve the 16-question SAT located at

http://bit.ly/SATexam”. In both cases, we paid workers

between $0.10 and $0.40 per HIT. Each “subdivide” or

“merge” HIT received answers within 4 hours; solutions

to the initial task were complete within 72 hours.

Results

The decompositions produced by Turkers while running

Turkomatic are displayed in Figure 1 (essay-writing)

and Figure 4 (SAT).

In the essay task, each “subdivide” HIT was posted

three times by Turkomatic and the best of the three

was selected by experimenters (simulating Turker

voting) to continue the solution process. The proposed

decompositions were overwhelmingly linear and chose

to break the task down either by paragraph or by

activity (for example, one Turker proposed: brainstorm,

create outline, write topic sentences, fill in facts). The

decomposition used in the final essay used two levels of

recursion. As groups of subtasks were completed,

Turkomatic passed solutions to merge workers for

reassembly. The resulting essay is complete and

coherent, although somewhat lacking in cohesion.

We allowed essay-writers to pick a topic; the chosen

one (university legacy admissions) was somewhat

specialized, but the final essay displayed a reasonably

good understanding of the topic, even if the writing

quality was often mixed. The decomposition selected

for the SAT task used only one level of recursion.

Workers divided the task into 12 subtasks consisting of

1 to 3 thematically linked questions. These were each

solved in parallel by distinct workers and the results

were given to a merge worker who produced the final

solution. The score on the overall solution was 12/17,

with the worst performance on math and grammar

questions and the best in reading and vocabulary.

Obtaining useful decompositions proved tricky for

workers – many seemed confused about the nature of

the planning task. However, once the tasks were

decomposed, solution of the constituent parts and

reassembly into an overall solution were

straightforward for Turkers to accomplish.

Evaluation: Interface

In a second informal study, we examined whether

reducing user involvement in the HIT design improved

ease of use and efficiency. We hypothesized that the

high level of abstraction enabled by automatic task

design would make it easier for requesters to

crowdsource their work.

We asked a pool of four users to try to collect answers

for a basic brainstorming task on Mechanical Turk. The

task asked our participants to generate five ideas of

topics for an essay. Participants performed this task

twice, first, using Turkomatic to post tasks and obtain

results, then, using Mechanical Turk’s web interface. No

instruction on either interface was provided. We

examined how long it took the user to post the task.

With Turkomatic, our users finished posting their tasks

in an average of 37 seconds. On Mechanical Turk,

where low-level task design was required, users needed

an average of 244.2 seconds to post their tasks. More

importantly, the HITs posted by two users who were

not familiar with Mechanical Turk would not have

produced any meaningful results. One user posted

minor variations of the default templates provided on

Figure 4. For the SAT task, we uploaded

sixteen questions from a high school

Scholastic Aptitude Test to the web and

posed the following task to Turkomatic:

“Please solve the 16-question SAT located

at http://bit.ly/SATexam”.

"#!$%&!'%()(*!%!(&+,-.-+/!&01,+((-#2!('+&!-(!%&&3-+/!'1!+%,4!-'+$!-#!'4+!&%0'-'-1#5!64+(+!'%()(!%0+!-/+%337!(-$&3+!+#1824!'1!9+!%#(:+0%93+!97!%!(-#23+!:10)+0!-#!%!(410'!%$18#'!1.!'-$+5!;10!+<%$&3+*!%!$%&!'%()!.10!%0'-,3+!:0-'-#2!,183/!%()!%!:10)+0!'1!,133+,'!1#+!.%,'!1#!%!2-=+#!'1&-,!-#!'4+!%0'-,3+>(!18'3-#+5!?83'-&3+!-#('%#,+(!1.!%!$%&!'%()(!,183/!9+!-#('%#'-%'+/!.10!+%,4!&%0'-'-1#@!+525*!$83'-&3+!:10)+0(!,183/!9+!%()+/!'1!,133+,'!1#+!.%,'!+%,4!1#!%!'1&-,!-#!&%0%33+35!

;-#%337*!0+/8,+!'%()(!'%)+!%33!'4+!0+(83'(!.01$!%!2-=+#!$%&!'%()!%#/!,1#(13-/%'+!'4+$*!'7&-,%337!-#'1!%!(-#23+!0+(83'5!"#!'4+!%0'-,3+!:0-'-#2!+<%$&3+*!%!0+/8,+!('+&!$-24'!'%)+!.%,'(!,133+,'+/!.10!%!2-=+#!'1&-,!97!$%#7!:10)+0(!%#/!4%=+!%!:10)+0!'80#!'4+$!-#'1!%!&%0%20%&45!

A#7!1.!'4+(+!('+&(!,%#!9+!-'+0%'-=+5!;10!+<%$&3+*!'4+!'1&-,!.10!%#!%0'-,3+!(+,'-1#!/+.-#+/!-#!%!.-0('!&%0'-'-1#!,%#!-'(+3.!9+!&%0'-'-1#+/!-#'1!(89(+,'-1#(5!B-$-3%037*!'4+!&%0%20%&4(!0+'80#+/!.01$!1#+!0+/8,'-1#!('+&!,%#!-#!'80#!9+!0+10/+0+/!'401824!%!(+,1#/!0+/8,'-1#!('+&5!

!"#$%#&'()$#%C+!+<&310+/!%(!%!,%(+!('8/7!'4+!,1$&3+<!'%()!1.!:0-'-#2!%#!+#,7,31&+/-%!%0'-,3+5!C0-'-#2!%#!%0'-,3+!-(!%!,4%33+#2-#2!%#/!-#'+0/+&+#/+#'!'%()!'4%'!-#=13=+(!$%#7!/-..+0+#'!(89'%()(D!&3%##-#2!'4+!(,1&+!1.!'4+!%0'-,3+*!41:!-'!(4183/!9+!('08,'80+/*!.-#/-#2!%#/!.-3'+0-#2!-#.10$%'-1#!'1!-#,38/+*!:0-'-#2!8&!'4%'!-#.10$%'-1#*!.-#/-#2!%#/!.-<-#2!20%$$%0!%#/!(&+33-#2*!%#/!$%)-#2!'4+!%0'-,3+!,14+0+#'5!64+(+!,4%0%,'+0-('-,(!$%)+!%0'-,3+!:0-'-#2!%!,4%33+#2-#2!98'!0+&0+(+#'%'-=+!'+('!,%(+!.10!180!%&&01%,45!

61!(13=+!'4-(!&0193+$!:+!,0+%'+/!%!(-$&3+!.31:!,1#(-('-#2!1.!%!&%0'-'-1#*!$%&*!%#/!0+/8,+!('+&5!!64+!

&%0'-'-1#!('+&!%()+/!:10)+0(!'1!,0+%'+!%#!%0'-,3+!18'3-#+*!0+&0+(+#'+/!%(!%#!%00%7!1.!(+,'-1#!4+%/-#2(!(8,4!%(!EF-('107G!%#/!EH+120%&47G5!"#!%#!+#=-01#$+#'!:4+0+!:10)+0(!:183/!,1$&3+'+!4-24!+..10'!'%()(*!'4+!#+<'!('+&!$-24'!9+!'1!4%=+!(1$+1#+!:0-'+!%!&%0%20%&4!.10!+%,4!(+,'-1#5!F1:+=+0*!'4+!/-..-,83'7!%#/!'-$+!-#=13=+/!-#!.-#/-#2!'4+!-#.10$%'-1#!.10!%#/!:0-'-#2!%!,1$&3+'+!&%0%20%&4!.10!%!4+%/-#2!-(!%!$-($%',4!'1!'4+!31:!:10)!,%&%,-'7!1.!$-,01I'%()!$%0)+'(5!648(!:+!901)+!'4+!'%()!8&!.80'4+0*!(+&%0%'-#2!'4+!-#.10$%'-1#!,133+,'-1#!%#/!:0-'-#2!(89'%()(5!B&+,-.-,%337*!+%,4!(+,'-1#!4+%/-#2!.01$!'4+!&%0'-'-1#!:%(!8(+/!'1!2+#+0%'+!$%&!'%()(!-#!

*)+',$%-.%/",&)"0%,$#'0&#%12%"%3100"41,"&)5$%6,)&)7+%&"#89%

CHI 2011 • Work-in-Progress May 7–12, 2011 • Vancouver, BC, Canada

Future Directions in Crowdsourcing

•  Real-time Crowdsourcing –  Bigham, et al. VizWiz, UIST 2010

What color is this pillow? What denomination is this bill?

Do you see picnic tables across the parking lot?

What temperature is my oven set to?

Can you please tell me what this can is?

What kind of drink does this can hold?

(89s) .(105s) multiple shades of soft green, blue and gold

(24s) 20(29s) 20

(13s) no(46s) no

(69s) it looks like 425 degrees but the image is difficult to see.(84s) 400(122s) 450

(183s) chickpeas.(514s) beans(552s) Goya Beans

(91s) Energy(99s) no can in the picture(247s) energy drink

Figure 2: Six questions asked by participants, the photographs they took, and answers received with latency in seconds.

the total time required to answer a question. quikTurkit alsomakes it easy to keep a pool of workers of a given size contin-uously engaged and waiting, although workers must be paidto wait. In practice, we have found that keeping 10 or moreworkers in the pool is doable, although costly.

Most Mechanical Turk workers find HITs to do using theprovided search engine3. This search engine allows users toview available HITs sorted by creation date, the number ofHITs available, the reward amount, the expiration date, thetitle, or the time alloted for the work. quikTurkit employsseveral heuristics for optimizing its listing in order to obtainworkers quickly. First, it posts many more HITs than areactually required at any time because only a fraction will ac-tually be picked up within the first few minutes. These HITsare posted in batches, helping quikTurkit HITs stay near thetop. Finally, quikTurkit supports posting multiple HIT vari-ants at once with different titles or reward amounts to covermore of the first page of search results.

VizWiz currently posts a maximum of 64 times more HITsthan are required, posts them at a maximum rate of 4 HITsevery 10 seconds, and uses 6 different HIT variants (2 titles× 3 rewards). These choices are explored more closely in thecontext of VizWiz in the following section.

FIELD DEPLOYMENT

To better understand how VizWiz might be used by blindpeople in their everyday lives, we deployed it to 11 blindiPhone users aged 22 to 55 (3 female). Participants were re-cruited remotely and guided through using VizWiz over thephone until they felt comfortable using it. The wizard inter-face used by VizWiz speaks instructions as it goes, and soparticipants generally felt comfortable using VizWiz after asingle use. Participants were asked to use VizWiz at leastonce a day for one week. After each answer was returned,participants were prompted to leave a spoken comment.

quikTurkit used the following two titles for the jobs that itposted to Mechanical Turk: “3 Quick Visual Questions” and“Answer Three Questions for A Blind Person.” The reward3Available at mturk.com

distribution was set such that half of the HITs posted paid$0.01, and a quarter paid $0.02 and $0.03 each.

Asking Questions Participants asked a total of 82 questions(See Figure 2 for participant examples and accompanyingphotographs). Speech recognition correctly recognized thequestion asked for only 13 of the 82 questions (15.8%), and55 (67.1%) questions could be answered from the photostaken. Of the 82 questions, 22 concerned color identifica-tion, 14 were open ended “what is this?” or “describe thispicture” questions, 13 were of the form “what kind of (blank)is this?,” 12 asked for text to be read, 12 asked whether a par-ticular object was contained within the photograph, 5 askedfor a numerical answer or currency denomination, and 4 didnot fit into these categories.

Problems Taking Pictures 9 (11.0%) of the images takenwere too dark for the question to be answered, and 17 (21.0%)were too blurry for the question to be answered. Although afew other questions could not be answered due to the pho-tos that were taken, photos that were too dark or too blurrywere the most prevalent reason why questions could not beanswered. In the next section, we discuss a second iterationon the VizWiz prototype that helps to alert users to these par-ticular problems before sending the questions to workers.

Answers Overall, the first answer received was correct in71 of 82 cases (86.6%), where “correct” was defined as eitherbeing the answer to the question or an accurate descriptionof why the worker could not answer the question with theinformation contained within the photo provided (i.e., “Thisimage is too blurry”). A correct answer was received in allcases by the third answer.

The first answer was received across all questions in an aver-age of 133.3 seconds (SD=132.7), although the latency re-quired varied dramatically based on whether the questioncould actually be answered from the picture and on whetherthe speech recognition accurately recognized the question(Figure 4). Workers took 105.5 seconds (SD=160.3) on av-erage to answer questions that could be answered by the pro-vided photo compared to 170.2 seconds (SD=159.5) for those

•  Embedding of Crowdwork inside Tools –  Bernstein, et al. Solyent, UIST 2010

•  Shepherding Crowdwork –  Dow et al. CHI2011 WIP

workers to persevere and accept additional tasks. We investigate these hypotheses through a prototype system, Shepherd, that demonstrates how to make feedback an integral part of crowdsourced creative work.

Understanding Opportunities for Crowd Feedback To effectively design feedback mechanisms that achieve the goals of learning, engagement, and quality improvement, we first analyze the important dimensions of the design space for crowd feedback (Figure 2).

Timeliness: When should feedback be shown? In micro-task work, workers stay with tasks for a short while, then move on. This implies two timing options: synchronously deliver feedback when workers are still engaged in a set of tasks, or asynchronously deliver feedback after workers have completed the tasks.

Synchronous feedback may have more impact on future task performance since it arrives while workers are still thinking about the task domain. It also increases the probability that workers will continue onto similar tasks. However, synchronous feedback places a burden on the feedback providers; they have little time to review work. This implies a need for tools or scheduling algorithms that enable near real-time feedback. Asynchronous feedback gives feedback providers more time to review and comment on work.

However, workers may have forgotten about the task or feel unmotivated to review the feedback and to return to the task.

Currently, platforms like Mechanical Turk only allow asynchronous feedback with no enticement to return. Requesters can provide feedback at payment time, but at that point (typically days later), workers care more about getting paid than improving submitted work. More importantly, unless requesters have more jobs available, workers cannot act on requesters’ advice.

Specificity: How detailed should feedback be? Mechanical Turk currently allows requesters one bit of feedback—accept or reject. While additional freeform communication is possible, it is rarely used unless workers file complaints. Workers may learn more if they receive detailed and personalized feedback on each piece of work. However, this added specificity comes at a price: feedback providers must spend time authoring feedback. When feedback resources are limited, customizable templates can accelerate feedback generation and enable requesters to codify domain knowledge into pre-authored statements. However, templates could be perceived as overly general or repetitive, reducing their desired impact. Workers may need explicit incentive to read and reflect on feedback.

Source: Who should provide feedback? Crowdsourcing requesters post tasks with specific quality objectives in mind; they are a natural choice for assuming the feedback role. However, experts often underestimate the difficulty novices face in solving tasks [7] or use language or concepts that are beyond the grasp of novices [6]. Moreover, as feedback

Figure 2: Current systems (in orange) focus on asynchronous, single-bit feedback by requesters. Shepherd (in blue) investigates richer, synchronous feedback by requesters and peers.

Tutorials

•  Thanks to Matt Lease http://ir.ischool.utexas.edu/crowd/ •  AAAI 2011 (w HCOMP 2011): Human Computation: Core Research

Questions and State of the Art (E. Law & Luis von Ahn) •  WSDM 2011: Crowdsourcing 101: Putting the WSDM of Crowds to

Work for You (Omar Alonso and Matthew Lease) –  http://ir.ischool.utexas.edu/wsdm2011_tutorial.pdf

•  LREC 2010 Tutorial: Statistical Models of the Annotation Process (Bob Carpenter and Massimo Poesio)

–  http://lingpipe-blog.com/2010/05/17/

•  ECIR 2010: Crowdsourcing for Relevance Evaluation. (Omar Alonso) –  http://wwwcsif.cs.ucdavis.edu/~alonsoom/crowdsourcing.html

•  CVPR 2010: Mechanical Turk for Computer Vision. (Alex Sorokin and Fei‐Fei Li)

–  http://sites.google.com/site/turkforvision/

•  CIKM 2008: Crowdsourcing for Relevance Evaluation (D. Rose) –  http://videolectures.net/cikm08_rose_cfre/

•  WWW2011: Managing Crowdsourced Human Computation (Panos Ipeirotis)

–  http://www.slideshare.net/ipeirotis/managing-crowdsourced-human-computation

Thanks!

•  chi@acm.org

•  http://edchi.net •  @edchi

•  Aniket Kittur, Ed H. Chi, Bongwon Suh. Crowdsourcing User Studies With Mechanical Turk. In Proceedings of the ACM Conference on Human-factors in Computing Systems (CHI2008), pp.453-456. ACM Press, 2008. Florence, Italy.

•  Aniket Kittur, Bongwon Suh, Ed H. Chi. Can You Ever Trust a Wiki? Impacting Perceived Trustworthiness in Wikipedia. In Proc. of Computer-Supported Cooperative Work (CSCW2008), pp. 477-480. ACM Press, 2008. San Diego, CA. [Best Note Award]

Tutorial on Using Amazon Mechanical Turk (MTurk) for HCI Research

Technology