Post on 20-Jan-2015
description
transcript
Quality ���Crowdsourcing for Human Computer Interaction Research
Ed H. Chi
Research Scientist Google (work done while at [Xerox] PARC)
1
Aniket Kittur, Ed H. Chi, Bongwon Suh. Crowdsourcing User Studies With Mechanical Turk. In CHI2008.
Example Task from Amazon MTurk
2
Historical Footnote
• De Prony, 1794, hired hairdressers • (unemployed after French revolution; knew only
addition and subtraction)
• to create logarithmic and trigonometric tables. • He managed the process by splitting the
work into very detailed workflows.
– Grier, When computers were human, 2005
3
!"#$% &'#(")$)*'%+ ,'"%- .• !"#$%/ 0121 )31 4*2/)56'#(")12/+7 "/1- 4'2#$)3 6'#(")$)*'%/
• !"#$%&'() 6'#(")$)*'%8– &9$*2$")+ $/)2'%'#:+ .;<=8:&'#(")1- )31 !$991:>/6'#1) '2?*) @)3211 ?'-:(2'?91#A )&*&)&%# +,((2'?91#A )&*&)&%# +,(-$./" '4 %"#12*66'#(")$)*'%/ $62'// B$/)2'%'#12/
C2*12+ D31% 6'#(")12/ 0121 3"#$%+ EFF<C2*12+ GHHH I%%$9/ .JJ=
Using Mechanical Turk for user studies
Traditional user studies
Mechanical Turk
Task complexity Complex Long
Simple Short
Task subjectivity Subjective Opinions
Objective Verifiable
User information Targeted demographics High interactivity
Unknown demographics Limited interactivity
Can Mechanical Turk be usefully used for user studies?
4
Task
• Assess quality of Wikipedia articles • Started with ratings from expert Wikipedians
– 14 articles (e.g., “Germany”, “Noam Chomsky”) – 7-point scale
• Can we get matching ratings with mechanical turk?
5
Experiment 1
• Rate articles on 7-point scales: – Well written
– Factually accurate
– Overall quality
• Free-text input: – What improvements does the article need?
• Paid $0.05 each
6
Experiment 1: Good news
• 58 users made 210 ratings (15 per article) – $10.50 total
• Fast results – 44% within a day, 100% within two days
– Many completed within minutes
7
Experiment 1: Bad news
• Correlation between turkers and Wikipedians only marginally significant (r=.50, p=.07)
• Worse, 59% potentially invalid responses
• Nearly 75% of these done by only 8 users
Experiment 1
Invalid comments
49%
<1 min responses
31%
8
Not a good start
• Summary of Experiment 1: – Only marginal correlation with experts.
– Heavy gaming of the system by a minority
• Possible Response: – Can make sure these gamers are not rewarded
– Ban them from doing your hits in the future
– Create a reputation system [Delores Lab]
• Can we change how we collect user input ?
9
Design changes
• Use verifiable questions to signal monitoring – “How many sections does the article have?”
– “How many images does the article have?”
– “How many references does the article have?”
10
Design changes
• Use verifiable questions to signal monitoring • Make malicious answers as high cost as good-faith
answers – “Provide 4-6 keywords that would give someone a
good summary of the contents of the article”
11
Design changes
• Use verifiable questions to signal monitoring • Make malicious answers as high cost as good-faith
answers
• Make verifiable answers useful for completing task – Used tasks similar to how Wikipedians evaluate quality
(organization, presentation, references)
12
Design changes
• Use verifiable questions to signal monitoring • Make malicious answers as high cost as good-faith
answers
• Make verifiable answers useful for completing task
• Put verifiable tasks before subjective responses – First do objective tasks and summarization – Only then evaluate subjective quality
– Ecological validity?
13
Experiment 2: Results
• 124 users provided 277 ratings (~20 per article) • Significant positive correlation with Wikipedians
– r=.66, p=.01
• Smaller proportion malicious responses • Increased time on task
Experiment 1 Experiment 2
Invalid comments
49% 3%
<1 min responses
31% 7%
Median time 1:30 4:06
14
Generalizing to other MTurk studies
• Combine objective and subjective questions – Rapid prototyping: ask verifiable questions about content/
design of prototype before subjective evaluation
– User surveys: ask common-knowledge questions before asking for opinions
• Filtering for Quality – Put in a field for Free-Form Responses and Filter out
data without answers – Results that came in too quickly
– Sort by WorkerID and look for cut and paste answers
– Look for outliers in the data that are suspicious
15
Quick Summary of Tips
1. Use verifiable questions to signal monitoring 2. Make malicious answers as high cost as good-faith answers 3. Make verifiable answers useful for completing task
4. Put verifiable tasks before subjective responses
• Mechanical Turk offers the practitioner a way to access a large user pool and quickly collect data at low cost
• Good results require careful task design
16
Managing Quality
• Quality through redundancy: Combining votes – Majority vote [work best when similar worker quality]
– Worker-Quality‐adjusted vote
– Managing dependencies
• Quality through gold data – Advantaged when imbalanced dataset & bad workers
• Estimating worker quality (Redundancy + Gold) – Calculate the confusion matrix and see if you actually
get some information from the worker
• Toolkit: http://code.google.com/p/get‐another‐label/
17 Source: Ipeirotis, WWW2011
Coding and Machine Learning
• Integration with Machine Learning – Build automatic classification models using
crowdsourced data
18
!"#$%& '(%)*"(+
• ,)#-+' %-.&% */-"+"+0 1-*-,)#-+' %-.&% */-"+"+0 1-*-• 2'& */-"+"+0 1-*- *( .)"%1 #(1&%
Data from existing
crowdsourced answerscrowdsourced answers
N CNew Case Automatic Model
(through machine learning)
Automatic
Answer
Source: Ipeirotis, WWW2011
Limitations of Mechanical Turk
• No control of users’ environment – Potential for different browsers, physical distractions
– General problem with online experimentation
• Not designed for user studies – Difficult to do between-subjects design
– May need some programming
• Users – Somewhat hard to control demographics, expertise
19
Crowdsourcing for HCI Research
• Does my interface/visualization work? – WikiDashboard: transparency vis for Wikipedia [Suh et al.]
– Replicating Perceptual Experiments [Heer et al., CHI2010]
• Coding of large amount of user data – What is a Question in Twitter? [Sharoda Paul, Lichan Hong, Ed Chi]
• Incentive mechanisms – Intrinsic vs. Extrinsic rewards: Games vs. Pay – [Horton & Chilton, 2010 for Mturk] and [Ariely, 2009] in general
20
Crowdsourcing for HCI Research
• Does my interface/visualization work? – WikiDashboard: transparency vis for Wikipedia [Suh et al.]
– Replicating Perceptual Experiments [Heer et al., CHI2010]
• Coding of large amount of user data – What is a Question in Twitter? [Sharoda Paul, Lichan Hong, Ed Chi]
• Incentive mechanisms – Intrinsic vs. Extrinsic rewards: Games vs. Pay – [Horton & Chilton, 2010 for MTurk] and Satisficing – [Ariely, 2009] in general: Higher pay != Better work
21
Crowdsourcing for HCI Research
• Does my interface/visualization work? – WikiDashboard: transparency vis for Wikipedia [Suh et al.]
– Replicating Perceptual Experiments [Heer et al., CHI2010]
• Coding of large amount of user data – What is a Question in Twitter? [Sharoda Paul, Lichan Hong, Ed Chi]
• Incentive mechanisms – Intrinsic vs. Extrinsic rewards: Games vs. Pay – [Horton & Chilton, 2010 for MTurk] and Satisficing – [Ariely, 2009] in general: Higher pay != Better work
22
Crowd Programming for Complex Tasks
• Decompose tasks into smaller tasks – Digital Taylorism
– Frederick Winslow Taylor (1856-1915)
– 1911 'Principles Of Scientific Management’
• Crowd Programming Explorations – MapReduce Models
• Kittur, A.; Smus, B.; and Kraut, R. CHI2011EA on CrowdForge. • Kulkarni, Can, Hartmann, CHI2011 workshop & WIP
– Little, G.; Chilton, L.; Goldman, M.; and Miller, R. C. In KDD 2010 Workshop on Human Computation.
23
Crowd Programming for Complex Tasks
• Crowd Programming Explorations – Kittur, A.; Smus, B.; and Kraut, R. CHI2011EA on
CrowdForge. – Kulkarni, Can, Hartmann, CHI2011 workshop & WIP
24
“Please solve the 16-question SAT located at
http://bit.ly/SATexam”. In both cases, we paid workers
between $0.10 and $0.40 per HIT. Each “subdivide” or
“merge” HIT received answers within 4 hours; solutions
to the initial task were complete within 72 hours.
Results
The decompositions produced by Turkers while running
Turkomatic are displayed in Figure 1 (essay-writing)
and Figure 4 (SAT).
In the essay task, each “subdivide” HIT was posted
three times by Turkomatic and the best of the three
was selected by experimenters (simulating Turker
voting) to continue the solution process. The proposed
decompositions were overwhelmingly linear and chose
to break the task down either by paragraph or by
activity (for example, one Turker proposed: brainstorm,
create outline, write topic sentences, fill in facts). The
decomposition used in the final essay used two levels of
recursion. As groups of subtasks were completed,
Turkomatic passed solutions to merge workers for
reassembly. The resulting essay is complete and
coherent, although somewhat lacking in cohesion.
We allowed essay-writers to pick a topic; the chosen
one (university legacy admissions) was somewhat
specialized, but the final essay displayed a reasonably
good understanding of the topic, even if the writing
quality was often mixed. The decomposition selected
for the SAT task used only one level of recursion.
Workers divided the task into 12 subtasks consisting of
1 to 3 thematically linked questions. These were each
solved in parallel by distinct workers and the results
were given to a merge worker who produced the final
solution. The score on the overall solution was 12/17,
with the worst performance on math and grammar
questions and the best in reading and vocabulary.
Obtaining useful decompositions proved tricky for
workers – many seemed confused about the nature of
the planning task. However, once the tasks were
decomposed, solution of the constituent parts and
reassembly into an overall solution were
straightforward for Turkers to accomplish.
Evaluation: Interface
In a second informal study, we examined whether
reducing user involvement in the HIT design improved
ease of use and efficiency. We hypothesized that the
high level of abstraction enabled by automatic task
design would make it easier for requesters to
crowdsource their work.
We asked a pool of four users to try to collect answers
for a basic brainstorming task on Mechanical Turk. The
task asked our participants to generate five ideas of
topics for an essay. Participants performed this task
twice, first, using Turkomatic to post tasks and obtain
results, then, using Mechanical Turk’s web interface. No
instruction on either interface was provided. We
examined how long it took the user to post the task.
With Turkomatic, our users finished posting their tasks
in an average of 37 seconds. On Mechanical Turk,
where low-level task design was required, users needed
an average of 244.2 seconds to post their tasks. More
importantly, the HITs posted by two users who were
not familiar with Mechanical Turk would not have
produced any meaningful results. One user posted
minor variations of the default templates provided on
Figure 4. For the SAT task, we uploaded
sixteen questions from a high school
Scholastic Aptitude Test to the web and
posed the following task to Turkomatic:
“Please solve the 16-question SAT located
at http://bit.ly/SATexam”.
! !
"#!$%&!'%()(*!%!(&+,-.-+/!&01,+((-#2!('+&!-(!%&&3-+/!'1!+%,4!-'+$!-#!'4+!&%0'-'-1#5!64+(+!'%()(!%0+!-/+%337!(-$&3+!+#1824!'1!9+!%#(:+0%93+!97!%!(-#23+!:10)+0!-#!%!(410'!%$18#'!1.!'-$+5!;10!+<%$&3+*!%!$%&!'%()!.10!%0'-,3+!:0-'-#2!,183/!%()!%!:10)+0!'1!,133+,'!1#+!.%,'!1#!%!2-=+#!'1&-,!-#!'4+!%0'-,3+>(!18'3-#+5!?83'-&3+!-#('%#,+(!1.!%!$%&!'%()(!,183/!9+!-#('%#'-%'+/!.10!+%,4!&%0'-'-1#@!+525*!$83'-&3+!:10)+0(!,183/!9+!%()+/!'1!,133+,'!1#+!.%,'!+%,4!1#!%!'1&-,!-#!&%0%33+35!
;-#%337*!0+/8,+!'%()(!'%)+!%33!'4+!0+(83'(!.01$!%!2-=+#!$%&!'%()!%#/!,1#(13-/%'+!'4+$*!'7&-,%337!-#'1!%!(-#23+!0+(83'5!"#!'4+!%0'-,3+!:0-'-#2!+<%$&3+*!%!0+/8,+!('+&!$-24'!'%)+!.%,'(!,133+,'+/!.10!%!2-=+#!'1&-,!97!$%#7!:10)+0(!%#/!4%=+!%!:10)+0!'80#!'4+$!-#'1!%!&%0%20%&45!
A#7!1.!'4+(+!('+&(!,%#!9+!-'+0%'-=+5!;10!+<%$&3+*!'4+!'1&-,!.10!%#!%0'-,3+!(+,'-1#!/+.-#+/!-#!%!.-0('!&%0'-'-1#!,%#!-'(+3.!9+!&%0'-'-1#+/!-#'1!(89(+,'-1#(5!B-$-3%037*!'4+!&%0%20%&4(!0+'80#+/!.01$!1#+!0+/8,'-1#!('+&!,%#!-#!'80#!9+!0+10/+0+/!'401824!%!(+,1#/!0+/8,'-1#!('+&5!
!"#$%#&'()$#%C+!+<&310+/!%(!%!,%(+!('8/7!'4+!,1$&3+<!'%()!1.!:0-'-#2!%#!+#,7,31&+/-%!%0'-,3+5!C0-'-#2!%#!%0'-,3+!-(!%!,4%33+#2-#2!%#/!-#'+0/+&+#/+#'!'%()!'4%'!-#=13=+(!$%#7!/-..+0+#'!(89'%()(D!&3%##-#2!'4+!(,1&+!1.!'4+!%0'-,3+*!41:!-'!(4183/!9+!('08,'80+/*!.-#/-#2!%#/!.-3'+0-#2!-#.10$%'-1#!'1!-#,38/+*!:0-'-#2!8&!'4%'!-#.10$%'-1#*!.-#/-#2!%#/!.-<-#2!20%$$%0!%#/!(&+33-#2*!%#/!$%)-#2!'4+!%0'-,3+!,14+0+#'5!64+(+!,4%0%,'+0-('-,(!$%)+!%0'-,3+!:0-'-#2!%!,4%33+#2-#2!98'!0+&0+(+#'%'-=+!'+('!,%(+!.10!180!%&&01%,45!
61!(13=+!'4-(!&0193+$!:+!,0+%'+/!%!(-$&3+!.31:!,1#(-('-#2!1.!%!&%0'-'-1#*!$%&*!%#/!0+/8,+!('+&5!!64+!
&%0'-'-1#!('+&!%()+/!:10)+0(!'1!,0+%'+!%#!%0'-,3+!18'3-#+*!0+&0+(+#'+/!%(!%#!%00%7!1.!(+,'-1#!4+%/-#2(!(8,4!%(!EF-('107G!%#/!EH+120%&47G5!"#!%#!+#=-01#$+#'!:4+0+!:10)+0(!:183/!,1$&3+'+!4-24!+..10'!'%()(*!'4+!#+<'!('+&!$-24'!9+!'1!4%=+!(1$+1#+!:0-'+!%!&%0%20%&4!.10!+%,4!(+,'-1#5!F1:+=+0*!'4+!/-..-,83'7!%#/!'-$+!-#=13=+/!-#!.-#/-#2!'4+!-#.10$%'-1#!.10!%#/!:0-'-#2!%!,1$&3+'+!&%0%20%&4!.10!%!4+%/-#2!-(!%!$-($%',4!'1!'4+!31:!:10)!,%&%,-'7!1.!$-,01I'%()!$%0)+'(5!648(!:+!901)+!'4+!'%()!8&!.80'4+0*!(+&%0%'-#2!'4+!-#.10$%'-1#!,133+,'-1#!%#/!:0-'-#2!(89'%()(5!B&+,-.-,%337*!+%,4!(+,'-1#!4+%/-#2!.01$!'4+!&%0'-'-1#!:%(!8(+/!'1!2+#+0%'+!$%&!'%()(!-#!
*)+',$%-.%/",&)"0%,$#'0&#%12%"%3100"41,"&)5$%6,)&)7+%&"#89%
CHI 2011 • Work-in-Progress May 7–12, 2011 • Vancouver, BC, Canada
1804
Future Directions in Crowdsourcing
• Real-time Crowdsourcing – Bigham, et al. VizWiz, UIST 2010
25
What color is this pillow? What denomination is this bill?
Do you see picnic tables across the parking lot?
What temperature is my oven set to?
Can you please tell me what this can is?
What kind of drink does this can hold?
(89s) .(105s) multiple shades of soft green, blue and gold
(24s) 20(29s) 20
(13s) no(46s) no
(69s) it looks like 425 degrees but the image is difficult to see.(84s) 400(122s) 450
(183s) chickpeas.(514s) beans(552s) Goya Beans
(91s) Energy(99s) no can in the picture(247s) energy drink
Figure 2: Six questions asked by participants, the photographs they took, and answers received with latency in seconds.
the total time required to answer a question. quikTurkit alsomakes it easy to keep a pool of workers of a given size contin-uously engaged and waiting, although workers must be paidto wait. In practice, we have found that keeping 10 or moreworkers in the pool is doable, although costly.
Most Mechanical Turk workers find HITs to do using theprovided search engine3. This search engine allows users toview available HITs sorted by creation date, the number ofHITs available, the reward amount, the expiration date, thetitle, or the time alloted for the work. quikTurkit employsseveral heuristics for optimizing its listing in order to obtainworkers quickly. First, it posts many more HITs than areactually required at any time because only a fraction will ac-tually be picked up within the first few minutes. These HITsare posted in batches, helping quikTurkit HITs stay near thetop. Finally, quikTurkit supports posting multiple HIT vari-ants at once with different titles or reward amounts to covermore of the first page of search results.
VizWiz currently posts a maximum of 64 times more HITsthan are required, posts them at a maximum rate of 4 HITsevery 10 seconds, and uses 6 different HIT variants (2 titles× 3 rewards). These choices are explored more closely in thecontext of VizWiz in the following section.
FIELD DEPLOYMENT
To better understand how VizWiz might be used by blindpeople in their everyday lives, we deployed it to 11 blindiPhone users aged 22 to 55 (3 female). Participants were re-cruited remotely and guided through using VizWiz over thephone until they felt comfortable using it. The wizard inter-face used by VizWiz speaks instructions as it goes, and soparticipants generally felt comfortable using VizWiz after asingle use. Participants were asked to use VizWiz at leastonce a day for one week. After each answer was returned,participants were prompted to leave a spoken comment.
quikTurkit used the following two titles for the jobs that itposted to Mechanical Turk: “3 Quick Visual Questions” and“Answer Three Questions for A Blind Person.” The reward3Available at mturk.com
distribution was set such that half of the HITs posted paid$0.01, and a quarter paid $0.02 and $0.03 each.
Asking Questions Participants asked a total of 82 questions(See Figure 2 for participant examples and accompanyingphotographs). Speech recognition correctly recognized thequestion asked for only 13 of the 82 questions (15.8%), and55 (67.1%) questions could be answered from the photostaken. Of the 82 questions, 22 concerned color identifica-tion, 14 were open ended “what is this?” or “describe thispicture” questions, 13 were of the form “what kind of (blank)is this?,” 12 asked for text to be read, 12 asked whether a par-ticular object was contained within the photograph, 5 askedfor a numerical answer or currency denomination, and 4 didnot fit into these categories.
Problems Taking Pictures 9 (11.0%) of the images takenwere too dark for the question to be answered, and 17 (21.0%)were too blurry for the question to be answered. Although afew other questions could not be answered due to the pho-tos that were taken, photos that were too dark or too blurrywere the most prevalent reason why questions could not beanswered. In the next section, we discuss a second iterationon the VizWiz prototype that helps to alert users to these par-ticular problems before sending the questions to workers.
Answers Overall, the first answer received was correct in71 of 82 cases (86.6%), where “correct” was defined as eitherbeing the answer to the question or an accurate descriptionof why the worker could not answer the question with theinformation contained within the photo provided (i.e., “Thisimage is too blurry”). A correct answer was received in allcases by the third answer.
The first answer was received across all questions in an aver-age of 133.3 seconds (SD=132.7), although the latency re-quired varied dramatically based on whether the questioncould actually be answered from the picture and on whetherthe speech recognition accurately recognized the question(Figure 4). Workers took 105.5 seconds (SD=160.3) on av-erage to answer questions that could be answered by the pro-vided photo compared to 170.2 seconds (SD=159.5) for those
Future Directions in Crowdsourcing
• Real-time Crowdsourcing – Bigham, et al. VizWiz, UIST 2010
• Embedding of Crowdwork inside Tools – Bernstein, et al. Solyent, UIST 2010
26
Future Directions in Crowdsourcing
• Real-time Crowdsourcing – Bigham, et al. VizWiz, UIST 2010
• Embedding of Crowdwork inside Tools – Bernstein, et al. Solyent, UIST 2010
• Shepherding Crowdwork – Dow et al. CHI2011 WIP
27
workers to persevere and accept additional tasks. We investigate these hypotheses through a prototype system, Shepherd, that demonstrates how to make feedback an integral part of crowdsourced creative work.
Understanding Opportunities for Crowd Feedback To effectively design feedback mechanisms that achieve the goals of learning, engagement, and quality improvement, we first analyze the important dimensions of the design space for crowd feedback (Figure 2).
Timeliness: When should feedback be shown? In micro-task work, workers stay with tasks for a short while, then move on. This implies two timing options: synchronously deliver feedback when workers are still engaged in a set of tasks, or asynchronously deliver feedback after workers have completed the tasks.
Synchronous feedback may have more impact on future task performance since it arrives while workers are still thinking about the task domain. It also increases the probability that workers will continue onto similar tasks. However, synchronous feedback places a burden on the feedback providers; they have little time to review work. This implies a need for tools or scheduling algorithms that enable near real-time feedback. Asynchronous feedback gives feedback providers more time to review and comment on work.
However, workers may have forgotten about the task or feel unmotivated to review the feedback and to return to the task.
Currently, platforms like Mechanical Turk only allow asynchronous feedback with no enticement to return. Requesters can provide feedback at payment time, but at that point (typically days later), workers care more about getting paid than improving submitted work. More importantly, unless requesters have more jobs available, workers cannot act on requesters’ advice.
Specificity: How detailed should feedback be? Mechanical Turk currently allows requesters one bit of feedback—accept or reject. While additional freeform communication is possible, it is rarely used unless workers file complaints. Workers may learn more if they receive detailed and personalized feedback on each piece of work. However, this added specificity comes at a price: feedback providers must spend time authoring feedback. When feedback resources are limited, customizable templates can accelerate feedback generation and enable requesters to codify domain knowledge into pre-authored statements. However, templates could be perceived as overly general or repetitive, reducing their desired impact. Workers may need explicit incentive to read and reflect on feedback.
Source: Who should provide feedback? Crowdsourcing requesters post tasks with specific quality objectives in mind; they are a natural choice for assuming the feedback role. However, experts often underestimate the difficulty novices face in solving tasks [7] or use language or concepts that are beyond the grasp of novices [6]. Moreover, as feedback
Figure 2: Current systems (in orange) focus on asynchronous, single-bit feedback by requesters. Shepherd (in blue) investigates richer, synchronous feedback by requesters and peers.
Tutorials
• Thanks to Matt Lease http://ir.ischool.utexas.edu/crowd/ • AAAI 2011 (w HCOMP 2011): Human Computation: Core Research
Questions and State of the Art (E. Law & Luis von Ahn) • WSDM 2011: Crowdsourcing 101: Putting the WSDM of Crowds to
Work for You (Omar Alonso and Matthew Lease) – http://ir.ischool.utexas.edu/wsdm2011_tutorial.pdf
• LREC 2010 Tutorial: Statistical Models of the Annotation Process (Bob Carpenter and Massimo Poesio)
– http://lingpipe-blog.com/2010/05/17/
• ECIR 2010: Crowdsourcing for Relevance Evaluation. (Omar Alonso) – http://wwwcsif.cs.ucdavis.edu/~alonsoom/crowdsourcing.html
• CVPR 2010: Mechanical Turk for Computer Vision. (Alex Sorokin and Fei‐Fei Li)
– http://sites.google.com/site/turkforvision/
• CIKM 2008: Crowdsourcing for Relevance Evaluation (D. Rose) – http://videolectures.net/cikm08_rose_cfre/
• WWW2011: Managing Crowdsourced Human Computation (Panos Ipeirotis)
– http://www.slideshare.net/ipeirotis/managing-crowdsourced-human-computation
28
Thanks!
• chi@acm.org
• http://edchi.net • @edchi
• Aniket Kittur, Ed H. Chi, Bongwon Suh. Crowdsourcing User Studies With Mechanical Turk. In Proceedings of the ACM Conference on Human-factors in Computing Systems (CHI2008), pp.453-456. ACM Press, 2008. Florence, Italy.
• Aniket Kittur, Bongwon Suh, Ed H. Chi. Can You Ever Trust a Wiki? Impacting Perceived Trustworthiness in Wikipedia. In Proc. of Computer-Supported Cooperative Work (CSCW2008), pp. 477-480. ACM Press, 2008. San Diego, CA. [Best Note Award]
29