Kimberly A. Jameson1, Sean Tauber1, Prutha S. Deshpande2, Stephanie M. Chang3, and Sergio Gago3
1Institute for Mathematical Behavioral Sciences, 2Cognitive Sciences, and 3Calit2
University of California, Irvine
INSTITUTE FORMATHEMATICAL BEHAVIORAL SCIENCESUC IRVINE
Crowdsourcing the transcription of archival data
UCI ColCat Project Collaborators:
Funding and Support for the archive project: Calit2 at UCI. University of California PacificRim Research Program, 2010-2015 (K.A. Jameson, PI). National Science Foundation 2014-2017 (#SMA-1416907, K.A. Jameson, PI). UCI’s UROP Program Awards. IRB ApprovalsHS#2013-9921 and 2015-9047.
Prutha DeshpandeSean Tauber
Stephanie ChangSergio Gago
Nathan BenjaminYang Jiao
Brian HuynhHan Ke
Ram BhaktaZhimin Xiang
Ian Harris
Prutha S. Deshpande CogSci
Sean TauberIMBS
Sergio GagoCalit2
Stephanie M. ChangCalit2
UCI ColCat Project Collaborators:
• Background on an important problem in Cognitive Science.
• The domain under consideration: Color categorization.
• Creating a new database using internet-based procedures.
• Features of the internet-based research problem and solution approaches that may generalize elsewhere.
• Modeling the problem and developing appropriate analyses.
• Preliminary results from empirical tests.
• Summary.
Talk Overview
Research on how concepts are represented across linguistic
groups
✶ Individual concept formation and the sharing and transmission of concepts within and across groups.
E.g., Kinship terminology …
Concept formation across language groups
E.g., Kinship terminology:
https://en.wikipedia.org/wiki/Kinship
Concept formation across language groups
https://en.wikipedia.org/wiki/Kinship
E.g., Kinship terminology:
In what ways are representations of concepts similar across individuals and language
groups?
and
What are the various ways concepts vary across individuals and language groups?
How do the world’s languages map the color appearances we all see in our environments?
Basic Color Terms (1969)
Paul KayBrent Berlin
Basic Color Terms being described as “the smallest set of simple words with which the speaker can name any color.”
Courtesy of Lindsey & Brown (2006). PNAS, 102.
Image Credit: Lindsey & Brown (2006). PNAS, 102.
Basic Color Terms (1969)
(2) Provided a sequence by which languages adopted subsets of the 11 basic color categories.
(1) Found all languages tested had systems including 11 or fewer basic color words (e.g., English): red, yellow, green, blue, orange, purple, pink, brown, grey, black and
white.
(Terms such as crimson, blonde and royal blue are not considered to be basic.)
IMBS workshop UC Irvine 12/04/2015
Color concept universals like this were made popular by Berlin & Kay, and by
several other investigators,
still, there are instances where different societies have evolved different conventions for color naming ...
Image Credit: Lindsey & Brown (2006). PNAS, 102.
Courtesy of Lindsey & Brown (2006). PNAS, 102.
Berinmo (5 words)
Image Credit: Kay & Regier (2007). Cognition, 102.
Different numbers of Color Terms:n=3
T. Regier et al, PNAS 104, 2007
n=4
n=3
T. Regier et al, PNAS 104, 2007
Different numbers of Color Terms:
n=4
n=3
n=5
T. Regier et al, PNAS 104, 2007
Different numbers of Color Terms:
n=4
n=3
n=5
n=6
T. Regier et al, PNAS 104, 2007
Different numbers of Color Terms:
The World Color Survey
✶ 110 languages; 25 speakers.
✶ Data collection ended in 1980.
✶ Digitalizing hand coded data took more than 23 years.
✶ A very valuable site of unembellished ascii data files:http://www.icsi.berkeley.edu/wcs/data.html
World Color Survey Data — Uses a Generic Format
The existing World Color Survey (WCS) database
✶ Beginning ~2003 the WCS database was made publicly available.
✶ Has been very widely cited in the last few years.
http://www.icsi.berkeley.edu/wcs/data.html
(2009)
E.g., Focus selection task: Shown the chart, pinpoint the “best example” of each root they volunteered while naming.
(WCS datafiles do not include headers)
Language Number
Speaker Number
Focus Number
Term Abbrv.
Coordinates of focus selection
Datafile Example “foci.txt”:Color chip selected as category best-exemplar
Focus selections in two languages:
Deshpande, P.S. (under review). Investigating Color Categorization Behaviors in Korean-English Bilinguals. UCI Undergraduate Research Journal (submitted June, 2015).
English
Korean
The WCS data is awesome, but …
a platform with a GUI for empirically investigating and analyzing such data would be even better,
and a site with rigorous on-board research tools would
also be a big plus.
We were given a chance to do this…
Jameson, K. A., Benjamin, N. A., Chang, S.M., Deshpande, P. S., Gago, S., Harris, I. G., Jiao, Y., and Tauber, S. (2015). Mesoamerican Color Survey Digital Archive. In Encyclopedia of Color Science and Technology, (Ronnier Luo, Ed.). Springer: Berlin / Heidelberg. ISBN: 978-3-642-27851-8 (Online). DOI 10.1007/978-3-642-27851-8.
Nathan
See Poster: An Affordance Based Approach to Large Data-Set Navigation.
The Robert E. MacLaury Archive
✶ ~23,000 pages of raw color categorization data that includes:
✶ 116 dialects from indigenous Mesoamerican societies (261 surveys), and
✶ ~130 additional surveys from a variety of languages (across Africa, Asia, the Americas and Europe).
R. E. MacLaury’s Dissertation:Color in MesoAmerica, Vol. I: A Theory of Composite Categorization. (1986)
Book:Color and Cognition in Mesoamerica: Constructing Categories as Vantages.(1997)
The mesoamerican portion of the REM archive:
37 within Oaxaca
30 within Guatemala
33 within Mexico City
Jameson et al. (2015). ECST.
Chinantec language diversity in the MCS
http://dc433.4shared.com/img/5va8nYgU/s7/0.6475236094740385/aaa.jpg?async
Chinantec language diversity in the MCS
Developing
Vigorous
Endangered
Jameson et al. (2015). ECST.
Features of our transcription problem that may be general:
✶ The data has a constrained structure and format.(unlike typical historical records transcription tasks)
✶ It’s a perceptual identification/reproduction problem:e.g., identify handwritten characters/symbols in a
standardized template or form and reproduce them via keyboard input.
✶ transcription of large blocks of data can be brokeninto small tasks and transcribed by OCR or
crowdsourcing methods.
YangSee Poster: Optical Character Recognition of Handwritten Tabular Data.
Focus selection task: Shown the chart, pinpoint the “best example” of each root they volunteered while naming.
Focus selection task: Shown the chart, pinpoint the “best example” of each root they volunteered while naming.
Problem: Convert THIS into a data addressable file
American English Data
Problem: Convert THIS into a data addressable file
DATA
... continues up to 330 ...
Challenges of our transcription job: • Concepts. How they apply everywhere
• There’s a classic example — color.
• There’s an existing database.
• There’s a chance to do better.
• Crowdsourcing can help greatly
• Why OCR doesn't work. Handwriting that is not prose.
• The reason is its a perceptual problem.
• Crowdsourcing lets us break the problem into pieces and solve it piecewise.
Features of our problem and approach that may apply elsewhere:
✶ The perceptual nature of our tasks differ from general information surveys or opinion-poll data — e.g., response bias is likely to be item-based rather than the usual informant-based form, perhaps allowing more than one possible decision strategy.
✶ In large-scale efforts there’s a need to automate quantification and evaluation of the “goodness” of the transcribed product.
✶ Minimize response bias by partitioning larger tasks into smaller, distributed, tasks that are answered by several subjects and reassembled into a whole lends itself to crowdsourced approaches.
✶ By definition, while crowdsourcing makes Big Data possible, an intelligent model of data aggregation (like CCT) may permit trading off “smarter” for “bigger,” giving a more economical approach to accurately deriving robust results using internet-based crowdsourcing methods.
National Science Foundation 2014-2017 (#SMA-1416907, K.A. Jameson, PI).
Features of our problem and approach that may apply elsewhere:
✶ The perceptual nature of our tasks differ from general information surveys or opinion-poll data — e.g., response bias is likely to be item-based rather than the usual informant-based form, perhaps allowing more than one possible decision strategy.
✶ In large-scale efforts there’s a need to automate quantification and evaluation of the “goodness” of the transcribed product.
✶ Minimize response bias by partitioning larger tasks into smaller, distributed, tasks that are answered by several subjects and reassembled into a whole lends itself to crowdsourced approaches.
✶ By definition, while crowdsourcing makes Big Data possible, an intelligent model of data aggregation (like CCT) may permit trading off “smarter” for “bigger,” giving a more economical approach to accurately deriving robust results using internet-based crowdsourcing methods.
National Science Foundation 2014-2017 (#SMA-1416907, K.A. Jameson, PI).
Features of our problem and approach that may apply elsewhere:
✶ The perceptual nature of our tasks differ from general information surveys or opinion-poll data — e.g., response bias is likely to be item-based rather than the usual informant-based form, perhaps allowing more than one possible decision strategy.
✶ In large-scale efforts there’s a need to automate quantification and evaluation of the “goodness” of the transcribed product.
✶ Minimize response bias by partitioning larger tasks into smaller, distributed, tasks that are answered by several subjects and reassembled into a whole lends itself to crowdsourced approaches.
✶ By definition, while crowdsourcing makes Big Data possible, an intelligent model of data aggregation (like CCT) may permit trading off “smarter” for “bigger,” giving a more economical approach to accurately deriving robust results using internet-based crowdsourcing methods.
National Science Foundation 2014-2017 (#SMA-1416907, K.A. Jameson, PI).
Features of our problem and approach that may apply elsewhere:
✶ The perceptual nature of our tasks differ from general information surveys or opinion-poll data — e.g., response bias is likely to be item-based rather than the usual informant-based form, perhaps allowing more than one possible decision strategy.
✶ In large-scale efforts there’s a need to automate quantification and evaluation of the “goodness” of the transcribed product.
✶ Minimize response bias by partitioning larger tasks into smaller, distributed, tasks that are answered by several subjects and reassembled into a whole lends itself to crowdsourced approaches.
✶ While crowdsourcing makes Big Data possible, an intelligent model of data aggregation (like “CCT”) may permit trading off “smarter” data for “bigger” data, giving a more economical approach to accurately deriving robust results using internet-based crowdsourcing methods.
National Science Foundation 2014-2017 (#SMA-1416907, K.A. Jameson, PI).
Jameson & Romney (1990). Consensus on Semiotic Models of Alphabetic Systems.J. of Quant. Anthro.
Batchelder and Romney (1988)Test theory without an answer-
key. Psychometrika.
Cultural consensus analyses of a
cognitive-perceptual task
✶ For tasks evaluating new characters designed to extend the 26 letters of the English alphabet, consensus analyses objectively identified expert typeface designers with higher “competence” compared to college undergraduates.
Automating archive transcription: Task and Judgments
Stephanie
Design 1: OCR verification (pattern recognition) - 2-AFC yes/noDesign 2: OCR verification (training data) - free responseDesign 3: Crowdsource verification - 2-AFC “match/no-match” Design 4: Naming ranges 1 - free response + confidenceDesign 5: Naming ranges 2 - N-AFC + confidenceDesign 6: Focus transcription 1 - free response + confidenceDesign 7: Focus transcription 2 - free response
“free response” = a “reCAPTCHA” task.
Poster title: Designing Crowdsourcing Methods for the Transcription of Handwritten Documents.
*
E.g., internet-based transcription task:
http://colcat.calit2.uci.edu
Cultural Consensus Theory (CCT)to aggregate the data
Deshpande, Tauber., Chang, Gago & Jameson. (in preparation). Digitizing a large corpus of handwritten documents using crowdsourcing and cultural consensus theory.
Prutha
— Automate piece-wise crowdsourced transcription designs for analysis with CCT to derive the correct transcription.
— Enrich the model underlying Dichtomous Bayesian form of CCT (Oravecz, et al. 2014) to handle N-alternative forced-choice data formats.
— As a result, employ smarter analyses of smaller samples, using CCT’s formal process model, that produce solutions as robust as those from large amounts of “averaged” data.
See Poster: A Cultural Consensus Theory Analysis of Crowdsourced Transcription Data.
Results:
Results: Task 4n=30
Results: Task 4
hi, hl
n=30
Results: Task 4n=30
Inferring the true transcription
• Mode?
• (Bayesian) Cultural Consensus Theory (CCT)(Oravecz, Vandekerckhove & Batchelder, 2014)(Batchelder & Romney, 1988)
Cultural Consensus Theory (CCT)
• “Test theory without an answer key” (Batchelder & Romney, 1988)
• Allows us to infer:
• shared latent cultural knowledge (true transcription)
• individual ability
• item difficulty
• response bias
Cultural Consensus Theory (CCT)
• Usually applied to dichotomous (true/false) data.
• Other formats have been explored with Bayesian framework but not multiple choice / free response (to our knowledge).
• Not typically applied to perceptual identification(although, see Jameson 1990)
Dichotomous CCT Multiple Choice CCT
Dichotomous CCT Multiple Choice CCTObserved Data
Dichotomous CCT Multiple Choice CCTObserved Data
Dichotomous CCT Multiple Choice CCTObserved Data
Latent Parameters
Dichotomous CCT Multiple Choice CCTObserved Data
Latent Parameters
Dichotomous CCT Multiple Choice CCTObserved Data
Latent Parameters
Dichotomous CCT Multiple Choice CCTObserved Data
Latent Parameters
Dichotomous CCT Multiple Choice CCT
(subject-wise bias)
Observed Data
Latent Parameters
Examples of perceptually confusable stimuli
Response bias:Individuals or items?
subject-wise bias item-wise bias
Response bias:Individuals or items?
subject-wise bias item-wise bias
CCT Answer Key: Task 4
CCT Answer Key: Task 4
CCT Answer Key: Task 4
CCT Answer Key: Task 4
CCT Answer Key: Task 4
subject-wise posteriorsAnswer 4 (Z4) Answer 16 (Z16)
Answer 125 (Z125) Subject 0 bias (g0)
subject-wise posteriorsAnswer 4 (Z4) Answer 16 (Z16)
Answer 125 (Z125) Subject 0 bias (g0)
item-wise posteriorsAnswer 4 (Z4)
Item 4 bias (g4)Answer 16 (Z16)
Item 16 bias (g16)
Answer 125 (Z125)Item 125 bias (g125)
subject-wise model predictionstask 4
subject-wise model predictions
item-wise model predictions
task 4
subject-wise model predictionstask 7
subject-wise model predictions
item-wise model predictions
task 7
✶ CCT was designed to work on small (6-10) sized subject samples typical of anthropological studies.
Would the patterns of results reported for Task 4 be possible with a sample smaller than 30 participants?
Method Answer Key Estimate %-correct
Mean Competence
Mean Item Difficulty
Trial 1 - 8 participants 100% 0.929 0.466
Trial 2 - 8 participants 100% 0.937 0.460
Trial 3 - 8 participants 100% 0.914 0.459
Trial 4 - 8 participants 100% 0.942 0.464
Trial 5 - 8 participants 100% 0.935 0.464
30 Participants 100% 0.917 0.366
Can we use fewer informants ?…
Preliminary trends suggests 8 participants may be as informative as 30.
Discussion points• Two (or more) response-strategy “subcultures”?
• Confidence data can help CCT results
• Quantitative model evaluation
• Item + individual bias component?
• Automation and integration with other server-side processes (Python module vs. R, Matlab)
Results Summary: ✶ These preliminary results suggest two novel approaches, piece-wise crowdsourcing and CCT data handling, can be used to accurately transcribe a large corpus of ethnographic data.
✶ By using internet-based methods, it appears we can a avoid 20+ year manual transcription job and derive an accurate and unbiased database of great value to investigations of concept formation across language groups.
✶ The economical way in which we modeled this perceptually-based transcription problem seems likely to generalize to other internet-based tasks that require extraction and evaluation of targets embedded in distracting information, and our novel use of CCT analyses seem promising for intelligently aggregating smaller subsets of crowdsourced responses to address large data handling problems.
Thanks for Listening!!
Funding and Support for the archive project: Calit2 at UCI. University of California PacificRim Research Program, 2010-2015 (K.A. Jameson, PI). National Science Foundation 2014-2017 (#SMA-1416907, K.A. Jameson, PI). UCI’s UROP Program Awards. IRB ApprovalsHS#2013-9921 and 2015-9047.