Applied Machine LearningLecture 4-2: Data collection, bias, and annotation
Selpi ([email protected])
The slides are further development of Richard Johansson's slides
January 31, 2020
Overview
The need for DATA and how to get DATA
Data collection and bias
Manual annotation
Quality control of annotation
Review and Closing
Supervised, unsupervised, and semi-supervised learning
I To be able to learn, the machine needs DATA!
Supervised, unsupervised, and semi-supervised learning
I To be able to learn, the machine needs DATA!
Scraping from websites or using open APIs
Copyright issues
I Published on the web 6= freely available!
I There is a risk that the work you do will be wastedI Twitter datasets
I May distribute just the URLs (as in e.g. ImageNet)I but they may disappear
How and where do we get data?
I Download publicly open data from:I UCI Machine Learning RepositoryI data.europa.euI ...
I Get access to publicly accessible but regulated (with varyingdegree) data from:I Swedish Traffic Accident Data Acquisition(STRADA)I Authors of papers who made their data accessible to users
after registration (e.g., HighD))I Kaggle, ...
I Pay to get some dataI SHRP2 Naturalistic Driving DataI ...
I Or collect new data � This can be challenging!
Overview
The need for DATA and how to get DATA
Data collection and bias
Manual annotation
Quality control of annotation
Review and Closing
Discuss the projects used for illustrations
assumptions about data in machine learning
what's the �population�? what's �representative�?
I the sample is representative if what's true about the sampleis also true in generalI is our sample of drivers for a certain vehicle brand
representative of all drivers?I are our images taken in good lighting condition representative
of �images in general�?
I depending on the type of data, it can be hard to determinewhether a sample is representative in practice
I useful to document the composition of a dataset
Example of bias
I in 1936, Literary Digest polled a few million Americans abouttheir preferred candidate in the presidential electionI result of the poll: Landon 57%, Roosevelt 43%I result of the election: Roosevelt 62%, Landon 38%
I the massive polling error was caused byI sampling bias: they polled people with a phone; poorer
people were over represented among people without a phoneI nonresponse bias: who are the people who answer the
survey?I similar di�culty: self-selection in web survey data
I What about in real driving data collection?
strati�cation and weighting
[source]
I To decide, take into account the purpose of collecting data
I Illustrate di�erent scenarios of sampling drivers w.r.t. agegroups
Strati�ed train/test splits in scikit-learn
from sklearn.model_selection import train_test_split
X, Y = ( ... read the dataset ...)
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y,
stratified=Y)
I but this doesn't solve the problem of di�erence between oursample and the real-world distribution. . .
availability vs. representativity
I sometimes, we don't have the luxury of selecting a�representative� sample: we just have to take what we can getI observational data in medicineI historical dataI drivers in naturalistic driving study
I technical issues:I �ction is harder to access than web-published text (news,
blogs, . . . )
I copyright: we get a bias if we only include free data
sampling e�ects and machine learning systemsI genre: what if there are only book reviews in our sentiment
dataset?
I time: how well will my system work in di�erent year, di�erentseason?
I selection: what if the skin tumor detection system wastrained only on people who saw a specialist?
I demography: what if there are only white people, or onlypeople without glasses in the training data for imageclassi�ers?
[source]
Cross-domain classi�cation example
I Example of sample selection bias
I See notebook
Other aspects to consider for data collection
I Ethical issues (e.g., animal/human testing)
I Legal issues (e.g., foreign companies cannot collect GPS datato be used outside China)
I Budget & time
Overview
The need for DATA and how to get DATA
Data collection and bias
Manual annotation
Quality control of annotation
Review and Closing
Training data for imitating human decisions
I in many cases, the goal of a predictive system is toautomatize human decisions
I in practice, the human input is often missing and has to beadded manually
I this process is called annotationI or �labeling�, �tagging�, �coding� etc
I in real-world scenarios, this is a substantial investment
I we will now discuss some practical aspects of annotation
Some types of annotation
I categories:I what type of animal is this?I is this email spam or legit?I does this event lead to crash or not?
I segmentation or tagging:I highlight the parts of an image showing a street signI mark when driver is distracted (eating, texting, talking on
phone, etc.)I mark the segments of the text that refer to proteins
I graphs, trees and other types of structures:I biology, language, . . .
Tools for annotation
I small projects, simple annotation: text �le, Excel, directories
I in the long run, it usually pays o� to �nd or develop aspecialized annotation user interfaceI because the type of data to annotate is complexI because we want to keep track of annotators
I Example tool for annotating driving data (Fig.67 D3.3)
example: tagging relevant audio segments
[source]
example of annotating text: names
I http://brat.nlplab.org/
I WebAnno is a similar toolhttps://webanno.github.io/webanno/
example: relation annotation in biomedical text
Biases in annotation
I is the user interface biased?
I is some choice easier? is there a �default�?
I are the annotators paid by the hour or by quantity?
I boredom?
Annotation manual / speci�cations
I See SHRP2 Code book / data dictionary
I we need to write down a manual specifying the task in detail
I the clarity of the manual will in�uence the quality of theannotation
I a few useful things to include:I the purpose of the annotationI de�nitions of the concepts in the modelI . . . and practical explanations of how they are applied
I a reasonable amount of examplesI describe common hard cases, borderline situations
example: de�ning an annotation task
Who should annotate and how to get annotators
I specialists? students with specialist training? companyspecialising for this task?
I use software to do semi-automatic annotation?
I use crowdsourcing (e.g., non-experts instead oftrained-experts)?I the most well-known framework is Amazon Mechanical Turk:
http://mturk.comI risk of cheating, ethical issues (e.g., low salary)
Example of unpaid crowdsourcing: reCAPTCHA
Example of unpaid crowdsourcing: A/B testing
Examples of companies in the annotation business
I https://www.annotell.com/ (Gothenburg)
I https://www.figure-eight.com/ (formerly CrowdFlower)
I https://appen.com
I https://www.cogitotech.com/
Overview
The need for DATA and how to get DATA
Data collection and bias
Manual annotation
Quality control of annotation
Review and Closing
Safeguards in crowdsourcing
I inspection after the fact
I mix annotation with checks
I double annotation
I inter-annotator agreement (to see how often the annotatorsagree with each other)
various inter-annotator scores in Python
I the StatsModels Python library includes some of these scores
http://www.statsmodels.org/dev/stats.html#
module-statsmodels.stats.inter_rater
Overview
The need for DATA and how to get DATA
Data collection and bias
Manual annotation
Quality control of annotation
Review and Closing
Review of data collection, bias, and annotation
I On data collection:I Reason why the choice of data collection could have a big
in�uence on machine learning performances
I On annotation:I Explain the pros and cons of the di�erent methods used for
data annotation (see "Pros and cons of labelling
approaches")I Describe what could be done to control the quality of data
annotationI Explain how data annotation could in�uence the performance
of machine learning systems
I On bias:I Suggest ways to minimise the bias from data collection and
annotation
Next lecture (on Friday next week)
I Optimisation in machine learning
I Logistic regression and support vector classi�ers