Date post: | 12-Nov-2014 |
Category: |
Technology |
Upload: | mortardata |
View: | 2,396 times |
Download: | 1 times |
A social scientist’s perspectives on data science
Drew ConwayNYC Data Science
MeetupMarch 5, 2013http://www.flickr.com/photos/uiowa/8047
195100/
Hacking Skills
Obtain Munge
I hold the following truths to be self-evident...
1.Data come from many sources
2.Data come in many form(at)s
A .zip file of PDFs ≠ data‣Data scientist must know where to get data and how to obtain it
‣Work with big text files
$ head publicvotes-20101018_votes.dump
‣Work with APIs
$ curl http://search.twitter.com/search.json?q=@drewconway > drewconway.json
Real data are messy ‣Even curated data: duplicates, missing values, date formats
‣Combine data from multiple sources/formats
‣Tools• *NIX tools: sed, awk, grep• Scripting languages: Perl, Python
and R
$ cat ufo_awesome.tsv | grep probe | wc -l 131
Hacking Skills
While 80% of effort is spent here, perhaps most straightforward to teach
Heavily tool focused, borrow from CS/EE curriculums
‣Comfort working at the command-line, with text editors
‣A language for every season!
Conveying findings in creative and compelling ways
Math & Stats
Knowledge
If: Better data beats better mathThen: What methods should be taught?
How do you find structure in new data?
‣Scatter plots‣Density plotsData exploration that scales
‣Reduce dimensionality‣PCA, SVD, MDS
Methods must match data
‣Text‣Geospatial‣Web-scaleWhat is the ‘best’
model?‣Most predictive‣Most parsimonious‣Cross-validation
Explore Model
}
Math & Stats
Knowledge
Universities good at methods training......but what methods fit into Data Science?
Things data scientist like...‣Illustrating the current state of the
world‣Predicting future observations‣Classifying/ranking observations
Things social scientists like...‣Testable theoretical models‣Natural experiments‣Causality
1.When applicable2.Right tool / right
job3.Open black
boxes4.Learn limitations
Substantive
Expertise
Data Science, as a discipline, is fundamentally about human behavior
Inquire InterpretFocus on questions / not tech
‣What new questions can be asked from web-scale data?
‣Tools are a means to an end
Social science has questions
‣Markets‣Organization‣Decision making
How do we know when the results we get make sense, if ever?
http://www.flickr.com/photos/cawley/3242403224/
Case Study: Methods for Collecting Large-Scale Non-Expert Text Coding
Median Voter Theorem
Theorem: In a majority rules system, the preference of the median voter will succeed
http://thomasmoreinstitute.wordpress.com/2010/04/28/the-uk-election-and-the-curse-of-the-median-voter/
Assumption: The political/ideological preferences of voters can be projected onto a single numeric dimension
Median Voter Theorem
http://voteview.com/blog/?p=564
How do we calculate these numbers?
We make it up...
http://www.flickr.com/photos/estherlairlandesa/4649566079/
But, we have to!
http://en.wikipedia.org/wiki/File:Obama_Health_Care_Speech_to_Joint_Session_of_Congress.jpg
http://www.flickr.com/photos/becca02/6727193557/
A tale of two disciplines
Physics Political Science
Build instrument Measure Observe action Infer
One thing we have a lot of: text
Politicians‣Speeches‣Constituent communication
Parties‣Platform / manifestos‣Position statements
Countries‣Diplomatic cables‣Military declarations
ExpertCodin
g!
How expert coding (typically) works
http://en.wikipedia.org/wiki/Official_Monster_Raving_Loony_Party
Expert Code Book
1. Health & Safety: We propose to ban Self Responsibilty on
the grounds that it may be dangerous to your health.
2. M.P’s Expenses: We propose that instead of a second home
allowance M.P’s will have a caravan which will be parked
outside the Houses of Parliament. This will make it easier
as flipping a caravan is easier than flipping homes
3. Eurofit: The European Constitution which will be sorted
out by going for a long Walk. “As everyone knows that
walking is good for the constitution”
Manifesto
Party Year Score
Monster Raving Loony
2010 -2
DATA!
What’s wrong with experts?
They’re slow
They’re biased
They’re expensive
They’re wrong
Can we use non-Can we use non-experts to code experts to code
political political manifestos?manifestos?
How can we How can we measure the measure the
quality/validity quality/validity of non-expert of non-expert
codings?codings?Use Use
Mechanical Mechanical Turk to code Turk to code
many many manifesto manifesto fragments.fragments.
Experimental approach
Expert codings
Texts: 18 “big 3” British party manifestos 1987-2010
Experts: 5 advanced poli. sci. graduate students + 2 tenured faculty
Coding: deliberately simple schema
Baseline data
Three experiments
No Qualificatio
n
Low-Threshold
High-Threshold
Anyone in
4/6 Correct
5/6 Correct
MT codings
Experimental designHypothesis: Stronger filter on Turkers leads to better coding
Filter: Use MT qualification test as gatekeeper
How do we think about coding a manifesto fragment?
Example text coding HIT from the experiment
How do we implement this (aka, the glue)?
Expert codings
[{ ‘text_unit_id’: ..., ‘sentence_text’: ..., .... }, ...]
Random sample, as JSON
EC2
S3
MT
Dynamically generate HITs
MT codings
Push HITs + retrieve results
Statistical analysis of results
Scholarship, FTW!
https://github.com/drewconway/mturk_coder_quality
What’s good about MT non-experts?
They’re fast
They’re biased?
They’re cheap
They’re wrong?
The last crowd-sourced coding job for 600 sentences
and got 4,300 sentences coded in
about 20 hours (about 3.6
sentences per minute)
•We pay about $0.02 / sentence
•Typical manifesto (in British set) has 1,000 sentences•Whole manifesto coded for $20
•By comparison, the CMP pays expert coders about €150 per manifesto, call it €.15 or $.20/manifesto - 10x more per sentence
Results Kappa Statistic
Experiment
Sentences
# MT Coders
% Agreement
k*Std.
Errorz
No Qual.
1,315 89 0.65 0.47 0.13 22.6
Low-Threshold
1,393 56 0.7 0.54 0.12 26.7
High-Threshold
1,250 23 0.62 0.41 0.13 18.3
* A k value between 0.4-0.6 is considered “moderate” agreement
Agreement by experiment
ExperimentExpert
CodingMT %
Agreement
No Qual.
Economic 0.77
Social 0.92
Neither 0.22
Low-Threshold
Economic 0.87
Social 0.98
Neither 0.2
High-Threshold
Economic 0.77
Social 0.91
Neither 0.09
Agreement by expert-coding
Results of initial MT experiments
Results Kappa Statistic
Experiment
Sentences
# MT Coders
% Agreement
k*Std.
Errorz
Econ-only
942 15 0.62 0.23 0.1 4.28
Soc-only
955 32 0.6 0.17 0.09 0.95
* A k value between 0.4-0.6 is considered “moderate” agreement
ExperimentExpert
CodingMT %
Agreement
Economic 0.92
Economic-only
Neither 0.28
Social 0.97
Social-only Neither 0.19
Non-experts have a very hard time with a “null” coding!
Separating Social and Economic Sentences
Joint work with...
Michael LaverNYU
Kenneth BennoitLSE
Slava MikhaylovUCL
Paper: http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2260437Presentation: http://bit.ly/nonexperts
ProjectFlorida
No Qualification
Coder performance stability
Low-threshold
High-threshold
Performance becomes very stable after approximately 20 HITs
Party shifts: economic
Party shifts: social