Managing the Inferen-al Possibili-es of Open Government Data
Solon Barocas and Arvind Narayanan Princeton University
Jeff Hammerbacher: Facebook “actually built a system to predict the gender of a user who did not offer their gender.”
“[W]hen someone suddenly starts buying lots of scent-‐free soap and extra-‐big bags of coQon balls, in addi-on to hand sani-zers and washcloths, it signals they could be geTng close to their delivery date.”
“Selected most predic-ve Likes”
Trait Selected most predictive Likes
IQ
Hig
h
The Godfather Mozart Thunderstorms The Colbert Report Morgan Freemans Voice The Daily Show Lord Of The Rings To Kill A Mockingbird Science Curly Fries
Jason Aldean Tyler Perry Sephora Chiq Bret Michaels Clark Griswold Bebe I Love Being A Mom Harley Davidson Lady Antebellum
Low Sa
tisfa
ctio
n W
ith L
ife
Satis
fied
Sarah Palin Glenn Beck Proud To Be Christian Indiana Jones Swimming Jesus Christ Bible Jesus Being Conservative Pride And Prejudice
Hawthorne Heights Kickass Atreyu (Metal Band) Lamb Of God Gorillaz Science Quote Portal Stewie Griffin Killswitch Engage Ipod
Dissatisfied
Ope
nnes
s
Libe
ral &
Arti
stic
Oscar Wilde Charles Bukowski Sylvia Plath Leonardo Da Vinci Bauhaus Dmt The Spirit Molecule American Gods John Waters Plato Leonard Cohen
NASCAR Austin Collie Monster-In-Law I don’t read Justin Moore ESPN2 Farmlandia The Bachelor Oklahoma State University Teen Mom 2
Conservative C
onsc
ient
ious
ness
Wel
l Org
anize
d
Law Officer National Law Enforcement Lowfares.Com Accounting Foursquare Emergency Medical Services Sunday Best Kaplan University Glock Inc Mycalendar 2010
Wes Anderson Bandit Nation Omegle Vocaloid Serial Killer Screamo Anime Vamplets Join If Ur Fat Not Dying
Spontaneous Ex
trav
ersio
n
Out
goin
g &
Act
ive
Beerpong Michael Jordan Dancing Socializing Chris Tucker I Feel Better Tan Modeling Cheerleading Theatre Flip Cup
RPGs Fanfiction.Net Programming Anime Manga Video Games Role Playing Games Minecraft Voltaire Terry Pratchet
Shy & Reserved
“Computer-‐based personality judgments are more accurate than those made by humans”
The devil is not just in the details • Data mining provides a path by which – to glean facts that have not been disclosed – to guess at characteris-cs that cannot be observed directly
– and to avoid the need to obtain that informa-on from others
• By automa-ng inference and extending the fron-er of the inferable, data mining introduces new routes to arrive at certain personal details—and with them, a new set of prac-cal and philosophical challenges for privacy
Anecdotes à Systema-c Understanding
A Tenta-ve Taxonomy
1. Inferences made automa-cally and en masse 2. Inferences drawn on the basis of criteria not
commonly perceived as relevant 3. Inferences that extend the fron-er of the
inferable
NB: Elsewhere, I’ve tried to develop a norma-ve theory to explain what counts as privacy-‐viola-ng inference-‐making (“Leaps and Bounds”)
Why Open Government Data?
• The state is frequently able to collect informa-on that other ins-tu-ons cannot – Compelled disclosure – Sharing is necessary for the provision of basic services
• The state is likely to have far more comprehensive coverage of the popula-on than any other ins-tu-on
A Tenta-ve Taxonomy
1. Inferences made automa-cally and en masse 2. Inferences drawn on the basis of criteria not
commonly perceived as relevant 3. Inferences that extend the fron-er of the
inferable
A Fool’s Errand?
• Prac-cally impossible to an-cipate and head-‐off the many different inferences that government data might support
• The very point of open government data – To facilitate poten-al uses that government is unlikely to recognize or an-cipate
– Indeed, open government data ini-a-ves will be perceived as successful only to the extent that they support novel applica-ons
A Familiar Problem
Re-‐Iden-fica-on
Deduc-ve Inference
Machine Learning
Induc-ve Inference (training the model)
Deduc-ve Inference (applying the model)
Troy Raeder, Brian Dalessandro, Claudia Perlich, “Considering Privacy in Predic-ve Modeling Applica-ons,” Proceedings of the Data Ethics Workshop, Conference on Knowledge Discovery and Data Mining (KDD), August 2014
#FAIL
“We believe that the empirical study of machine learning and data mining methods oien falls prey to the effects of publica-on bias that favors posi-ve results over nega-ve ones. Most, if not all, ar-cles in conferences and journals report only posi-ve results. This does not reflect the prac-ce of a field where failures happen regularly.” Christophe Giraud-‐Carrier and Margaret H Dunham, “On the Importance of Sharing Nega-ve Results,” SIGKDD Explora;ons 12, no. 2 (December 2010): 3.
Limi-ng Factors
Algorithms Phenomena Data
Training Data
Labeled Examples
Feature Selec-on/Engineering
To what extent can we figure out which limit is at play in any given
situa-on—and use that to roll the clock forward to think through future situa-ons?
The Phenomena Themselves
Inferring gender from narra-ve text
(and yet we’re very good at author
iden-fica-on)
From the Field
• Government data lacks the necessary granularity for effec-ve machine learning – Insufficiently precise measurement – Purposefully coarse data
• Do anonymity-‐preserving techniques have the addi-onal effect of limi-ng what can be learned? • (Increasingly) well understood trade-‐off with u-lity, but no clear sense of whether this imposes poten-ally desirable limits on what can be learned
Machine Learning
Induc-ve Inference (training the model)
Deduc-ve Inference (applying the model)
Induc-ve Step
• How much training data do we need? – What propor-on of people need to disclose that they possess a certain aQribute for data miners to be able to iden-fy all the other members in the popula-on who also have this aQribute?
• What is the minimum-‐sized set of features necessary to induce a reliable rule?
Deduc-ve Step
• How accessible are the features that are necessary to apply the rule? – The government informa-on that figures in the induc-ve step may be more or less difficult to observe by those who want to apply the rule
• How easily can we withhold or conceal these features? – Forthcoming research suggests that it depends on the learning process • Daizhuo Chen, Robert Moakler, Samuel P. Fraiberger, Foster Provost, “Enhancing Transparency and Control when Predic-ng Private Traits from Digital Indicators of Human Tastes”
Thank You
Solon Barocas
[email protected] solon.barocas.org
@s010n
Arvind Narayanan
[email protected] randomwalker.info @random_walker